Hemanth S Banur

Tensor Parallelism: Sometimes it's not the Compute but the memory

Jan 04, 2026

Whenever someone explains how AI models work, it always comes down to this: 'matrix multiplications'. But here is the thing, What happens when the matrices are damn too big to fit on a single machine's memory (training/inferencing)?

Let's talk some real numbers

Say we have a model with 100 Billion parameters. Breaking things down - We're using FP16/BF16 (mixed precision training), which is pretty standard these days.

Model weights:

100B * 2 bytes (FP16) = 200GB

Gradients:

100B * 2 bytes (FP16) = 200GB

Optimizer states (Adam let's say):

This is where it gets heavy. Adam optimizer doesn't just store your parameters, it stores 2 additional states per parameter.

Momentum
Variance

Both stored in FP32 for stability

100B * 8 bytes (2 FP32 states) = 800GB

Total so far ~1.2TB

But wait, there's more!

During forward and backward passes, you need to store intermediate values (activations).

The memory required depends on:

Batch size
Sequence length
Model architecture
Whether you are using gradient checkpointing

For a 100B model with reasonable batch size and seq length, we need 200-400GB of activation memory.

So total VRAM needed: 1.4 - 1.6TB.

Now, speaking realistically, a single GPU has 80-90GB of VRAM (like A100 80GB, H100 80GB), to train this 100B model we would need at least 20-25 machines. How to distribute and co-ordinate?

The Other Solutions (That don't really work)

Before we get into tensor parallelism, let us quickly look into 'obvious' solutions.

Data Parallelism

Replicate the entire model on each GPU, but split the data batch across GPUs. Each GPU process the different examples, then you average the gradients.

You can see the problem here, you still need the full model on each GPU, if your model doesn't fit on 1 GPU, this would not help at all! I feel this is only for speeding up the process.

Pipeline Parallelism

Split the model vertically, put layers 1-25 on GPU0, 26-50 on GPU1 and etc.

If you think, there are couple of problems here: Each GPU still needs to fit entire layers in memory, And the 'time' - GPUs sitting idle is more (successive GPUs input depends on previous output)

Now Enters: Tensor Parallelism

This doesn't split the model by layers. It splits individual layers across multiple GPUs.

As I mentioned earlier, it's just 'matrix multiplication'. But how?, how to distribute to different GPUs?

Keeping this simple, we have

C = A * B

A has shape (m, n)
B has shape (n, p)
C will be the output and is (m, p)

On a single machine it is straightforward, Load A and B into memory, multiply them, get C. Done

But now let's say we have 2 GPUs and we want to split the work between them, and how to split?

We have 2 ways to do that:

Column wise tensor parallelism
Row wise tensor parallelism

Let's take 2 simple matrices and see how it works.

Matrices A and B

1. Column wise tensor parallelism

Split the matrix B into 2 parts, column wise and duplicate the matrix A. Each GPU computes its part independently.

GPU0: C1 = A * B1 -> shape (2, 1)
GPU1: C2 = A * B2 -> shape (2, 1)

At last we concatenate the results C1 and C2 and get C matrix. And this is called 'All Gather' operation, Reason: here we are not summing the values, just concatenating the results from different machines.

Column Parallelism

2. Row wise tensor parallelism

Split matrix B row-wise into 2 parts. To keep the matrix multiplication valid, we must split the output matrix A into 2 parts by its columns.

GPU0: C1 = A1 × B1 -> shape (2, 2)
GPU1: C2 = A2 × B2 -> shape (2, 2)

Now C will be C1+C2 (element wise sum)

And this is called 'all reduce' operation, Reason: here we do element wise sum (reduce) of the values from different machines.

Row Parallelism

Why both methods matter

Here is the thing, 'you alternare between them'

In a neural network, you stack multiple layers. The output of the one layer becomes the input to the next. If you alternate column and row parallelism, the output format of one layer matches the input format for the next layer.

Real stuff

let's see how this actually works in transformer architecture.

we are going to parallelize 2 main components:

Multi Head Attention
MLP (feed forward network)

1. Multi Head Attention

I feel it is almost perfect for tensor parallelism. It's like it was designed for this.

Standard Attention Mechanism:

Q = X @ Wq
K = X @ Wk
V = X @ Wv

Attention = softmax(QK^T / sqrt(d_k)) @ V

Output = Attention @ Wo

For MHA with h heads: d_model = h * d_head

for each head i:

Qi = X @ Wq_i
Ki = X @ Wk_i
Vi = X @ Wv_i
head_i = softmax(Qi @ Ki^T / sqrt(d_head)) @ Vi

Concatenate all heads:

MH = Concat(head_1, head_2, ..., head_h)

Notice something? Each head is computed completely independently. Head 1 doesn't need to know what head 2 is doing.

So here's the idea: if you've got 8 heads and 4 GPUs, put 2 heads on each GPU. Each GPU computes its heads independently, then we concatenate results at the end.

After this, we split the Wo matrix row wise and distribute across machines. Each GPU computes:

partial_output_i = MH_i @ Wo_i

each GPU produces a tensor of full output size. So, sum them all..

final_output = AllReduce(partial_output_0, partial_output_1,..., partial_output_h)

Input X: (batch, seq, 12288) [replicated on all GPUs] 
    ↓ 
QKV Projections (column-parallel, no communication) 
    ↓ 
   Q_i, K_i, V_i on each GPU: (batch, seq, 1536) 
    ↓ 
Attention Computation (local, no communication) 
    ↓ 
   attn_output_i on each GPU: (batch, seq, 1536) 
    ↓ 
Output Projection (row-parallel) 
    ↓
   partial_output_i on each GPU: (batch, seq, 12288) 
    ↓ 
All-Reduce (SUM all partial outputs) 
    ↓ 
Output: (batch, seq, 12288) [replicated on all GPUs]

Communication cost: 1 all-reduce of size (batch * seq * d_model) after output projection.

2. MLP block

This feels even simpler

Normally

Hidden = GELU(X @ W1 + b1) #expansion: d_model -> 4*d_model
Output = Hidden @ W2 + b2 #contraction: 4*d_model -> d_model

Let's say:

d_model = 12288

d_ff (feedforward dimension) = 4 * 12288 = 49152

N = 8 GPUs

STEP 1: first linear layer(column parallelism)

split W1(12288, 49152) column wise:

GPU 0 gets: W1_0: (12288, 6144)
GPU 1 gets: W1_1: (12288, 6144)
...
GPU 7 gets: W1_7: (12288, 6144)

each GPU computes:

hidden_i = GELU(X @ W1_i + b1_i)

Input X is duplicated. Each GPU gets a slide of expanded hidden dimension. No communication needed.

STEP 2: Activation function

hidden_i = GELU(hidden_i)

No communication needed again.

STEP 3: second linear layer (row parallelism)

split W2 (49152, 12288) row wise:

GPU 0 gets: W2_0: (6144, 12288)
GPU 1 gets: W2_1: (6144, 12288)
...
GPU 7 gets: W2_7: (6144, 12288)

each GPU computes:

partial_output_i = hidden_i @ W2_i

then all reduce:

final_output = AllReduce(partial_output_0 + ... + partial_output_7) + b2

Input X: (batch, seq, 12288) [replicated on all GPUs] 
    ↓ 
Linear1 (column-parallel, no communication) 
    ↓ 
   hidden_i on each GPU: (batch, seq, 6144) 
    ↓ 
GELU Activation (local, no communication) 
    ↓ 
   hidden_i on each GPU: (batch, seq, 6144) 
    ↓ 
Linear2 (row-parallel) 
    ↓
   partial_output_i on each GPU: (batch, seq, 12288) 
    ↓ 
All-Reduce (SUM all partial outputs) 
    ↓ 
Output: (batch, seq, 12288) [replicated on all GPUs]

Communication cost: 1 all-reduce of size (batch * seq * d_model) after second linear layer.

The Backward pass:

I should mention the backward pass too... because that's where the other all-reduces come in.

Column Parallel Backward:

Forward: no communication
Backward: all-reduce on input gradients

But why?

In the forward pass, each GPU computed a part of the output using the full input. In backward, each GPU has gradients for part of the output. To get gradients for the full input, we need to sum contributions from all GPUs.

Row Parallel Backward:

Forward: all reduce on outputs
Backward: no communication

I think it makes sense right? Because in forward, we already summed partial outputs. In backward, gradients flow back through that sum, and the gradient of a sum is just the gradient passed to each term. Each GPU already has the graidents for its partial output.