Tensor Parallelism: Sometimes it's not the Compute but the memory
Whenever someone explains how AI models work, it always comes down to this: 'matrix multiplications'. But here is the thing, What happens when the matrices are damn too big to fit on a single machine's memory (training/inferencing)?
Let's talk some real numbers
Say we have a model with 100 Billion parameters. Breaking things down - We're using FP16/BF16 (mixed precision training), which is pretty standard these days.
Model weights:
- 100B * 2 bytes (FP16) = 200GB
Gradients:
- 100B * 2 bytes (FP16) = 200GB
Optimizer states (Adam let's say):
This is where it gets heavy. Adam optimizer doesn't just store your parameters, it stores 2 additional states per parameter.
- Momentum
- Variance
Both stored in FP32 for stability
- 100B * 8 bytes (2 FP32 states) = 800GB
Total so far ~1.2TB
But wait, there's more!
During forward and backward passes, you need to store intermediate values (activations).
The memory required depends on:
- Batch size
- Sequence length
- Model architecture
- Whether you are using gradient checkpointing
For a 100B model with reasonable batch size and seq length, we need 200-400GB of activation memory.
So total VRAM needed: 1.4 - 1.6TB.
Now, speaking realistically, a single GPU has 80-90GB of VRAM (like A100 80GB, H100 80GB), to train this 100B model we would need at least 20-25 machines. How to distribute and co-ordinate?
The Other Solutions (That don't really work)
Before we get into tensor parallelism, let us quickly look into 'obvious' solutions.
Data Parallelism
Replicate the entire model on each GPU, but split the data batch across GPUs. Each GPU process the different examples, then you average the gradients.
You can see the problem here, you still need the full model on each GPU, if your model doesn't fit on 1 GPU, this would not help at all! I feel this is only for speeding up the process.
Pipeline Parallelism
Split the model vertically, put layers 1-25 on GPU0, 26-50 on GPU1 and etc.
If you think, there are couple of problems here: Each GPU still needs to fit entire layers in memory, And the 'time' - GPUs sitting idle is more (successive GPUs input depends on previous output)
Now Enters: Tensor Parallelism
This doesn't split the model by layers. It splits individual layers across multiple GPUs.
As I mentioned earlier, it's just 'matrix multiplication'. But how?, how to distribute to different GPUs?
Keeping this simple, we have
C = A * B
- A has shape (m, n)
- B has shape (n, p)
- C will be the output and is (m, p)
On a single machine it is straightforward, Load A and B into memory, multiply them, get C. Done
But now let's say we have 2 GPUs and we want to split the work between them, and how to split?
We have 2 ways to do that:
- Column wise tensor parallelism
- Row wise tensor parallelism
Let's take 2 simple matrices and see how it works.

Matrices A and B
1. Column wise tensor parallelism
Split the matrix B into 2 parts, column wise and duplicate the matrix A. Each GPU computes its part independently.
- GPU0: C1 = A * B1 -> shape (2, 1)
- GPU1: C2 = A * B2 -> shape (2, 1)
At last we concatenate the results C1 and C2 and get C matrix. And this is called 'All Gather' operation, Reason: here we are not summing the values, just concatenating the results from different machines.

Column Parallelism
2. Row wise tensor parallelism
Split matrix B row-wise into 2 parts. To keep the matrix multiplication valid, we must split the output matrix A into 2 parts by its columns.
- GPU0: C1 = A1 × B1 -> shape (2, 2)
- GPU1: C2 = A2 × B2 -> shape (2, 2)
Now C will be C1+C2 (element wise sum)
And this is called 'all reduce' operation, Reason: here we do element wise sum (reduce) of the values from different machines.

Row Parallelism
Why both methods matter
Here is the thing, 'you alternare between them'
In a neural network, you stack multiple layers. The output of the one layer becomes the input to the next. If you alternate column and row parallelism, the output format of one layer matches the input format for the next layer.
Real stuff
let's see how this actually works in transformer architecture.
we are going to parallelize 2 main components:
- Multi Head Attention
- MLP (feed forward network)
1. Multi Head Attention
I feel it is almost perfect for tensor parallelism. It's like it was designed for this.
Standard Attention Mechanism:
- Q = X @ Wq
- K = X @ Wk
- V = X @ Wv
Attention = softmax(QK^T / sqrt(d_k)) @ V
Output = Attention @ Wo
For MHA with h heads: d_model = h * d_head
for each head i:
- Qi = X @ Wq_i
- Ki = X @ Wk_i
- Vi = X @ Wv_i
- head_i = softmax(Qi @ Ki^T / sqrt(d_head)) @ Vi
Concatenate all heads:
MH = Concat(head_1, head_2, ..., head_h)
Notice something? Each head is computed completely independently. Head 1 doesn't need to know what head 2 is doing.
So here's the idea: if you've got 8 heads and 4 GPUs, put 2 heads on each GPU. Each GPU computes its heads independently, then we concatenate results at the end.
After this, we split the Wo matrix row wise and distribute across machines. Each GPU computes:
- partial_output_i = MH_i @ Wo_i
each GPU produces a tensor of full output size. So, sum them all..
- final_output = AllReduce(partial_output_0, partial_output_1,..., partial_output_h)
Input X: (batch, seq, 12288) [replicated on all GPUs]
↓
QKV Projections (column-parallel, no communication)
↓
Q_i, K_i, V_i on each GPU: (batch, seq, 1536)
↓
Attention Computation (local, no communication)
↓
attn_output_i on each GPU: (batch, seq, 1536)
↓
Output Projection (row-parallel)
↓
partial_output_i on each GPU: (batch, seq, 12288)
↓
All-Reduce (SUM all partial outputs)
↓
Output: (batch, seq, 12288) [replicated on all GPUs]Communication cost: 1 all-reduce of size (batch * seq * d_model) after output projection.
2. MLP block
This feels even simpler
Normally
- Hidden = GELU(X @ W1 + b1) #expansion: d_model -> 4*d_model
- Output = Hidden @ W2 + b2 #contraction: 4*d_model -> d_model
Let's say:
d_model = 12288
d_ff (feedforward dimension) = 4 * 12288 = 49152
N = 8 GPUs
STEP 1: first linear layer(column parallelism)
split W1(12288, 49152) column wise:
- GPU 0 gets: W1_0: (12288, 6144)
- GPU 1 gets: W1_1: (12288, 6144)
- ...
- GPU 7 gets: W1_7: (12288, 6144)
each GPU computes:
- hidden_i = GELU(X @ W1_i + b1_i)
Input X is duplicated. Each GPU gets a slide of expanded hidden dimension. No communication needed.
STEP 2: Activation function
hidden_i = GELU(hidden_i)
No communication needed again.
STEP 3: second linear layer (row parallelism)
split W2 (49152, 12288) row wise:
- GPU 0 gets: W2_0: (6144, 12288)
- GPU 1 gets: W2_1: (6144, 12288)
- ...
- GPU 7 gets: W2_7: (6144, 12288)
each GPU computes:
- partial_output_i = hidden_i @ W2_i
then all reduce:
- final_output = AllReduce(partial_output_0 + ... + partial_output_7) + b2
Input X: (batch, seq, 12288) [replicated on all GPUs]
↓
Linear1 (column-parallel, no communication)
↓
hidden_i on each GPU: (batch, seq, 6144)
↓
GELU Activation (local, no communication)
↓
hidden_i on each GPU: (batch, seq, 6144)
↓
Linear2 (row-parallel)
↓
partial_output_i on each GPU: (batch, seq, 12288)
↓
All-Reduce (SUM all partial outputs)
↓
Output: (batch, seq, 12288) [replicated on all GPUs]Communication cost: 1 all-reduce of size (batch * seq * d_model) after second linear layer.
The Backward pass:
I should mention the backward pass too... because that's where the other all-reduces come in.
Column Parallel Backward:
- Forward: no communication
- Backward: all-reduce on input gradients
But why?
In the forward pass, each GPU computed a part of the output using the full input. In backward, each GPU has gradients for part of the output. To get gradients for the full input, we need to sum contributions from all GPUs.
Row Parallel Backward:
- Forward: all reduce on outputs
- Backward: no communication
I think it makes sense right? Because in forward, we already summed partial outputs. In backward, gradients flow back through that sum, and the gradient of a sum is just the gradient passed to each term. Each GPU already has the graidents for its partial output.