Chapter 7: Pipeline and Expert Parallelism
“When one GPU can’t hold one layer (TP), we split layers. When it can’t hold all layers (PP), we split the model vertically.”
Learning Objectives
By the end of this chapter, you will be able to:
- Explain how pipeline parallelism splits models across GPUs
- Implement 1F1B scheduling for efficient pipeline execution
- Understand Mixture-of-Experts (MoE) and Expert Parallelism
- Calculate the optimal parallelism strategy for a given model
Prerequisites
- Completed Chapters 5-6 (Data and Tensor Parallelism)
- Understanding of transformer architecture
- Basic knowledge of GPU memory hierarchy
Concept Overview
Pipeline Parallelism: Splitting by Layers
While tensor parallelism splits individual layers horizontally, pipeline parallelism splits the model vertically—each GPU holds a contiguous group of layers.
Full Model: [Embed] [Layer 0-5] [Layer 6-11] [Layer 12-17] [Layer 18-23] [Head]
↓ ↓ ↓ ↓ ↓ ↓
Pipeline: ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│GPU 0 │ │GPU 1 │ │GPU 2 │ │GPU 3 │ │GPU 4 │ │GPU 5 │
│Stage 0│ │Stage 1│ │Stage 2│ │Stage 3│ │Stage 4│ │Stage 5│
└──────┘ └──────┘ └──────┘ └──────┘ └──────┘ └──────┘
Communication: Activations flow forward, gradients flow backward (point-to-point send/recv).
The Pipeline Bubble Problem
Naive pipeline execution has a fatal flaw: bubbles.
Time →
GPU 0: [F0] [F1] [F2] [F3] [B3] [B2] [B1] [B0]
GPU 1: [F0] [F1] [F2] [F3] [B3] [B2] [B1] [B0]
GPU 2: [F0] [F1] [F2] [F3] [B3] [B2] [B1] [B0]
GPU 3: [F0] [F1] [F2] [F3] [B3] [B2] [B1] [B0]
Bubbles = empty time where GPUs are idle
Bubble fraction = (P-1) / (M + P - 1), where P = pipeline stages, M = microbatches.
For P=4, M=4: Bubble = 3/7 = 43% wasted time!
1F1B Scheduling: The Solution
1F1B (One Forward, One Backward) interleaves forward and backward passes to reduce bubbles:
Time →
GPU 0: [F0] [F1] [F2] [F3] [B0] [F4] [B1] [F5] [B2] [B3]
GPU 1: [F0] [F1] [F2] [B0] [F3] [B1] [F4] [B2] [B3]
GPU 2: [F0] [F1] [B0] [F2] [B1] [F3] [B2] [B3]
GPU 3: [F0] [B0] [F1] [B1] [F2] [B2] [F3] [B3]
Key insight: Once the pipeline is “full,” each GPU does one forward then one backward, keeping memory constant.
Memory in Pipeline Parallelism
Each GPU stores:
- Model parameters for its stages
- Activations from forward pass (needed for backward)
1F1B memory advantage: Only need to store activations for P microbatches, not M.
Mixture of Experts (MoE)
MoE replaces the standard FFN with multiple “expert” FFNs:
Standard FFN:
Input → FFN → Output
MoE FFN:
Input → Router → Expert 0 →
→ Expert 1 → Weighted Sum → Output
→ Expert 2 →
→ Expert 3 →
The router (a small neural network) decides which experts process each token. Typically, only top-K experts (K=1 or 2) are activated per token.
Why MoE?
- More parameters without more FLOPs
- Each token only activates a fraction of parameters
- DeepSeek-V3: 671B parameters but only 37B activated per token!
Expert Parallelism (EP)
When you have 64+ experts, they don’t fit on one GPU. Expert Parallelism distributes experts across GPUs:
Token Routing
│
┌──────────┼──────────┐
▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐
│ GPU 0 │ │ GPU 1 │ │ GPU 2 │
│E0,E1,E2│ │E3,E4,E5│ │E6,E7,E8│
└───────┘ └───────┘ └───────┘
│ │ │
└──────────┴──────────┘
All-to-All
(collect results)
Communication pattern: All-to-All (each GPU sends tokens to the GPUs hosting the selected experts).
EP vs TP: A Critical Comparison
For MoE models, EP is often better than TP:
| Aspect | Tensor Parallelism | Expert Parallelism |
|---|---|---|
| What’s split | Each expert matrix | Whole experts |
| Communication | 2 all-reduce per layer | 2 all-to-all per layer |
| Volume | 2 × batch × seq × hidden | 2 × k × batch × seq × hidden / N |
| Compute efficiency | Low (small GEMMs) | High (full expert GEMMs) |
Key insight: TP slices already small expert matrices, making GEMMs inefficient. EP keeps expert matrices whole.
Communication Volume Deep Dive
For TP with degree T on an MoE layer:
Volume = 2S (all-reduce, activations of size S)
For EP with N experts, k activated:
Volume = 2kS/N (all-to-all, only k/N of tokens go to each GPU)
When k << N (sparse activation), EP wins on communication too!
Combining Parallelisms: The 3D Approach
Real large-model training uses multiple parallelism strategies:
┌─────────────────────────────────────────────────────────────┐
│ DATA PARALLELISM │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ PIPELINE PARALLELISM │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │TENSOR PARA- │ │TENSOR PARA- │ │TENSOR PARA- │ │ │
│ │ │LLELISM │ │LLELISM │ │LLELISM │ │ │
│ │ │ (Stage 0) │→│ (Stage 1) │→│ (Stage 2) │ │ │
│ │ │ 8 GPUs │ │ 8 GPUs │ │ 8 GPUs │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └──────────────────────────────────────────────────────┘ │
│ (Replica 0) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ ... more replicas ... │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Rule of thumb:
- TP: Within a node (needs NVLink)
- PP: Across nodes (lower bandwidth OK)
- DP: Scales indefinitely (gradient sync)
Code Walkthrough
Script 1: pipeline_schedule_viz.py
Visualizes different pipeline schedules:
- Naive (fill-drain)
- 1F1B
- Interleaved 1F1B
Shows bubble ratios and memory usage.
Script 2: parallel_strategy_calculator.py
Given model specs and hardware, calculates:
- Memory per GPU for each parallelism strategy
- Communication volume
- Recommended configuration
Try It Yourself
Exercise 1: Calculate Bubble Fraction
For a 4-stage pipeline with 16 microbatches:
- What’s the bubble fraction with naive scheduling?
- How does it improve with 1F1B?
Exercise 2: MoE Communication Analysis
For an MoE layer with:
- 64 experts
- top-2 routing (k=2)
- 4096 hidden dimension
- Batch of 4 × 1024 tokens
Calculate communication volume for:
- 8-way TP (splitting each expert)
- 8-way EP (8 experts per GPU)
Exercise 3: Design a Parallelism Strategy
You have:
- 70B parameter dense model
- 64 H100 GPUs (8 per node)
- 80GB memory per GPU
Design a parallelism strategy. Consider:
- Model size: ~140GB in FP16
- Activations and gradients
- Communication patterns
Key Takeaways
- PP splits the model by layers - Point-to-point communication only
- Bubbles are the enemy - 1F1B scheduling minimizes idle time
- MoE = sparse activation - More parameters, same compute
- EP beats TP for MoE - Keeps expert matrices whole
- Combine strategies - Real systems use TP + PP + DP + EP
The Parallelism Decision Tree
Is one layer too big for one GPU?
├─ Yes → Use Tensor Parallelism (within node)
└─ No
└─ Is the full model too big for one GPU?
├─ Yes → Use Pipeline Parallelism (across nodes OK)
│ + Use Tensor Parallelism if layers are large
└─ No
└─ Use Data Parallelism (scales indefinitely)
Is the model MoE?
├─ Yes → Add Expert Parallelism (across nodes OK)
└─ No → Continue with above strategy
What’s Next?
In Part III, we’ll dive into LLM Inference Systems—how to efficiently serve models after training. This includes KV cache management, batching strategies, and speculative decoding.