Chapter 4: NCCL Algorithms and GPU Topology
“Understanding your hardware is half the battle. The other half is making NCCL do what you want.”
Learning Objectives
By the end of this chapter, you will be able to:
- Explain how Ring and Tree algorithms work for all_reduce
- Inspect GPU topology and NVLink connections
- Understand why communication patterns matter for performance
- Choose the right NCCL settings for your hardware
Prerequisites
- Completed Chapters 1-3
- Understanding of all_reduce and collective operations
- Access to a machine with NVIDIA GPU (for hands-on topology inspection)
Concept Overview
Why Does the Algorithm Matter?
When you call dist.all_reduce(tensor), NCCL doesn’t just magically synchronize data. It runs a carefully designed algorithm that determines:
- Who sends to whom - The communication pattern
- What data flows - Partial aggregates vs full tensors
- How much bandwidth is used - Network saturation
- How long it takes - Latency characteristics
Different algorithms excel in different scenarios:
- Ring: Great bandwidth utilization, scales with data size
- Tree: Lower latency for small messages, scales better with node count
- Double Binary Tree: Best of both worlds for large clusters
The Ring Algorithm
Ring is the most intuitive collective algorithm. Picture GPUs arranged in a circle:
┌──────► GPU 1 ──────┐
│ │
│ ▼
GPU 0 GPU 2
▲ │
│ │
└────── GPU 3 ◄──────┘
How Ring All-Reduce Works:
Phase 1: Scatter-Reduce (each GPU accumulates partial sums)
Step 1: GPU0 sends chunk0 to GPU1, GPU1 sends chunk1 to GPU2, ...
Step 2: Recipients add their local chunk to received chunk, send result
... (N-1 steps total)
Phase 2: All-Gather (distribute the fully reduced chunks)
Step 1: GPU0 sends its complete chunk0 to GPU1, ...
... (N-1 steps total)
Ring Complexity:
- Total steps: 2(N-1) where N is number of GPUs
- Data per step: D/N where D is total data size
- Total data moved: 2D(N-1)/N ≈ 2D for large N
Ring’s Superpower: Bandwidth utilization is nearly 100%! Each GPU is always sending and receiving.
The Tree Algorithm
For large clusters, Ring’s latency (O(N) steps) becomes problematic. Tree algorithms use a hierarchical structure:
GPU 0 (root)
/ \
GPU 1 GPU 2
/ \ / \
GPU 3 GPU 4 GPU 5 GPU 6
How Tree Reduce Works:
Step 1: Leaves (3,4,5,6) send to parents (1,2)
Step 2: Parents combine, send to root (0)
Step 3: Root has final result
Tree Complexity:
- Total steps: 2 * log2(N) (reduce up + broadcast down)
- Much better latency for small messages
Tree’s Tradeoff: Lower bandwidth utilization (only half the GPUs active at any time).
Double Binary Tree (for 24,000+ GPUs)
At scale (think training GPT-4), even tree algorithms hit bottlenecks. Double Binary Tree uses two complementary trees to keep all links busy:
Tree A: Tree B:
0 7
/ \ / \
1 2 6 5
/ \ / \ / \ / \
3 4 5 6 0 1 2 3
Different GPUs are roots/leaves in each tree, balancing the load.
NVLink: The Speed Demon
NVLink is NVIDIA’s high-bandwidth interconnect for GPU-to-GPU communication:
| Generation | Bandwidth (per link) | Links per GPU |
|---|---|---|
| NVLink 1.0 | 20 GB/s | 4 |
| NVLink 2.0 | 25 GB/s | 6 |
| NVLink 3.0 | 25 GB/s | 12 |
| NVLink 4.0 | 25 GB/s | 18 |
For comparison, PCIe 4.0 x16 is only ~32 GB/s total!
A fully-connected 8-GPU node with NVLink 4.0 has 900 GB/s aggregate bandwidth between GPUs. This is why DGX systems are so fast for training.
GPU Topology: The Key to Understanding Performance
Not all GPU pairs are connected equally! Use nvidia-smi topo -m to see your topology:
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X NV4 NV4 NV4 0-31
GPU1 NV4 X NV4 NV4 0-31
GPU2 NV4 NV4 X NV4 0-31
GPU3 NV4 NV4 NV4 X 0-31
Legend:
- X: Self
- NV#: NVLink with # links
- SYS: Cross NUMA node (slowest)
- NODE: Same NUMA node, no NVLink
- PHB: Same PCIe host bridge
- PXB: Different PCIe bridges
- PIX: Same PCIe switch
Rule of thumb: More NVLinks = faster. SYS = slow, avoid if possible.
Code Walkthrough
Script 1: topology_inspector.py
This script inspects your GPU topology and reports:
- How many GPUs you have
- NVLink connections between GPUs
- PCIe topology
- NUMA affinity
It also suggests optimal process placement.
Script 2: benchmark_algorithms.py
This script benchmarks different NCCL algorithms on your hardware:
- Measures all_reduce throughput
- Compares Ring vs Tree
- Shows how performance scales with message size
NCCL Environment Variables
You can tune NCCL behavior with environment variables:
| Variable | Description | Default |
|---|---|---|
NCCL_ALGO | Algorithm: Ring, Tree, CollNetChain | Auto |
NCCL_PROTO | Protocol: Simple, LL, LL128 | Auto |
NCCL_NTHREADS | Threads per block | Auto |
NCCL_DEBUG | Debugging output (WARN, INFO, TRACE) | WARN |
NCCL_DEBUG_SUBSYS | Subsystems to debug | All |
Example: Force ring algorithm and show debug info:
NCCL_ALGO=Ring NCCL_DEBUG=INFO python train.py
Try It Yourself
Exercise 1: Inspect Your Topology
Run topology_inspector.py on a GPU machine and answer:
- How many NVLinks connect GPU 0 to GPU 1?
- Are any GPU pairs connected only via PCIe?
- What’s the CPU affinity for each GPU?
Exercise 2: Benchmark All-Reduce
Run benchmark_algorithms.py with different message sizes:
- 1 KB
- 1 MB
- 100 MB
When does Ring outperform Tree? When does Tree win?
Exercise 3: Measure the NVLink Advantage
If you have GPUs connected via NVLink AND PCIe:
- Run all_reduce between NVLink-connected GPUs
- Run all_reduce between PCIe-connected GPUs
- Calculate the speedup
Key Takeaways
- Ring excels at large messages - Nearly 100% bandwidth utilization
- Tree excels at low latency - O(log N) steps vs O(N)
- NVLink is crucial - 10x+ faster than PCIe
- Topology determines performance - Know your hardware!
- NCCL auto-selects - But you can override for specific cases
Performance Intuition
For a 1 GB all_reduce on 8 GPUs:
| Connection | Ring Bandwidth | Approximate Time |
|---|---|---|
| NVLink 4.0 (900 GB/s) | ~450 GB/s effective | ~2.2 ms |
| PCIe 4.0 x16 (32 GB/s) | ~16 GB/s effective | ~62 ms |
That’s a 28x difference just from interconnect!
What’s Next?
In Part II, we’ll use these primitives to implement actual parallelism strategies:
- Chapter 5: Data Parallelism (DDP, FSDP, ZeRO)
- Chapter 6: Tensor Parallelism (splitting layers)
- Chapter 7: Pipeline Parallelism (splitting models)