ML Systems Infrastructure Tutorial
From distributed primitives to production RLHF: A hands-on journey through ML infrastructure
This tutorial takes you from zero to understanding how large-scale ML systems work. If you’re comfortable with PyTorch and understand transformers but wonder “how do people actually train GPT-4?”, this is for you.
Who This Is For
- Strong ML background: You know PyTorch, can train models, understand attention
- New to systems: You haven’t done distributed training, don’t know NCCL from TCP
- Curious about scale: You want to understand how 1000-GPU training actually works
What You’ll Learn
By the end of this tutorial, you’ll understand:
- How GPUs talk to each other - Communication primitives that enable distributed training
- How to parallelize training - Data, tensor, and pipeline parallelism strategies
- How inference servers work - KV cache, batching, and speculative decoding
- How RLHF systems are built - The four-model dance that makes ChatGPT possible
Tutorial Structure
Part I: Foundations of Distributed Computing (Chapters 1-4)
Start here. These concepts are the alphabet of distributed systems.
| Chapter | Topic | Key Concepts |
|---|---|---|
| Chapter 1 | Your First Distributed Program | rank, world_size, process groups |
| Chapter 2 | Point-to-Point Communication | send/recv, deadlock avoidance |
| Chapter 3 | Collective Operations | all_reduce, broadcast, scatter |
| Chapter 4 | NCCL and GPU Topology | Ring/Tree algorithms, NVLink |
Part II: Parallelism Strategies (Chapters 5-7)
Now you know the primitives. Let’s use them to train models that don’t fit on one GPU.
| Chapter | Topic | Key Concepts |
|---|---|---|
| Chapter 5 | Data Parallelism Deep Dive | DDP, FSDP, ZeRO stages |
| Chapter 6 | Tensor Parallelism | Column/row parallel, Megatron-style |
| Chapter 7 | Pipeline & Expert Parallelism | 1F1B scheduling, MoE |
Part III: LLM Inference Systems (Chapters 8-11)
Training is half the story. Serving models efficiently is the other half.
| Chapter | Topic | Key Concepts |
|---|---|---|
| Chapter 8 | Server Anatomy | Request lifecycle, prefill/decode |
| Chapter 9 | KV Cache Management | PagedAttention, RadixCache |
| Chapter 10 | Scheduling & CUDA Graphs | Zero-overhead scheduling |
| Chapter 11 | Speculative & Constraint Decoding | Draft models, structured output |
Part IV: RLHF Systems (Chapters 12-14)
The grand finale: training models with human feedback.
| Chapter | Topic | Key Concepts |
|---|---|---|
| Chapter 12 | RL Fundamentals for LLMs | PPO, GAE, policy gradients |
| Chapter 13 | RLHF Computation Flow | Four models, reward calculation |
| Chapter 14 | RLHF System Architecture | Co-located vs disaggregated |
How to Use This Tutorial
Prerequisites
pip install torch # Core requirement
pip install gymnasium # For RL chapter (optional)
No GPU required! All scripts have CPU fallback with the gloo backend.
Learning Path
Recommended order: Follow chapters sequentially. Each builds on the previous.
Hands-on learning: Each chapter has:
- Conceptual explanation (the chapter page)
- Runnable scripts (linked as sub-pages)
- Exercises to try
Running the Scripts
# Chapter 1: Your first distributed program
python tutorial/part1-distributed/chapter01-first-program/scripts/verify_setup.py
python tutorial/part1-distributed/chapter01-first-program/scripts/hello_distributed.py
# Chapter 3: Collective operations
python tutorial/part1-distributed/chapter03-collectives/scripts/collective_cheatsheet.py
Quick Start: See Something Work!
Want to jump in immediately? Run this:
python tutorial/part1-distributed/chapter01-first-program/scripts/verify_setup.py # Check your environment
python tutorial/part1-distributed/chapter01-first-program/scripts/hello_distributed.py # Your first distributed program!
You should see 4 processes talking to each other!
Core Mental Models
The Parallelism Zoo
Problem: Model too big?
├── Too big for memory → Data Parallelism (replicate model)
│ └── Still too big → ZeRO/FSDP (shard everything)
├── One layer too big → Tensor Parallelism (split layers)
└── All layers too big → Pipeline Parallelism (split model)
Problem: Model is MoE?
└── Add Expert Parallelism (distribute experts)
The Memory Hierarchy
Fast ──────────────────────────────────────────► Slow
GPU L2 GPU HBM CPU RAM NVMe SSD Network
90TB/s 3TB/s 200GB/s 7GB/s 50GB/s
Goal: Keep computation in fast memory
Strategy: Overlap communication with computation
The Inference Pipeline
Request → Tokenizer → Scheduler → Model Runner → Detokenizer → Response
↓
[Prefill: Process prompt]
↓
[Decode: Generate tokens]
↓
[KV Cache: Remember context]
“The best way to understand distributed systems is to build one. The second best way is this tutorial.”
Happy learning!