Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ML Systems Infrastructure Tutorial

From distributed primitives to production RLHF: A hands-on journey through ML infrastructure

This tutorial takes you from zero to understanding how large-scale ML systems work. If you’re comfortable with PyTorch and understand transformers but wonder “how do people actually train GPT-4?”, this is for you.

Who This Is For

  • Strong ML background: You know PyTorch, can train models, understand attention
  • New to systems: You haven’t done distributed training, don’t know NCCL from TCP
  • Curious about scale: You want to understand how 1000-GPU training actually works

What You’ll Learn

By the end of this tutorial, you’ll understand:

  1. How GPUs talk to each other - Communication primitives that enable distributed training
  2. How to parallelize training - Data, tensor, and pipeline parallelism strategies
  3. How inference servers work - KV cache, batching, and speculative decoding
  4. How RLHF systems are built - The four-model dance that makes ChatGPT possible

Tutorial Structure

Part I: Foundations of Distributed Computing (Chapters 1-4)

Start here. These concepts are the alphabet of distributed systems.

ChapterTopicKey Concepts
Chapter 1Your First Distributed Programrank, world_size, process groups
Chapter 2Point-to-Point Communicationsend/recv, deadlock avoidance
Chapter 3Collective Operationsall_reduce, broadcast, scatter
Chapter 4NCCL and GPU TopologyRing/Tree algorithms, NVLink

Part II: Parallelism Strategies (Chapters 5-7)

Now you know the primitives. Let’s use them to train models that don’t fit on one GPU.

ChapterTopicKey Concepts
Chapter 5Data Parallelism Deep DiveDDP, FSDP, ZeRO stages
Chapter 6Tensor ParallelismColumn/row parallel, Megatron-style
Chapter 7Pipeline & Expert Parallelism1F1B scheduling, MoE

Part III: LLM Inference Systems (Chapters 8-11)

Training is half the story. Serving models efficiently is the other half.

ChapterTopicKey Concepts
Chapter 8Server AnatomyRequest lifecycle, prefill/decode
Chapter 9KV Cache ManagementPagedAttention, RadixCache
Chapter 10Scheduling & CUDA GraphsZero-overhead scheduling
Chapter 11Speculative & Constraint DecodingDraft models, structured output

Part IV: RLHF Systems (Chapters 12-14)

The grand finale: training models with human feedback.

ChapterTopicKey Concepts
Chapter 12RL Fundamentals for LLMsPPO, GAE, policy gradients
Chapter 13RLHF Computation FlowFour models, reward calculation
Chapter 14RLHF System ArchitectureCo-located vs disaggregated

How to Use This Tutorial

Prerequisites

pip install torch  # Core requirement
pip install gymnasium  # For RL chapter (optional)

No GPU required! All scripts have CPU fallback with the gloo backend.

Learning Path

Recommended order: Follow chapters sequentially. Each builds on the previous.

Hands-on learning: Each chapter has:

  • Conceptual explanation (the chapter page)
  • Runnable scripts (linked as sub-pages)
  • Exercises to try

Running the Scripts

# Chapter 1: Your first distributed program
python tutorial/part1-distributed/chapter01-first-program/scripts/verify_setup.py
python tutorial/part1-distributed/chapter01-first-program/scripts/hello_distributed.py

# Chapter 3: Collective operations
python tutorial/part1-distributed/chapter03-collectives/scripts/collective_cheatsheet.py

Quick Start: See Something Work!

Want to jump in immediately? Run this:

python tutorial/part1-distributed/chapter01-first-program/scripts/verify_setup.py  # Check your environment
python tutorial/part1-distributed/chapter01-first-program/scripts/hello_distributed.py  # Your first distributed program!

You should see 4 processes talking to each other!

Core Mental Models

The Parallelism Zoo

Problem: Model too big?
├── Too big for memory → Data Parallelism (replicate model)
│   └── Still too big → ZeRO/FSDP (shard everything)
├── One layer too big → Tensor Parallelism (split layers)
└── All layers too big → Pipeline Parallelism (split model)

Problem: Model is MoE?
└── Add Expert Parallelism (distribute experts)

The Memory Hierarchy

Fast ──────────────────────────────────────────► Slow
GPU L2   GPU HBM   CPU RAM   NVMe SSD   Network

90TB/s   3TB/s     200GB/s   7GB/s      50GB/s

Goal: Keep computation in fast memory
Strategy: Overlap communication with computation

The Inference Pipeline

Request → Tokenizer → Scheduler → Model Runner → Detokenizer → Response
                         ↓
              [Prefill: Process prompt]
                         ↓
              [Decode: Generate tokens]
                         ↓
              [KV Cache: Remember context]

“The best way to understand distributed systems is to build one. The second best way is this tutorial.”

Happy learning!