Chapter 10: Advanced Scheduling and CUDA Graphs
“The fastest code is code that doesn’t run. The second fastest is code you ran yesterday.”
Learning Objectives
By the end of this chapter, you will be able to:
- Explain why CPU scheduling overhead matters for inference
- Understand CUDA Graphs and when to use them
- Describe zero-overhead scheduling with FutureMap
- Identify scenarios where CUDA Graphs help vs hurt
Prerequisites
- Completed Chapter 8-9 (Server Anatomy, KV Cache)
- Basic understanding of CPU-GPU interaction
- Familiarity with asynchronous programming
Concept Overview
The Scheduling Overhead Problem
LLM inference involves many small operations:
- Schedule which requests to process
- Prepare batch metadata
- Launch GPU kernels
- Wait for results
- Process outputs
The problem? Steps 1, 2, and 5 happen on CPU while GPU waits!
Traditional scheduling:
Time →
CPU: [Schedule][Prepare]............[Process]............[Schedule][Prepare]
GPU: ............[Compute]............[idle].............[Compute]..........
↑ ↑
GPU working GPU waiting for CPU!
For small decode batches, CPU overhead can exceed GPU compute time.
How Bad Is It?
On a high-end setup (H100 + fast CPU):
- GPU decode step: ~5-10ms
- CPU scheduling overhead: ~2-5ms
That’s 20-50% overhead! For latency-sensitive applications, this is unacceptable.
CUDA Graphs: Recording Once, Playing Many
CUDA Graphs capture a sequence of GPU operations and replay them with minimal CPU overhead.
# Traditional approach (CPU launches each kernel)
for i in range(1000):
output = model(input) # CPU launches kernels each time
# CUDA Graph approach
# Step 1: Capture
with torch.cuda.graph(graph):
output = model(input) # Operations recorded, not executed
# Step 2: Replay
for i in range(1000):
graph.replay() # Single CPU call replays entire graph
Why it’s fast:
- One CPU→GPU launch instead of many
- GPU executes pre-optimized kernel sequence
- No kernel launch latency per operation
CUDA Graphs: The Constraints
CUDA Graphs require static computation:
| Allowed | Not Allowed |
|---|---|
| Fixed tensor shapes | Dynamic shapes |
| Deterministic operations | Random dropout |
| Pre-allocated memory | Dynamic allocation |
| Fixed control flow | Data-dependent branching |
This is perfect for inference (fixed model) but problematic for training.
Why Training Rarely Uses CUDA Graphs
- Dynamic optimizer updates: Gradient clipping changes behavior
- Learning rate scheduling: Different computations each step
- Gradient accumulation: Variable number of backwards
- Dropout: Random behavior
- Dynamic memory: Activation checkpointing allocates/frees
Zero-Overhead Scheduling: The FutureMap
SGLang’s innovation: overlap CPU scheduling with GPU compute.
Traditional:
Batch N: [CPU Schedule N] → [GPU Compute N] → [CPU Process N]
↓
Batch N+1: [CPU Schedule N+1] → [GPU Compute N+1]
Overlapped:
Batch N: [CPU Schedule N] → [GPU Compute N] ────────────────→
↓
Batch N+1: [CPU Schedule N+1] → [GPU Compute N+1]
↑
Running in parallel!
The challenge: Batch N+1’s inputs might depend on Batch N’s outputs!
FutureMap: Speculative Scheduling
FutureMap solves this with symbolic references:
1. CPU pre-allocates slots for Batch N's output
2. CPU schedules Batch N+1 using slot references (not actual values)
3. GPU runs Batch N, writes to pre-allocated slots
4. GPU's "resolve" kernel substitutes symbolic refs with real values
5. GPU runs Batch N+1
┌────────────────────────────────────────────────────────────────────┐
│ FutureMap Mechanism │
│ │
│ CPU Thread: │
│ 1. Reserve slots for Batch N output │
│ 2. Build Batch N+1 input with symbolic refs: [slot_42, slot_43] │
│ 3. Continue scheduling (no blocking!) │
│ │
│ GPU Thread: │
│ 1. Compute Batch N │
│ 2. Write results to reserved slots (42, 43) │
│ 3. Resolve kernel: [slot_42, slot_43] → [actual_token_ids] │
│ 4. Compute Batch N+1 │
│ │
└────────────────────────────────────────────────────────────────────┘
The Complete Overlap Scheduler
async def overlap_scheduler_loop(self):
"""SGLang's overlapped scheduling loop."""
last_batch = None
last_result = None
while True:
# Step 1: Schedule NEXT batch (CPU)
# This happens WHILE previous batch is computing!
next_batch = self.get_next_batch_to_run()
# Step 2: Launch next batch (GPU, non-blocking)
next_result = self.run_batch(next_batch)
# Step 3: Process PREVIOUS batch results (CPU)
# By now, previous batch is likely done
if last_batch is not None:
self.process_batch_result(last_batch, last_result)
last_batch = next_batch
last_result = next_result
Combining CUDA Graphs with Overlap Scheduling
The ultimate optimization:
- CUDA Graphs for decode batches (fixed shape, repeated)
- Overlap scheduling for prefill/mixed batches
- FutureMap to bridge the gap
Decode path (CUDA Graph):
- Capture graph for batch sizes [1, 2, 4, 8, 16, ...]
- Replay appropriate graph based on batch size
- Near-zero CPU overhead
Prefill path (Overlap):
- Variable prompt lengths
- Use overlap scheduling with FutureMap
- Reduced but not eliminated CPU overhead
Code Walkthrough
Script 1: cuda_graph_simple.py
Demonstrates CUDA Graphs:
- Captures a simple model forward pass
- Compares replay vs normal execution
- Shows the constraints
Script 2: scheduling_overhead_benchmark.py
Measures scheduling overhead:
- Time breakdown: scheduling vs compute
- Impact of batch size
- Benefits of overlap scheduling
Try It Yourself
Exercise 1: Measure Kernel Launch Overhead
Write a benchmark that:
- Runs 100 small matrix multiplications normally
- Captures them in a CUDA Graph
- Compares total time
Exercise 2: Understand Shape Constraints
Try to capture a CUDA Graph with:
- Fixed input shape → works
- Different input shapes → observe behavior
- How do real systems handle multiple shapes?
Exercise 3: Simulate Overlap Scheduling
Implement a simple overlap scheduler:
- Queue of “batches” (just sleep timers)
- Measure throughput with vs without overlap
- What’s the speedup?
Key Takeaways
- CPU overhead is real - Can be 20-50% of decode time
- CUDA Graphs eliminate kernel launch overhead - But need static shapes
- Overlap scheduling hides CPU work - Schedule N+1 while computing N
- FutureMap enables speculation - Pre-allocate outputs, resolve later
- Real systems combine techniques - CUDA Graphs for decode, overlap for prefill
The Speed Hierarchy
From fastest to slowest:
- CUDA Graph replay: ~0.01ms overhead
- Overlap scheduled: ~0.5ms (hidden)
- Normal scheduling: ~2-5ms
- Naive Python loop: ~10ms+
When Not to Use CUDA Graphs
- Variable sequence lengths (prefill)
- Dynamic batch sizes (requests finishing)
- Debugging (graphs hide errors)
- Memory-constrained (graphs consume memory)
What’s Next?
In Chapter 11, we’ll explore Speculative and Constraint Decoding—using draft models to speed up generation and grammar constraints to ensure structured output.