Chapter 10: Advanced Scheduling and CUDA Graphs

“The fastest code is code that doesn’t run. The second fastest is code you ran yesterday.”

Learning Objectives

By the end of this chapter, you will be able to:

Explain why CPU scheduling overhead matters for inference
Understand CUDA Graphs and when to use them
Describe zero-overhead scheduling with FutureMap
Identify scenarios where CUDA Graphs help vs hurt

Prerequisites

Completed Chapter 8-9 (Server Anatomy, KV Cache)
Basic understanding of CPU-GPU interaction
Familiarity with asynchronous programming

Concept Overview

The Scheduling Overhead Problem

LLM inference involves many small operations:

Schedule which requests to process
Prepare batch metadata
Launch GPU kernels
Wait for results
Process outputs

The problem? Steps 1, 2, and 5 happen on CPU while GPU waits!

Traditional scheduling:

Time →
CPU:  [Schedule][Prepare]............[Process]............[Schedule][Prepare]
GPU:  ............[Compute]............[idle].............[Compute]..........
                     ↑                    ↑
              GPU working           GPU waiting for CPU!

For small decode batches, CPU overhead can exceed GPU compute time.

How Bad Is It?

On a high-end setup (H100 + fast CPU):

GPU decode step: ~5-10ms
CPU scheduling overhead: ~2-5ms

That’s 20-50% overhead! For latency-sensitive applications, this is unacceptable.

CUDA Graphs: Recording Once, Playing Many

CUDA Graphs capture a sequence of GPU operations and replay them with minimal CPU overhead.

# Traditional approach (CPU launches each kernel)
for i in range(1000):
    output = model(input)  # CPU launches kernels each time

# CUDA Graph approach
# Step 1: Capture
with torch.cuda.graph(graph):
    output = model(input)  # Operations recorded, not executed

# Step 2: Replay
for i in range(1000):
    graph.replay()  # Single CPU call replays entire graph

Why it’s fast:

One CPU→GPU launch instead of many
GPU executes pre-optimized kernel sequence
No kernel launch latency per operation

CUDA Graphs: The Constraints

CUDA Graphs require static computation:

Allowed	Not Allowed
Fixed tensor shapes	Dynamic shapes
Deterministic operations	Random dropout
Pre-allocated memory	Dynamic allocation
Fixed control flow	Data-dependent branching

This is perfect for inference (fixed model) but problematic for training.

Why Training Rarely Uses CUDA Graphs

Dynamic optimizer updates: Gradient clipping changes behavior
Learning rate scheduling: Different computations each step
Gradient accumulation: Variable number of backwards
Dropout: Random behavior
Dynamic memory: Activation checkpointing allocates/frees

Zero-Overhead Scheduling: The FutureMap

SGLang’s innovation: overlap CPU scheduling with GPU compute.

Traditional:

Batch N: [CPU Schedule N] → [GPU Compute N] → [CPU Process N]
                                              ↓
Batch N+1:                                   [CPU Schedule N+1] → [GPU Compute N+1]

Overlapped:

Batch N:   [CPU Schedule N] → [GPU Compute N] ────────────────→
                               ↓
Batch N+1:                    [CPU Schedule N+1] → [GPU Compute N+1]
                               ↑
                               Running in parallel!

The challenge: Batch N+1’s inputs might depend on Batch N’s outputs!

FutureMap: Speculative Scheduling

FutureMap solves this with symbolic references:

1. CPU pre-allocates slots for Batch N's output
2. CPU schedules Batch N+1 using slot references (not actual values)
3. GPU runs Batch N, writes to pre-allocated slots
4. GPU's "resolve" kernel substitutes symbolic refs with real values
5. GPU runs Batch N+1

┌────────────────────────────────────────────────────────────────────┐
│ FutureMap Mechanism                                                 │
│                                                                    │
│  CPU Thread:                                                        │
│    1. Reserve slots for Batch N output                             │
│    2. Build Batch N+1 input with symbolic refs: [slot_42, slot_43] │
│    3. Continue scheduling (no blocking!)                           │
│                                                                    │
│  GPU Thread:                                                        │
│    1. Compute Batch N                                              │
│    2. Write results to reserved slots (42, 43)                     │
│    3. Resolve kernel: [slot_42, slot_43] → [actual_token_ids]      │
│    4. Compute Batch N+1                                            │
│                                                                    │
└────────────────────────────────────────────────────────────────────┘

The Complete Overlap Scheduler

async def overlap_scheduler_loop(self):
    """SGLang's overlapped scheduling loop."""
    last_batch = None
    last_result = None

    while True:
        # Step 1: Schedule NEXT batch (CPU)
        # This happens WHILE previous batch is computing!
        next_batch = self.get_next_batch_to_run()

        # Step 2: Launch next batch (GPU, non-blocking)
        next_result = self.run_batch(next_batch)

        # Step 3: Process PREVIOUS batch results (CPU)
        # By now, previous batch is likely done
        if last_batch is not None:
            self.process_batch_result(last_batch, last_result)

        last_batch = next_batch
        last_result = next_result

Combining CUDA Graphs with Overlap Scheduling

The ultimate optimization:

CUDA Graphs for decode batches (fixed shape, repeated)
Overlap scheduling for prefill/mixed batches
FutureMap to bridge the gap

Decode path (CUDA Graph):
  - Capture graph for batch sizes [1, 2, 4, 8, 16, ...]
  - Replay appropriate graph based on batch size
  - Near-zero CPU overhead

Prefill path (Overlap):
  - Variable prompt lengths
  - Use overlap scheduling with FutureMap
  - Reduced but not eliminated CPU overhead

Code Walkthrough

Script 1: cuda_graph_simple.py

Demonstrates CUDA Graphs:

Captures a simple model forward pass
Compares replay vs normal execution
Shows the constraints

Script 2: scheduling_overhead_benchmark.py

Measures scheduling overhead:

Time breakdown: scheduling vs compute
Impact of batch size
Benefits of overlap scheduling

Try It Yourself

Exercise 1: Measure Kernel Launch Overhead

Write a benchmark that:

Runs 100 small matrix multiplications normally
Captures them in a CUDA Graph
Compares total time

Exercise 2: Understand Shape Constraints

Try to capture a CUDA Graph with:

Fixed input shape → works
Different input shapes → observe behavior
How do real systems handle multiple shapes?

Exercise 3: Simulate Overlap Scheduling

Implement a simple overlap scheduler:

Queue of “batches” (just sleep timers)
Measure throughput with vs without overlap
What’s the speedup?

Key Takeaways

CPU overhead is real - Can be 20-50% of decode time
CUDA Graphs eliminate kernel launch overhead - But need static shapes
Overlap scheduling hides CPU work - Schedule N+1 while computing N
FutureMap enables speculation - Pre-allocate outputs, resolve later
Real systems combine techniques - CUDA Graphs for decode, overlap for prefill

The Speed Hierarchy

From fastest to slowest:

CUDA Graph replay: ~0.01ms overhead
Overlap scheduled: ~0.5ms (hidden)
Normal scheduling: ~2-5ms
Naive Python loop: ~10ms+

When Not to Use CUDA Graphs

Variable sequence lengths (prefill)
Dynamic batch sizes (requests finishing)
Debugging (graphs hide errors)
Memory-constrained (graphs consume memory)

What’s Next?

In Chapter 11, we’ll explore Speculative and Constraint Decoding—using draft models to speed up generation and grammar constraints to ensure structured output.

ML Systems for Dummies