Chapter 8: Anatomy of an LLM Inference Server
“Training is a sprint. Inference is a marathon that never ends.”
Learning Objectives
By the end of this chapter, you will be able to:
- Trace the lifecycle of a request through an inference server
- Explain the roles of Tokenizer, Scheduler, and Model Runner
- Understand why inference is fundamentally different from training
- Identify bottlenecks in inference serving
Prerequisites
- Completed Part II (Parallelism Strategies)
- Basic understanding of transformer architecture
- Familiarity with REST APIs
Concept Overview
Training vs Inference: A Tale of Two Challenges
| Aspect | Training | Inference |
|---|---|---|
| Goal | Update model weights | Generate tokens |
| Batch size | Fixed (large) | Dynamic (varies) |
| Latency | Irrelevant | Critical |
| Throughput | Samples/second | Tokens/second |
| Memory | Dominated by gradients | Dominated by KV cache |
| Workload | Homogeneous | Heterogeneous |
Training processes fixed batches for hours. Inference serves arbitrary requests in milliseconds.
The Inference Pipeline
When you send a prompt to an LLM, here’s what happens:
┌──────────────────────────────────────────────────────────────────────────┐
│ LLM INFERENCE SERVER │
│ │
│ HTTP Request ─────────────────────────────────────────► HTTP Response │
│ │ ▲ │
│ ▼ │ │
│ ┌─────────────┐ ┌───────────────┐ ┌─────────────────┐ │ │
│ │ API Adapter │───►│TokenizerMgr │───►│ Scheduler │ │ │
│ │ │ │(tokenize) │ │(batch requests) │ │ │
│ └─────────────┘ └───────────────┘ └───────┬─────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────────┐ │ │
│ │ Model Runner │ │ │
│ │ (GPU compute) │ │ │
│ └───────┬─────────┘ │ │
│ │ │ │
│ ▼ │ │
│ ┌─────────────────┐ │ │
│ │DetokenizerMgr │──┘ │
│ │(tokens→text) │ │
│ └─────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
Component Deep Dive
1. API Adapter
Translates HTTP requests into internal format:
- Parses JSON body
- Validates parameters (temperature, max_tokens, etc.)
- Creates
GenerateRequestobject
@app.post("/v1/chat/completions")
async def chat_completion(request: ChatRequest):
# Validate and convert to internal format
generate_request = convert_to_internal(request)
# Send to tokenizer manager
return await tokenizer_manager.generate(generate_request)
2. Tokenizer Manager
Handles text ↔ token conversion:
- Tokenizes input prompt
- Manages vocabulary and special tokens
- Queues tokenized requests for scheduler
3. Scheduler
The brain of the inference server:
- Manages request queue
- Decides which requests to batch together
- Allocates KV cache memory
- Chooses between prefill and decode
The scheduler is so important it gets its own chapters (9-10)!
4. Model Runner
Executes the actual neural network:
- Loads model weights
- Runs forward pass
- Samples next token
5. Detokenizer Manager
Converts tokens back to text:
- Decodes token IDs to strings
- Handles streaming output
- Manages stop sequences
The Two Phases of Inference
LLM inference has two distinct phases:
Phase 1: Prefill (Prompt Processing)
Input: "What is the capital of France?"
[token_0, token_1, token_2, ..., token_n]
Output: KV cache for all tokens + first generated token
Compute: Parallelizable (all tokens at once)
Memory: Write n entries to KV cache
Phase 2: Decode (Token Generation)
Input: Previously generated token + KV cache
[token_i]
Output: Next token
Compute: Sequential (one token at a time)
Memory: Read from KV cache, write 1 entry
Time →
Prefill: [===================] (process all prompt tokens)
↓
Decode: [=] [=] [=] [=] [=] [=] [=] [=] ...
t₁ t₂ t₃ t₄ t₅ t₆ t₇ t₈
Key insight: Prefill is compute-bound, decode is memory-bound.
Why Batching is Complicated
Training batches are simple: same sequence length, process together.
Inference batches are hard:
- Requests arrive at different times
- Different prompt lengths
- Different desired output lengths
- Some requests finish mid-batch
Continuous batching solves this:
Time →
Request A: [====prefill====][d][d][d][d][d][done]
Request B: [prefill][d][d][d][d][d][d][d][d]...
Request C: [====prefill====][d][d]...
Batched execution:
[A+B prefill] [A+B decode] [A+B+C] [B+C decode] ...
Memory: The Inference Bottleneck
For a 70B parameter model serving requests:
| Component | Memory |
|---|---|
| Model weights (FP16) | 140 GB |
| KV cache (per request) | ~2 GB for 32K context |
| Activations | ~1 GB |
With 140 GB of weights and 80 GB GPU memory… we need tensor parallelism just to load the model!
And each request needs its own KV cache. Serving 100 concurrent requests at 32K context would need 200 GB just for KV cache!
This is why KV cache management (Chapter 9) is critical.
Code Walkthrough
Script: minimal_inference_server.py
A simplified inference server showing the core components:
- Request queue management
- Simple batching
- Token-by-token generation
This isn’t production-ready but demonstrates the architecture.
Key Metrics
When evaluating inference servers:
| Metric | Definition | Target |
|---|---|---|
| TTFT | Time To First Token | < 500ms |
| ITL | Inter-Token Latency | < 50ms |
| Throughput | Tokens/second | Maximize |
| Concurrency | Simultaneous requests | Maximize |
Trade-offs:
- Higher concurrency → higher memory → larger batches → higher throughput
- Higher concurrency → more KV cache → potential OOM
- Larger batches → higher latency per request
Try It Yourself
Exercise 1: Trace a Request
Using the minimal_inference_server.py:
- Add logging to each component
- Trace a single request through the system
- Measure time spent in each stage
Exercise 2: Measure Prefill vs Decode
Profile inference to measure:
- Time for prefill (prompt processing)
- Time per decode step
- How does prompt length affect prefill time?
- How does batch size affect decode time?
Exercise 3: Calculate KV Cache Size
For a model with:
- 32 layers
- 8192 hidden dimension
- 128 heads
- 32K max sequence length
Calculate:
- KV cache size per token
- KV cache size for one 32K request
- Max concurrent requests with 80 GB memory (after model weights)
Key Takeaways
- Inference is a pipeline - Multiple stages, each can be a bottleneck
- Prefill vs Decode - Different compute characteristics, different optimizations
- Memory dominates - KV cache limits concurrency
- Batching is complex - Continuous batching enables high throughput
- Latency matters - Unlike training, users are waiting
The Inference Optimization Hierarchy
Level 0: Model works (correctness)
↓
Level 1: Model fits in memory (quantization, TP)
↓
Level 2: Efficient memory management (KV cache, paging)
↓
Level 3: Efficient batching (continuous batching)
↓
Level 4: Kernel optimizations (FlashAttention, CUDA graphs)
↓
Level 5: Speculative decoding (draft models)
What’s Next?
In Chapter 9, we’ll dive deep into KV Cache Management—how systems like PagedAttention and RadixCache enable serving many concurrent requests efficiently.