Chapter 9: KV Cache Management

“In LLM inference, memory is the new compute. And KV cache is the memory hog.”

Learning Objectives

By the end of this chapter, you will be able to:

Explain why KV cache exists and how it accelerates inference
Calculate KV cache size for different models and contexts
Understand PagedAttention and its benefits
Explain how RadixCache enables prefix sharing

Prerequisites

Completed Chapter 8 (Server Anatomy)
Understanding of transformer attention mechanism
Basic knowledge of memory management

Concept Overview

Why KV Cache?

In transformers, each attention layer computes:

Attention(Q, K, V) = softmax(QK^T / √d) V

During generation, we produce one token at a time. Without caching:

Token 1: Compute K, V for position 0
Token 2: Compute K, V for positions 0, 1 (recompute position 0!)
Token 3: Compute K, V for positions 0, 1, 2 (recompute again!)
…
Token N: O(N²) total computations!

With KV cache:

Token 1: Compute K₀, V₀, store in cache
Token 2: Compute K₁, V₁, concatenate with cached K₀, V₀
Token 3: Compute K₂, V₂, concatenate with cached
…
Token N: O(N) computations

KV cache trades memory for compute.

KV Cache Size Calculation

For each token, we store K and V for every layer:

KV per token = 2 × num_layers × num_heads × head_dim × dtype_size

Example (LLaMA-70B):
  Layers: 80
  Heads: 64 (8 KV heads with GQA)
  Head dim: 128
  Dtype: FP16 (2 bytes)

  KV per token = 2 × 80 × 8 × 128 × 2 = 327,680 bytes ≈ 320 KB

For 32K context:
  KV per request = 320 KB × 32K = 10.24 GB

A single request needs 10 GB of KV cache! This is why memory management is critical.

The Memory Fragmentation Problem

Traditional approach: Pre-allocate maximum context length per request.

Request A (needs 1K): [■■■■□□□□□□□□□□□□□□□□□□□□□□□□□□□□] 32K allocated
Request B (needs 2K): [■■■■■■■■□□□□□□□□□□□□□□□□□□□□□□□□] 32K allocated
Request C (needs 1K): [■■■■□□□□□□□□□□□□□□□□□□□□□□□□□□□□] 32K allocated

Total allocated: 96K tokens worth of memory
Actually used: 4K tokens
Waste: 96%!

This is internal fragmentation—memory reserved but unused.

PagedAttention: The Solution

PagedAttention (from vLLM) applies OS-style virtual memory to KV cache:

Physical Memory (Pages):
[Page 0][Page 1][Page 2][Page 3][Page 4][Page 5][Page 6][Page 7]

Request A (logical view):     Request B (logical view):
[Tokens 0-255][Tokens 256-511] [Tokens 0-255][Tokens 256-511][Tokens 512-767]
      ↓              ↓               ↓              ↓              ↓
   Page 2         Page 5          Page 0         Page 3         Page 7
   (physical)    (physical)      (physical)     (physical)     (physical)

Key insight: Allocate physical pages only when needed. Different requests can share the same physical memory pool.

Benefits:

Near-zero fragmentation
Memory utilization > 95%
More concurrent requests

The Three-Level KV Cache Hierarchy

Modern systems like SGLang use a three-level structure:

┌─────────────────────────────────────────────────────────────────┐
│ Level 1: RadixCache (Logical)                                    │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ Radix Tree of Token Sequences                                │ │
│ │                                                              │ │
│ │            [root]                                           │ │
│ │           /      \                                          │ │
│ │    "What is"    "Tell me"                                   │ │
│ │       /              \                                       │ │
│ │  "the capital"    "a joke"                                  │ │
│ │                                                              │ │
│ │ Purpose: Detect prefix sharing opportunities                 │ │
│ └─────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Level 2: ReqToTokenPool (Mapping)                               │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ [request_id, token_position] → physical_slot_index          │ │
│ │                                                              │ │
│ │ (req_0, pos_0) → slot_42                                    │ │
│ │ (req_0, pos_1) → slot_17                                    │ │
│ │ (req_1, pos_0) → slot_42  ← Same slot! Prefix sharing!      │ │
│ │                                                              │ │
│ │ Purpose: Map logical positions to physical memory           │ │
│ └─────────────────────────────────────────────────────────────┘ │
├─────────────────────────────────────────────────────────────────┤
│ Level 3: TokenToKVPool (Physical GPU Memory)                    │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ slot_index → actual K,V tensors on GPU                      │ │
│ │                                                              │ │
│ │ Slot 0:  [K tensor][V tensor]                               │ │
│ │ Slot 1:  [K tensor][V tensor]                               │ │
│ │ ...                                                          │ │
│ │ Slot N:  [K tensor][V tensor]                               │ │
│ │                                                              │ │
│ │ Purpose: Store actual KV values                             │ │
│ └─────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

RadixCache enables automatic prefix caching. If multiple requests share a prompt prefix, they share KV cache:

Request 1: "What is the capital of France?"
Request 2: "What is the capital of Germany?"
Request 3: "What is the largest planet?"

Radix Tree:
                    [root]
                      │
              "What is the"  (shared by all 3!)
                 /    \
        "capital of"  "largest planet"
            /    \          │
      "France"  "Germany"  [Request 3]
         │         │
    [Request 1] [Request 2]

Memory savings: "What is the" KV cache stored ONCE, used by 3 requests!

This is huge for:

System prompts (shared across all requests)
Few-shot examples
Chat history prefixes

Cache Eviction: LRU with Reference Counting

When memory is full, we need to evict cached entries. RadixCache uses:

Reference counting: Don’t evict entries in use
LRU (Least Recently Used): Evict oldest unused entries first

Code Walkthrough

Script 1: kv_cache_calculator.py

Calculate KV cache sizes for different model configurations:

Shows memory requirements per token, per request
Estimates concurrent request capacity
Compares with and without paging

Script 2: prefix_sharing_demo.py

Demonstrates how prefix sharing works:

Shows memory savings from shared prefixes
Visualizes the radix tree structure
Calculates sharing efficiency

Memory Budget Planning

For a 70B model on 8× H100 (640 GB total):

Component	Memory
Model weights (FP16)	140 GB
CUDA kernels, etc.	~20 GB
Available for KV cache	~480 GB

With 320 KB per token per request:

Max tokens in cache: 480 GB / 320 KB = 1.5M tokens
At 4K avg context: ~375 concurrent requests
At 32K context: ~47 concurrent requests

This is why context length dramatically affects capacity!

Try It Yourself

Exercise 1: Calculate Your Model’s KV Cache

For your favorite model (LLaMA, Mistral, etc.):

Find: num_layers, num_kv_heads, head_dim
Calculate: KV bytes per token
Calculate: Max requests at 8K context with 80GB GPU

Design a benchmark:

Create 100 requests with shared system prompt
Calculate memory with individual caching
Calculate memory with prefix sharing
What’s the savings percentage?

Exercise 3: Implement Simple LRU Cache

Implement a basic LRU cache for KV entries:

Fixed capacity
Reference counting
Eviction when full

Key Takeaways

KV cache is massive - Often larger than model weights for long contexts
Fragmentation wastes memory - Pre-allocation is inefficient
PagedAttention solves fragmentation - Near-100% memory utilization
Prefix sharing saves memory - Especially for system prompts
Memory limits concurrency - More memory = more concurrent requests

Trade-offs

Approach	Pros	Cons
Pre-allocation	Simple, no overhead	Massive fragmentation
PagedAttention	Low fragmentation	Page table overhead
RadixCache	Prefix sharing	Tree management overhead
Quantized KV	Less memory	Slight quality loss

What’s Next?

In Chapter 10, we’ll explore Advanced Scheduling and CUDA Graphs—how to hide scheduling overhead and maximize GPU utilization.

ML Systems for Dummies