Chapter 11: Speculative and Constraint Decoding
“Predict multiple tokens, verify in parallel. It’s like spell-checking as you type, but for LLMs.”
Learning Objectives
By the end of this chapter, you will be able to:
- Explain how speculative decoding accelerates generation
- Understand the acceptance/rejection mechanism
- Use constraint decoding for structured output (JSON, code)
- Choose when to apply these techniques
Prerequisites
- Completed Chapters 8-10 (Inference Systems)
- Understanding of autoregressive generation
- Basic probability concepts
Concept Overview
The Problem: Sequential Generation
Autoregressive LLMs generate one token at a time:
Prompt: "The capital of France is"
Step 1: → "Paris"
Step 2: → "."
Step 3: → " It"
Step 4: → " is"
...
Each step requires a full model forward pass. For a 70B model generating 100 tokens:
- 100 sequential forward passes
- Can’t parallelize within a request
- Memory bandwidth limited during decode
Speculative Decoding: The Key Insight
What if we could verify multiple tokens at once?
Speculative decoding uses a small “draft” model to guess multiple tokens, then verifies them in parallel with the large “target” model.
Draft model (small, fast):
Input: "The capital of France is"
Draft: ["Paris", ".", " It", " is", " known"]
(5 tokens in one pass)
Target model (large, accurate):
Verify all 5 tokens in ONE parallel forward pass
Accept: ["Paris", ".", " It"] (3 accepted)
Reject: [" is", " known"] (distribution mismatch)
Result: Generated 3 tokens with 1 target model pass instead of 3!
How Verification Works
The target model doesn’t just check “right or wrong”—it uses a probabilistic acceptance criterion:
For each position i:
p_target = target_model_probability(token_i)
p_draft = draft_model_probability(token_i)
If p_target >= p_draft:
ACCEPT (draft was conservative)
Else:
ACCEPT with probability p_target / p_draft
(randomly accept based on ratio)
If REJECT:
Sample new token from adjusted distribution
Stop accepting further tokens
This ensures the output distribution exactly matches the target model!
Speedup Analysis
Let:
- γ = acceptance rate (typically 0.7-0.9)
- k = draft length (tokens generated by draft)
- c = cost ratio (target_time / draft_time, typically 10-50x)
Expected tokens per target forward pass:
E[tokens] = 1 + γ + γ² + ... + γ^k = (1 - γ^(k+1)) / (1 - γ)
For γ=0.8, k=5:
E[tokens] = (1 - 0.8^6) / (1 - 0.8) = 3.36 tokens per pass
3.4x theoretical speedup!
Draft Model Selection
Good draft models:
- Same tokenizer as target (required!)
- Similar training data
- Much smaller (7B for 70B target)
Common pairings:
- LLaMA-70B target + LLaMA-7B draft
- GPT-4 target + GPT-3.5 draft
- Mixtral target + Mistral draft
Constraint Decoding: Structured Output
Sometimes we need output to follow a specific format:
- JSON schema
- SQL query
- Function calls
- Code in specific language
Constraint decoding restricts token probabilities to only valid continuations.
Grammar-Based Constraints
Define valid output using a grammar:
json_value := object | array | string | number | "true" | "false" | "null"
object := "{" (pair ("," pair)*)? "}"
pair := string ":" json_value
...
At each generation step:
- Get logits from model
- Identify tokens that lead to valid states
- Mask invalid tokens (set probability to 0)
- Sample from valid tokens only
def constrained_sample(logits, grammar_state):
# Get valid next tokens from grammar
valid_tokens = grammar_state.get_valid_tokens()
# Mask invalid tokens
mask = torch.zeros_like(logits)
mask[valid_tokens] = 1
logits = logits * mask + (1 - mask) * float('-inf')
# Sample from masked distribution
return torch.multinomial(torch.softmax(logits, dim=-1), 1)
Regex Constraints
For simpler patterns, regex constraints work well:
# Only generate valid email addresses
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
# At each step, check if current output + candidate token
# can still match the pattern
Combining Speculative + Constraint Decoding
Can we get both speedup AND structured output? Yes!
The draft model also generates under constraints:
- Draft generates k constrained tokens
- Target verifies (also checking constraints)
- All accepted tokens are guaranteed valid
Tricky part: Draft must use same constraint state as target.
Code Walkthrough
Script 1: speculative_demo.py
Demonstrates speculative decoding:
- Simulates draft/target model interaction
- Shows acceptance/rejection process
- Calculates speedup
Script 2: json_constraint_demo.py
Demonstrates constraint decoding:
- Simple JSON schema
- Token masking
- Valid output generation
When to Use What
| Technique | Best For | Avoid When |
|---|---|---|
| Speculative | Long generations, high acceptance rate | Very different draft/target, short outputs |
| Constraint | Structured output, API responses | Free-form text |
| Combined | Structured output with length | Complex grammars + low acceptance |
Try It Yourself
Exercise 1: Calculate Speedup
For a system with:
- Acceptance rate: 75%
- Draft length: 4 tokens
- Draft cost: 5% of target cost
Calculate:
- Expected tokens per target pass
- Overall speedup including draft cost
Exercise 2: Design a Grammar
Write a simple grammar for:
- Python function definitions
- Email addresses
- Phone numbers
Exercise 3: Acceptance Rate Experiment
If you had access to models:
- Measure acceptance rate for different draft lengths
- Find the optimal draft length
- How does temperature affect acceptance?
Key Takeaways
- Speculative decoding parallelizes verification - Multiple tokens checked in one forward pass
- Acceptance criterion preserves distribution - Output is identical to non-speculative
- Draft model selection matters - Same tokenizer, similar distribution
- Constraint decoding ensures validity - Grammar-based token masking
- Both can combine - Speedup + structure
The Trade-off Triangle
Latency
/\
/ \
/ \
/ \
/________\
Quality Structure
- Speculative decoding: Latency ↓, Quality =, Structure =
- Constraint decoding: Latency ↑, Quality ≈, Structure ↑
- Combined: Latency ↓, Quality ≈, Structure ↑
What’s Next?
In Part IV, we’ll explore RLHF Systems—how to train LLMs with human feedback, including the complex multi-model orchestration required for PPO training.