Chapter 11: Speculative and Constraint Decoding

“Predict multiple tokens, verify in parallel. It’s like spell-checking as you type, but for LLMs.”

Learning Objectives

By the end of this chapter, you will be able to:

Explain how speculative decoding accelerates generation
Understand the acceptance/rejection mechanism
Use constraint decoding for structured output (JSON, code)
Choose when to apply these techniques

Prerequisites

Completed Chapters 8-10 (Inference Systems)
Understanding of autoregressive generation
Basic probability concepts

Concept Overview

The Problem: Sequential Generation

Autoregressive LLMs generate one token at a time:

Prompt: "The capital of France is"
Step 1: → "Paris"
Step 2: → "."
Step 3: → " It"
Step 4: → " is"
...

Each step requires a full model forward pass. For a 70B model generating 100 tokens:

100 sequential forward passes
Can’t parallelize within a request
Memory bandwidth limited during decode

Speculative Decoding: The Key Insight

What if we could verify multiple tokens at once?

Speculative decoding uses a small “draft” model to guess multiple tokens, then verifies them in parallel with the large “target” model.

Draft model (small, fast):
  Input: "The capital of France is"
  Draft: ["Paris", ".", " It", " is", " known"]
  (5 tokens in one pass)

Target model (large, accurate):
  Verify all 5 tokens in ONE parallel forward pass
  Accept: ["Paris", ".", " It"] (3 accepted)
  Reject: [" is", " known"] (distribution mismatch)

Result: Generated 3 tokens with 1 target model pass instead of 3!

How Verification Works

The target model doesn’t just check “right or wrong”—it uses a probabilistic acceptance criterion:

For each position i:
  p_target = target_model_probability(token_i)
  p_draft = draft_model_probability(token_i)

  If p_target >= p_draft:
    ACCEPT (draft was conservative)
  Else:
    ACCEPT with probability p_target / p_draft
    (randomly accept based on ratio)

  If REJECT:
    Sample new token from adjusted distribution
    Stop accepting further tokens

This ensures the output distribution exactly matches the target model!

Speedup Analysis

Let:

γ = acceptance rate (typically 0.7-0.9)
k = draft length (tokens generated by draft)
c = cost ratio (target_time / draft_time, typically 10-50x)

Expected tokens per target forward pass:

E[tokens] = 1 + γ + γ² + ... + γ^k = (1 - γ^(k+1)) / (1 - γ)

For γ=0.8, k=5:

E[tokens] = (1 - 0.8^6) / (1 - 0.8) = 3.36 tokens per pass

3.4x theoretical speedup!

Draft Model Selection

Good draft models:

Same tokenizer as target (required!)
Similar training data
Much smaller (7B for 70B target)

Common pairings:

LLaMA-70B target + LLaMA-7B draft
GPT-4 target + GPT-3.5 draft
Mixtral target + Mistral draft

Constraint Decoding: Structured Output

Sometimes we need output to follow a specific format:

JSON schema
SQL query
Function calls
Code in specific language

Constraint decoding restricts token probabilities to only valid continuations.

Grammar-Based Constraints

Define valid output using a grammar:

json_value := object | array | string | number | "true" | "false" | "null"
object := "{" (pair ("," pair)*)? "}"
pair := string ":" json_value
...

At each generation step:

Get logits from model
Identify tokens that lead to valid states
Mask invalid tokens (set probability to 0)
Sample from valid tokens only

def constrained_sample(logits, grammar_state):
    # Get valid next tokens from grammar
    valid_tokens = grammar_state.get_valid_tokens()

    # Mask invalid tokens
    mask = torch.zeros_like(logits)
    mask[valid_tokens] = 1
    logits = logits * mask + (1 - mask) * float('-inf')

    # Sample from masked distribution
    return torch.multinomial(torch.softmax(logits, dim=-1), 1)

Regex Constraints

For simpler patterns, regex constraints work well:

# Only generate valid email addresses
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

# At each step, check if current output + candidate token
# can still match the pattern

Combining Speculative + Constraint Decoding

Can we get both speedup AND structured output? Yes!

The draft model also generates under constraints:

Draft generates k constrained tokens
Target verifies (also checking constraints)
All accepted tokens are guaranteed valid

Tricky part: Draft must use same constraint state as target.

Code Walkthrough

Script 1: speculative_demo.py

Demonstrates speculative decoding:

Simulates draft/target model interaction
Shows acceptance/rejection process
Calculates speedup

Script 2: json_constraint_demo.py

Demonstrates constraint decoding:

Simple JSON schema
Token masking
Valid output generation

When to Use What

Technique	Best For	Avoid When
Speculative	Long generations, high acceptance rate	Very different draft/target, short outputs
Constraint	Structured output, API responses	Free-form text
Combined	Structured output with length	Complex grammars + low acceptance

Try It Yourself

Exercise 1: Calculate Speedup

For a system with:

Acceptance rate: 75%
Draft length: 4 tokens
Draft cost: 5% of target cost

Calculate:

Expected tokens per target pass
Overall speedup including draft cost

Exercise 2: Design a Grammar

Write a simple grammar for:

Python function definitions
Email addresses
Phone numbers

Exercise 3: Acceptance Rate Experiment

If you had access to models:

Measure acceptance rate for different draft lengths
Find the optimal draft length
How does temperature affect acceptance?

Key Takeaways

Speculative decoding parallelizes verification - Multiple tokens checked in one forward pass
Acceptance criterion preserves distribution - Output is identical to non-speculative
Draft model selection matters - Same tokenizer, similar distribution
Constraint decoding ensures validity - Grammar-based token masking
Both can combine - Speedup + structure

The Trade-off Triangle

        Latency
         /\
        /  \
       /    \
      /      \
     /________\
Quality    Structure

- Speculative decoding: Latency ↓, Quality =, Structure =
- Constraint decoding: Latency ↑, Quality ≈, Structure ↑
- Combined: Latency ↓, Quality ≈, Structure ↑

What’s Next?

In Part IV, we’ll explore RLHF Systems—how to train LLMs with human feedback, including the complex multi-model orchestration required for PPO training.

ML Systems for Dummies