Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 11: Speculative and Constraint Decoding

“Predict multiple tokens, verify in parallel. It’s like spell-checking as you type, but for LLMs.”

Learning Objectives

By the end of this chapter, you will be able to:

  • Explain how speculative decoding accelerates generation
  • Understand the acceptance/rejection mechanism
  • Use constraint decoding for structured output (JSON, code)
  • Choose when to apply these techniques

Prerequisites

  • Completed Chapters 8-10 (Inference Systems)
  • Understanding of autoregressive generation
  • Basic probability concepts

Concept Overview

The Problem: Sequential Generation

Autoregressive LLMs generate one token at a time:

Prompt: "The capital of France is"
Step 1: → "Paris"
Step 2: → "."
Step 3: → " It"
Step 4: → " is"
...

Each step requires a full model forward pass. For a 70B model generating 100 tokens:

  • 100 sequential forward passes
  • Can’t parallelize within a request
  • Memory bandwidth limited during decode

Speculative Decoding: The Key Insight

What if we could verify multiple tokens at once?

Speculative decoding uses a small “draft” model to guess multiple tokens, then verifies them in parallel with the large “target” model.

Draft model (small, fast):
  Input: "The capital of France is"
  Draft: ["Paris", ".", " It", " is", " known"]
  (5 tokens in one pass)

Target model (large, accurate):
  Verify all 5 tokens in ONE parallel forward pass
  Accept: ["Paris", ".", " It"] (3 accepted)
  Reject: [" is", " known"] (distribution mismatch)

Result: Generated 3 tokens with 1 target model pass instead of 3!

How Verification Works

The target model doesn’t just check “right or wrong”—it uses a probabilistic acceptance criterion:

For each position i:
  p_target = target_model_probability(token_i)
  p_draft = draft_model_probability(token_i)

  If p_target >= p_draft:
    ACCEPT (draft was conservative)
  Else:
    ACCEPT with probability p_target / p_draft
    (randomly accept based on ratio)

  If REJECT:
    Sample new token from adjusted distribution
    Stop accepting further tokens

This ensures the output distribution exactly matches the target model!

Speedup Analysis

Let:

  • γ = acceptance rate (typically 0.7-0.9)
  • k = draft length (tokens generated by draft)
  • c = cost ratio (target_time / draft_time, typically 10-50x)

Expected tokens per target forward pass:

E[tokens] = 1 + γ + γ² + ... + γ^k = (1 - γ^(k+1)) / (1 - γ)

For γ=0.8, k=5:

E[tokens] = (1 - 0.8^6) / (1 - 0.8) = 3.36 tokens per pass

3.4x theoretical speedup!

Draft Model Selection

Good draft models:

  • Same tokenizer as target (required!)
  • Similar training data
  • Much smaller (7B for 70B target)

Common pairings:

  • LLaMA-70B target + LLaMA-7B draft
  • GPT-4 target + GPT-3.5 draft
  • Mixtral target + Mistral draft

Constraint Decoding: Structured Output

Sometimes we need output to follow a specific format:

  • JSON schema
  • SQL query
  • Function calls
  • Code in specific language

Constraint decoding restricts token probabilities to only valid continuations.

Grammar-Based Constraints

Define valid output using a grammar:

json_value := object | array | string | number | "true" | "false" | "null"
object := "{" (pair ("," pair)*)? "}"
pair := string ":" json_value
...

At each generation step:

  1. Get logits from model
  2. Identify tokens that lead to valid states
  3. Mask invalid tokens (set probability to 0)
  4. Sample from valid tokens only
def constrained_sample(logits, grammar_state):
    # Get valid next tokens from grammar
    valid_tokens = grammar_state.get_valid_tokens()

    # Mask invalid tokens
    mask = torch.zeros_like(logits)
    mask[valid_tokens] = 1
    logits = logits * mask + (1 - mask) * float('-inf')

    # Sample from masked distribution
    return torch.multinomial(torch.softmax(logits, dim=-1), 1)

Regex Constraints

For simpler patterns, regex constraints work well:

# Only generate valid email addresses
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"

# At each step, check if current output + candidate token
# can still match the pattern

Combining Speculative + Constraint Decoding

Can we get both speedup AND structured output? Yes!

The draft model also generates under constraints:

  1. Draft generates k constrained tokens
  2. Target verifies (also checking constraints)
  3. All accepted tokens are guaranteed valid

Tricky part: Draft must use same constraint state as target.

Code Walkthrough

Script 1: speculative_demo.py

Demonstrates speculative decoding:

  • Simulates draft/target model interaction
  • Shows acceptance/rejection process
  • Calculates speedup

Script 2: json_constraint_demo.py

Demonstrates constraint decoding:

  • Simple JSON schema
  • Token masking
  • Valid output generation

When to Use What

TechniqueBest ForAvoid When
SpeculativeLong generations, high acceptance rateVery different draft/target, short outputs
ConstraintStructured output, API responsesFree-form text
CombinedStructured output with lengthComplex grammars + low acceptance

Try It Yourself

Exercise 1: Calculate Speedup

For a system with:

  • Acceptance rate: 75%
  • Draft length: 4 tokens
  • Draft cost: 5% of target cost

Calculate:

  1. Expected tokens per target pass
  2. Overall speedup including draft cost

Exercise 2: Design a Grammar

Write a simple grammar for:

  • Python function definitions
  • Email addresses
  • Phone numbers

Exercise 3: Acceptance Rate Experiment

If you had access to models:

  1. Measure acceptance rate for different draft lengths
  2. Find the optimal draft length
  3. How does temperature affect acceptance?

Key Takeaways

  1. Speculative decoding parallelizes verification - Multiple tokens checked in one forward pass
  2. Acceptance criterion preserves distribution - Output is identical to non-speculative
  3. Draft model selection matters - Same tokenizer, similar distribution
  4. Constraint decoding ensures validity - Grammar-based token masking
  5. Both can combine - Speedup + structure

The Trade-off Triangle

        Latency
         /\
        /  \
       /    \
      /      \
     /________\
Quality    Structure

- Speculative decoding: Latency ↓, Quality =, Structure =
- Constraint decoding: Latency ↑, Quality ≈, Structure ↑
- Combined: Latency ↓, Quality ≈, Structure ↑

What’s Next?

In Part IV, we’ll explore RLHF Systems—how to train LLMs with human feedback, including the complex multi-model orchestration required for PPO training.

Further Reading