Chapter 12: RL Fundamentals for LLMs

“Before you can teach a model with human feedback, you need to speak the language of reinforcement learning.”

Learning Objectives

By the end of this chapter, you will be able to:

Explain the core RL concepts: states, actions, rewards, policies
Understand value functions and the Bellman equation
Implement policy gradients and the REINFORCE algorithm
Explain PPO (Proximal Policy Optimization) and why it’s used for LLMs

Prerequisites

Completed Part III (LLM Inference)
Basic calculus (derivatives)
Familiarity with neural network training

Concept Overview

RL in 60 Seconds

Supervised Learning: Given input X, predict label Y (teacher provides answer)

Reinforcement Learning: Given state S, take action A, observe reward R (learn from trial and error)

┌─────────────────────────────────────────────────────────────────────┐
│                        RL FRAMEWORK                                  │
│                                                                     │
│    ┌─────────┐         action a         ┌─────────────┐           │
│    │  Agent  │ ─────────────────────────► Environment  │           │
│    │ (Policy)│ ◄───────────────────────── (World)     │           │
│    └─────────┘    state s, reward r     └─────────────┘           │
│                                                                     │
│    Goal: Learn policy π(a|s) that maximizes cumulative reward      │
└─────────────────────────────────────────────────────────────────────┘

The LLM as an RL Agent

RL Concept	LLM Interpretation
State	Prompt + generated tokens so far
Action	Next token to generate
Policy	The LLM itself (token probabilities)
Reward	Human preference score (or reward model)
Episode	One complete generation

Value Functions: Predicting Future Rewards

State Value V(s): Expected total reward starting from state s

V(s) = E[R₀ + γR₁ + γ²R₂ + ... | S₀ = s]

Action Value Q(s,a): Expected total reward after taking action a in state s

Q(s,a) = E[R₀ + γR₁ + γ²R₂ + ... | S₀ = s, A₀ = a]

γ (gamma): Discount factor (0-1). Lower γ = short-sighted, higher γ = long-term thinking.

For LLMs, we typically use γ ≈ 1 (care equally about all future rewards).

The Bellman Equation

The fundamental equation of RL:

V(s) = E[R + γV(s') | S = s]
     = Σₐ π(a|s) [R(s,a) + γ Σₛ' P(s'|s,a) V(s')]

“The value of a state is the immediate reward plus the discounted value of the next state.”

This recursive structure enables dynamic programming solutions.

Policy Gradients: Learning by Gradient Ascent

Instead of computing values, directly optimize the policy!

Objective: Maximize expected reward

J(θ) = E[Σₜ R(sₜ, aₜ)]

Policy Gradient Theorem:

∇J(θ) = E[Σₜ ∇log π_θ(aₜ|sₜ) · Gₜ]

Where Gₜ = total future reward from time t.

Intuition:

If action led to high reward: increase its probability (positive gradient)
If action led to low reward: decrease its probability (negative gradient)

REINFORCE Algorithm

The simplest policy gradient algorithm:

for episode in episodes:
    # Collect trajectory
    states, actions, rewards = collect_episode(policy)

    # Compute returns
    returns = compute_returns(rewards, gamma)

    # Update policy
    for t, (s, a, G) in enumerate(zip(states, actions, returns)):
        loss = -log_prob(policy(s), a) * G
        loss.backward()

    optimizer.step()

Problem: High variance! Returns can vary wildly between episodes.

Variance Reduction: Baselines

Subtract a baseline from returns to reduce variance:

∇J(θ) = E[Σₜ ∇log π_θ(aₜ|sₜ) · (Gₜ - b(sₜ))]

Common baseline: Value function V(s) — learn to predict expected return.

This gives us the Advantage:

A(s,a) = Q(s,a) - V(s)
       ≈ R + γV(s') - V(s)  (TD error)

“How much better is this action compared to the average?”

Actor-Critic: Best of Both Worlds

Actor: Policy network π_θ(a|s) Critic: Value network V_φ(s)

# Actor update (policy gradient with advantage)
advantage = reward + gamma * V(next_state) - V(state)
actor_loss = -log_prob(action) * advantage.detach()

# Critic update (value regression)
critic_loss = (V(state) - (reward + gamma * V(next_state).detach()))²

Generalized Advantage Estimation (GAE)

GAE smoothly interpolates between:

Low bias, high variance (full returns)
High bias, low variance (TD error)

A^GAE_t = Σₖ (γλ)^k δₜ₊ₖ

Where δₜ = rₜ + γV(sₜ₊₁) - V(sₜ)  (TD error)

λ controls the tradeoff:

λ = 0: Just TD error (high bias, low variance)
λ = 1: Full returns (low bias, high variance)

Typical: λ = 0.95

PPO: The Industry Standard

PPO (Proximal Policy Optimization) adds trust region constraints:

“Don’t change the policy too much in one update.”

PPO-Clip objective:

L^CLIP(θ) = E[min(rₜ(θ)Aₜ, clip(rₜ(θ), 1-ε, 1+ε)Aₜ)]

Where rₜ(θ) = π_θ(aₜ|sₜ) / π_θold(aₜ|sₜ)  (probability ratio)

Intuition:

If advantage is positive and ratio is high: clip to prevent too much increase
If advantage is negative and ratio is low: clip to prevent too much decrease
Keeps policy changes bounded

Why PPO for LLMs?

Stable training: Trust region prevents catastrophic forgetting
Sample efficient: Reuses samples within trust region
Proven at scale: Used by OpenAI, Anthropic, DeepMind
Simple to implement: No second-order optimization

Code Walkthrough

Script 1: ppo_cartpole.py

A minimal PPO implementation on CartPole:

Actor-Critic networks
GAE advantage computation
PPO-Clip objective

This isn’t for LLMs but shows PPO mechanics clearly.

Script 2: gae_visualizer.py

Visualizes how GAE works:

Shows TD errors over trajectory
Compares different λ values
Demonstrates bias-variance tradeoff

The RLHF Connection

In RLHF:

State: Prompt + partial response
Action: Next token
Reward: Comes from reward model (trained on human preferences)
Episode: Complete response generation

The PPO objective becomes:

max E[R_reward_model(response) - β * KL(π || π_ref)]

Where:
- R_reward_model: Score from reward model
- KL term: Penalty for diverging from reference policy
- β: KL coefficient (prevents reward hacking)

Try It Yourself

Exercise 1: Implement REINFORCE

Implement REINFORCE for a simple environment:

Collect episodes
Compute returns
Update policy
Track learning curves

Exercise 2: Add a Baseline

Modify your REINFORCE to use a learned baseline:

Add a value network
Compute advantages
Compare variance with/without baseline

Exercise 3: Understand PPO Clipping

For different advantage signs and probability ratios:

Compute clipped and unclipped objectives
Determine which is used
Explain why clipping helps stability

Key Takeaways

RL learns from rewards, not labels - Trial and error, not supervision
Value functions predict future rewards - Enables credit assignment
Policy gradients directly optimize the policy - No need to estimate values
Baselines reduce variance - Critical for practical training
PPO is stable and scalable - The go-to algorithm for RLHF

The RL Hierarchy

Simple ────────────────────────────────────► Complex

REINFORCE → Actor-Critic → A2C → PPO → RLHF with PPO
  ↓              ↓           ↓      ↓           ↓
High        Value as    Parallel  Trust    Multi-model
variance    baseline    training  region   orchestration

What’s Next?

In Chapter 13, we’ll dive into RLHF Computation Flow—how the Actor, Critic, Reward, and Reference models work together during training.

ML Systems for Dummies