Chapter 13: RLHF Computation Flow
“Four models, one update. Orchestrating RLHF is like conducting a symphony of neural networks.”
Learning Objectives
By the end of this chapter, you will be able to:
- Name the four models in RLHF and their roles
- Trace the data flow through one RLHF training step
- Explain why we need a reference model
- Calculate memory requirements for RLHF training
Prerequisites
- Completed Chapter 12 (RL Fundamentals)
- Understanding of PPO and advantage estimation
- Familiarity with model architecture (transformers)
Concept Overview
The Four Models of RLHF
| Model | Role | Updates? | Size |
|---|---|---|---|
| Actor (Policy) | Generates responses | Yes | Full LLM |
| Critic (Value) | Predicts expected reward | Yes | Full LLM or smaller |
| Reward | Scores responses | No | Trained separately |
| Reference | Prevents reward hacking | No | Copy of initial actor |
┌─────────────────────────────────────────────────────────────────────────┐
│ RLHF MODEL ORCHESTRA │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Actor │ │ Critic │ │ Reward │ │Reference│ │
│ │(Policy) │ │(Value) │ │ Model │ │ Policy │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │ │
│ │ │ │ │ │
│ Generates Estimates Evaluates Anchors │
│ responses future reward quality updates │
│ │ │ │ │ │
│ └────────────────┴────────────────┴────────────────┘ │
│ │ │
│ PPO Update │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The RLHF Training Loop
One step of RLHF training:
1. SAMPLE PROMPTS
└─► Get batch of prompts from dataset
2. GENERATE RESPONSES (Actor)
└─► Actor generates responses for each prompt
└─► Save token probabilities
3. SCORE RESPONSES (Reward Model)
└─► Reward model scores each response
└─► This is the "human feedback" signal
4. COMPUTE KL PENALTY (Reference)
└─► Compare actor probabilities to reference
└─► Penalize divergence (prevent reward hacking)
5. COMPUTE ADVANTAGES (Critic + GAE)
└─► Critic estimates values
└─► GAE computes advantages
6. PPO UPDATE (Actor + Critic)
└─► Update actor using PPO objective
└─► Update critic to predict rewards better
Detailed Data Flow
Prompt
│
▼
┌──────────────────────────┐
│ ACTOR │
│ Generate response │
│ Output: tokens, logits │
└───────────┬──────────────┘
│
┌─────────────┼─────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ REWARD │ │REFERENCE │ │ CRITIC │
│ MODEL │ │ │ │ │
│Score: 0.8│ │ logits │ │ values │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
└─────────────┴─────────────┘
│
▼
┌─────────────────┐
│ COMPUTE REWARD │
│ R = R_rm - β*KL │
└────────┬────────┘
│
▼
┌─────────────────┐
│ COMPUTE GAE │
│ advantages │
└────────┬────────┘
│
▼
┌─────────────────┐
│ PPO UPDATE │
│ actor, critic │
└─────────────────┘
The Reward Calculation
The reward for each response combines:
R_total = R_reward_model - β * KL(π_actor || π_reference)
R_reward_model: Score from reward model (trained on human preferences)
KL penalty: Prevents “reward hacking”
Without KL penalty, the model might find degenerate solutions:
- Repeating phrases that game the reward model
- Producing unnatural but high-scoring outputs
- Catastrophic forgetting of language capabilities
Why Reference Model?
The reference model is a frozen copy of the initial policy. It serves as an anchor:
Without reference:
Actor → "AMAZING! INCREDIBLE! BEST EVER!" (reward hacks)
With reference:
Actor → Natural response similar to reference
If too different → KL penalty reduces total reward
KL divergence measures how different the actor’s distribution is from the reference:
KL(π_actor || π_ref) = Σ π_actor(token) * log(π_actor(token) / π_ref(token))
Per-Token vs Per-Response Rewards
In practice, rewards can be assigned:
Per-response (most common):
- Reward model scores complete response
- Reward assigned to last token
- Other tokens get 0 reward
- GAE propagates signal backwards
Per-token (process reward):
- Each token gets a score
- More fine-grained signal
- Harder to obtain labels
Memory Requirements
For a 7B parameter model with RLHF:
| Component | Memory (FP16) |
|---|---|
| Actor | 14 GB |
| Critic | 14 GB |
| Reward Model | 14 GB |
| Reference | 14 GB |
| Optimizer states | 56 GB |
| Activations | ~20 GB |
| Total | ~130 GB |
For 70B: multiply by 10 → ~1.3 TB!
This is why RLHF needs careful system design.
Code Walkthrough
Script 1: rlhf_loop_pseudo.py
Pseudocode implementation of the RLHF loop:
- Shows exact data flow
- Demonstrates each computation
- Explains intermediate values
Script 2: reward_calculator.py
Implements reward calculation:
- Reward model scoring
- KL divergence computation
- Total reward with penalty
Common Questions
Q: Why not just fine-tune on high-reward responses?
Supervised fine-tuning on selected responses (rejection sampling) works, but:
- Wastes low-reward samples
- No gradient signal about “how bad” something is
- PPO makes more efficient use of data
Q: Can the critic share weights with the actor?
Yes! Common approaches:
- Separate critic: Full model, independent
- Shared backbone: Same transformer, different heads
- Value head: Small MLP on top of actor’s hidden states
Shared approaches save memory but may have optimization conflicts.
Q: How is the reward model trained?
Before RLHF:
- Collect comparison data: “Response A is better than B”
- Train reward model with ranking loss
- Reward model learns human preferences
The reward model is then frozen during RLHF.
Try It Yourself
Exercise 1: Trace Data Flow
For a batch of 4 prompts with max response length 100:
- What are the tensor shapes at each stage?
- How many forward passes per training step?
- What’s the communication pattern?
Exercise 2: KL Penalty Tuning
The KL coefficient β controls the penalty:
- β too low: reward hacking
- β too high: no learning
Experiment (conceptually):
- What happens if β = 0?
- What happens if β = 10?
- How would you find the right β?
Exercise 3: Memory Optimization
You have 8× 80GB GPUs and want to train a 70B model with RLHF.
- What parallelism strategies would you use?
- Can you fit all 4 models?
- What trade-offs would you make?
Key Takeaways
- Four models, one loop - Actor, Critic, Reward, Reference
- KL penalty is crucial - Prevents reward hacking
- GAE for credit assignment - Propagates reward signal
- Memory is the bottleneck - 4× model weights minimum
- Reference stays frozen - Anchors the learning
The RLHF Equation
The complete PPO-RLHF objective:
L = E[
L^PPO(actor_params) # Policy improvement
- c₁ * L^VF(critic_params) # Value function loss
+ c₂ * Entropy(actor) # Exploration bonus
]
Where:
L^PPO = min(ratio * A, clip(ratio) * A)
L^VF = (V_predicted - R_observed)²
A = GAE(rewards, values)
rewards = R_reward_model - β * KL
What’s Next?
In Chapter 14, we’ll explore RLHF System Architecture—how to efficiently orchestrate these models across GPUs with co-location, disaggregation, and hybrid approaches.