Skip to content

What is Reverse Attention?

The Big Picture

Imagine you're reading this sentence and trying to understand the last word. Your brain doesn't look at that word in isolation—it draws connections back to earlier words:

"The quick brown fox jumps over the lazy dog."

When you read "dog," your attention probably traces back through "lazy," maybe "jumps," and definitely connects to "fox" for that classic sentence structure.

Transformers work similarly. The attention mechanism lets tokens "look at" earlier tokens, and those attention weights tell us how much each token influenced the current one.

The Chain Effect

Here's the key insight: attention can chain.

Token 10 might mostly attend to token 7, which mostly attended to token 3. To really understand what influenced your output, you need to trace these paths backward through multiple steps.

Token 10 → Token 7 → Token 3 → Token 1
   (0.8)      (0.6)      (0.9)

A single attention matrix shows direct connections. Reverse attention tracing follows the flow of influence through the entire sequence.

Why "Reverse"?

Traditional attention analysis looks at the attention matrix and asks: "What is this token attending to?"

Reverse attention starts at a target token (usually the last one) and traces backward: "What sequence of tokens led to the most attention flowing here?"

This is like:

  • Forward: "Where is this water flowing to?"
  • Reverse: "Where did this water come from?"

Attention as Flow

Think of attention weights as water flow through pipes:

[The] ←─── 0.1 ───┐
[quick] ←─ 0.2 ───┤
[brown] ←─ 0.15 ──┤
                  ├──→ [dog]
[fox] ←─── 0.25 ──┤
[lazy] ←── 0.3 ───┘

Some paths carry more "flow" than others. Reverse attention tracing finds the high-flow paths—the routes where attention concentrates.

Why Not Just Look at Max Attention?

A greedy approach (always follow the highest attention) misses important alternative paths:

Example:

  • Path A: Target → 0.6 → 0.6 → 0.6 (product: 0.216)
  • Path B: Target → 0.9 → 0.2 → start (product: 0.18)

Greedy follows Path B because it starts with 0.9. But Path A has higher cumulative probability!

This is why we use beam search—it keeps multiple hypotheses alive and finds globally better paths.

What You Can Learn

Reverse attention tracing reveals:

  1. Information sources: Which early tokens most influence the output?
  2. Attention chains: How does information propagate through the sequence?
  3. Redundancy: Do multiple paths converge on the same critical tokens?
  4. Layer behavior: How do attention patterns differ across layers?

Practical Applications

Debugging Model Outputs

When a model generates something unexpected, trace back to see what it was "looking at":

result = tracer.trace_text("The capital of France is Berlin")
# Trace from "Berlin" to understand why the model made this error

Interpretability Research

Visualize attention flow patterns across different types of inputs:

  • Do named entities form distinct attention clusters?
  • How do syntactic structures affect attention routing?
  • What tokens act as "attention hubs"?

Understanding Context Windows

See how information from early tokens reaches late positions:

result = tracer.trace_text(very_long_text, target_pos=-1)
# Do early tokens still influence the end?

Next Steps