Parameter Tuning¶
This guide explains each parameter and how to tune them for your use case.
Quick Reference¶
| Parameter | Default | Range | Effect |
|---|---|---|---|
top_beam |
5 | 1-20 | Number of paths to discover |
top_k |
5 | 1-20 | Exploration breadth per step |
min_attn |
0.0 | 0.0-1.0 | Attention threshold filter |
length_norm |
avg_logprob |
see below | Path length compensation |
agg_heads |
mean |
mean/max |
Head aggregation method |
layer |
-1 | any layer | Which layer to analyze |
target_pos |
-1 | any position | Starting position for trace |
top_beam - Number of Beams¶
Controls how many distinct paths to discover.
# Minimal - single best path
result = tracer.trace_text(text, top_beam=1)
# Balanced (default)
result = tracer.trace_text(text, top_beam=5)
# Comprehensive
result = tracer.trace_text(text, top_beam=10)
Recommendations:
1: When you only want the dominant path3-5: General exploration (recommended)10+: Research/detailed analysis
Higher values increase computation time and visualization complexity.
top_k - Predecessors Per Step¶
Controls how many candidates to consider at each position.
# Focused - follow only the strongest attention
result = tracer.trace_text(text, top_k=2)
# Balanced (default)
result = tracer.trace_text(text, top_k=5)
# Exploratory
result = tracer.trace_text(text, top_k=10)
Recommendations:
1: Greedy search (not recommended—misses alternatives)3-5: Good balance10+: When you want to discover less-obvious paths
Note: top_k × top_beam determines memory usage per step.
min_attn - Attention Threshold¶
Filters out weak attention connections.
# Include all attention (default)
result = tracer.trace_text(text, min_attn=0.0)
# Only meaningful attention
result = tracer.trace_text(text, min_attn=0.05)
# Only strong attention
result = tracer.trace_text(text, min_attn=0.1)
Recommendations:
0.0: Include everything (default)0.01-0.05: Filter noise, keep signal0.1+: Focus on dominant connections only
Warning
Too high a threshold may cause paths to terminate early because no predecessor meets the threshold.
length_norm - Score Normalization¶
Controls how path scores are normalized for length.
# Raw scores (favor shorter paths)
result = tracer.trace_text(text, length_norm="none")
# Geometric mean (fair comparison - default)
result = tracer.trace_text(text, length_norm="avg_logprob")
# Moderate normalization
result = tracer.trace_text(text, length_norm="sqrt")
# Custom exponent
result = tracer.trace_text(text, length_norm="pow:0.7")
Comparison¶
| Method | Score Formula | Behavior |
|---|---|---|
none |
Σ log(attn) | Prefers shorter paths |
avg_logprob |
Σ log(attn) / n | Equal treatment |
sqrt |
Σ log(attn) / √n | Slight short preference |
pow:α |
Σ log(attn) / n^α | Tunable |
Recommendations:
- Use
avg_logprobfor general analysis (default) - Use
nonewhen path length matters (absolute influence) - Use
pow:0.7orsqrtas a middle ground
agg_heads - Head Aggregation¶
Controls how attention from multiple heads is combined.
# Average across heads (default)
result = tracer.trace_text(text, agg_heads="mean")
# Maximum across heads
result = tracer.trace_text(text, agg_heads="max")
Comparison:
| Method | Formula | Effect |
|---|---|---|
mean |
avg(head₁, head₂, ...) | Consensus attention |
max |
max(head₁, head₂, ...) | Strongest head signal |
Recommendations:
mean: General use, shows where heads agreemax: When looking for specialized head behavior
Note
agg_heads="none" is not supported for beam search because it would create separate attention matrices per head.
layer - Layer Selection¶
Choose which transformer layer to analyze.
# Last layer (most processed - default)
result = tracer.trace_text(text, layer=-1)
# First layer (raw attention patterns)
result = tracer.trace_text(text, layer=0)
# Middle layer
result = tracer.trace_text(text, layer=12)
Negative indices work like Python lists:
- -1: Last layer
- -2: Second-to-last layer
Recommendations:
-1: Default, shows final attention patterns- Early layers: Syntactic/local patterns
- Middle layers: Semantic relationships
- Late layers: Task-specific patterns
target_pos - Starting Position¶
Choose where to start the backward trace.
# Last token (default)
result = tracer.trace_text(text, target_pos=-1)
# Second-to-last
result = tracer.trace_text(text, target_pos=-2)
# Specific position
result = tracer.trace_text(text, target_pos=5)
Common use cases:
-1: What influenced the final output?- Specific position: What influenced token X?
- Loop through positions: Analyze entire sequence
Combining Parameters¶
Quick Overview¶
Deep Dive¶
# Thorough analysis
result = tracer.trace_text(
text,
top_beam=10,
top_k=8,
min_attn=0.01,
length_norm="avg_logprob",
)
Debugging Specific Behavior¶
# Focus on strong signals in early layers
result = tracer.trace_text(
text,
layer=0,
min_attn=0.1,
top_beam=3,
agg_heads="max",
)
Performance Tips¶
- Start small: Begin with defaults, adjust as needed
- Use min_attn: Filter noise for cleaner visualizations
- Reduce top_beam: Lower beam count = faster execution
- Early layers are faster: Less computation needed
Troubleshooting¶
Empty or short paths¶
- Lower
min_attn - Increase
top_k - Check if
stop_at_bosis terminating early
Too many similar paths¶
- Reduce
top_beam - Increase
min_attn - Use beam filtering in visualization
Slow performance¶
- Reduce
top_beamandtop_k - Use shorter input sequences
- Try a smaller model