Skip to content

Parameter Tuning

This guide explains each parameter and how to tune them for your use case.

Quick Reference

Parameter Default Range Effect
top_beam 5 1-20 Number of paths to discover
top_k 5 1-20 Exploration breadth per step
min_attn 0.0 0.0-1.0 Attention threshold filter
length_norm avg_logprob see below Path length compensation
agg_heads mean mean/max Head aggregation method
layer -1 any layer Which layer to analyze
target_pos -1 any position Starting position for trace

top_beam - Number of Beams

Controls how many distinct paths to discover.

# Minimal - single best path
result = tracer.trace_text(text, top_beam=1)

# Balanced (default)
result = tracer.trace_text(text, top_beam=5)

# Comprehensive
result = tracer.trace_text(text, top_beam=10)

Recommendations:

  • 1: When you only want the dominant path
  • 3-5: General exploration (recommended)
  • 10+: Research/detailed analysis

Higher values increase computation time and visualization complexity.

top_k - Predecessors Per Step

Controls how many candidates to consider at each position.

# Focused - follow only the strongest attention
result = tracer.trace_text(text, top_k=2)

# Balanced (default)
result = tracer.trace_text(text, top_k=5)

# Exploratory
result = tracer.trace_text(text, top_k=10)

Recommendations:

  • 1: Greedy search (not recommended—misses alternatives)
  • 3-5: Good balance
  • 10+: When you want to discover less-obvious paths

Note: top_k × top_beam determines memory usage per step.

min_attn - Attention Threshold

Filters out weak attention connections.

# Include all attention (default)
result = tracer.trace_text(text, min_attn=0.0)

# Only meaningful attention
result = tracer.trace_text(text, min_attn=0.05)

# Only strong attention
result = tracer.trace_text(text, min_attn=0.1)

Recommendations:

  • 0.0: Include everything (default)
  • 0.01-0.05: Filter noise, keep signal
  • 0.1+: Focus on dominant connections only

Warning

Too high a threshold may cause paths to terminate early because no predecessor meets the threshold.

length_norm - Score Normalization

Controls how path scores are normalized for length.

# Raw scores (favor shorter paths)
result = tracer.trace_text(text, length_norm="none")

# Geometric mean (fair comparison - default)
result = tracer.trace_text(text, length_norm="avg_logprob")

# Moderate normalization
result = tracer.trace_text(text, length_norm="sqrt")

# Custom exponent
result = tracer.trace_text(text, length_norm="pow:0.7")

Comparison

Method Score Formula Behavior
none Σ log(attn) Prefers shorter paths
avg_logprob Σ log(attn) / n Equal treatment
sqrt Σ log(attn) / √n Slight short preference
pow:α Σ log(attn) / n^α Tunable

Recommendations:

  • Use avg_logprob for general analysis (default)
  • Use none when path length matters (absolute influence)
  • Use pow:0.7 or sqrt as a middle ground

agg_heads - Head Aggregation

Controls how attention from multiple heads is combined.

# Average across heads (default)
result = tracer.trace_text(text, agg_heads="mean")

# Maximum across heads
result = tracer.trace_text(text, agg_heads="max")

Comparison:

Method Formula Effect
mean avg(head₁, head₂, ...) Consensus attention
max max(head₁, head₂, ...) Strongest head signal

Recommendations:

  • mean: General use, shows where heads agree
  • max: When looking for specialized head behavior

Note

agg_heads="none" is not supported for beam search because it would create separate attention matrices per head.

layer - Layer Selection

Choose which transformer layer to analyze.

# Last layer (most processed - default)
result = tracer.trace_text(text, layer=-1)

# First layer (raw attention patterns)
result = tracer.trace_text(text, layer=0)

# Middle layer
result = tracer.trace_text(text, layer=12)

Negative indices work like Python lists: - -1: Last layer - -2: Second-to-last layer

Recommendations:

  • -1: Default, shows final attention patterns
  • Early layers: Syntactic/local patterns
  • Middle layers: Semantic relationships
  • Late layers: Task-specific patterns

target_pos - Starting Position

Choose where to start the backward trace.

# Last token (default)
result = tracer.trace_text(text, target_pos=-1)

# Second-to-last
result = tracer.trace_text(text, target_pos=-2)

# Specific position
result = tracer.trace_text(text, target_pos=5)

Common use cases:

  • -1: What influenced the final output?
  • Specific position: What influenced token X?
  • Loop through positions: Analyze entire sequence

Combining Parameters

Quick Overview

# Fast, simple view
result = tracer.trace_text(text, top_beam=2, top_k=3)

Deep Dive

# Thorough analysis
result = tracer.trace_text(
    text,
    top_beam=10,
    top_k=8,
    min_attn=0.01,
    length_norm="avg_logprob",
)

Debugging Specific Behavior

# Focus on strong signals in early layers
result = tracer.trace_text(
    text,
    layer=0,
    min_attn=0.1,
    top_beam=3,
    agg_heads="max",
)

Performance Tips

  1. Start small: Begin with defaults, adjust as needed
  2. Use min_attn: Filter noise for cleaner visualizations
  3. Reduce top_beam: Lower beam count = faster execution
  4. Early layers are faster: Less computation needed

Troubleshooting

Empty or short paths

  • Lower min_attn
  • Increase top_k
  • Check if stop_at_bos is terminating early

Too many similar paths

  • Reduce top_beam
  • Increase min_attn
  • Use beam filtering in visualization

Slow performance

  • Reduce top_beam and top_k
  • Use shorter input sequences
  • Try a smaller model