Model Specific Interpretability - Learning Module

Loading content...

0/245

Attention Visualization

Looking Where the Model Looks

When a transformer model translates 'The cat sat on the mat' to French, how does it know that 'cat' corresponds to 'chat'? When BERT classifies a movie review as positive, which words drove that decision? The answer lies in attention mechanisms—and more importantly, in our ability to visualize and interpret them.

Attention mechanisms are the backbone of modern NLP and increasingly vision models. Unlike recurrent networks that process sequences step-by-step, attention allows models to directly connect any input position to any output position. This creates a natural interpretability opportunity: we can literally see which parts of the input the model 'attends to' when producing each part of the output.

But attention visualization is not as simple as 'bright colors mean important'. Multi-head attention, layer depth, and the difference between attention weights and information flow create subtle interpretation challenges. This page gives you the complete framework to correctly visualize and interpret attention patterns.

What You Will Learn

This page covers: (1) The mechanics of attention and why it creates interpretable patterns, (2) Visualizing single-head and multi-head attention, (3) Layer-by-layer attention analysis, (4) BertViz and other visualization tools, (5) Attention in vision transformers, (6) Cross-attention for multi-modal models, (7) Critical limitations of attention as explanation, and (8) When attention tells the truth and when it misleads.

Attention Mechanism Fundamentals

Before visualizing attention, we must understand what attention computes. The scaled dot-product attention from 'Attention Is All You Need' (Vaswani et al., 2017) operates as follows:

Query-Key-Value (QKV) Framework:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

where:

Q (Query): What we're looking for (e.g., representation of current word)
K (Key): What we're looking at (e.g., representations of all words)
V (Value): What we actually retrieve (information to aggregate)

The Attention Weights Matrix:

The term $\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)$ produces an attention weight matrix $A$ of shape [sequence_length × sequence_length]. Each row sums to 1 and represents how much each source position contributes to each target position.

$A_{ij}$ = attention weight from position $i$ to position $j$ = "how much position $i$ attends to position $j$"

Key Attention Properties for Interpretation

•Softmax normalization — Each row sums to 1, creating a probability distribution over source positions. Attention is 'allocated' across inputs.
•Position-specific — Each output position has its own attention distribution. Word 'dog' might attend differently than word 'cat'.
•Symmetric access — Unlike RNNs, any position can attend to any other position (in bidirectional models).
•Multiple heads — Modern transformers use multi-head attention with 8-12+ heads per layer, each learning different patterns.
•Stacked layers — Deep models stack 6-24+ layers. Information flows through cascading attention operations.

Attention ≠ Importance (Yet)

High attention weight means the model 'looked' at that position. It doesn't necessarily mean that position was 'important' for the final prediction. Attention is one step in a complex computation—the model still applies transformations after aggregating attended values. Keep this distinction in mind throughout.

Visualizing Single-Head Attention

The simplest attention visualization shows the attention matrix as a heatmap. Given input tokens, we color cells based on attention weight values.

Basic Heatmap Visualization:

Rows: Target positions (what is being computed)
Columns: Source positions (what is being attended to)
Color intensity: Attention weight (0 = no attention, 1 = full attention)

For self-attention (input = output positions), the matrix is square. For cross-attention (e.g., decoder attending to encoder), it's rectangular.

attention_heatmap_basic.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import torch
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from transformers import BertTokenizer, BertModel
 
# Load pre-trained BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
 
# Sample sentence
sentence = "The cat sat on the mat because it was tired"
inputs = tokenizer(sentence, return_tensors='pt')
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
 
# Forward pass with attention outputs
with torch.no_grad():
    outputs = model(**inputs)
 
# outputs.attentions is tuple of (num_layers, ) 
# each layer: [batch, num_heads, seq_len, seq_len]
attentions = torch.stack(outputs.attentions).squeeze(1)  # [layers, heads, seq, seq]
print(f"Attention shape: {attentions.shape}")
 
# Visualize single head from single layer
layer_idx = 5   # Middle layer
head_idx = 0    # First head
 
attention_matrix = attentions[layer_idx, head_idx].numpy()
 
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(
    attention_matrix,
    xticklabels=tokens,
    yticklabels=tokens,
    cmap='Blues',
    ax=ax,
    vmin=0, vmax=1
)
ax.set_xlabel('Source Token (Key)')
ax.set_ylabel('Target Token (Query)')
ax.set_title(f'Attention Weights: Layer {layer_idx}, Head {head_idx}')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.savefig('attention_heatmap.png', dpi=150)
plt.show()
 
# Line/arc visualization for specific token
def plot_token_attention(attention_matrix, tokens, target_idx, ax=None):
    """Visualize attention FROM a specific target token."""
    if ax is None:
        fig, ax = plt.subplots(figsize=(12, 4))
    
    weights = attention_matrix[target_idx]
    positions = range(len(tokens))
    
    # Bar chart of attention weights
    bars = ax.bar(positions, weights, color='steelblue', alpha=0.7)
    
    # Highlight the source token
    bars[target_idx].set_color('darkred')
    
    ax.set_xticks(positions)
    ax.set_xticklabels(tokens, rotation=45, ha='right')
    ax.set_ylabel('Attention Weight')
    ax.set_title(f'Attention from "{tokens[target_idx]}" to all tokens')
    ax.set_ylim(0, 1)
    
    return ax
 
# What does 'it' attend to? (pronoun resolution)
it_idx = tokens.index('it')
fig, ax = plt.subplots(figsize=(12, 4))
plot_token_attention(attention_matrix, tokens, it_idx, ax)
plt.tight_layout()
plt.savefig('pronoun_attention.png', dpi=150)
plt.show()

Reading Attention Heatmaps

•Diagonal patterns — Self-attention to the same position. Often strong in early layers.
•Vertical stripes — Many tokens attending to specific positions (often special tokens like [CLS] or [SEP]).
•Local attention — Nearby positions attending to each other (syntactic/local coherence).
•Long-range connections — Distant positions with high weights (semantic relationships).
•Sparse attention — Only a few positions have significant weight (focused attention).

Multi-Head Attention: Diverse Patterns

Real transformer models use multi-head attention: multiple attention mechanisms running in parallel. Each head can learn to attend differently:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

where each head$_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

Why Multiple Heads?

Different heads capture different relationships
Some heads might focus on syntactic dependencies (subject-verb)
Others might capture semantic similarity
Some attend to positional patterns (next/previous word)
Ensemble of attention patterns provides richer representation

Visualizing Multiple Heads:

Displaying 12 heads × 12 layers = 144 attention matrices is overwhelming. Common strategies:

Show all heads for one layer as a grid
Average heads within a layer
Identify heads with specific behaviors and highlight those
Use interactive tools to explore specific heads

multihead_attention_viz.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import BertTokenizer, BertModel
 
# Load model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
 
sentence = "The lawyer questioned the witness because she thought he was lying"
inputs = tokenizer(sentence, return_tensors='pt')
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
 
with torch.no_grad():
    outputs = model(**inputs)
 
attentions = torch.stack(outputs.attentions).squeeze(1)  # [12, 12, seq, seq]
num_layers, num_heads, seq_len, _ = attentions.shape
 
# Visualize all 12 heads from layer 8 (often captures semantic relations)
layer = 8
fig, axes = plt.subplots(3, 4, figsize=(16, 12))
 
for head in range(num_heads):
    ax = axes[head // 4, head % 4]
    attn = attentions[layer, head].numpy()
    
    im = ax.imshow(attn, cmap='Blues', vmin=0, vmax=attn.max())
    ax.set_title(f'Head {head}', fontsize=10)
    ax.set_xticks(range(len(tokens)))
    ax.set_yticks(range(len(tokens)))
    
    if head % 4 == 0:
        ax.set_yticklabels(tokens, fontsize=6)
    else:
        ax.set_yticklabels([])
    
    if head >= 8:
        ax.set_xticklabels(tokens, fontsize=6, rotation=90)
    else:
        ax.set_xticklabels([])
 
plt.suptitle(f'All 12 Attention Heads - Layer {layer}', fontsize=14)
plt.tight_layout()
plt.savefig('all_heads_layer8.png', dpi=150)
plt.show()
 
# Identify specialized heads
def analyze_head_patterns(attentions, tokens):
    """Analyze what patterns different heads capture."""
    n_layers, n_heads, seq_len, _ = attentions.shape
    
    patterns = []
    for layer in range(n_layers):
        for head in range(n_heads):
            attn = attentions[layer, head].numpy()
            
            # Metrics for head specialization
            diagonal_ratio = np.trace(attn) / seq_len  # Self-attention
            
            # Previous token attention (diagonal shifted by 1)
            prev_token = np.mean([attn[i, i-1] for i in range(1, seq_len)])
            
            # Next token attention
            next_token = np.mean([attn[i, i+1] for i in range(seq_len-1)])
            
            # Special token ([CLS]) attention
            cls_attention = np.mean(attn[:, 0])  # Attention TO [CLS]
            
            # Entropy (spread of attention)
            entropy = -np.sum(attn * np.log(attn + 1e-10)) / seq_len
            
            patterns.append({
                'layer': layer, 'head': head,
                'self_attn': diagonal_ratio,
                'prev_token': prev_token,
                'next_token': next_token,
                'cls_attn': cls_attention,
                'entropy': entropy
            })
    
    return patterns
 
patterns = analyze_head_patterns(attentions, tokens)
 
# Find specialized heads
print("Head Specialization Analysis:")
print("="*60)
 
# Highest self-attention
self_attn = max(patterns, key=lambda x: x['self_attn'])
print(f"Position-aware (self): L{self_attn['layer']}H{self_attn['head']} ({self_attn['self_attn']:.3f})")
 
# Previous token focus
prev_focus = max(patterns, key=lambda x: x['prev_token'])
print(f"Previous token focus: L{prev_focus['layer']}H{prev_focus['head']} ({prev_focus['prev_token']:.3f})")
 
# CLS focus  
cls_focus = max(patterns, key=lambda x: x['cls_attn'])
print(f"[CLS] aggregator: L{cls_focus['layer']}H{cls_focus['head']} ({cls_focus['cls_attn']:.3f})")

Head Specialization Research

Research has found that specific heads consistently capture specific linguistic phenomena: syntactic heads track grammatical relations, coreference heads link pronouns to antecedents, and positional heads focus on adjacent tokens. The paper 'What Does BERT Look At?' (Clark et al., 2019) provides detailed analysis of BERT's attention heads.

Interactive Visualization with BertViz

Manual heatmap creation quickly becomes tedious. BertViz (Jesse Vig, 2019) provides interactive attention visualization that has become the standard tool for exploring transformer attention.

BertViz Views:

Head View: All heads from all layers in one interface. Click any head to see its attention pattern. Lines connect attended tokens with width proportional to attention weight.
Model View: Aggregated attention across all heads in a layer. Useful for seeing overall layer behavior.
Neuron View: Traces how individual query and key neurons contribute to attention. Most granular analysis.

Key Features:

Interactive selection of layers and heads
Hover to highlight specific attention connections
Support for multiple transformer architectures (BERT, GPT-2, RoBERTa, etc.)
Works in Jupyter notebooks

bertviz_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# Install BertViz: pip install bertviz
 
from bertviz import head_view, model_view
from transformers import BertTokenizer, BertModel
import torch
 
# Load model with attention outputs
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
 
# Analyze a sentence
sentence_a = "The cat sat on the mat."
sentence_b = "It was very comfortable."
 
# Tokenize
inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
input_ids = inputs['input_ids'][0]
tokens = tokenizer.convert_ids_to_tokens(input_ids)
 
# Get attention
with torch.no_grad():
    outputs = model(**inputs)
 
attention = outputs.attentions  # Tuple of (batch, heads, seq, seq) per layer
 
# HEAD VIEW: Interactive visualization of all heads
# Shows connections between tokens with lines
# Width of line = attention weight
head_view(attention, tokens)
 
# MODEL VIEW: Aggregated view across heads
# Useful for seeing layer-by-layer patterns
model_view(attention, tokens)
 
# For Jupyter notebooks, these render inline as interactive widgets
# For scripts, they open in browser
 
# Programmatic analysis alongside visualization
def summarize_attention_patterns(attention, tokens):
    """Generate human-readable summary of attention patterns."""
    attention_stack = torch.stack(attention).squeeze()  # [layers, heads, seq, seq]
    
    summaries = []
    
    for layer_idx in range(attention_stack.shape[0]):
        layer_attn = attention_stack[layer_idx]  # [heads, seq, seq]
        
        # Average across heads
        avg_attn = layer_attn.mean(dim=0).numpy()
        
        # Find strongest non-diagonal attention
        np.fill_diagonal(avg_attn, 0)
        max_attn = np.unravel_index(np.argmax(avg_attn), avg_attn.shape)
        source, target = max_attn
        weight = avg_attn[source, target]
        
        summaries.append(f"Layer {layer_idx}: '{tokens[source]}' → '{tokens[target]}' ({weight:.3f})")
    
    return summaries
 
summaries = summarize_attention_patterns(attention, tokens)
print("Strongest Attention Connections per Layer:")
for s in summaries:
    print(s)

Attention Visualization Tools Comparison
Tool	Main Use Case	Interactivity	Installation
BertViz	General transformer attention exploration	High (web-based)	pip install bertviz
Transformers Interpret	Attribution + attention for HuggingFace	Medium	pip install transformers-interpret
Ecco	NLP interpretation with attention + embeddings	High	pip install ecco
AllenNLP Interpret	NLP interpretation suite	High (demo)	pip install allennlp-interpret
LIT (Language Interpretability Tool)	Google's interactive ML analysis	Very High	pip install lit-nlp

Layer-by-Layer Attention Analysis

Deep transformers stack many layers, and attention patterns evolve dramatically from early to late layers. Understanding this evolution reveals how models build up representations.

Layer Progression Patterns:

Early Layers (1-3): Attention often focuses on positional patterns—adjacent tokens, special tokens ([CLS], [SEP], [PAD]). Local syntactic structure.
Middle Layers (4-8): Increasingly semantic attention. Pronouns attend to antecedents. Related concepts connect. Long-range dependencies emerge.
Late Layers (9-12): Task-specific patterns. For classification, attention concentrates on [CLS]. For generation, attention patterns become more diffuse and abstract.

Aggregating Across Layers:

No single layer tells the complete story. Information flows through the network, transforming at each layer. Recent work suggests that attention rollout or attention flow that traces paths through all layers better reflects information flow than single-layer attention.

layer_by_layer_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import BertTokenizer, BertModel
 
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)
 
sentence = "John gave Mary a book because she asked for it"
inputs = tokenizer(sentence, return_tensors='pt')
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
 
with torch.no_grad():
    outputs = model(**inputs)
 
attentions = torch.stack(outputs.attentions).squeeze()  # [12, 12, seq, seq]
 
# Analyze attention entropy (spread) across layers
def attention_entropy(attn_matrix):
    """Calculate entropy of attention distribution per row, averaged."""
    # attn_matrix: [seq, seq], each row sums to 1
    epsilon = 1e-10
    entropy_per_row = -torch.sum(attn_matrix * torch.log(attn_matrix + epsilon), dim=-1)
    return entropy_per_row.mean().item()
 
# Track metrics across layers
layer_metrics = []
for layer in range(12):
    layer_attn = attentions[layer].mean(dim=0)  # Average across heads
    
    entropy = attention_entropy(layer_attn)
    diagonal = torch.trace(layer_attn).item() / len(tokens)
    
    # Attention to [CLS]
    cls_attn = layer_attn[:, 0].mean().item()
    
    layer_metrics.append({
        'layer': layer,
        'entropy': entropy,
        'self_attention': diagonal,
        'cls_attention': cls_attn
    })
 
# Plot layer progression
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
 
layers = range(12)
metrics = layer_metrics
 
axes[0].plot(layers, [m['entropy'] for m in metrics], 'o-', color='steelblue')
axes[0].set_xlabel('Layer')
axes[0].set_ylabel('Attention Entropy')
axes[0].set_title('Attention Spread (Entropy)')
 
axes[1].plot(layers, [m['self_attention'] for m in metrics], 'o-', color='darkorange')
axes[1].set_xlabel('Layer')
axes[1].set_ylabel('Self-Attention Ratio')
axes[1].set_title('Diagonal Attention')
 
axes[2].plot(layers, [m['cls_attention'] for m in metrics], 'o-', color='forestgreen')
axes[2].set_xlabel('Layer')
axes[2].set_ylabel('[CLS] Attention')
axes[2].set_title('Attention to [CLS]')
 
plt.suptitle('Attention Patterns Across BERT Layers', fontsize=14)
plt.tight_layout()
plt.savefig('layer_progression.png', dpi=150)
plt.show()
 
# ATTENTION ROLLOUT: Aggregate attention across layers
def attention_rollout(attentions, add_residual=True):
    """
    Compute attention rollout as described in Abnar & Zuidema (2020).
    Aggregates attention across layers accounting for residual connections.
    """
    # attentions: [n_layers, n_heads, seq_len, seq_len]
    n_layers, n_heads, seq_len, _ = attentions.shape
    
    # Average across heads
    attn_avg = attentions.mean(dim=1)  # [n_layers, seq_len, seq_len]
    
    # Add identity matrix for residual connection
    if add_residual:
        eye = torch.eye(seq_len)
        attn_avg = 0.5 * attn_avg + 0.5 * eye
    
    # Normalize rows
    attn_avg = attn_avg / attn_avg.sum(dim=-1, keepdim=True)
    
    # Rollout: multiply attention matrices
    rollout = attn_avg[0]
    for layer in range(1, n_layers):
        rollout = torch.matmul(attn_avg[layer], rollout)
    
    return rollout
 
rollout_attn = attention_rollout(attentions)
 
# Compare single layer vs rollout
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
 
# Single layer (last layer)
single_layer = attentions[-1].mean(dim=0).numpy()
im1 = axes[0].imshow(single_layer, cmap='Blues')
axes[0].set_title('Single Layer (Layer 12, avg heads)')
axes[0].set_xticks(range(len(tokens)))
axes[0].set_xticklabels(tokens, rotation=90, fontsize=8)
axes[0].set_yticks(range(len(tokens)))
axes[0].set_yticklabels(tokens, fontsize=8)
plt.colorbar(im1, ax=axes[0])
 
# Attention rollout
im2 = axes[1].imshow(rollout_attn.numpy(), cmap='Blues')
axes[1].set_title('Attention Rollout (All Layers)')
axes[1].set_xticks(range(len(tokens)))
axes[1].set_xticklabels(tokens, rotation=90, fontsize=8)
axes[1].set_yticks(range(len(tokens)))
axes[1].set_yticklabels(tokens, fontsize=8)
plt.colorbar(im2, ax=axes[1])
 
plt.tight_layout()
plt.savefig('rollout_comparison.png', dpi=150)
plt.show()

Rollout Has Limitations Too

Attention rollout assumes attention matrices can be simply multiplied across layers. This ignores the non-linear transformations (FFN, LayerNorm) between attention layers. More sophisticated methods like 'Attention Flow' (Abnar & Zuidema, 2020) or gradient-based methods may better capture true information flow.

Attention in Vision Transformers

Vision Transformers (ViT) apply the same attention mechanism to images by treating image patches as 'tokens'. This creates a powerful opportunity: we can visualize which parts of an image the model attends to for each patch.

ViT Patch Structure:

Image is divided into fixed-size patches (e.g., 16×16 pixels)
Each patch is linearly embedded into a vector
Positional embeddings are added
A [CLS] token is prepended for classification
Standard transformer attention operates on patch 'sequence'

Attention Visualization for ViT:

For a 224×224 image with 16×16 patches, we get 196 patches + 1 [CLS] = 197 tokens. The attention from [CLS] to all patches can be reshaped back to a 14×14 spatial grid and overlaid on the original image.

This creates attention maps showing which regions the model 'looks at' for its classification decision.

vit_attention_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from transformers import ViTImageProcessor, ViTModel
import requests
 
# Load ViT model
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model = ViTModel.from_pretrained('google/vit-base-patch16-224', output_attentions=True)
 
# Load sample image
url = 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Cat_November_2010-1a.jpg/1200px-Cat_November_2010-1a.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
 
# Process image
inputs = processor(images=image, return_tensors="pt")
 
# Get attention
with torch.no_grad():
    outputs = model(**inputs)
 
attentions = torch.stack(outputs.attentions).squeeze()  # [12, 12, 197, 197]
# 197 = 196 patches (14x14) + 1 [CLS] token
 
# Extract attention from [CLS] token to all patches
def get_attention_map(attentions, layer, head, cls_idx=0, patch_size=14):
    """Get spatial attention map from [CLS] to patches."""
    attn = attentions[layer, head]  # [197, 197]
    
    # Attention FROM [CLS] TO patches (excluding [CLS] itself)
    cls_to_patches = attn[cls_idx, 1:].numpy()  # [196]
    
    # Reshape to spatial grid
    attn_map = cls_to_patches.reshape(patch_size, patch_size)
    
    return attn_map
 
# Visualize attention for different heads
fig, axes = plt.subplots(3, 4, figsize=(16, 12))
 
for head in range(12):
    ax = axes[head // 4, head % 4]
    
    # Get attention from last layer
    attn_map = get_attention_map(attentions, layer=-1, head=head)
    
    # Resize to original image size
    attn_resized = np.array(Image.fromarray(attn_map).resize((224, 224), Image.BILINEAR))
    
    # Overlay on original image
    ax.imshow(image.resize((224, 224)))
    ax.imshow(attn_resized, cmap='hot', alpha=0.6)
    ax.set_title(f'Head {head}')
    ax.axis('off')
 
plt.suptitle('ViT Attention Maps (Last Layer, [CLS] → Patches)', fontsize=14)
plt.tight_layout()
plt.savefig('vit_attention_heads.png', dpi=150)
plt.show()
 
# Aggregate across heads and layers for overall attention
def aggregate_vit_attention(attentions, method='mean'):
    """Aggregate attention across heads and layers."""
    # attentions: [n_layers, n_heads, 197, 197]
    
    if method == 'mean':
        # Simple average
        avg_attn = attentions.mean(dim=(0, 1))  # [197, 197]
        cls_to_patches = avg_attn[0, 1:].numpy()
        
    elif method == 'rollout':
        # Attention rollout across layers
        n_layers, n_heads, seq_len, _ = attentions.shape
        attn_avg = attentions.mean(dim=1)  # [n_layers, 197, 197]
        
        eye = torch.eye(seq_len)
        attn_avg = 0.5 * attn_avg + 0.5 * eye
        attn_avg = attn_avg / attn_avg.sum(dim=-1, keepdim=True)
        
        rollout = attn_avg[0]
        for layer in range(1, n_layers):
            rollout = torch.matmul(attn_avg[layer], rollout)
        
        cls_to_patches = rollout[0, 1:].numpy()
    
    return cls_to_patches.reshape(14, 14)
 
# Compare aggregation methods
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
 
axes[0].imshow(image.resize((224, 224)))
axes[0].set_title('Original Image')
axes[0].axis('off')
 
mean_attn = aggregate_vit_attention(attentions, 'mean')
mean_resized = np.array(Image.fromarray(mean_attn).resize((224, 224), Image.BILINEAR))
axes[1].imshow(image.resize((224, 224)))
axes[1].imshow(mean_resized, cmap='hot', alpha=0.6)
axes[1].set_title('Mean Attention')
axes[1].axis('off')
 
rollout_attn = aggregate_vit_attention(attentions, 'rollout')
rollout_resized = np.array(Image.fromarray(rollout_attn).resize((224, 224), Image.BILINEAR))
axes[2].imshow(image.resize((224, 224)))
axes[2].imshow(rollout_resized, cmap='hot', alpha=0.6)
axes[2].set_title('Attention Rollout')
axes[2].axis('off')
 
plt.tight_layout()
plt.savefig('vit_aggregation.png', dpi=150)
plt.show()

Cross-Attention for Multi-Modal Models

Some of the most interpretable attention patterns emerge in cross-attention, where one modality (e.g., text) attends to another (e.g., images). This is central to:

Machine Translation: Decoder attends to encoder (source → target alignment)
Image Captioning: Text decoder attends to image encoder
Visual Question Answering: Question attends to image regions
Text-to-Image Generation: Image decoder attends to text prompt

Why Cross-Attention is More Interpretable:

In cross-attention, we're explicitly asking: 'When generating word X, which parts of the image does the model look at?' This is more directly interpretable than self-attention, where the same sequence attends to itself.

cross_attention_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import torch
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
import requests
 
# Load BLIP model (image captioning with cross-attention)
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained(
    "Salesforce/blip-image-captioning-base", 
    output_attentions=True
)
 
# Load image
url = 'https://upload.wikimedia.org/wikipedia/commons/b/bc/Juvenile_Ragdoll.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert('RGB')
 
# Generate caption with attention
inputs = processor(images=image, return_tensors="pt")
 
# Generate with output attentions
with torch.no_grad():
    outputs = model.generate(
        **inputs, 
        max_new_tokens=50, 
        return_dict_in_generate=True,
        output_attentions=True
    )
 
# The cross-attentions show decoder tokens attending to encoder (image) tokens
# outputs.cross_attentions is complex for generation (one per generated token)
 
# Decode generated caption
caption = processor.decode(outputs.sequences[0], skip_special_tokens=True)
print(f"Generated caption: {caption}")
 
# For detailed cross-attention analysis, use a single forward pass
# This is a simplified example showing the concept
 
def visualize_translation_alignment(src_tokens, tgt_tokens, attention_matrix):
    """
    Visualize cross-attention alignment for translation or similar tasks.
    attention_matrix: [tgt_len, src_len]
    """
    fig, ax = plt.subplots(figsize=(12, 8))
    
    im = ax.imshow(attention_matrix, cmap='Blues')
    
    ax.set_xticks(range(len(src_tokens)))
    ax.set_xticklabels(src_tokens, rotation=45, ha='right')
    ax.set_yticks(range(len(tgt_tokens)))
    ax.set_yticklabels(tgt_tokens)
    
    ax.set_xlabel('Source (Encoder)')
    ax.set_ylabel('Target (Decoder)')
    ax.set_title('Cross-Attention Alignment')
    
    plt.colorbar(im)
    plt.tight_layout()
    
    return fig
 
# Simulated translation alignment (English-French)
src_tokens = ['The', 'cat', 'sat', 'on', 'the', 'mat', '.']
tgt_tokens = ['Le', 'chat', 's\'est', 'assis', 'sur', 'le', 'tapis', '.']
 
# Simulated attention matrix (would come from model in practice)
np.random.seed(42)
attention = np.zeros((len(tgt_tokens), len(src_tokens)))
# Create plausible alignment
alignments = [(0, 0), (1, 1), (2, 2), (3, 2), (4, 3), (5, 4), (6, 5), (7, 6)]
for tgt_idx, src_idx in alignments:
    attention[tgt_idx, src_idx] = np.random.uniform(0.6, 0.9)
    # Add some noise to neighbors
    for offset in [-1, 1]:
        if 0 <= src_idx + offset < len(src_tokens):
            attention[tgt_idx, src_idx + offset] = np.random.uniform(0.05, 0.15)
 
# Normalize rows
attention = attention / attention.sum(axis=1, keepdims=True)
 
fig = visualize_translation_alignment(src_tokens, tgt_tokens, attention)
plt.savefig('translation_alignment.png', dpi=150)
plt.show()
 
# Key insight: Cross-attention creates word-level or patch-level alignments
# that are often directly interpretable as "word X corresponds to word Y"
# or "word X looks at image region Z"

Cross-Attention in Practice

Cross-attention visualizations are particularly valuable for debugging multi-modal models. If an image captioning model says 'a dog on the couch' when the image shows a cat, checking cross-attention can reveal whether the model looked at the wrong region (attention error) or correctly attended but misclassified the object (recognition error).

Critical Limitations: When Attention Lies

Despite its intuitive appeal, attention visualization has fundamental limitations that every practitioner must understand. The seminal paper 'Attention is Not Explanation' (Jain & Wallace, 2019) demonstrated that attention weights do not reliably indicate feature importance.

Key Findings:

Alternative Attention Distributions: Many different attention patterns can produce the same prediction. High attention on a word doesn't mean that word was necessary.
Gradient Mismatch: Attention weights often don't align with gradient-based importance measures. The model might attend to a word but not be sensitive to changes in that word.
Attention is Input to Computation: Attention determines what gets aggregated, but the subsequent transformations (FFN layers, LayerNorm) determine how that information affects output.
Adversarial Attention: It's possible to create models with misleading attention that still perform well on tasks.

Attention vs Explanation: Key Distinctions
Aspect	What Attention Shows	What Explanation Requires
Definition	Which inputs were weighted highly	Which inputs caused the output
Counterfactual	No: doesn't show what happens if input changed	Yes: requires sensitivity analysis
Sufficiency	No: aggregated values still transformed	Yes: should identify sufficient features
Uniqueness	No: multiple attention patterns give same output	Ideally: canonical explanation
Relationship	Correlation with output computation	Causal contribution to output

When Attention Visualization Misleads

•Attention to stop words — Models often attend to function words ('the', 'a') not because they're important but because they're ubiquitous anchors.
•[CLS] aggregation patterns — Late-layer attention concentrating on [CLS] reflects classification architecture, not feature importance.
•Attention as retrieval — In some architectures, attention identifies WHAT to copy, not WHAT matters for the decision.
•Layer-specific patterns — Early layers capture syntax; late layers capture task-specific patterns. Mixing them confuses interpretation.
•Position-based patterns — Some heads attend based on relative position regardless of content.

attention_vs_gradient_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import torch
import numpy as np
from transformers import BertTokenizer, BertForSequenceClassification
 
# Load sentiment classification model
tokenizer = BertTokenizer.from_pretrained('textattack/bert-base-uncased-SST-2')
model = BertForSequenceClassification.from_pretrained(
    'textattack/bert-base-uncased-SST-2', 
    output_attentions=True
)
 
sentence = "This movie was absolutely terrible and a waste of time"
inputs = tokenizer(sentence, return_tensors='pt', padding=True)
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
 
# Get attention
model.eval()
outputs = model(**inputs)
attentions = torch.stack(outputs.attentions).squeeze()  # [12, 12, seq, seq]
 
# Average attention to [CLS] from last layer
last_layer_attn = attentions[-1].mean(dim=0)  # Average across heads
attn_to_cls = last_layer_attn[:, 0].numpy()  # Attention FROM each token TO [CLS]
attn_from_cls = last_layer_attn[0, :].numpy()  # Attention FROM [CLS] TO each token
 
# Now compute gradient-based importance
embeddings = model.bert.embeddings.word_embeddings(inputs['input_ids'])
embeddings.retain_grad()
 
# Forward pass with gradient tracking
logits = model.bert(inputs_embeds=embeddings)[0]
logits_cls = model.classifier(logits[:, 0])  # Use [CLS] for classification
 
# Compute gradient w.r.t. predicted class
predicted_class = logits_cls.argmax().item()
logits_cls[0, predicted_class].backward()
 
# Gradient importance: L2 norm of embedding gradients
gradient_importance = embeddings.grad[0].norm(dim=-1).detach().numpy()
 
# Compare attention vs gradient
import matplotlib.pyplot as plt
 
fig, axes = plt.subplots(3, 1, figsize=(12, 10))
 
x = range(len(tokens))
 
axes[0].bar(x, attn_from_cls, color='steelblue')
axes[0].set_xticks(x)
axes[0].set_xticklabels(tokens, rotation=45, ha='right')
axes[0].set_ylabel('Attention Weight')
axes[0].set_title('Attention FROM [CLS] (Last Layer, Avg Heads)')
 
axes[1].bar(x, gradient_importance, color='darkorange')
axes[1].set_xticks(x)
axes[1].set_xticklabels(tokens, rotation=45, ha='right')
axes[1].set_ylabel('Gradient Magnitude')
axes[1].set_title('Gradient-Based Importance')
 
# Correlation
from scipy.stats import spearmanr
corr, pval = spearmanr(attn_from_cls, gradient_importance)
 
axes[2].scatter(attn_from_cls, gradient_importance, alpha=0.7)
for i, tok in enumerate(tokens):
    axes[2].annotate(tok, (attn_from_cls[i], gradient_importance[i]), fontsize=8)
axes[2].set_xlabel('Attention Weight')
axes[2].set_ylabel('Gradient Importance')
axes[2].set_title(f'Attention vs Gradient (Spearman r={corr:.3f}, p={pval:.3f})')
 
plt.tight_layout()
plt.savefig('attention_vs_gradient.png', dpi=150)
plt.show()
 
# Key insight: If correlation is low, attention doesn't predict importance well

Attention is Not Not Explanation

The follow-up paper 'Attention is Not Not Explanation' (Wiegreffe & Pinter, 2019) argues that the critique is too strong. Attention can be a reasonable explanation in many contexts, especially when combined with other evidence. The takeaway: attention is one useful signal among many, not a complete explanation.

Summary: Attention Visualization Best Practices

Attention visualization provides a window into transformer computations, but that window must be used carefully with full awareness of its limitations. Here's the essential framework:

Key Takeaways

•Attention weights show 'looking', not 'importance' — High weight means the model aggregated from that position; it doesn't mean that position determined the output.
•Multi-head attention requires multi-head analysis — Different heads learn different patterns. Look for specialized heads (syntactic, positional, semantic).
•Layer progression matters — Early layers are local/syntactic; late layers are task-specific. Aggregate with rollout for flow analysis.
•Cross-attention is more interpretable — Encoder-decoder attention creates explicit alignments between modalities.
•ViT attention maps to spatial regions — Vision transformer attention can be reshaped to show which image regions are attended.
•Tools like BertViz enable exploration — Interactive visualization beats static heatmaps for developing intuition.
•Always triangulate with other methods — Compare attention with gradients, SHAP, or ablation studies. Consistent signals are more trustworthy.
•Be skeptical of clean attention patterns — Real attention is often messier than cherry-picked examples suggest.

What's Next:

Attention visualization explained what models 'look at' in a soft, distributed way. In the next page, we'll explore Saliency Maps—gradient-based methods that reveal which input features the model is most sensitive to. Saliency provides a complementary view based on counterfactual reasoning: 'How would the output change if this input changed?'

Page Complete

You now have a comprehensive understanding of attention visualization in transformer models. You can visualize single-head and multi-head patterns, analyze layer progression, interpret cross-attention for multi-modal tasks, and critically evaluate when attention does and doesn't provide reliable explanations.