Model Specific Interpretability - Learning Module

Loading content...

0/245

Concept Activation Vectors

Interpreting Models in Human Terms

When a neural network classifies an image as 'zebra', we naturally ask: Is it using stripes? Is it recognizing the body shape? Or is it picking up on something else entirely? Gradient-based saliency tells us which pixels matter, but not in terms of human-understandable concepts like 'stripes' or 'savanna'.

Concept Activation Vectors (CAVs) bridge this gap. Instead of explaining predictions in terms of low-level inputs (pixels, tokens), CAVs test whether models use high-level, human-interpretable concepts. We can ask and answer questions like:

'Does this model rely on the concept of striped texture to classify zebras?'
'Is the model using gender-related concepts when predicting occupation?'
'How important is curly hair to this face recognition system?'

This shifts interpretation from where the model looks to what concepts the model uses—a profoundly more actionable form of understanding.

What You Will Learn

This page covers: (1) The intuition and mathematics behind CAVs, (2) Testing with CAVs (TCAV) for quantitative concept importance, (3) Practical implementation with image classifiers, (4) Choosing and validating concepts, (5) Relative concept importance and sensitivity analysis, (6) CAVs for bias detection, (7) Limitations and extensions, and (8) Best practices for concept-based interpretation.

The Foundational Intuition Behind CAVs

Neural networks learn internal representations—activations at each layer encode information about the input. The key insight of CAVs (Kim et al., 2018) is that human-interpretable concepts have directions in activation space.

The Core Idea:

Collect examples of a concept (e.g., striped images) and non-examples (random images)
Extract activations from a chosen layer for both sets
Train a linear classifier to separate concept from non-concept activations
The normal vector to this separating hyperplane is the Concept Activation Vector (CAV)

This CAV represents the 'direction of stripedness' in that layer's activation space. Moving in that direction corresponds to adding more stripe-like qualities as understood by the network.

Why Linear Classifiers?

The linearity assumption is philosophically motivated: if we need a complex non-linear classifier to separate 'striped' from 'not striped' activations, then 'stripes' isn't a meaningful concept in that layer's representation. Concepts should be linearly separable if the network has learned them.

CAV Components

•Concept Examples — Images (or other inputs) that clearly exemplify the concept. Quality and diversity matter.
•Random/Negative Examples — Inputs that don't exemplify the concept. Often random samples from the dataset or a diverse image collection.
•Target Layer — The neural network layer whose activations we analyze. Different layers capture different levels of abstraction.
•Linear Classifier — Typically logistic regression or SVM. Must achieve reasonable accuracy to indicate the concept is present.
•CAV Vector — The unit vector normal to the decision boundary. Represents the concept direction in activation space.

Activation Space Geometry

Think of the layer's activation space as a high-dimensional room. Each input creates a point in this room (its activation vector). If 'striped' is a meaningful concept, all striped inputs cluster in a certain region, and the CAV points toward that region. Testing if the model 'uses stripes' means asking if moving toward that cluster affects the model's output.

Mathematical Foundation of TCAV

Testing with Concept Activation Vectors (TCAV) provides a quantitative measure of concept importance. The key metric is the TCAV score, which measures what fraction of inputs have predictions that increase when moving in the concept direction.

Formal Definition:

Let:

$f_l(x)$: Activations at layer $l$ for input $x$
$v_C^l$: CAV for concept $C$ at layer $l$
$h_{k,l}(x)$: Logit for class $k$ given activations at layer $l$

The directional derivative of the prediction w.r.t. the concept direction:

$$S_{C,k,l}(x) = \nabla h_{k,l}(f_l(x)) \cdot v_C^l$$

This tells us: if we move the activation in the concept direction, does the prediction for class $k$ increase or decrease?

TCAV Score:

$$\text{TCAV}{C,k,l} = \frac{|{x \in X_k : S{C,k,l}(x) > 0}|}{|X_k|}$$

This is the fraction of class-$k$ examples where the concept positively influences the prediction. A TCAV score of 0.7 means 70% of examples have predictions that would increase if we added more of concept $C$.

Interpreting TCAV Scores
TCAV Score	Interpretation	Implication
~0.5	Random / No relationship	Concept not consistently used for this class
0.6	Positive association	Concept tends to increase class prediction
0.8	Strong positive association	Concept is important for the class
< 0.4	Negative association	Concept tends to decrease class prediction
< 0.2	Strong negative association	Concept actively opposes the class

Statistical Significance Testing:

A critical aspect of TCAV is testing whether the measured score is statistically different from chance (0.5). The paper introduces two approaches:

Random CAVs: Train multiple CAVs using random concept sets. If the real CAV's score is significantly different from random CAVs, the concept is meaningful.
Bootstrap Testing: Resample concept examples and retrain CAVs. Compute confidence intervals for the TCAV score.

This statistical rigor distinguishes TCAV from simple correlation measures.

tcav_mathematics.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import numpy as np
from scipy import stats
 
def compute_tcav_score(directional_derivatives):
    """
    Compute TCAV score from directional derivatives.
    
    Args:
        directional_derivatives: Array of S_{C,k,l}(x) for all x in class k
    
    Returns:
        TCAV score (fraction with positive derivative)
    """
    positive_count = np.sum(directional_derivatives > 0)
    total_count = len(directional_derivatives)
    return positive_count / total_count
 
def tcav_significance_test(tcav_score, n_samples, null_hypothesis=0.5, alpha=0.05):
    """
    Test if TCAV score significantly differs from random.
    Uses binomial test.
    
    Args:
        tcav_score: Observed TCAV score
        n_samples: Number of samples
        null_hypothesis: Expected score under null (0.5 for random)
        alpha: Significance level
    
    Returns:
        is_significant, p_value
    """
    k = int(tcav_score * n_samples)  # Number of positive examples
    
    # Two-tailed binomial test
    p_value = stats.binom_test(k, n_samples, null_hypothesis, alternative='two-sided')
    
    is_significant = p_value < alpha
    return is_significant, p_value
 
def tcav_with_random_baselines(real_tcav_score, random_tcav_scores, alpha=0.05):
    """
    Compare real TCAV to random concept baselines.
    
    Args:
        real_tcav_score: TCAV score from real concept
        random_tcav_scores: Array of TCAV scores from random concepts
    
    Returns:
        is_significant, percentile
    """
    percentile = stats.percentileofscore(random_tcav_scores, real_tcav_score)
    
    # Significant if outside 95% interval
    is_significant = percentile < 2.5 or percentile > 97.5
    
    return is_significant, percentile
 
# Example
np.random.seed(42)
 
# Simulate: concept has moderate positive effect
n_samples = 200
true_effect = 0.7  # 70% positive influence
 
directional_derivatives = np.random.randn(n_samples)
# Make 70% positive
directional_derivatives[int(n_samples * 0.3):] = np.abs(directional_derivatives[int(n_samples * 0.3):])
 
tcav_score = compute_tcav_score(directional_derivatives)
print(f"TCAV Score: {tcav_score:.3f}")
 
is_sig, p_val = tcav_significance_test(tcav_score, n_samples)
print(f"Significantly different from 0.5? {is_sig} (p={p_val:.4f})")
 
# Compare to random baselines
random_scores = [compute_tcav_score(np.random.randn(n_samples)) for _ in range(100)]
is_sig_rand, pct = tcav_with_random_baselines(tcav_score, random_scores)
print(f"Significant vs random concepts? {is_sig_rand} (percentile={pct:.1f}%)")

Implementing Concept Activation Vectors

Let's implement CAVs step by step for an image classification model. We'll use a pre-trained CNN and test whether it uses the concept of 'stripes' to classify zebras.

Implementation Steps:

Collect positive examples (striped images) and negative examples (random images)
Extract activations from a target layer for both sets
Train a linear classifier (logistic regression) on these activations
Extract the CAV as the classifier's weight vector (normalized)
For each test image, compute directional derivative
Calculate TCAV score and test significance

cav_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
import torch
import torch.nn as nn
import numpy as np
from torchvision import models, transforms
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from PIL import Image
import os
 
class CAVExtractor:
    """Extract and use Concept Activation Vectors."""
    
    def __init__(self, model, target_layer_name):
        self.model = model
        self.model.eval()
        self.target_layer_name = target_layer_name
        self.activations = {}
        
        # Register hook to capture activations
        for name, module in model.named_modules():
            if name == target_layer_name:
                module.register_forward_hook(self._save_activation(name))
                break
    
    def _save_activation(self, name):
        def hook(module, input, output):
            # Global average pool if needed
            if len(output.shape) == 4:  # [B, C, H, W]
                self.activations[name] = output.mean(dim=[2, 3])
            else:
                self.activations[name] = output
        return hook
    
    def get_activations(self, images):
        """Get activations for a batch of images."""
        with torch.no_grad():
            _ = self.model(images)
        return self.activations[self.target_layer_name].cpu().numpy()
    
    def train_cav(self, concept_images, random_images):
        """
        Train CAV by learning to separate concept from random activations.
        
        Returns:
            cav: Unit vector representing concept direction
            accuracy: Classification accuracy (validation)
        """
        # Get activations
        concept_acts = self.get_activations(concept_images)
        random_acts = self.get_activations(random_images)
        
        # Prepare data
        X = np.vstack([concept_acts, random_acts])
        y = np.array([1] * len(concept_acts) + [0] * len(random_acts))
        
        # Train-test split
        X_train, X_val, y_train, y_val = train_test_split(
            X, y, test_size=0.2, random_state=42
        )
        
        # Train linear classifier
        clf = LogisticRegression(max_iter=1000, C=1.0)
        clf.fit(X_train, y_train)
        
        accuracy = clf.score(X_val, y_val)
        
        # CAV is the weight vector (normalized)
        cav = clf.coef_[0]
        cav = cav / np.linalg.norm(cav)
        
        return cav, accuracy
    
    def compute_directional_derivatives(self, images, cav, target_class):
        """
        Compute directional derivative of class logit w.r.t. CAV direction.
        
        S_{C,k,l}(x) = ∇h_{k,l}(f_l(x)) · v_C
        """
        derivatives = []
        
        for img in images:
            img = img.unsqueeze(0).requires_grad_(True)
            
            # Forward pass
            output = self.model(img)
            
            # Get logit for target class
            class_logit = output[0, target_class]
            
            # Backward to get gradient of logit w.r.t. activations
            self.model.zero_grad()
            class_logit.backward()
            
            # Gradient at the target layer activations
            # We need gradient w.r.t. f_l(x)
            # Using hook to capture activation gradients
            activation = self.activations[self.target_layer_name]
            
            # For directional derivative, we compute:
            # ∇output[k] · CAV, evaluated at current activation
            # Since this is tricky with hooks, we'll use numerical approximation
            
            with torch.no_grad():
                epsilon = 0.01
                
                # Get current activation
                _ = self.model(img.detach())
                current_act = self.activations[self.target_layer_name][0].numpy()
                current_logit = self.model(img.detach())[0, target_class].item()
                
                # Perturb in CAV direction
                # This requires modifying intermediate activations, which is complex
                # We'll use a simpler approximation based on input sensitivity
                
            derivatives.append(current_logit)  # Placeholder
        
        return np.array(derivatives)
 
    def compute_tcav_score(self, test_images, cav, target_class):
        """
        Compute TCAV score: fraction of test images where concept
        positively influences the prediction.
        """
        # Cleaner implementation using gradient computation
        positive_count = 0
        
        for img in test_images:
            img = img.unsqueeze(0)
            
            # We'll use a clean gradient-based approach
            img.requires_grad = True
            
            # Forward pass to layer l
            _ = self.model(img)
            activation = self.activations[self.target_layer_name]
            
            # Create modified activation
            activation_np = activation.detach().numpy().flatten()
            
            # Compute gradient of class logit w.r.t. activation
            output = self.model(img)
            class_logit = output[0, target_class]
            
            # Numerical gradient (simplified)
            grad = torch.autograd.grad(class_logit, activation, retain_graph=True)[0]
            grad_np = grad.detach().numpy().flatten()
            
            # Directional derivative: grad · CAV
            directional_deriv = np.dot(grad_np, cav)
            
            if directional_deriv > 0:
                positive_count += 1
        
        return positive_count / len(test_images)
 
 
# Example usage (pseudocode - requires actual image data)
"""
# Load model
model = models.inception_v3(pretrained=True)
model.eval()
 
# Setup CAV extractor
extractor = CAVExtractor(model, 'Mixed_6e')  # Middle layer
 
# Load concept images (e.g., 50 striped images, 50 random)
concept_images = load_striped_images()  # Tensor [50, 3, 299, 299]
random_images = load_random_images()    # Tensor [50, 3, 299, 299]
 
# Train CAV
cav, accuracy = extractor.train_cav(concept_images, random_images)
print(f"CAV accuracy: {accuracy:.3f}")
 
if accuracy < 0.6:
    print("Warning: Low accuracy suggests concept not well-represented at this layer")
 
# Load test images of zebras
zebra_images = load_zebra_images()  # Tensor [100, 3, 299, 299]
zebra_class = 340  # ImageNet class for zebra
 
# Compute TCAV score
tcav_score = extractor.compute_tcav_score(zebra_images, cav, zebra_class)
print(f"TCAV score (stripes → zebra): {tcav_score:.3f}")
 
# Statistical test
from scipy import stats
n = len(zebra_images)
k = int(tcav_score * n)
p_value = stats.binom_test(k, n, 0.5, alternative='greater')
print(f"p-value: {p_value:.4f}")
"""

Using the Official TCAV Library

Google provides an official TCAV implementation at github.com/tensorflow/tcav. It handles many implementation details including proper gradient computation, multiple CAV training, and statistical testing. For production use, start with this reference implementation rather than building from scratch.

Choosing and Validating Concepts

The quality of CAV analysis depends critically on how concepts are defined and represented. Poor concept choices lead to meaningless or misleading results.

Concept Dataset Requirements:

Sufficient Size: At least 20-50 examples per concept (more is better)
Diverse Representation: Concept should vary in other dimensions (e.g., striped objects of different types, not just shirts)
Clear Definition: Concept should be unambiguous (what exactly is 'striped'?)
Appropriate Abstraction: Concepts should match the layer's abstraction level

Validation Checks:

CAV Accuracy: If the linear classifier achieves < 60% accuracy, the concept may not be linearly represented at that layer. Try a different layer.
Random Concept Comparison: CAVs trained on random subsets of images should have TCAV scores near 0.5. If your real concept has a similar score, it may not be meaningful.
Concept Purity: Ensure your concept images don't confound with other concepts. Striped images that are all indoor scenes might learn 'indoor' instead of 'stripes'.

Common Concept Sources
Source	Examples	Pros	Cons
BRODEN Dataset	Textures, colors, materials, scenes	Curated, diverse, semantic	May not have specific concepts needed
ImageNet Classes	Use specific class as concept	Large, diverse images	Class ≠ concept (zebra ≠ stripes)
Custom Collection	Web scraping, manual curation	Exactly what you need	Time-consuming, may lack diversity
Synthetic Generation	Generate with specific properties	Controlled, pure concepts	May not match real data distribution
Attribute Datasets	CelebA, AwA, CUB attributes	Well-annotated attributes	Domain-specific

concept_validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
 
def validate_cav_quality(concept_activations, random_activations, n_random_cavs=20):
    """
    Validate CAV quality through multiple checks.
    
    Returns:
        dict with validation metrics
    """
    # Check 1: CAV classifier accuracy
    X = np.vstack([concept_activations, random_activations])
    y = np.array([1] * len(concept_activations) + [0] * len(random_activations))
    
    clf = LogisticRegression(max_iter=1000)
    cv_scores = cross_val_score(clf, X, y, cv=5)
    mean_accuracy = cv_scores.mean()
    std_accuracy = cv_scores.std()
    
    # Check 2: Compare to random concept CAVs
    n_samples = len(concept_activations)
    all_activations = X.copy()
    random_cav_accuracies = []
    
    for _ in range(n_random_cavs):
        # Create random "concept" by random labeling
        random_y = np.random.permutation(y)
        clf_random = LogisticRegression(max_iter=1000)
        random_scores = cross_val_score(clf_random, X, random_y, cv=5)
        random_cav_accuracies.append(random_scores.mean())
    
    # Check if real accuracy significantly exceeds random
    from scipy import stats
    _, p_value = stats.ttest_1samp(random_cav_accuracies, mean_accuracy)
    
    # Check 3: Stability across resamples
    resample_accuracies = []
    for _ in range(10):
        # Resample concept examples
        idx_c = np.random.choice(len(concept_activations), len(concept_activations), replace=True)
        idx_r = np.random.choice(len(random_activations), len(random_activations), replace=True)
        
        X_resample = np.vstack([concept_activations[idx_c], random_activations[idx_r]])
        y_resample = np.array([1] * len(idx_c) + [0] * len(idx_r))
        
        clf_resample = LogisticRegression(max_iter=1000)
        clf_resample.fit(X_resample, y_resample)
        resample_accuracies.append(clf_resample.score(X_resample, y_resample))
    
    stability = 1 - np.std(resample_accuracies) / np.mean(resample_accuracies)
    
    validation_results = {
        'accuracy_mean': mean_accuracy,
        'accuracy_std': std_accuracy,
        'random_accuracy_mean': np.mean(random_cav_accuracies),
        'significantly_better_than_random': p_value < 0.05,
        'p_value': p_value,
        'stability_score': stability,
        'recommendations': []
    }
    
    # Generate recommendations
    if mean_accuracy < 0.6:
        validation_results['recommendations'].append(
            "Low accuracy: Concept may not be linearly separable at this layer. "
            "Try a different layer or refine concept examples."
        )
    
    if not validation_results['significantly_better_than_random']:
        validation_results['recommendations'].append(
            "Not significantly better than random: Concept may not be meaningful "
            "or examples may not be representative."
        )
    
    if stability < 0.9:
        validation_results['recommendations'].append(
            "Low stability: CAV varies across resamples. Add more concept examples "
            "or ensure greater diversity."
        )
    
    if len(validation_results['recommendations']) == 0:
        validation_results['recommendations'].append(
            "CAV appears valid. Proceed with TCAV analysis."
        )
    
    return validation_results
 
# Example
np.random.seed(42)
 
# Simulate good concept (well-separated)
concept_acts = np.random.randn(50, 512) + np.array([1.0] * 512)
random_acts = np.random.randn(50, 512)
 
results = validate_cav_quality(concept_acts, random_acts)
 
print("CAV Validation Results:")
print("=" * 50)
print(f"Classifier Accuracy: {results['accuracy_mean']:.3f} ± {results['accuracy_std']:.3f}")
print(f"Random CAV Accuracy: {results['random_accuracy_mean']:.3f}")
print(f"Significantly better? {results['significantly_better_than_random']} (p={results['p_value']:.4f})")
print(f"Stability Score: {results['stability_score']:.3f}")
print()
print("Recommendations:")
for rec in results['recommendations']:
    print(f"  • {rec}")

Choosing the Right Layer

Different layers in a neural network capture different levels of abstraction. The same concept may be represented differently (or not at all) across layers.

Layer Hierarchy:

Early Layers: Low-level features (edges, textures, colors)
Middle Layers: Mid-level patterns (shapes, parts, texture combinations)
Late Layers: High-level semantics (objects, scenes, abstract concepts)

Matching Concepts to Layers:

A concept like 'striped texture' might be well-represented in early/middle layers, while 'dog breed' is better captured in later layers.

Practical Approach:

Start with several candidate layers at different depths
Train CAVs at each layer
Compare CAV classifiers' accuracies
The layer with highest accuracy best represents the concept
Verify TCAV results are consistent across nearby layers

layer_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_concept_across_layers(concept_data, random_data, layer_names):
    """
    Analyze where a concept is best represented by training CAVs at each layer.
    
    Args:
        concept_data: dict mapping layer_name -> concept activations
        random_data: dict mapping layer_name -> random activations
        layer_names: List of layers to analyze
    
    Returns:
        dict with layer-wise metrics
    """
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import cross_val_score
    
    results = {}
    
    for layer in layer_names:
        X = np.vstack([concept_data[layer], random_data[layer]])
        y = np.array([1] * len(concept_data[layer]) + [0] * len(random_data[layer]))
        
        clf = LogisticRegression(max_iter=1000)
        scores = cross_val_score(clf, X, y, cv=5)
        
        results[layer] = {
            'accuracy': scores.mean(),
            'std': scores.std(),
            'n_features': concept_data[layer].shape[1]
        }
    
    return results
 
# Simulation: Concept (texture) better represented in middle layers
np.random.seed(42)
 
layers = ['conv1', 'conv2', 'conv3', 'conv4', 'conv5', 'fc1']
concept_data = {}
random_data = {}
 
# Simulate: concept has different separability at each layer
layer_separability = [0.55, 0.65, 0.85, 0.90, 0.75, 0.60]  # Peak at conv3-4
 
for layer, sep in zip(layers, layer_separability):
    n_features = 256 if 'conv' in layer else 1024
    
    # Simulate concept activations with varying separability
    offset = (sep - 0.5) * 2  # Convert to mean offset
    concept_data[layer] = np.random.randn(50, n_features) + offset
    random_data[layer] = np.random.randn(50, n_features)
 
results = analyze_concept_across_layers(concept_data, random_data, layers)
 
# Find best layer
best_layer = max(results, key=lambda x: results[x]['accuracy'])
 
# Visualization
fig, ax = plt.subplots(figsize=(10, 6))
 
layer_positions = range(len(layers))
accuracies = [results[l]['accuracy'] for l in layers]
stds = [results[l]['std'] for l in layers]
 
bars = ax.bar(layer_positions, accuracies, yerr=stds, capsize=5, 
              color=['forestgreen' if l == best_layer else 'steelblue' for l in layers])
 
ax.axhline(y=0.5, color='red', linestyle='--', label='Random baseline')
ax.axhline(y=0.6, color='orange', linestyle='--', label='Minimum threshold')
 
ax.set_xticks(layer_positions)
ax.set_xticklabels(layers, rotation=45, ha='right')
ax.set_ylabel('CAV Classifier Accuracy')
ax.set_title('Concept Representation Across Layers')
ax.legend()
ax.set_ylim(0.4, 1.0)
 
plt.tight_layout()
plt.savefig('layer_selection.png', dpi=150)
plt.show()
 
print(f"\nBest layer for concept: {best_layer} (accuracy: {results[best_layer]['accuracy']:.3f})")
print("\nAll layers:")
for layer in layers:
    print(f"  {layer}: {results[layer]['accuracy']:.3f} ± {results[layer]['std']:.3f}")

Common Layer Recommendations

For standard CNN architectures: textures/colors → early-mid layers; object parts → middle layers; object categories → late layers. In InceptionV3, 'mixed4' through 'mixed7' typically work well for most concepts. In ResNet, 'layer3' and 'layer4' are common choices.

Using CAVs for Bias Detection

One of the most valuable applications of TCAV is detecting unwanted biases in models. By testing whether protected attributes (gender, race, age) influence predictions that shouldn't depend on them, we can uncover hidden discrimination.

Example: Gender Bias in Occupation Classification

Suppose we have a model that classifies images of people by occupation (doctor, nurse, engineer, etc.). We can test:

Does the concept 'female-presenting' influence 'nurse' predictions? (Expected: shouldn't)
Does 'male-presenting' influence 'CEO' predictions? (Expected: shouldn't)

High TCAV scores above 0.5 indicate the model has learned to use gender as a predictor of occupation—a clear bias.

Bias Detection Framework:

Define protected attribute concepts (gender, race, age-related visual features)
Train CAVs for each protected concept
Compute TCAV scores for each (protected concept, class) pair
Flag combinations with scores significantly above/below 0.5
Investigate flagged combinations for spurious correlations

cav_bias_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
 
def analyze_bias_with_tcav(tcav_scores, classes, concepts, alpha=0.05):
    """
    Analyze potential biases using TCAV scores.
    
    Args:
        tcav_scores: Dict of {(concept, class): score}
        classes: List of class names
        concepts: List of protected concept names
        alpha: Significance level
    
    Returns:
        List of flagged (concept, class, score, severity) tuples
    """
    flagged_biases = []
    
    for concept in concepts:
        for cls in classes:
            score = tcav_scores.get((concept, cls), 0.5)
            
            # Test if significantly different from 0.5
            # Assuming n=100 samples for binomial test
            n_samples = 100
            k = int(score * n_samples)
            p_value = stats.binom_test(k, n_samples, 0.5, alternative='two-sided')
            
            if p_value < alpha:
                # Determine severity
                deviation = abs(score - 0.5)
                if deviation > 0.3:
                    severity = 'High'
                elif deviation > 0.15:
                    severity = 'Medium'
                else:
                    severity = 'Low'
                
                direction = 'positive' if score > 0.5 else 'negative'
                
                flagged_biases.append({
                    'concept': concept,
                    'class': cls,
                    'tcav_score': score,
                    'p_value': p_value,
                    'severity': severity,
                    'direction': direction,
                    'interpretation': f"'{concept}' has {direction} influence on '{cls}' prediction"
                })
    
    return flagged_biases
 
# Simulate TCAV scores for occupation classifier
np.random.seed(42)
 
classes = ['Doctor', 'Nurse', 'Engineer', 'CEO', 'Teacher', 'Secretary']
concepts = ['Female-presenting', 'Male-presenting', 'Young-appearing', 'Elderly-appearing']
 
# Simulate scores (embedding known biases for demonstration)
tcav_scores = {}
 
# Female concept
for cls in classes:
    if cls == 'Nurse':
        tcav_scores[('Female-presenting', cls)] = 0.78  # Biased!
    elif cls == 'CEO':
        tcav_scores[('Female-presenting', cls)] = 0.28  # Biased!
    elif cls == 'Secretary':
        tcav_scores[('Female-presenting', cls)] = 0.72  # Biased!
    else:
        tcav_scores[('Female-presenting', cls)] = np.random.uniform(0.45, 0.55)
 
# Male concept
for cls in classes:
    if cls == 'CEO':
        tcav_scores[('Male-presenting', cls)] = 0.75  # Biased!
    elif cls == 'Engineer':
        tcav_scores[('Male-presenting', cls)] = 0.68  # Biased!
    elif cls == 'Nurse':
        tcav_scores[('Male-presenting', cls)] = 0.25  # Biased!
    else:
        tcav_scores[('Male-presenting', cls)] = np.random.uniform(0.45, 0.55)
 
# Age concepts (less biased)
for concept in ['Young-appearing', 'Elderly-appearing']:
    for cls in classes:
        tcav_scores[(concept, cls)] = np.random.uniform(0.40, 0.60)
 
# Analyze
flagged = analyze_bias_with_tcav(tcav_scores, classes, concepts)
 
print("Bias Analysis Results")
print("=" * 70)
print(f"Found {len(flagged)} potential biases:\n")
 
for bias in sorted(flagged, key=lambda x: abs(x['tcav_score'] - 0.5), reverse=True):
    print(f"[{bias['severity']}] {bias['concept']} → {bias['class']}")
    print(f"    TCAV Score: {bias['tcav_score']:.3f} (p={bias['p_value']:.4f})")
    print(f"    {bias['interpretation']}\n")
 
# Visualization
fig, ax = plt.subplots(figsize=(12, 8))
 
# Heatmap of TCAV scores
score_matrix = np.zeros((len(concepts), len(classes)))
for i, concept in enumerate(concepts):
    for j, cls in enumerate(classes):
        score_matrix[i, j] = tcav_scores.get((concept, cls), 0.5)
 
im = ax.imshow(score_matrix, cmap='RdBu_r', vmin=0, vmax=1)
 
ax.set_xticks(range(len(classes)))
ax.set_xticklabels(classes, rotation=45, ha='right')
ax.set_yticks(range(len(concepts)))
ax.set_yticklabels(concepts)
 
# Add score annotations
for i in range(len(concepts)):
    for j in range(len(classes)):
        score = score_matrix[i, j]
        color = 'white' if abs(score - 0.5) > 0.2 else 'black'
        ax.text(j, i, f'{score:.2f}', ha='center', va='center', color=color, fontsize=9)
 
ax.set_title('TCAV Scores: Protected Concepts vs Occupation Classes\n(0.5 = no association, >0.5 = positive, <0.5 = negative)')
plt.colorbar(im, ax=ax, label='TCAV Score')
plt.tight_layout()
plt.savefig('bias_heatmap.png', dpi=150)
plt.show()

Bias Interpretation Requires Care

A high TCAV score doesn't automatically mean problematic bias—it might reflect real-world correlations in training data that are legitimate for the task. The question is whether the association is APPROPRIATE for the decision being made. Gender influencing 'requires_restroom_signage' might be legitimate; gender influencing 'should_get_loan' is not.

ACE and Advanced Extensions

The original TCAV requires manually defining concepts with curated image sets. Automatic Concept Explanations (ACE) extends this by automatically discovering salient concepts in a model's latent space.

ACE Method (Ghorbani et al., 2019):

Apply image segmentation to create super-pixels
Cluster similar segments across all images of a class
Treat each cluster as a potential concept
Compute TCAV scores for discovered concepts
Return concepts with highest importance

Advantages:

No manual concept curation required
Discovers concepts the model actually uses (not just what humans assume)
Can reveal unexpected or hidden concepts

Other Extensions:

Net2Vec: Learns continuous concept embeddings rather than binary CAVs
Concept Bottleneck Models: Force model to predict interpretable concepts as intermediate layer
Counterfactual CAVs: Test what happens if concept is removed from activation
Hierarchical Concepts: Organize concepts in taxonomies (color → red, blue, etc.)

Concept-Based Explanation Methods
Method	Concept Source	Key Innovation	Best For
TCAV	Human-provided examples	Original quantitative concept testing	Hypothesis testing about specific concepts
ACE	Automatic clustering	No manual curation needed	Exploring what concepts model uses
Net2Vec	Word embeddings	Aligns visual and semantic spaces	Bridging vision and language
Concept Bottleneck	Supervision on concepts	End-to-end concept learning	When concept labels available
CRAFT	Recursive segmentation	Multi-scale concept discovery	Detailed concept decomposition

automatic_concept_discovery.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
 
def discover_concepts_simplified(activations, n_concepts=10, random_state=42):
    """
    Simplified automatic concept discovery.
    
    In practice, this would operate on image segments/patches,
    not full image activations. This is a demonstration.
    
    Args:
        activations: [n_samples, n_features] activation matrix
        n_concepts: Number of concepts to discover
    
    Returns:
        concept_centers: [n_concepts, n_features] cluster centers
        labels: [n_samples] cluster assignment for each sample
    """
    # Optional: reduce dimensionality for clustering
    pca = PCA(n_components=min(50, activations.shape[1]))
    activations_reduced = pca.fit_transform(activations)
    
    # Cluster activations
    kmeans = KMeans(n_clusters=n_concepts, random_state=random_state, n_init=10)
    labels = kmeans.fit_predict(activations_reduced)
    
    # Get cluster centers in original space
    concept_centers = []
    for i in range(n_concepts):
        cluster_mask = labels == i
        if cluster_mask.sum() > 0:
            center = activations[cluster_mask].mean(axis=0)
            concept_centers.append(center)
    
    concept_centers = np.array(concept_centers)
    
    return concept_centers, labels
 
def rank_concepts_by_importance(concept_centers, class_activations, class_gradient):
    """
    Rank discovered concepts by their influence on a target class.
    
    Args:
        concept_centers: [n_concepts, n_features]
        class_activations: [n_samples, n_features] activations for target class
        class_gradient: [n_features] gradient of class logit w.r.t. activations
    
    Returns:
        Sorted list of (concept_idx, importance_score)
    """
    importances = []
    
    for i, center in enumerate(concept_centers):
        # Normalize to get direction (CAV-like)
        direction = center / (np.linalg.norm(center) + 1e-8)
        
        # Compute directional derivative (concept importance)
        importance = np.dot(class_gradient, direction)
        
        importances.append((i, importance))
    
    # Sort by absolute importance
    return sorted(importances, key=lambda x: abs(x[1]), reverse=True)
 
# Simulation
np.random.seed(42)
 
# Simulate activations with hidden structure
n_samples = 500
n_features = 256
 
# Create ground truth concepts embedded in activations
concept_directions = np.random.randn(5, n_features)
concept_directions = concept_directions / np.linalg.norm(concept_directions, axis=1, keepdims=True)
 
activations = np.random.randn(n_samples, n_features) * 0.1
for i in range(n_samples):
    # Add random combination of concepts
    for j, direction in enumerate(concept_directions):
        weight = np.random.rand() * 2
        activations[i] += weight * direction
 
# Discover concepts
discovered_centers, cluster_labels = discover_concepts_simplified(activations, n_concepts=8)
 
print(f"Discovered {len(discovered_centers)} concepts")
print(f"Cluster sizes: {[np.sum(cluster_labels == i) for i in range(len(discovered_centers))]}")
 
# Simulate class gradient (some concepts important, others not)
class_gradient = np.zeros(n_features)
class_gradient += 0.5 * concept_directions[0]  # Concept 0 is important
class_gradient += 0.3 * concept_directions[2]  # Concept 2 is somewhat important
class_gradient += np.random.randn(n_features) * 0.1  # Noise
 
# Rank discovered concepts
rankings = rank_concepts_by_importance(discovered_centers, activations, class_gradient)
 
print("\nConcept Rankings by Class Importance:")
for rank, (concept_idx, importance) in enumerate(rankings):
    print(f"  {rank+1}. Concept {concept_idx}: importance = {importance:.4f}")

Limitations of Concept-Based Interpretation

While CAVs provide valuable insights, they have important limitations that practitioners must understand:

Fundamental Limitations:

Linearity Assumption: CAVs assume concepts are linear directions in activation space. Concepts with non-linear representations won't be captured.
Concept Leakage: Concept examples may encode correlated concepts. 'Striped' images might also be 'outdoor scenes', confounding the CAV.
Layer Dependence: CAVs at different layers may give different importance rankings. Which layer is 'correct'?
Human Concept Bias: We can only test concepts we think to define. Models might use different concepts than humans assume.
Statistical Sensitivity: With few concept examples or test images, TCAV scores can be noisy and unreliable.

When CAVs May Fail

•Distributed representations — If a concept is encoded across multiple non-linear manifolds, a single linear direction won't capture it.
•Entangled concepts — When concepts naturally co-occur (spots + dalmatian), separating their effects is difficult.
•Context dependence — The same visual feature might indicate different concepts in different contexts.
•Subtle concepts — Concepts requiring complex reasoning (irony, style) may not have clear activation patterns.
•Adversarial robustness — Models might use concepts differently for adversarial vs clean inputs.

Interpreting Absence of Signal

A low TCAV score (near 0.5) could mean: (1) The model doesn't use this concept, (2) The concept isn't linearly represented at this layer, (3) Concept examples are poor, or (4) Positive and negative effects cancel out. Always validate with multiple approaches before concluding a concept is irrelevant.

Summary: Concept Activation Vectors

Concept Activation Vectors provide a powerful framework for understanding neural networks in human-interpretable terms. Here's the essential knowledge:

Key Takeaways

•CAVs capture concept directions — Human concepts correspond to linear directions in a model's activation space, found by training a classifier on concept examples.
•TCAV quantifies concept importance — The fraction of inputs where moving in the concept direction increases the prediction gives a clear importance metric.
•Statistical testing is essential — Compare real CAVs to random concepts; use binomial tests to assess significance.
•Layer choice matters — Different layers represent different abstraction levels. Choose layers where your concept is best linearly separable.
•CAVs excel at bias detection — Testing protected attributes (gender, race) against predictions reveals unwanted biases.
•Concept quality determines results — Curate diverse, pure concept examples; validate CAV classifier accuracy.
•ACE automates concept discovery — Automatic methods find what concepts the model uses without manual specification.
•Linearity is a limitation — Non-linear concept representations won't be captured; combine with other methods.

Module Summary:

Throughout this module, we've explored model-specific interpretability techniques:

Linear Coefficients — Direct interpretation for linear models
Tree Visualization — Decision paths and feature importance for tree-based models
Attention Visualization — What transformers 'look at' when making predictions
Saliency Maps — Gradient-based input sensitivity for neural networks
Concept Activation Vectors — Testing high-level concept usage in deep models

Together, these methods provide a comprehensive toolkit for understanding what models have learned and why they make specific predictions. The choice of method depends on the model architecture, the type of question being asked, and the audience for the explanation.

Module Complete

You have completed Module 3: Model-Specific Interpretability. You now have deep understanding of how to interpret predictions from linear models, tree ensembles, transformers, and deep neural networks using coefficients, decision rules, attention patterns, saliency maps, and concept activation vectors. This foundational knowledge prepares you for the fairness and practical interpretability topics in subsequent modules.