Loading content...
When a neural network classifies an image as 'zebra', we naturally ask: Is it using stripes? Is it recognizing the body shape? Or is it picking up on something else entirely? Gradient-based saliency tells us which pixels matter, but not in terms of human-understandable concepts like 'stripes' or 'savanna'.
Concept Activation Vectors (CAVs) bridge this gap. Instead of explaining predictions in terms of low-level inputs (pixels, tokens), CAVs test whether models use high-level, human-interpretable concepts. We can ask and answer questions like:
This shifts interpretation from where the model looks to what concepts the model uses—a profoundly more actionable form of understanding.
This page covers: (1) The intuition and mathematics behind CAVs, (2) Testing with CAVs (TCAV) for quantitative concept importance, (3) Practical implementation with image classifiers, (4) Choosing and validating concepts, (5) Relative concept importance and sensitivity analysis, (6) CAVs for bias detection, (7) Limitations and extensions, and (8) Best practices for concept-based interpretation.
Neural networks learn internal representations—activations at each layer encode information about the input. The key insight of CAVs (Kim et al., 2018) is that human-interpretable concepts have directions in activation space.
The Core Idea:
This CAV represents the 'direction of stripedness' in that layer's activation space. Moving in that direction corresponds to adding more stripe-like qualities as understood by the network.
Why Linear Classifiers?
The linearity assumption is philosophically motivated: if we need a complex non-linear classifier to separate 'striped' from 'not striped' activations, then 'stripes' isn't a meaningful concept in that layer's representation. Concepts should be linearly separable if the network has learned them.
Think of the layer's activation space as a high-dimensional room. Each input creates a point in this room (its activation vector). If 'striped' is a meaningful concept, all striped inputs cluster in a certain region, and the CAV points toward that region. Testing if the model 'uses stripes' means asking if moving toward that cluster affects the model's output.
Testing with Concept Activation Vectors (TCAV) provides a quantitative measure of concept importance. The key metric is the TCAV score, which measures what fraction of inputs have predictions that increase when moving in the concept direction.
Formal Definition:
Let:
The directional derivative of the prediction w.r.t. the concept direction:
$$S_{C,k,l}(x) = \nabla h_{k,l}(f_l(x)) \cdot v_C^l$$
This tells us: if we move the activation in the concept direction, does the prediction for class $k$ increase or decrease?
TCAV Score:
$$\text{TCAV}{C,k,l} = \frac{|{x \in X_k : S{C,k,l}(x) > 0}|}{|X_k|}$$
This is the fraction of class-$k$ examples where the concept positively influences the prediction. A TCAV score of 0.7 means 70% of examples have predictions that would increase if we added more of concept $C$.
| TCAV Score | Interpretation | Implication |
|---|---|---|
| ~0.5 | Random / No relationship | Concept not consistently used for this class |
0.6 | Positive association | Concept tends to increase class prediction |
0.8 | Strong positive association | Concept is important for the class |
| < 0.4 | Negative association | Concept tends to decrease class prediction |
| < 0.2 | Strong negative association | Concept actively opposes the class |
Statistical Significance Testing:
A critical aspect of TCAV is testing whether the measured score is statistically different from chance (0.5). The paper introduces two approaches:
Random CAVs: Train multiple CAVs using random concept sets. If the real CAV's score is significantly different from random CAVs, the concept is meaningful.
Bootstrap Testing: Resample concept examples and retrain CAVs. Compute confidence intervals for the TCAV score.
This statistical rigor distinguishes TCAV from simple correlation measures.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import numpy as npfrom scipy import stats def compute_tcav_score(directional_derivatives): """ Compute TCAV score from directional derivatives. Args: directional_derivatives: Array of S_{C,k,l}(x) for all x in class k Returns: TCAV score (fraction with positive derivative) """ positive_count = np.sum(directional_derivatives > 0) total_count = len(directional_derivatives) return positive_count / total_count def tcav_significance_test(tcav_score, n_samples, null_hypothesis=0.5, alpha=0.05): """ Test if TCAV score significantly differs from random. Uses binomial test. Args: tcav_score: Observed TCAV score n_samples: Number of samples null_hypothesis: Expected score under null (0.5 for random) alpha: Significance level Returns: is_significant, p_value """ k = int(tcav_score * n_samples) # Number of positive examples # Two-tailed binomial test p_value = stats.binom_test(k, n_samples, null_hypothesis, alternative='two-sided') is_significant = p_value < alpha return is_significant, p_value def tcav_with_random_baselines(real_tcav_score, random_tcav_scores, alpha=0.05): """ Compare real TCAV to random concept baselines. Args: real_tcav_score: TCAV score from real concept random_tcav_scores: Array of TCAV scores from random concepts Returns: is_significant, percentile """ percentile = stats.percentileofscore(random_tcav_scores, real_tcav_score) # Significant if outside 95% interval is_significant = percentile < 2.5 or percentile > 97.5 return is_significant, percentile # Examplenp.random.seed(42) # Simulate: concept has moderate positive effectn_samples = 200true_effect = 0.7 # 70% positive influence directional_derivatives = np.random.randn(n_samples)# Make 70% positivedirectional_derivatives[int(n_samples * 0.3):] = np.abs(directional_derivatives[int(n_samples * 0.3):]) tcav_score = compute_tcav_score(directional_derivatives)print(f"TCAV Score: {tcav_score:.3f}") is_sig, p_val = tcav_significance_test(tcav_score, n_samples)print(f"Significantly different from 0.5? {is_sig} (p={p_val:.4f})") # Compare to random baselinesrandom_scores = [compute_tcav_score(np.random.randn(n_samples)) for _ in range(100)]is_sig_rand, pct = tcav_with_random_baselines(tcav_score, random_scores)print(f"Significant vs random concepts? {is_sig_rand} (percentile={pct:.1f}%)")Let's implement CAVs step by step for an image classification model. We'll use a pre-trained CNN and test whether it uses the concept of 'stripes' to classify zebras.
Implementation Steps:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191
import torchimport torch.nn as nnimport numpy as npfrom torchvision import models, transformsfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom PIL import Imageimport os class CAVExtractor: """Extract and use Concept Activation Vectors.""" def __init__(self, model, target_layer_name): self.model = model self.model.eval() self.target_layer_name = target_layer_name self.activations = {} # Register hook to capture activations for name, module in model.named_modules(): if name == target_layer_name: module.register_forward_hook(self._save_activation(name)) break def _save_activation(self, name): def hook(module, input, output): # Global average pool if needed if len(output.shape) == 4: # [B, C, H, W] self.activations[name] = output.mean(dim=[2, 3]) else: self.activations[name] = output return hook def get_activations(self, images): """Get activations for a batch of images.""" with torch.no_grad(): _ = self.model(images) return self.activations[self.target_layer_name].cpu().numpy() def train_cav(self, concept_images, random_images): """ Train CAV by learning to separate concept from random activations. Returns: cav: Unit vector representing concept direction accuracy: Classification accuracy (validation) """ # Get activations concept_acts = self.get_activations(concept_images) random_acts = self.get_activations(random_images) # Prepare data X = np.vstack([concept_acts, random_acts]) y = np.array([1] * len(concept_acts) + [0] * len(random_acts)) # Train-test split X_train, X_val, y_train, y_val = train_test_split( X, y, test_size=0.2, random_state=42 ) # Train linear classifier clf = LogisticRegression(max_iter=1000, C=1.0) clf.fit(X_train, y_train) accuracy = clf.score(X_val, y_val) # CAV is the weight vector (normalized) cav = clf.coef_[0] cav = cav / np.linalg.norm(cav) return cav, accuracy def compute_directional_derivatives(self, images, cav, target_class): """ Compute directional derivative of class logit w.r.t. CAV direction. S_{C,k,l}(x) = ∇h_{k,l}(f_l(x)) · v_C """ derivatives = [] for img in images: img = img.unsqueeze(0).requires_grad_(True) # Forward pass output = self.model(img) # Get logit for target class class_logit = output[0, target_class] # Backward to get gradient of logit w.r.t. activations self.model.zero_grad() class_logit.backward() # Gradient at the target layer activations # We need gradient w.r.t. f_l(x) # Using hook to capture activation gradients activation = self.activations[self.target_layer_name] # For directional derivative, we compute: # ∇output[k] · CAV, evaluated at current activation # Since this is tricky with hooks, we'll use numerical approximation with torch.no_grad(): epsilon = 0.01 # Get current activation _ = self.model(img.detach()) current_act = self.activations[self.target_layer_name][0].numpy() current_logit = self.model(img.detach())[0, target_class].item() # Perturb in CAV direction # This requires modifying intermediate activations, which is complex # We'll use a simpler approximation based on input sensitivity derivatives.append(current_logit) # Placeholder return np.array(derivatives) def compute_tcav_score(self, test_images, cav, target_class): """ Compute TCAV score: fraction of test images where concept positively influences the prediction. """ # Cleaner implementation using gradient computation positive_count = 0 for img in test_images: img = img.unsqueeze(0) # We'll use a clean gradient-based approach img.requires_grad = True # Forward pass to layer l _ = self.model(img) activation = self.activations[self.target_layer_name] # Create modified activation activation_np = activation.detach().numpy().flatten() # Compute gradient of class logit w.r.t. activation output = self.model(img) class_logit = output[0, target_class] # Numerical gradient (simplified) grad = torch.autograd.grad(class_logit, activation, retain_graph=True)[0] grad_np = grad.detach().numpy().flatten() # Directional derivative: grad · CAV directional_deriv = np.dot(grad_np, cav) if directional_deriv > 0: positive_count += 1 return positive_count / len(test_images) # Example usage (pseudocode - requires actual image data)"""# Load modelmodel = models.inception_v3(pretrained=True)model.eval() # Setup CAV extractorextractor = CAVExtractor(model, 'Mixed_6e') # Middle layer # Load concept images (e.g., 50 striped images, 50 random)concept_images = load_striped_images() # Tensor [50, 3, 299, 299]random_images = load_random_images() # Tensor [50, 3, 299, 299] # Train CAVcav, accuracy = extractor.train_cav(concept_images, random_images)print(f"CAV accuracy: {accuracy:.3f}") if accuracy < 0.6: print("Warning: Low accuracy suggests concept not well-represented at this layer") # Load test images of zebraszebra_images = load_zebra_images() # Tensor [100, 3, 299, 299]zebra_class = 340 # ImageNet class for zebra # Compute TCAV scoretcav_score = extractor.compute_tcav_score(zebra_images, cav, zebra_class)print(f"TCAV score (stripes → zebra): {tcav_score:.3f}") # Statistical testfrom scipy import statsn = len(zebra_images)k = int(tcav_score * n)p_value = stats.binom_test(k, n, 0.5, alternative='greater')print(f"p-value: {p_value:.4f}")"""Google provides an official TCAV implementation at github.com/tensorflow/tcav. It handles many implementation details including proper gradient computation, multiple CAV training, and statistical testing. For production use, start with this reference implementation rather than building from scratch.
The quality of CAV analysis depends critically on how concepts are defined and represented. Poor concept choices lead to meaningless or misleading results.
Concept Dataset Requirements:
Validation Checks:
CAV Accuracy: If the linear classifier achieves < 60% accuracy, the concept may not be linearly represented at that layer. Try a different layer.
Random Concept Comparison: CAVs trained on random subsets of images should have TCAV scores near 0.5. If your real concept has a similar score, it may not be meaningful.
Concept Purity: Ensure your concept images don't confound with other concepts. Striped images that are all indoor scenes might learn 'indoor' instead of 'stripes'.
| Source | Examples | Pros | Cons |
|---|---|---|---|
| BRODEN Dataset | Textures, colors, materials, scenes | Curated, diverse, semantic | May not have specific concepts needed |
| ImageNet Classes | Use specific class as concept | Large, diverse images | Class ≠ concept (zebra ≠ stripes) |
| Custom Collection | Web scraping, manual curation | Exactly what you need | Time-consuming, may lack diversity |
| Synthetic Generation | Generate with specific properties | Controlled, pure concepts | May not match real data distribution |
| Attribute Datasets | CelebA, AwA, CUB attributes | Well-annotated attributes | Domain-specific |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
import numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import cross_val_score def validate_cav_quality(concept_activations, random_activations, n_random_cavs=20): """ Validate CAV quality through multiple checks. Returns: dict with validation metrics """ # Check 1: CAV classifier accuracy X = np.vstack([concept_activations, random_activations]) y = np.array([1] * len(concept_activations) + [0] * len(random_activations)) clf = LogisticRegression(max_iter=1000) cv_scores = cross_val_score(clf, X, y, cv=5) mean_accuracy = cv_scores.mean() std_accuracy = cv_scores.std() # Check 2: Compare to random concept CAVs n_samples = len(concept_activations) all_activations = X.copy() random_cav_accuracies = [] for _ in range(n_random_cavs): # Create random "concept" by random labeling random_y = np.random.permutation(y) clf_random = LogisticRegression(max_iter=1000) random_scores = cross_val_score(clf_random, X, random_y, cv=5) random_cav_accuracies.append(random_scores.mean()) # Check if real accuracy significantly exceeds random from scipy import stats _, p_value = stats.ttest_1samp(random_cav_accuracies, mean_accuracy) # Check 3: Stability across resamples resample_accuracies = [] for _ in range(10): # Resample concept examples idx_c = np.random.choice(len(concept_activations), len(concept_activations), replace=True) idx_r = np.random.choice(len(random_activations), len(random_activations), replace=True) X_resample = np.vstack([concept_activations[idx_c], random_activations[idx_r]]) y_resample = np.array([1] * len(idx_c) + [0] * len(idx_r)) clf_resample = LogisticRegression(max_iter=1000) clf_resample.fit(X_resample, y_resample) resample_accuracies.append(clf_resample.score(X_resample, y_resample)) stability = 1 - np.std(resample_accuracies) / np.mean(resample_accuracies) validation_results = { 'accuracy_mean': mean_accuracy, 'accuracy_std': std_accuracy, 'random_accuracy_mean': np.mean(random_cav_accuracies), 'significantly_better_than_random': p_value < 0.05, 'p_value': p_value, 'stability_score': stability, 'recommendations': [] } # Generate recommendations if mean_accuracy < 0.6: validation_results['recommendations'].append( "Low accuracy: Concept may not be linearly separable at this layer. " "Try a different layer or refine concept examples." ) if not validation_results['significantly_better_than_random']: validation_results['recommendations'].append( "Not significantly better than random: Concept may not be meaningful " "or examples may not be representative." ) if stability < 0.9: validation_results['recommendations'].append( "Low stability: CAV varies across resamples. Add more concept examples " "or ensure greater diversity." ) if len(validation_results['recommendations']) == 0: validation_results['recommendations'].append( "CAV appears valid. Proceed with TCAV analysis." ) return validation_results # Examplenp.random.seed(42) # Simulate good concept (well-separated)concept_acts = np.random.randn(50, 512) + np.array([1.0] * 512)random_acts = np.random.randn(50, 512) results = validate_cav_quality(concept_acts, random_acts) print("CAV Validation Results:")print("=" * 50)print(f"Classifier Accuracy: {results['accuracy_mean']:.3f} ± {results['accuracy_std']:.3f}")print(f"Random CAV Accuracy: {results['random_accuracy_mean']:.3f}")print(f"Significantly better? {results['significantly_better_than_random']} (p={results['p_value']:.4f})")print(f"Stability Score: {results['stability_score']:.3f}")print()print("Recommendations:")for rec in results['recommendations']: print(f" • {rec}")Different layers in a neural network capture different levels of abstraction. The same concept may be represented differently (or not at all) across layers.
Layer Hierarchy:
Matching Concepts to Layers:
A concept like 'striped texture' might be well-represented in early/middle layers, while 'dog breed' is better captured in later layers.
Practical Approach:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
import numpy as npimport matplotlib.pyplot as plt def analyze_concept_across_layers(concept_data, random_data, layer_names): """ Analyze where a concept is best represented by training CAVs at each layer. Args: concept_data: dict mapping layer_name -> concept activations random_data: dict mapping layer_name -> random activations layer_names: List of layers to analyze Returns: dict with layer-wise metrics """ from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score results = {} for layer in layer_names: X = np.vstack([concept_data[layer], random_data[layer]]) y = np.array([1] * len(concept_data[layer]) + [0] * len(random_data[layer])) clf = LogisticRegression(max_iter=1000) scores = cross_val_score(clf, X, y, cv=5) results[layer] = { 'accuracy': scores.mean(), 'std': scores.std(), 'n_features': concept_data[layer].shape[1] } return results # Simulation: Concept (texture) better represented in middle layersnp.random.seed(42) layers = ['conv1', 'conv2', 'conv3', 'conv4', 'conv5', 'fc1']concept_data = {}random_data = {} # Simulate: concept has different separability at each layerlayer_separability = [0.55, 0.65, 0.85, 0.90, 0.75, 0.60] # Peak at conv3-4 for layer, sep in zip(layers, layer_separability): n_features = 256 if 'conv' in layer else 1024 # Simulate concept activations with varying separability offset = (sep - 0.5) * 2 # Convert to mean offset concept_data[layer] = np.random.randn(50, n_features) + offset random_data[layer] = np.random.randn(50, n_features) results = analyze_concept_across_layers(concept_data, random_data, layers) # Find best layerbest_layer = max(results, key=lambda x: results[x]['accuracy']) # Visualizationfig, ax = plt.subplots(figsize=(10, 6)) layer_positions = range(len(layers))accuracies = [results[l]['accuracy'] for l in layers]stds = [results[l]['std'] for l in layers] bars = ax.bar(layer_positions, accuracies, yerr=stds, capsize=5, color=['forestgreen' if l == best_layer else 'steelblue' for l in layers]) ax.axhline(y=0.5, color='red', linestyle='--', label='Random baseline')ax.axhline(y=0.6, color='orange', linestyle='--', label='Minimum threshold') ax.set_xticks(layer_positions)ax.set_xticklabels(layers, rotation=45, ha='right')ax.set_ylabel('CAV Classifier Accuracy')ax.set_title('Concept Representation Across Layers')ax.legend()ax.set_ylim(0.4, 1.0) plt.tight_layout()plt.savefig('layer_selection.png', dpi=150)plt.show() print(f"\nBest layer for concept: {best_layer} (accuracy: {results[best_layer]['accuracy']:.3f})")print("\nAll layers:")for layer in layers: print(f" {layer}: {results[layer]['accuracy']:.3f} ± {results[layer]['std']:.3f}")For standard CNN architectures: textures/colors → early-mid layers; object parts → middle layers; object categories → late layers. In InceptionV3, 'mixed4' through 'mixed7' typically work well for most concepts. In ResNet, 'layer3' and 'layer4' are common choices.
One of the most valuable applications of TCAV is detecting unwanted biases in models. By testing whether protected attributes (gender, race, age) influence predictions that shouldn't depend on them, we can uncover hidden discrimination.
Example: Gender Bias in Occupation Classification
Suppose we have a model that classifies images of people by occupation (doctor, nurse, engineer, etc.). We can test:
High TCAV scores above 0.5 indicate the model has learned to use gender as a predictor of occupation—a clear bias.
Bias Detection Framework:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
import numpy as npimport matplotlib.pyplot as pltfrom scipy import stats def analyze_bias_with_tcav(tcav_scores, classes, concepts, alpha=0.05): """ Analyze potential biases using TCAV scores. Args: tcav_scores: Dict of {(concept, class): score} classes: List of class names concepts: List of protected concept names alpha: Significance level Returns: List of flagged (concept, class, score, severity) tuples """ flagged_biases = [] for concept in concepts: for cls in classes: score = tcav_scores.get((concept, cls), 0.5) # Test if significantly different from 0.5 # Assuming n=100 samples for binomial test n_samples = 100 k = int(score * n_samples) p_value = stats.binom_test(k, n_samples, 0.5, alternative='two-sided') if p_value < alpha: # Determine severity deviation = abs(score - 0.5) if deviation > 0.3: severity = 'High' elif deviation > 0.15: severity = 'Medium' else: severity = 'Low' direction = 'positive' if score > 0.5 else 'negative' flagged_biases.append({ 'concept': concept, 'class': cls, 'tcav_score': score, 'p_value': p_value, 'severity': severity, 'direction': direction, 'interpretation': f"'{concept}' has {direction} influence on '{cls}' prediction" }) return flagged_biases # Simulate TCAV scores for occupation classifiernp.random.seed(42) classes = ['Doctor', 'Nurse', 'Engineer', 'CEO', 'Teacher', 'Secretary']concepts = ['Female-presenting', 'Male-presenting', 'Young-appearing', 'Elderly-appearing'] # Simulate scores (embedding known biases for demonstration)tcav_scores = {} # Female conceptfor cls in classes: if cls == 'Nurse': tcav_scores[('Female-presenting', cls)] = 0.78 # Biased! elif cls == 'CEO': tcav_scores[('Female-presenting', cls)] = 0.28 # Biased! elif cls == 'Secretary': tcav_scores[('Female-presenting', cls)] = 0.72 # Biased! else: tcav_scores[('Female-presenting', cls)] = np.random.uniform(0.45, 0.55) # Male conceptfor cls in classes: if cls == 'CEO': tcav_scores[('Male-presenting', cls)] = 0.75 # Biased! elif cls == 'Engineer': tcav_scores[('Male-presenting', cls)] = 0.68 # Biased! elif cls == 'Nurse': tcav_scores[('Male-presenting', cls)] = 0.25 # Biased! else: tcav_scores[('Male-presenting', cls)] = np.random.uniform(0.45, 0.55) # Age concepts (less biased)for concept in ['Young-appearing', 'Elderly-appearing']: for cls in classes: tcav_scores[(concept, cls)] = np.random.uniform(0.40, 0.60) # Analyzeflagged = analyze_bias_with_tcav(tcav_scores, classes, concepts) print("Bias Analysis Results")print("=" * 70)print(f"Found {len(flagged)} potential biases:\n") for bias in sorted(flagged, key=lambda x: abs(x['tcav_score'] - 0.5), reverse=True): print(f"[{bias['severity']}] {bias['concept']} → {bias['class']}") print(f" TCAV Score: {bias['tcav_score']:.3f} (p={bias['p_value']:.4f})") print(f" {bias['interpretation']}\n") # Visualizationfig, ax = plt.subplots(figsize=(12, 8)) # Heatmap of TCAV scoresscore_matrix = np.zeros((len(concepts), len(classes)))for i, concept in enumerate(concepts): for j, cls in enumerate(classes): score_matrix[i, j] = tcav_scores.get((concept, cls), 0.5) im = ax.imshow(score_matrix, cmap='RdBu_r', vmin=0, vmax=1) ax.set_xticks(range(len(classes)))ax.set_xticklabels(classes, rotation=45, ha='right')ax.set_yticks(range(len(concepts)))ax.set_yticklabels(concepts) # Add score annotationsfor i in range(len(concepts)): for j in range(len(classes)): score = score_matrix[i, j] color = 'white' if abs(score - 0.5) > 0.2 else 'black' ax.text(j, i, f'{score:.2f}', ha='center', va='center', color=color, fontsize=9) ax.set_title('TCAV Scores: Protected Concepts vs Occupation Classes\n(0.5 = no association, >0.5 = positive, <0.5 = negative)')plt.colorbar(im, ax=ax, label='TCAV Score')plt.tight_layout()plt.savefig('bias_heatmap.png', dpi=150)plt.show()A high TCAV score doesn't automatically mean problematic bias—it might reflect real-world correlations in training data that are legitimate for the task. The question is whether the association is APPROPRIATE for the decision being made. Gender influencing 'requires_restroom_signage' might be legitimate; gender influencing 'should_get_loan' is not.
The original TCAV requires manually defining concepts with curated image sets. Automatic Concept Explanations (ACE) extends this by automatically discovering salient concepts in a model's latent space.
ACE Method (Ghorbani et al., 2019):
Advantages:
Other Extensions:
| Method | Concept Source | Key Innovation | Best For |
|---|---|---|---|
| TCAV | Human-provided examples | Original quantitative concept testing | Hypothesis testing about specific concepts |
| ACE | Automatic clustering | No manual curation needed | Exploring what concepts model uses |
| Net2Vec | Word embeddings | Aligns visual and semantic spaces | Bridging vision and language |
| Concept Bottleneck | Supervision on concepts | End-to-end concept learning | When concept labels available |
| CRAFT | Recursive segmentation | Multi-scale concept discovery | Detailed concept decomposition |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
import numpy as npfrom sklearn.cluster import KMeansfrom sklearn.decomposition import PCA def discover_concepts_simplified(activations, n_concepts=10, random_state=42): """ Simplified automatic concept discovery. In practice, this would operate on image segments/patches, not full image activations. This is a demonstration. Args: activations: [n_samples, n_features] activation matrix n_concepts: Number of concepts to discover Returns: concept_centers: [n_concepts, n_features] cluster centers labels: [n_samples] cluster assignment for each sample """ # Optional: reduce dimensionality for clustering pca = PCA(n_components=min(50, activations.shape[1])) activations_reduced = pca.fit_transform(activations) # Cluster activations kmeans = KMeans(n_clusters=n_concepts, random_state=random_state, n_init=10) labels = kmeans.fit_predict(activations_reduced) # Get cluster centers in original space concept_centers = [] for i in range(n_concepts): cluster_mask = labels == i if cluster_mask.sum() > 0: center = activations[cluster_mask].mean(axis=0) concept_centers.append(center) concept_centers = np.array(concept_centers) return concept_centers, labels def rank_concepts_by_importance(concept_centers, class_activations, class_gradient): """ Rank discovered concepts by their influence on a target class. Args: concept_centers: [n_concepts, n_features] class_activations: [n_samples, n_features] activations for target class class_gradient: [n_features] gradient of class logit w.r.t. activations Returns: Sorted list of (concept_idx, importance_score) """ importances = [] for i, center in enumerate(concept_centers): # Normalize to get direction (CAV-like) direction = center / (np.linalg.norm(center) + 1e-8) # Compute directional derivative (concept importance) importance = np.dot(class_gradient, direction) importances.append((i, importance)) # Sort by absolute importance return sorted(importances, key=lambda x: abs(x[1]), reverse=True) # Simulationnp.random.seed(42) # Simulate activations with hidden structuren_samples = 500n_features = 256 # Create ground truth concepts embedded in activationsconcept_directions = np.random.randn(5, n_features)concept_directions = concept_directions / np.linalg.norm(concept_directions, axis=1, keepdims=True) activations = np.random.randn(n_samples, n_features) * 0.1for i in range(n_samples): # Add random combination of concepts for j, direction in enumerate(concept_directions): weight = np.random.rand() * 2 activations[i] += weight * direction # Discover conceptsdiscovered_centers, cluster_labels = discover_concepts_simplified(activations, n_concepts=8) print(f"Discovered {len(discovered_centers)} concepts")print(f"Cluster sizes: {[np.sum(cluster_labels == i) for i in range(len(discovered_centers))]}") # Simulate class gradient (some concepts important, others not)class_gradient = np.zeros(n_features)class_gradient += 0.5 * concept_directions[0] # Concept 0 is importantclass_gradient += 0.3 * concept_directions[2] # Concept 2 is somewhat importantclass_gradient += np.random.randn(n_features) * 0.1 # Noise # Rank discovered conceptsrankings = rank_concepts_by_importance(discovered_centers, activations, class_gradient) print("\nConcept Rankings by Class Importance:")for rank, (concept_idx, importance) in enumerate(rankings): print(f" {rank+1}. Concept {concept_idx}: importance = {importance:.4f}")While CAVs provide valuable insights, they have important limitations that practitioners must understand:
Fundamental Limitations:
Linearity Assumption: CAVs assume concepts are linear directions in activation space. Concepts with non-linear representations won't be captured.
Concept Leakage: Concept examples may encode correlated concepts. 'Striped' images might also be 'outdoor scenes', confounding the CAV.
Layer Dependence: CAVs at different layers may give different importance rankings. Which layer is 'correct'?
Human Concept Bias: We can only test concepts we think to define. Models might use different concepts than humans assume.
Statistical Sensitivity: With few concept examples or test images, TCAV scores can be noisy and unreliable.
A low TCAV score (near 0.5) could mean: (1) The model doesn't use this concept, (2) The concept isn't linearly represented at this layer, (3) Concept examples are poor, or (4) Positive and negative effects cancel out. Always validate with multiple approaches before concluding a concept is irrelevant.
Concept Activation Vectors provide a powerful framework for understanding neural networks in human-interpretable terms. Here's the essential knowledge:
Module Summary:
Throughout this module, we've explored model-specific interpretability techniques:
Together, these methods provide a comprehensive toolkit for understanding what models have learned and why they make specific predictions. The choice of method depends on the model architecture, the type of question being asked, and the audience for the explanation.
You have completed Module 3: Model-Specific Interpretability. You now have deep understanding of how to interpret predictions from linear models, tree ensembles, transformers, and deep neural networks using coefficients, decision rules, attention patterns, saliency maps, and concept activation vectors. This foundational knowledge prepares you for the fairness and practical interpretability topics in subsequent modules.