Loading content...
For differentiable models like neural networks, gradients offer a natural measure of feature importance: if changing a feature slightly would significantly change the prediction, that feature must be important. This intuition underlies gradient-based attribution methods—a family of techniques that explain predictions using the model's gradients.
But vanilla gradients have a critical flaw: saturation. ReLU networks produce zero gradients for many inputs because neurons saturate (output 0 or are in the linear regime). A feature might be crucial for a prediction, but its gradient could be zero simply because the network has saturated. This makes raw gradients unreliable for attribution.
Integrated Gradients (IG), introduced by Sundararajan, Taly, and Yan (2017), solves this problem elegantly. Instead of looking at gradients at a single point, IG integrates gradients along a path from a baseline (absence of information) to the actual input. This integration accumulates attributions even when instantaneous gradients are zero.
Remarkably, Integrated Gradients satisfies a set of axiomatic properties that make it uniquely suitable for attribution—including the same "efficiency" property that SHAP inherits from Shapley values. This makes IG the theoretically grounded standard for neural network interpretation.
This page covers Integrated Gradients from first principles: why vanilla gradients fail, how integration solves the problem, the axiomatic properties that make IG principled, baseline selection strategies, and practical implementation for various neural network architectures.
The simplest gradient-based attribution assigns each feature importance proportional to its gradient:
$$\text{Attribution}_i = \frac{\partial f(x)}{\partial x_i}$$
This measures the instantaneous sensitivity of the output to each input feature. Large gradient magnitude means small changes to that feature would significantly affect the prediction.
Consider a simple ReLU network:
$$f(x) = \max(0, w \cdot x - b)$$
If $w \cdot x > b$ (above threshold), then $ abla_x f = w$ — the gradient reflects the weights.
If $w \cdot x < b$ (below threshold), then $ abla_x f = 0$ — zero gradient regardless of how important x is.
This is catastrophic for attribution. An input could be essential for achieving $w \cdot x > b$ (without it, the prediction would be different), but its gradient is zero because we're in the linear regime.
Imagine a cat classifier. For a cat image:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
import numpy as npimport torchimport torch.nn as nn # Simple ReLU networkclass SimpleReLUNet(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(5, 10) self.fc2 = nn.Linear(10, 1) def forward(self, x): x = torch.relu(self.fc1(x)) return self.fc2(x) # Initialize modelmodel = SimpleReLUNet() # Input that puts some neurons into saturationx = torch.tensor([[0.5, 0.3, 0.8, 0.1, 0.2]], requires_grad=True) # Forward passoutput = model(x) # Compute gradient (vanilla attribution)output.backward()vanilla_grad = x.grad.clone() print("Input:", x.data.numpy().flatten())print("Vanilla Gradient:", vanilla_grad.numpy().flatten()) # Check hidden activationswith torch.no_grad(): hidden = model.fc1(x) relu_output = torch.relu(hidden) print("Hidden layer (before ReLU):", hidden.numpy().flatten())print("Hidden layer (after ReLU):", relu_output.numpy().flatten()) # Count saturated neurons (output = 0)saturated = (relu_output == 0).sum().item()total = relu_output.numel()print(f"Saturated neurons: {saturated}/{total} ({100*saturated/total:.1f}%)")print("Vanilla gradients IGNORE contributions through saturated neurons!")Some methods multiply gradient by input (Gradient × Input). This helps with sign but doesn't solve saturation—zero gradient times any input is still zero. The fundamental problem is the gradient itself, not scaling.
Integrated Gradients addresses saturation by accumulating gradients along a path from a baseline (representing "absence") to the actual input.
For a differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, an input $x \in \mathbb{R}^n$, and a baseline $x' \in \mathbb{R}^n$, the Integrated Gradient for feature $i$ is:
$$\text{IG}_i(x) = (x_i - x'i) \times \int{\alpha=0}^{1} \frac{\partial f(x' + \alpha \cdot (x - x'))}{\partial x_i} , d\alpha$$
The path: $x' + \alpha \cdot (x - x')$ for $\alpha \in [0, 1]$
The integral: $\int_0^1 \frac{\partial f}{\partial x_i} d\alpha$
The scaling: $(x_i - x'_i)$
Consider a ReLU that saturates at the input but not at intermediate points:
The path integral "sweeps" through all intermediate states, accumulating attribution whether or not any particular point is saturated.
Integrated Gradients connects to the fundamental theorem of calculus. Along the path:
$$f(x) - f(x') = \int_0^1 \frac{d}{d\alpha} f(x' + \alpha(x-x')) d\alpha$$
$$= \int_0^1 \sum_i \frac{\partial f}{\partial x_i} (x_i - x'_i) d\alpha = \sum_i \text{IG}_i(x)$$
This gives us a remarkable property:
$$\sum_i \text{IG}_i(x) = f(x) - f(x')$$
The Integrated Gradients sum exactly to the difference between the output at $x$ and the output at baseline. No attribution is lost or created—complete decomposition.
This is analogous to SHAP's local accuracy: the attributions sum exactly to the prediction difference. Integrated Gradients and SHAP share this property, making them both 'complete' attribution methods. This isn't coincidence—both satisfy efficiency/completeness axioms.
The power of Integrated Gradients lies in its axiomatic foundation. Sundararajan et al. proved that IG is the unique method satisfying certain desirable properties.
If the input and baseline differ in exactly one feature, and the output differs, then the differing feature should receive non-zero attribution.
Formally: If $x$ and $x'$ differ only in feature $i$ and $f(x) eq f(x')$, then $\text{IG}_i(x) eq 0$.
Why it matters: A feature that demonstrably affects the output should be attributed some importance. Vanilla gradients fail this—a feature could flip the output yet have zero gradient.
If a feature doesn't appear in the model's computation (mathematically doesn't affect output), its attribution should be zero.
Formally: If $\frac{\partial f}{\partial x_i} = 0$ everywhere, then $\text{IG}_i = 0$ for all inputs.
Why it matters: Don't attribute importance to unused features.
Two networks that compute the same function (same output for all inputs) should produce the same attributions, regardless of internal architecture.
Formally: If $f(x) = g(x)$ for all $x$, then $\text{IG}^f(x) = \text{IG}^g(x)$.
Why it matters: Attribution should depend on what the model does, not how it's implemented. A ReLU network and a Swish network computing the same function should give the same explanations.
The attributions sum to the difference between output at input and output at baseline.
$$\sum_i \text{IG}_i(x) = f(x) - f(x')$$
Why it matters: Complete accounting of the prediction. No unexplained residual.
| Method | Sensitivity | Implementation Invariance | Completeness |
|---|---|---|---|
| Vanilla Gradients | ❌ No | ✅ Yes | ❌ No |
| Gradient × Input | ❌ No | ✅ Yes | ❌ No |
| LRP (Layer-wise Relevance) | ✅ Yes | ❌ No | ✅ Yes |
| DeepLIFT | ✅ Yes | ❌ No | ✅ Yes |
| Integrated Gradients | ✅ Yes | ✅ Yes | ✅ Yes |
Sundararajan et al. proved that Integrated Gradients is the unique path method satisfying Sensitivity and Implementation Invariance. Just as Shapley values are unique given their axioms, IG is unique given these axioms. If you want a gradient-based method with these guarantees, you must use IG.
The baseline $x'$ represents "absence of information." Its choice significantly affects attributions. Unlike SHAP (which averages over a background distribution), IG uses a single baseline.
Black image (all zeros): Standard choice. Represents absence of visual information.
White image: Alternative, especially for dark images.
Random noise: Distributes baseline information across pixels.
Blurred version: Removes sharp features while maintaining global structure.
All-zeros embedding: Vector of zeros for each token position.
Padding token: Embed padding/unknown token.
Mean embedding: Average embedding across vocabulary.
Zeros: If features are standardized (mean=0).
Feature means: Average value of each feature in training data.
Trained reference point: Learn an optimal baseline.
A good baseline should:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import numpy as npimport torchimport torch.nn as nn def integrated_gradients( model, input_tensor, baseline, steps=50, target_class=None): """ Compute Integrated Gradients for a PyTorch model. Parameters ---------- model : PyTorch model input_tensor : input to explain (requires_grad=True) baseline : baseline tensor steps : number of interpolation steps target_class : class to explain (for classification) Returns ------- attributions : tensor of same shape as input """ # Generate interpolated inputs alphas = torch.linspace(0, 1, steps).view(-1, 1) # Compute path: baseline + alpha * (input - baseline) # Shape: (steps, *input_shape) delta = input_tensor - baseline interpolated = baseline + alphas.unsqueeze(-1) * delta.unsqueeze(0) # Compute gradients at each interpolated point interpolated.requires_grad_(True) # Forward pass outputs = model(interpolated.view(-1, *input_tensor.shape)) if target_class is not None: outputs = outputs[:, target_class] # Backward pass grads = torch.autograd.grad( outputs=outputs.sum(), inputs=interpolated, create_graph=False )[0] # Average gradients (Riemann approximation of integral) avg_grads = grads.mean(dim=0) # Scale by (input - baseline) attributions = delta * avg_grads return attributions # Compare different baselinesdef baseline_comparison(model, input_tensor): """Compare IG attributions with different baselines.""" baselines = { 'zeros': torch.zeros_like(input_tensor), 'ones': torch.ones_like(input_tensor), 'mean': torch.full_like(input_tensor, 0.5), 'random': torch.rand_like(input_tensor) } results = {} for name, baseline in baselines.items(): attributions = integrated_gradients(model, input_tensor, baseline) results[name] = { 'attributions': attributions, 'baseline_prediction': model(baseline.unsqueeze(0)).item(), 'input_prediction': model(input_tensor.unsqueeze(0)).item(), 'attribution_sum': attributions.sum().item() } # Verify completeness expected_diff = results[name]['input_prediction'] - results[name]['baseline_prediction'] actual_sum = results[name]['attribution_sum'] print(f"{name} baseline:") print(f" Baseline prediction: {results[name]['baseline_prediction']:.4f}") print(f" Input prediction: {results[name]['input_prediction']:.4f}") print(f" Attribution sum: {actual_sum:.4f}") print(f" Expected diff: {expected_diff:.4f}") print(f" Completeness error: {abs(actual_sum - expected_diff):.6f}") return resultsDifferent baselines can produce significantly different attributions. For important applications, conduct sensitivity analysis with multiple baselines, or use Expected Integrated Gradients (averaging over multiple baselines from a distribution).
The Riemann sum approximation of the integral requires choosing the number of steps. More steps = more accurate but slower.
The integral is approximated by:
$$\text{IG}_i(x) \approx (x_i - x'i) \times \frac{1}{m} \sum{k=1}^{m} \frac{\partial f(x' + \frac{k}{m}(x - x'))}{\partial x_i}$$
where $m$ is the number of steps (typically 20-300).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200
import torchimport torch.nn as nnimport numpy as npfrom typing import Optional, Tuple, Union class IntegratedGradients: """ Integrated Gradients attribution for PyTorch models. Implements the method from Sundararajan et al., 2017: "Axiomatic Attribution for Deep Networks" """ def __init__(self, model: nn.Module): """ Parameters ---------- model : PyTorch model (must be differentiable) """ self.model = model self.model.eval() def attribute( self, inputs: torch.Tensor, baselines: Optional[torch.Tensor] = None, target: Optional[int] = None, n_steps: int = 50, method: str = 'riemann_trapezoid', return_convergence_delta: bool = False ) -> Union[torch.Tensor, Tuple[torch.Tensor, float]]: """ Compute Integrated Gradients attributions. Parameters ---------- inputs : input tensor to explain, shape (batch, *input_dims) baselines : baseline tensor, same shape as inputs (default: zeros) target : target class for classification (None for regression) n_steps : number of integration steps (higher = more accurate) method : integration method ('riemann_left', 'riemann_right', 'riemann_trapezoid') return_convergence_delta : if True, return approximation error Returns ------- attributions : tensor of same shape as inputs delta : (optional) convergence error """ if baselines is None: baselines = torch.zeros_like(inputs) assert inputs.shape == baselines.shape, "Inputs and baselines must have same shape" # Generate alphas based on method if method == 'riemann_left': alphas = torch.linspace(0, 1 - 1/n_steps, n_steps) elif method == 'riemann_right': alphas = torch.linspace(1/n_steps, 1, n_steps) elif method == 'riemann_trapezoid': alphas = torch.linspace(0, 1, n_steps) else: raise ValueError(f"Unknown method: {method}") alphas = alphas.to(inputs.device) # Compute path inputs: baseline + alpha * (input - baseline) # Shape: (n_steps, batch, *input_dims) delta = inputs - baselines path_inputs = baselines.unsqueeze(0) + alphas.view(-1, *([1] * len(inputs.shape))) * delta.unsqueeze(0) # Flatten steps and batch for single forward pass batch_size = inputs.shape[0] path_inputs_flat = path_inputs.view(-1, *inputs.shape[1:]) path_inputs_flat.requires_grad_(True) # Forward pass outputs = self.model(path_inputs_flat) # Select target class if specified if target is not None: if outputs.dim() > 1: outputs = outputs[:, target] else: if outputs.dim() > 1: outputs = outputs.max(dim=1).values # Compute gradients grads = torch.autograd.grad( outputs=outputs.sum(), inputs=path_inputs_flat, create_graph=False )[0] # Reshape back to (n_steps, batch, *input_dims) grads = grads.view(n_steps, batch_size, *inputs.shape[1:]) # Integrate (average) gradients if method == 'riemann_trapezoid': # Trapezoidal rule: (f(0)/2 + f(1) + ... + f(n-1) + f(n)/2) / n grads = (grads[:-1] + grads[1:]) / 2 avg_grads = grads.mean(dim=0) # Scale by (input - baseline) attributions = delta * avg_grads if return_convergence_delta: # Verify completeness: sum of attributions should equal output difference with torch.no_grad(): input_output = self.model(inputs) baseline_output = self.model(baselines) if target is not None: input_output = input_output[:, target] baseline_output = baseline_output[:, target] else: if input_output.dim() > 1: input_output = input_output.max(dim=1).values baseline_output = baseline_output.max(dim=1).values expected_diff = input_output - baseline_output actual_sum = attributions.view(batch_size, -1).sum(dim=1) convergence_delta = (expected_diff - actual_sum).abs().mean().item() return attributions, convergence_delta return attributions def attribute_with_convergence_check( self, inputs: torch.Tensor, baselines: Optional[torch.Tensor] = None, target: Optional[int] = None, convergence_threshold: float = 0.01, initial_steps: int = 50, max_steps: int = 500, step_multiplier: float = 2 ) -> Tuple[torch.Tensor, int]: """ Compute attributions with automatic step adjustment for convergence. Returns ------- attributions : converged attributions final_steps : number of steps used """ n_steps = initial_steps while n_steps <= max_steps: attributions, delta = self.attribute( inputs, baselines, target, n_steps, return_convergence_delta=True ) if delta < convergence_threshold: return attributions, n_steps n_steps = int(n_steps * step_multiplier) print(f"Warning: Did not converge within {max_steps} steps (delta={delta:.4f})") return attributions, n_steps # Example usageif __name__ == "__main__": # Create a simple model model = nn.Sequential( nn.Linear(10, 50), nn.ReLU(), nn.Linear(50, 20), nn.ReLU(), nn.Linear(20, 2), nn.Softmax(dim=1) ) # Random input x = torch.rand(1, 10) baseline = torch.zeros(1, 10) # Create explainer ig = IntegratedGradients(model) # Compute attributions attrs, delta = ig.attribute( x, baseline, target=1, n_steps=100, return_convergence_delta=True ) print("Input shape:", x.shape) print("Attribution shape:", attrs.shape) print(f"Convergence delta: {delta:.6f}") print("Attributions:", attrs.numpy().flatten()) print("Attribution sum:", attrs.sum().item()) # Verify: should equal f(x) - f(baseline) with torch.no_grad(): diff = model(x)[0, 1] - model(baseline)[0, 1] print(f"Output difference: {diff.item():.6f}")For production use, Captum (by Facebook/Meta) provides highly optimized implementations of Integrated Gradients and many other attribution methods.
pip install captum
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
import torchimport torch.nn as nnfrom captum.attr import IntegratedGradients, NoiseTunnelfrom captum.attr import visualization as vizimport numpy as np # Define modelmodel = nn.Sequential( nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), nn.Flatten(), nn.Linear(64 * 8 * 8, 256), nn.ReLU(), nn.Linear(256, 10))model.eval() # Create explainerig = IntegratedGradients(model) # Random image input (batch, channels, height, width)input_img = torch.rand(1, 3, 32, 32, requires_grad=True)baseline = torch.zeros(1, 3, 32, 32) # Compute attributions for class 5attributions = ig.attribute( input_img, baselines=baseline, target=5, n_steps=50, return_convergence_delta=True) attr, delta = attributionsprint(f"Attribution shape: {attr.shape}")print(f"Convergence delta: {delta:.6f}") # For noisy inputs, use NoiseTunnel for smoothed attributionsnt = NoiseTunnel(ig)smoothed_attr = nt.attribute( input_img, baselines=baseline, nt_type='smoothgrad', nt_samples=10, target=5) print(f"Smoothed attribution shape: {smoothed_attr.shape}")123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import torchfrom captum.attr import IntegratedGradientsfrom captum.attr import visualization as vizimport matplotlib.pyplot as pltimport numpy as np def visualize_image_attributions( model, input_image, # Shape: (1, C, H, W) target_class, baselines=None, n_steps=50): """ Visualize IG attributions for image classification. """ if baselines is None: baselines = torch.zeros_like(input_image) # Compute attributions ig = IntegratedGradients(model) attributions = ig.attribute( input_image, baselines=baselines, target=target_class, n_steps=n_steps ) # Convert to numpy for visualization # Sum across channels for single attribution map attr_sum = attributions.squeeze().sum(dim=0).detach().numpy() # Original image (assuming normalized) img_np = input_image.squeeze().permute(1, 2, 0).detach().numpy() # Normalize attribution for visualization attr_normalized = attr_sum / np.max(np.abs(attr_sum)) # Create visualization fig, axes = plt.subplots(1, 4, figsize=(16, 4)) # Original image axes[0].imshow(img_np) axes[0].set_title('Original Image') axes[0].axis('off') # Attribution heatmap im = axes[1].imshow(attr_normalized, cmap='RdBu_r', vmin=-1, vmax=1) axes[1].set_title('IG Attribution') axes[1].axis('off') plt.colorbar(im, ax=axes[1], fraction=0.046) # Positive attributions only attr_positive = np.maximum(attr_normalized, 0) axes[2].imshow(attr_positive, cmap='Reds') axes[2].set_title('Positive Attribution') axes[2].axis('off') # Overlay axes[3].imshow(img_np) axes[3].imshow(attr_normalized, cmap='RdBu_r', alpha=0.5, vmin=-1, vmax=1) axes[3].set_title('Overlay') axes[3].axis('off') plt.tight_layout() plt.savefig('ig_visualization.png', dpi=150, bbox_inches='tight') plt.show() return attributions # For Captum's built-in visualization (more sophisticated)def captum_builtin_viz(original_image, attributions, prediction_class): """ Use Captum's built-in visualization utilities. """ # Transpose to (H, W, C) for visualization attr_np = attributions.squeeze().permute(1, 2, 0).detach().numpy() img_np = original_image.squeeze().permute(1, 2, 0).detach().numpy() # Blended heat map fig, ax = viz.visualize_image_attr( attr_np, img_np, method='blended_heat_map', sign='all', show_colorbar=True, title=f'IG Attribution for class {prediction_class}' ) return figIntegrated Gradients belongs to a family of gradient-based attribution methods. Understanding their relationships clarifies when to use each.
Vanilla Gradients: $ abla_x f(x)$
Gradient × Input: $x \odot abla_x f(x)$
SmoothGrad: Average gradients over noisy versions of input
Guided Backpropagation: Only backprop positive gradients through ReLUs
DeepLIFT: Reference-based, uses contribution rules per layer
Layer-wise Relevance Propagation (LRP): Propagate relevance backward through layers
| Method | Completeness | Sensitivity | Impl. Invariance | Speed |
|---|---|---|---|---|
| Vanilla Gradient | ❌ | ❌ | ✅ | Very fast |
| Gradient × Input | ❌ | ❌ | ✅ | Very fast |
| SmoothGrad | ❌ | ❌ | ✅ | Moderate |
| Guided Backprop | ❌ | ? | ❌ | Fast |
| DeepLIFT | ✅ | ✅ | ❌ | Fast |
| LRP | ✅ | ✅ | ❌ | Fast |
| Integrated Gradients | ✅ | ✅ | ✅ | Slow |
Use IG when axiomatic guarantees matter: research papers, regulatory explanations, and when you need provable properties. For quick exploratory visualization where speed matters more than theory, vanilla gradients or SmoothGrad may suffice.
| Aspect | Integrated Gradients | SHAP (DeepSHAP/GradientSHAP) |
|---|---|---|
| Theoretical basis | Path integral of gradients | Shapley values (game theory) |
| Baseline | Single explicit baseline | Distribution of baselines |
| Completeness | IG sums to f(x) - f(baseline) | SHAP sums to f(x) - E[f(X)] |
| Correlation handling | Assumes feature independence | Better with observational approach |
| Speed | O(steps × forward+backward) | Similar (multiple passes) |
| Uniqueness theorem | Yes (for path methods) | Yes (for Shapley) |
Practical advice: Both are principled. Use IG for straightforward implementation with explicit baseline. Use SHAP when you want importance relative to a data distribution (average prediction) rather than a single reference point.
Instead of a single baseline, average IG over multiple baselines sampled from a distribution:
$$\text{EIG}i(x) = \mathbb{E}{x' \sim D}[\text{IG}_i(x; x')]$$
This makes attributions more robust to baseline choice and connects IG closer to SHAP (which implicitly averages over a background distribution).
Captum provides GradientSHAP, which combines ideas from SHAP and IG:
This provides a fast approximation to SHAP using gradients.
IG can attribute to intermediate neurons, not just input features. This helps understand:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
import torchfrom captum.attr import IntegratedGradientsimport numpy as np def expected_integrated_gradients( model, input_tensor, baseline_distribution, # Tensor of shape (n_baselines, *input_shape) target=None, n_steps=50): """ Compute Expected Integrated Gradients by averaging over multiple baselines. """ ig = IntegratedGradients(model) n_baselines = baseline_distribution.shape[0] all_attributions = [] for i in range(n_baselines): baseline = baseline_distribution[i:i+1] attr = ig.attribute( input_tensor, baselines=baseline, target=target, n_steps=n_steps ) all_attributions.append(attr) # Average across baselines avg_attribution = torch.stack(all_attributions).mean(dim=0) return avg_attribution # Usage with training data as baseline distributiondef compute_eig_with_training_data(model, input_tensor, training_data, n_baselines=50, target=None): """ Use random training samples as baselines for EIG. """ # Sample random baselines from training data indices = np.random.choice(len(training_data), n_baselines, replace=False) baselines = training_data[indices] return expected_integrated_gradients( model, input_tensor, baselines, target )Integrated Gradients provides the principled, theoretically grounded approach to neural network attribution. Let's consolidate the key insights:
You've mastered the core feature attribution methods: Permutation Importance, SHAP values (with Shapley theory), LIME, and Integrated Gradients. Each method has its strengths: Permutation for model-agnostic simplicity, SHAP for theoretical guarantees and universality, LIME for human-readable local approximations, and IG for principled neural network attribution. Choose based on your model type, theoretical requirements, and computational constraints.