Feature Attribution Methods - Learning Module

Loading content...

0/278

Integrated Gradients

Attribution Through Gradients

For differentiable models like neural networks, gradients offer a natural measure of feature importance: if changing a feature slightly would significantly change the prediction, that feature must be important. This intuition underlies gradient-based attribution methods—a family of techniques that explain predictions using the model's gradients.

But vanilla gradients have a critical flaw: saturation. ReLU networks produce zero gradients for many inputs because neurons saturate (output 0 or are in the linear regime). A feature might be crucial for a prediction, but its gradient could be zero simply because the network has saturated. This makes raw gradients unreliable for attribution.

Integrated Gradients (IG), introduced by Sundararajan, Taly, and Yan (2017), solves this problem elegantly. Instead of looking at gradients at a single point, IG integrates gradients along a path from a baseline (absence of information) to the actual input. This integration accumulates attributions even when instantaneous gradients are zero.

Remarkably, Integrated Gradients satisfies a set of axiomatic properties that make it uniquely suitable for attribution—including the same "efficiency" property that SHAP inherits from Shapley values. This makes IG the theoretically grounded standard for neural network interpretation.

What You Will Learn

This page covers Integrated Gradients from first principles: why vanilla gradients fail, how integration solves the problem, the axiomatic properties that make IG principled, baseline selection strategies, and practical implementation for various neural network architectures.

The Problem with Vanilla Gradients

The simplest gradient-based attribution assigns each feature importance proportional to its gradient:

$$\text{Attribution}_i = \frac{\partial f(x)}{\partial x_i}$$

This measures the instantaneous sensitivity of the output to each input feature. Large gradient magnitude means small changes to that feature would significantly affect the prediction.

Why This Seems Reasonable

Directly measures sensitivity: Gradients capture how the output changes with input
Computationally cheap: One backward pass through the network
Model-aware: Uses the actual learned function, not a surrogate

The Saturation Problem

Consider a simple ReLU network:

$$f(x) = \max(0, w \cdot x - b)$$

If $w \cdot x > b$ (above threshold), then $ abla_x f = w$ — the gradient reflects the weights.

If $w \cdot x < b$ (below threshold), then $ abla_x f = 0$ — zero gradient regardless of how important x is.

This is catastrophic for attribution. An input could be essential for achieving $w \cdot x > b$ (without it, the prediction would be different), but its gradient is zero because we're in the linear regime.

Example: Image Classification

Imagine a cat classifier. For a cat image:

Many ReLU neurons are saturated (output 0 or constant)
Gradients of those saturated neurons are 0
Important pixels (cat eyes, ears) may have zero gradient
Result: Vanilla gradient attribution is noisy and incomplete

gradient_saturation_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
import torch
import torch.nn as nn
 
# Simple ReLU network
class SimpleReLUNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(5, 10)
        self.fc2 = nn.Linear(10, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)
 
# Initialize model
model = SimpleReLUNet()
 
# Input that puts some neurons into saturation
x = torch.tensor([[0.5, 0.3, 0.8, 0.1, 0.2]], requires_grad=True)
 
# Forward pass
output = model(x)
 
# Compute gradient (vanilla attribution)
output.backward()
vanilla_grad = x.grad.clone()
 
print("Input:", x.data.numpy().flatten())
print("Vanilla Gradient:", vanilla_grad.numpy().flatten())
 
# Check hidden activations
with torch.no_grad():
    hidden = model.fc1(x)
    relu_output = torch.relu(hidden)
    
print("
Hidden layer (before ReLU):", hidden.numpy().flatten())
print("Hidden layer (after ReLU):", relu_output.numpy().flatten())
 
# Count saturated neurons (output = 0)
saturated = (relu_output == 0).sum().item()
total = relu_output.numel()
print(f"
Saturated neurons: {saturated}/{total} ({100*saturated/total:.1f}%)")
print("
Vanilla gradients IGNORE contributions through saturated neurons!")

Gradient × Input Doesn't Help

Some methods multiply gradient by input (Gradient × Input). This helps with sign but doesn't solve saturation—zero gradient times any input is still zero. The fundamental problem is the gradient itself, not scaling.

The Integrated Gradients Formula

Integrated Gradients addresses saturation by accumulating gradients along a path from a baseline (representing "absence") to the actual input.

The Definition

For a differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, an input $x \in \mathbb{R}^n$, and a baseline $x' \in \mathbb{R}^n$, the Integrated Gradient for feature $i$ is:

$$\text{IG}_i(x) = (x_i - x'i) \times \int{\alpha=0}^{1} \frac{\partial f(x' + \alpha \cdot (x - x'))}{\partial x_i} , d\alpha$$

Understanding the Components

The path: $x' + \alpha \cdot (x - x')$ for $\alpha \in [0, 1]$

At $\alpha = 0$: We're at the baseline $x'$ (no information)
At $\alpha = 1$: We're at the input $x$ (full information)
The path is a straight line from baseline to input

The integral: $\int_0^1 \frac{\partial f}{\partial x_i} d\alpha$

Accumulates the gradient of feature $i$ along the entire path
Even if gradient is zero at $x$ (saturation), it may be non-zero along the path
Captures the "credit" for increasing $f$ as we move from baseline to input

The scaling: $(x_i - x'_i)$

Scales by how much feature $i$ changed from baseline to input
If feature doesn't change ($x_i = x'_i$), attribution is zero

Why Integration Helps

Consider a ReLU that saturates at the input but not at intermediate points:

At the baseline: $f(x') = 0$, gradient may be $w$ (not saturated)
At the input: $f(x) = w \cdot x - b > 0$, gradient is $w$ (linear regime)
The integral captures the contribution even through saturation regions

The path integral "sweeps" through all intermediate states, accumulating attribution whether or not any particular point is saturated.

The Fundamental Theorem Connection

Integrated Gradients connects to the fundamental theorem of calculus. Along the path:

$$f(x) - f(x') = \int_0^1 \frac{d}{d\alpha} f(x' + \alpha(x-x')) d\alpha$$

$$= \int_0^1 \sum_i \frac{\partial f}{\partial x_i} (x_i - x'_i) d\alpha = \sum_i \text{IG}_i(x)$$

This gives us a remarkable property:

$$\sum_i \text{IG}_i(x) = f(x) - f(x')$$

The Integrated Gradients sum exactly to the difference between the output at $x$ and the output at baseline. No attribution is lost or created—complete decomposition.

The Completeness Property

This is analogous to SHAP's local accuracy: the attributions sum exactly to the prediction difference. Integrated Gradients and SHAP share this property, making them both 'complete' attribution methods. This isn't coincidence—both satisfy efficiency/completeness axioms.

Axiomatic Properties

The power of Integrated Gradients lies in its axiomatic foundation. Sundararajan et al. proved that IG is the unique method satisfying certain desirable properties.

Axiom 1: Sensitivity (a)

If the input and baseline differ in exactly one feature, and the output differs, then the differing feature should receive non-zero attribution.

Formally: If $x$ and $x'$ differ only in feature $i$ and $f(x) eq f(x')$, then $\text{IG}_i(x) eq 0$.

Why it matters: A feature that demonstrably affects the output should be attributed some importance. Vanilla gradients fail this—a feature could flip the output yet have zero gradient.

Axiom 2: Sensitivity (b) — Null Player

If a feature doesn't appear in the model's computation (mathematically doesn't affect output), its attribution should be zero.

Formally: If $\frac{\partial f}{\partial x_i} = 0$ everywhere, then $\text{IG}_i = 0$ for all inputs.

Why it matters: Don't attribute importance to unused features.

Axiom 3: Implementation Invariance

Two networks that compute the same function (same output for all inputs) should produce the same attributions, regardless of internal architecture.

Formally: If $f(x) = g(x)$ for all $x$, then $\text{IG}^f(x) = \text{IG}^g(x)$.

Why it matters: Attribution should depend on what the model does, not how it's implemented. A ReLU network and a Swish network computing the same function should give the same explanations.

Axiom 4: Completeness

The attributions sum to the difference between output at input and output at baseline.

$$\sum_i \text{IG}_i(x) = f(x) - f(x')$$

Why it matters: Complete accounting of the prediction. No unexplained residual.

Which Methods Satisfy Which Axioms?
Method	Sensitivity	Implementation Invariance	Completeness
Vanilla Gradients	❌ No	✅ Yes	❌ No
Gradient × Input	❌ No	✅ Yes	❌ No
LRP (Layer-wise Relevance)	✅ Yes	❌ No	✅ Yes
DeepLIFT	✅ Yes	❌ No	✅ Yes
Integrated Gradients	✅ Yes	✅ Yes	✅ Yes

Uniqueness Theorem

Sundararajan et al. proved that Integrated Gradients is the unique path method satisfying Sensitivity and Implementation Invariance. Just as Shapley values are unique given their axioms, IG is unique given these axioms. If you want a gradient-based method with these guarantees, you must use IG.

Baseline Selection

The baseline $x'$ represents "absence of information." Its choice significantly affects attributions. Unlike SHAP (which averages over a background distribution), IG uses a single baseline.

Common Baseline Choices

For Images

Black image (all zeros): Standard choice. Represents absence of visual information.

White image: Alternative, especially for dark images.

Random noise: Distributes baseline information across pixels.

Blurred version: Removes sharp features while maintaining global structure.

For Text

All-zeros embedding: Vector of zeros for each token position.

Padding token: Embed padding/unknown token.

Mean embedding: Average embedding across vocabulary.

For Tabular Data

Zeros: If features are standardized (mean=0).

Feature means: Average value of each feature in training data.

Trained reference point: Learn an optimal baseline.

Baseline Selection Criteria

A good baseline should:

Represent "no information": Model's prediction at baseline should be neutral (e.g., 0.5 probability for binary classification)
Be realistic enough: Not so extreme that it creates unrealistic integration paths
Produce interpretable attributions: Attributions should align with domain intuition

baseline_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
import torch
import torch.nn as nn
 
def integrated_gradients(
    model,
    input_tensor,
    baseline,
    steps=50,
    target_class=None
):
    """
    Compute Integrated Gradients for a PyTorch model.
    
    Parameters
    ----------
    model : PyTorch model
    input_tensor : input to explain (requires_grad=True)
    baseline : baseline tensor
    steps : number of interpolation steps
    target_class : class to explain (for classification)
    
    Returns
    -------
    attributions : tensor of same shape as input
    """
    # Generate interpolated inputs
    alphas = torch.linspace(0, 1, steps).view(-1, 1)
    
    # Compute path: baseline + alpha * (input - baseline)
    # Shape: (steps, *input_shape)
    delta = input_tensor - baseline
    interpolated = baseline + alphas.unsqueeze(-1) * delta.unsqueeze(0)
    
    # Compute gradients at each interpolated point
    interpolated.requires_grad_(True)
    
    # Forward pass
    outputs = model(interpolated.view(-1, *input_tensor.shape))
    
    if target_class is not None:
        outputs = outputs[:, target_class]
    
    # Backward pass
    grads = torch.autograd.grad(
        outputs=outputs.sum(),
        inputs=interpolated,
        create_graph=False
    )[0]
    
    # Average gradients (Riemann approximation of integral)
    avg_grads = grads.mean(dim=0)
    
    # Scale by (input - baseline)
    attributions = delta * avg_grads
    
    return attributions
 
 
# Compare different baselines
def baseline_comparison(model, input_tensor):
    """Compare IG attributions with different baselines."""
    
    baselines = {
        'zeros': torch.zeros_like(input_tensor),
        'ones': torch.ones_like(input_tensor),
        'mean': torch.full_like(input_tensor, 0.5),
        'random': torch.rand_like(input_tensor)
    }
    
    results = {}
    for name, baseline in baselines.items():
        attributions = integrated_gradients(model, input_tensor, baseline)
        results[name] = {
            'attributions': attributions,
            'baseline_prediction': model(baseline.unsqueeze(0)).item(),
            'input_prediction': model(input_tensor.unsqueeze(0)).item(),
            'attribution_sum': attributions.sum().item()
        }
        
        # Verify completeness
        expected_diff = results[name]['input_prediction'] - results[name]['baseline_prediction']
        actual_sum = results[name]['attribution_sum']
        print(f"
{name} baseline:")
        print(f"  Baseline prediction: {results[name]['baseline_prediction']:.4f}")
        print(f"  Input prediction: {results[name]['input_prediction']:.4f}")
        print(f"  Attribution sum: {actual_sum:.4f}")
        print(f"  Expected diff: {expected_diff:.4f}")
        print(f"  Completeness error: {abs(actual_sum - expected_diff):.6f}")
    
    return results

Baseline Affects Attribution

Different baselines can produce significantly different attributions. For important applications, conduct sensitivity analysis with multiple baselines, or use Expected Integrated Gradients (averaging over multiple baselines from a distribution).

Implementation for Neural Networks

The Riemann sum approximation of the integral requires choosing the number of steps. More steps = more accurate but slower.

Numerical Approximation

The integral is approximated by:

$$\text{IG}_i(x) \approx (x_i - x'i) \times \frac{1}{m} \sum{k=1}^{m} \frac{\partial f(x' + \frac{k}{m}(x - x'))}{\partial x_i}$$

where $m$ is the number of steps (typically 20-300).

Full Implementation

integrated_gradients_full.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
import torch
import torch.nn as nn
import numpy as np
from typing import Optional, Tuple, Union
 
class IntegratedGradients:
    """
    Integrated Gradients attribution for PyTorch models.
    
    Implements the method from Sundararajan et al., 2017:
    "Axiomatic Attribution for Deep Networks"
    """
    
    def __init__(self, model: nn.Module):
        """
        Parameters
        ----------
        model : PyTorch model (must be differentiable)
        """
        self.model = model
        self.model.eval()
    
    def attribute(
        self,
        inputs: torch.Tensor,
        baselines: Optional[torch.Tensor] = None,
        target: Optional[int] = None,
        n_steps: int = 50,
        method: str = 'riemann_trapezoid',
        return_convergence_delta: bool = False
    ) -> Union[torch.Tensor, Tuple[torch.Tensor, float]]:
        """
        Compute Integrated Gradients attributions.
        
        Parameters
        ----------
        inputs : input tensor to explain, shape (batch, *input_dims)
        baselines : baseline tensor, same shape as inputs (default: zeros)
        target : target class for classification (None for regression)
        n_steps : number of integration steps (higher = more accurate)
        method : integration method ('riemann_left', 'riemann_right', 'riemann_trapezoid')
        return_convergence_delta : if True, return approximation error
        
        Returns
        -------
        attributions : tensor of same shape as inputs
        delta : (optional) convergence error
        """
        if baselines is None:
            baselines = torch.zeros_like(inputs)
        
        assert inputs.shape == baselines.shape, "Inputs and baselines must have same shape"
        
        # Generate alphas based on method
        if method == 'riemann_left':
            alphas = torch.linspace(0, 1 - 1/n_steps, n_steps)
        elif method == 'riemann_right':
            alphas = torch.linspace(1/n_steps, 1, n_steps)
        elif method == 'riemann_trapezoid':
            alphas = torch.linspace(0, 1, n_steps)
        else:
            raise ValueError(f"Unknown method: {method}")
        
        alphas = alphas.to(inputs.device)
        
        # Compute path inputs: baseline + alpha * (input - baseline)
        # Shape: (n_steps, batch, *input_dims)
        delta = inputs - baselines
        path_inputs = baselines.unsqueeze(0) + alphas.view(-1, *([1] * len(inputs.shape))) * delta.unsqueeze(0)
        
        # Flatten steps and batch for single forward pass
        batch_size = inputs.shape[0]
        path_inputs_flat = path_inputs.view(-1, *inputs.shape[1:])
        path_inputs_flat.requires_grad_(True)
        
        # Forward pass
        outputs = self.model(path_inputs_flat)
        
        # Select target class if specified
        if target is not None:
            if outputs.dim() > 1:
                outputs = outputs[:, target]
        else:
            if outputs.dim() > 1:
                outputs = outputs.max(dim=1).values
        
        # Compute gradients
        grads = torch.autograd.grad(
            outputs=outputs.sum(),
            inputs=path_inputs_flat,
            create_graph=False
        )[0]
        
        # Reshape back to (n_steps, batch, *input_dims)
        grads = grads.view(n_steps, batch_size, *inputs.shape[1:])
        
        # Integrate (average) gradients
        if method == 'riemann_trapezoid':
            # Trapezoidal rule: (f(0)/2 + f(1) + ... + f(n-1) + f(n)/2) / n
            grads = (grads[:-1] + grads[1:]) / 2
        
        avg_grads = grads.mean(dim=0)
        
        # Scale by (input - baseline)
        attributions = delta * avg_grads
        
        if return_convergence_delta:
            # Verify completeness: sum of attributions should equal output difference
            with torch.no_grad():
                input_output = self.model(inputs)
                baseline_output = self.model(baselines)
                
                if target is not None:
                    input_output = input_output[:, target]
                    baseline_output = baseline_output[:, target]
                else:
                    if input_output.dim() > 1:
                        input_output = input_output.max(dim=1).values
                        baseline_output = baseline_output.max(dim=1).values
                
                expected_diff = input_output - baseline_output
                actual_sum = attributions.view(batch_size, -1).sum(dim=1)
                convergence_delta = (expected_diff - actual_sum).abs().mean().item()
            
            return attributions, convergence_delta
        
        return attributions
    
    def attribute_with_convergence_check(
        self,
        inputs: torch.Tensor,
        baselines: Optional[torch.Tensor] = None,
        target: Optional[int] = None,
        convergence_threshold: float = 0.01,
        initial_steps: int = 50,
        max_steps: int = 500,
        step_multiplier: float = 2
    ) -> Tuple[torch.Tensor, int]:
        """
        Compute attributions with automatic step adjustment for convergence.
        
        Returns
        -------
        attributions : converged attributions
        final_steps : number of steps used
        """
        n_steps = initial_steps
        
        while n_steps <= max_steps:
            attributions, delta = self.attribute(
                inputs, baselines, target, n_steps,
                return_convergence_delta=True
            )
            
            if delta < convergence_threshold:
                return attributions, n_steps
            
            n_steps = int(n_steps * step_multiplier)
        
        print(f"Warning: Did not converge within {max_steps} steps (delta={delta:.4f})")
        return attributions, n_steps
 
 
# Example usage
if __name__ == "__main__":
    # Create a simple model
    model = nn.Sequential(
        nn.Linear(10, 50),
        nn.ReLU(),
        nn.Linear(50, 20),
        nn.ReLU(),
        nn.Linear(20, 2),
        nn.Softmax(dim=1)
    )
    
    # Random input
    x = torch.rand(1, 10)
    baseline = torch.zeros(1, 10)
    
    # Create explainer
    ig = IntegratedGradients(model)
    
    # Compute attributions
    attrs, delta = ig.attribute(
        x, baseline, target=1, n_steps=100,
        return_convergence_delta=True
    )
    
    print("Input shape:", x.shape)
    print("Attribution shape:", attrs.shape)
    print(f"Convergence delta: {delta:.6f}")
    print("
Attributions:", attrs.numpy().flatten())
    print("
Attribution sum:", attrs.sum().item())
    
    # Verify: should equal f(x) - f(baseline)
    with torch.no_grad():
        diff = model(x)[0, 1] - model(baseline)[0, 1]
    print(f"Output difference: {diff.item():.6f}")

Using the Captum Library

For production use, Captum (by Facebook/Meta) provides highly optimized implementations of Integrated Gradients and many other attribution methods.

Installation

pip install captum

Basic Usage

captum_ig.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import torch
import torch.nn as nn
from captum.attr import IntegratedGradients, NoiseTunnel
from captum.attr import visualization as viz
import numpy as np
 
# Define model
model = nn.Sequential(
    nn.Conv2d(3, 32, 3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(32, 64, 3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Flatten(),
    nn.Linear(64 * 8 * 8, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)
model.eval()
 
# Create explainer
ig = IntegratedGradients(model)
 
# Random image input (batch, channels, height, width)
input_img = torch.rand(1, 3, 32, 32, requires_grad=True)
baseline = torch.zeros(1, 3, 32, 32)
 
# Compute attributions for class 5
attributions = ig.attribute(
    input_img,
    baselines=baseline,
    target=5,
    n_steps=50,
    return_convergence_delta=True
)
 
attr, delta = attributions
print(f"Attribution shape: {attr.shape}")
print(f"Convergence delta: {delta:.6f}")
 
# For noisy inputs, use NoiseTunnel for smoothed attributions
nt = NoiseTunnel(ig)
smoothed_attr = nt.attribute(
    input_img,
    baselines=baseline,
    nt_type='smoothgrad',
    nt_samples=10,
    target=5
)
 
print(f"Smoothed attribution shape: {smoothed_attr.shape}")

Image Visualization with Captum

captum_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import torch
from captum.attr import IntegratedGradients
from captum.attr import visualization as viz
import matplotlib.pyplot as plt
import numpy as np
 
def visualize_image_attributions(
    model,
    input_image,  # Shape: (1, C, H, W)
    target_class,
    baselines=None,
    n_steps=50
):
    """
    Visualize IG attributions for image classification.
    """
    if baselines is None:
        baselines = torch.zeros_like(input_image)
    
    # Compute attributions
    ig = IntegratedGradients(model)
    attributions = ig.attribute(
        input_image,
        baselines=baselines,
        target=target_class,
        n_steps=n_steps
    )
    
    # Convert to numpy for visualization
    # Sum across channels for single attribution map
    attr_sum = attributions.squeeze().sum(dim=0).detach().numpy()
    
    # Original image (assuming normalized)
    img_np = input_image.squeeze().permute(1, 2, 0).detach().numpy()
    
    # Normalize attribution for visualization
    attr_normalized = attr_sum / np.max(np.abs(attr_sum))
    
    # Create visualization
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    
    # Original image
    axes[0].imshow(img_np)
    axes[0].set_title('Original Image')
    axes[0].axis('off')
    
    # Attribution heatmap
    im = axes[1].imshow(attr_normalized, cmap='RdBu_r', vmin=-1, vmax=1)
    axes[1].set_title('IG Attribution')
    axes[1].axis('off')
    plt.colorbar(im, ax=axes[1], fraction=0.046)
    
    # Positive attributions only
    attr_positive = np.maximum(attr_normalized, 0)
    axes[2].imshow(attr_positive, cmap='Reds')
    axes[2].set_title('Positive Attribution')
    axes[2].axis('off')
    
    # Overlay
    axes[3].imshow(img_np)
    axes[3].imshow(attr_normalized, cmap='RdBu_r', alpha=0.5, vmin=-1, vmax=1)
    axes[3].set_title('Overlay')
    axes[3].axis('off')
    
    plt.tight_layout()
    plt.savefig('ig_visualization.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    return attributions
 
# For Captum's built-in visualization (more sophisticated)
def captum_builtin_viz(original_image, attributions, prediction_class):
    """
    Use Captum's built-in visualization utilities.
    """
    # Transpose to (H, W, C) for visualization
    attr_np = attributions.squeeze().permute(1, 2, 0).detach().numpy()
    img_np = original_image.squeeze().permute(1, 2, 0).detach().numpy()
    
    # Blended heat map
    fig, ax = viz.visualize_image_attr(
        attr_np,
        img_np,
        method='blended_heat_map',
        sign='all',
        show_colorbar=True,
        title=f'IG Attribution for class {prediction_class}'
    )
    
    return fig

Comparison with Other Gradient Methods

Integrated Gradients belongs to a family of gradient-based attribution methods. Understanding their relationships clarifies when to use each.

Method Overview

Vanilla Gradients: $ abla_x f(x)$

Simple, fast
Fails sensitivity axiom (saturation problem)
Noisy for ReLU networks

Gradient × Input: $x \odot abla_x f(x)$

Adds sign consistency (importance direction)
Still fails sensitivity (zero gradient → zero attribution)

SmoothGrad: Average gradients over noisy versions of input

Reduces noise, improves visual quality
Still has saturation issues

Guided Backpropagation: Only backprop positive gradients through ReLUs

Sharper visualizations
Breaks implementation invariance
Not faithful to model (modifies gradient flow)

DeepLIFT: Reference-based, uses contribution rules per layer

Fast, completeness holds
Not implementation invariant (architecture-specific rules)

Layer-wise Relevance Propagation (LRP): Propagate relevance backward through layers

Completeness by construction
Not implementation invariant (layer-specific rules)

Gradient Method Comparison
Method	Completeness	Sensitivity	Impl. Invariance	Speed
Vanilla Gradient	❌	❌	✅	Very fast
Gradient × Input	❌	❌	✅	Very fast
SmoothGrad	❌	❌	✅	Moderate
Guided Backprop	❌	?	❌	Fast
DeepLIFT	✅	✅	❌	Fast
LRP	✅	✅	❌	Fast
Integrated Gradients	✅	✅	✅	Slow

When to Use Integrated Gradients

Use IG when axiomatic guarantees matter: research papers, regulatory explanations, and when you need provable properties. For quick exploratory visualization where speed matters more than theory, vanilla gradients or SmoothGrad may suffice.

IG vs SHAP for Neural Networks

Aspect	Integrated Gradients	SHAP (DeepSHAP/GradientSHAP)
Theoretical basis	Path integral of gradients	Shapley values (game theory)
Baseline	Single explicit baseline	Distribution of baselines
Completeness	IG sums to f(x) - f(baseline)	SHAP sums to f(x) - E[f(X)]
Correlation handling	Assumes feature independence	Better with observational approach
Speed	O(steps × forward+backward)	Similar (multiple passes)
Uniqueness theorem	Yes (for path methods)	Yes (for Shapley)

Practical advice: Both are principled. Use IG for straightforward implementation with explicit baseline. Use SHAP when you want importance relative to a data distribution (average prediction) rather than a single reference point.

Advanced Topics

Expected Integrated Gradients

Instead of a single baseline, average IG over multiple baselines sampled from a distribution:

$$\text{EIG}i(x) = \mathbb{E}{x' \sim D}[\text{IG}_i(x; x')]$$

This makes attributions more robust to baseline choice and connects IG closer to SHAP (which implicitly averages over a background distribution).

GradientSHAP

Captum provides GradientSHAP, which combines ideas from SHAP and IG:

Sample baselines from training data
Add noise to create interpolated samples
Compute gradients
Weight by Shapley-like scheme

This provides a fast approximation to SHAP using gradients.

Neuron-level Attribution

IG can attribute to intermediate neurons, not just input features. This helps understand:

Which neurons are most responsible for a prediction
How information flows through the network
Debugging specific layer behaviors

expected_ig.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import torch
from captum.attr import IntegratedGradients
import numpy as np
 
def expected_integrated_gradients(
    model,
    input_tensor,
    baseline_distribution,  # Tensor of shape (n_baselines, *input_shape)
    target=None,
    n_steps=50
):
    """
    Compute Expected Integrated Gradients by averaging over multiple baselines.
    """
    ig = IntegratedGradients(model)
    n_baselines = baseline_distribution.shape[0]
    
    all_attributions = []
    
    for i in range(n_baselines):
        baseline = baseline_distribution[i:i+1]
        
        attr = ig.attribute(
            input_tensor,
            baselines=baseline,
            target=target,
            n_steps=n_steps
        )
        all_attributions.append(attr)
    
    # Average across baselines
    avg_attribution = torch.stack(all_attributions).mean(dim=0)
    
    return avg_attribution
 
 
# Usage with training data as baseline distribution
def compute_eig_with_training_data(model, input_tensor, training_data, n_baselines=50, target=None):
    """
    Use random training samples as baselines for EIG.
    """
    # Sample random baselines from training data
    indices = np.random.choice(len(training_data), n_baselines, replace=False)
    baselines = training_data[indices]
    
    return expected_integrated_gradients(
        model, input_tensor, baselines, target
    )

Summary

Integrated Gradients provides the principled, theoretically grounded approach to neural network attribution. Let's consolidate the key insights:

Key Takeaways

•Vanilla gradient problem: Saturation causes zero gradients for important features; raw gradients are unreliable
•Path integral solution: IG integrates gradients from baseline to input, accumulating attribution through saturation
•Completeness guarantee: IG sum = f(x) - f(baseline); complete decomposition of prediction difference
•Axiomatic foundation: Sensitivity + Implementation Invariance uniquely characterize IG among path methods
•Baseline choice matters: Use domain-appropriate baselines; consider Expected IG for robustness
•Numerical approximation: 50-300 steps typically sufficient; verify with convergence delta
•Captum library: Production-ready implementation for PyTorch with visualization tools

Module Complete

You've mastered the core feature attribution methods: Permutation Importance, SHAP values (with Shapley theory), LIME, and Integrated Gradients. Each method has its strengths: Permutation for model-agnostic simplicity, SHAP for theoretical guarantees and universality, LIME for human-readable local approximations, and IG for principled neural network attribution. Choose based on your model type, theoretical requirements, and computational constraints.

Integrated Gradients

Attribution Through Gradients

What You Will Learn

The Problem with Vanilla Gradients

The simplest gradient-based attribution assigns each feature importance proportional to its gradient:

$$\text{Attribution}_i = \frac{\partial f(x)}{\partial x_i}$$

This measures the instantaneous sensitivity of the output to each input feature. Large gradient magnitude means small changes to that feature would significantly affect the prediction.

Why This Seems Reasonable

Directly measures sensitivity: Gradients capture how the output changes with input
Computationally cheap: One backward pass through the network
Model-aware: Uses the actual learned function, not a surrogate

The Saturation Problem

Consider a simple ReLU network:

$$f(x) = \max(0, w \cdot x - b)$$

If $w \cdot x > b$ (above threshold), then $ abla_x f = w$ — the gradient reflects the weights.

If $w \cdot x < b$ (below threshold), then $ abla_x f = 0$ — zero gradient regardless of how important x is.

Example: Image Classification

Imagine a cat classifier. For a cat image:

Many ReLU neurons are saturated (output 0 or constant)
Gradients of those saturated neurons are 0
Important pixels (cat eyes, ears) may have zero gradient
Result: Vanilla gradient attribution is noisy and incomplete

gradient_saturation_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import numpy as np
import torch
import torch.nn as nn
 
# Simple ReLU network
class SimpleReLUNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(5, 10)
        self.fc2 = nn.Linear(10, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)
 
# Initialize model
model = SimpleReLUNet()
 
# Input that puts some neurons into saturation
x = torch.tensor([[0.5, 0.3, 0.8, 0.1, 0.2]], requires_grad=True)
 
# Forward pass
output = model(x)
 
# Compute gradient (vanilla attribution)
output.backward()
vanilla_grad = x.grad.clone()
 
print("Input:", x.data.numpy().flatten())
print("Vanilla Gradient:", vanilla_grad.numpy().flatten())
 
# Check hidden activations
with torch.no_grad():
    hidden = model.fc1(x)
    relu_output = torch.relu(hidden)
    
print("
Hidden layer (before ReLU):", hidden.numpy().flatten())
print("Hidden layer (after ReLU):", relu_output.numpy().flatten())
 
# Count saturated neurons (output = 0)
saturated = (relu_output == 0).sum().item()
total = relu_output.numel()
print(f"
Saturated neurons: {saturated}/{total} ({100*saturated/total:.1f}%)")
print("
Vanilla gradients IGNORE contributions through saturated neurons!")

Gradient × Input Doesn't Help

The Integrated Gradients Formula

Integrated Gradients addresses saturation by accumulating gradients along a path from a baseline (representing "absence") to the actual input.

The Definition

For a differentiable function $f: \mathbb{R}^n \rightarrow \mathbb{R}$, an input $x \in \mathbb{R}^n$, and a baseline $x' \in \mathbb{R}^n$, the Integrated Gradient for feature $i$ is:

$$\text{IG}_i(x) = (x_i - x'i) \times \int{\alpha=0}^{1} \frac{\partial f(x' + \alpha \cdot (x - x'))}{\partial x_i} , d\alpha$$

Understanding the Components

The path: $x' + \alpha \cdot (x - x')$ for $\alpha \in [0, 1]$

At $\alpha = 0$: We're at the baseline $x'$ (no information)
At $\alpha = 1$: We're at the input $x$ (full information)
The path is a straight line from baseline to input

The integral: $\int_0^1 \frac{\partial f}{\partial x_i} d\alpha$

Accumulates the gradient of feature $i$ along the entire path
Even if gradient is zero at $x$ (saturation), it may be non-zero along the path
Captures the "credit" for increasing $f$ as we move from baseline to input

The scaling: $(x_i - x'_i)$

Scales by how much feature $i$ changed from baseline to input
If feature doesn't change ($x_i = x'_i$), attribution is zero

Why Integration Helps

Consider a ReLU that saturates at the input but not at intermediate points:

At the baseline: $f(x') = 0$, gradient may be $w$ (not saturated)
At the input: $f(x) = w \cdot x - b > 0$, gradient is $w$ (linear regime)
The integral captures the contribution even through saturation regions

The path integral "sweeps" through all intermediate states, accumulating attribution whether or not any particular point is saturated.

The Fundamental Theorem Connection

Integrated Gradients connects to the fundamental theorem of calculus. Along the path:

$$f(x) - f(x') = \int_0^1 \frac{d}{d\alpha} f(x' + \alpha(x-x')) d\alpha$$

$$= \int_0^1 \sum_i \frac{\partial f}{\partial x_i} (x_i - x'_i) d\alpha = \sum_i \text{IG}_i(x)$$

This gives us a remarkable property:

$$\sum_i \text{IG}_i(x) = f(x) - f(x')$$

The Integrated Gradients sum exactly to the difference between the output at $x$ and the output at baseline. No attribution is lost or created—complete decomposition.

The Completeness Property

Axiomatic Properties

The power of Integrated Gradients lies in its axiomatic foundation. Sundararajan et al. proved that IG is the unique method satisfying certain desirable properties.

Axiom 1: Sensitivity (a)

If the input and baseline differ in exactly one feature, and the output differs, then the differing feature should receive non-zero attribution.

Formally: If $x$ and $x'$ differ only in feature $i$ and $f(x) eq f(x')$, then $\text{IG}_i(x) eq 0$.

Why it matters: A feature that demonstrably affects the output should be attributed some importance. Vanilla gradients fail this—a feature could flip the output yet have zero gradient.

Axiom 2: Sensitivity (b) — Null Player

If a feature doesn't appear in the model's computation (mathematically doesn't affect output), its attribution should be zero.

Formally: If $\frac{\partial f}{\partial x_i} = 0$ everywhere, then $\text{IG}_i = 0$ for all inputs.

Why it matters: Don't attribute importance to unused features.

Axiom 3: Implementation Invariance

Two networks that compute the same function (same output for all inputs) should produce the same attributions, regardless of internal architecture.

Formally: If $f(x) = g(x)$ for all $x$, then $\text{IG}^f(x) = \text{IG}^g(x)$.

Why it matters: Attribution should depend on what the model does, not how it's implemented. A ReLU network and a Swish network computing the same function should give the same explanations.

Axiom 4: Completeness

The attributions sum to the difference between output at input and output at baseline.

$$\sum_i \text{IG}_i(x) = f(x) - f(x')$$

Why it matters: Complete accounting of the prediction. No unexplained residual.

Which Methods Satisfy Which Axioms?
Method	Sensitivity	Implementation Invariance	Completeness
Vanilla Gradients	❌ No	✅ Yes	❌ No
Gradient × Input	❌ No	✅ Yes	❌ No
LRP (Layer-wise Relevance)	✅ Yes	❌ No	✅ Yes
DeepLIFT	✅ Yes	❌ No	✅ Yes
Integrated Gradients	✅ Yes	✅ Yes	✅ Yes

Uniqueness Theorem

Baseline Selection

The baseline $x'$ represents "absence of information." Its choice significantly affects attributions. Unlike SHAP (which averages over a background distribution), IG uses a single baseline.

Common Baseline Choices

For Images

Black image (all zeros): Standard choice. Represents absence of visual information.

White image: Alternative, especially for dark images.

Random noise: Distributes baseline information across pixels.

Blurred version: Removes sharp features while maintaining global structure.

For Text

All-zeros embedding: Vector of zeros for each token position.

Padding token: Embed padding/unknown token.

Mean embedding: Average embedding across vocabulary.

For Tabular Data

Zeros: If features are standardized (mean=0).

Feature means: Average value of each feature in training data.

Trained reference point: Learn an optimal baseline.

Baseline Selection Criteria

A good baseline should:

Represent "no information": Model's prediction at baseline should be neutral (e.g., 0.5 probability for binary classification)
Be realistic enough: Not so extreme that it creates unrealistic integration paths
Produce interpretable attributions: Attributions should align with domain intuition

baseline_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import numpy as np
import torch
import torch.nn as nn
 
def integrated_gradients(
    model,
    input_tensor,
    baseline,
    steps=50,
    target_class=None
):
    """
    Compute Integrated Gradients for a PyTorch model.
    
    Parameters
    ----------
    model : PyTorch model
    input_tensor : input to explain (requires_grad=True)
    baseline : baseline tensor
    steps : number of interpolation steps
    target_class : class to explain (for classification)
    
    Returns
    -------
    attributions : tensor of same shape as input
    """
    # Generate interpolated inputs
    alphas = torch.linspace(0, 1, steps).view(-1, 1)
    
    # Compute path: baseline + alpha * (input - baseline)
    # Shape: (steps, *input_shape)
    delta = input_tensor - baseline
    interpolated = baseline + alphas.unsqueeze(-1) * delta.unsqueeze(0)
    
    # Compute gradients at each interpolated point
    interpolated.requires_grad_(True)
    
    # Forward pass
    outputs = model(interpolated.view(-1, *input_tensor.shape))
    
    if target_class is not None:
        outputs = outputs[:, target_class]
    
    # Backward pass
    grads = torch.autograd.grad(
        outputs=outputs.sum(),
        inputs=interpolated,
        create_graph=False
    )[0]
    
    # Average gradients (Riemann approximation of integral)
    avg_grads = grads.mean(dim=0)
    
    # Scale by (input - baseline)
    attributions = delta * avg_grads
    
    return attributions
 
 
# Compare different baselines
def baseline_comparison(model, input_tensor):
    """Compare IG attributions with different baselines."""
    
    baselines = {
        'zeros': torch.zeros_like(input_tensor),
        'ones': torch.ones_like(input_tensor),
        'mean': torch.full_like(input_tensor, 0.5),
        'random': torch.rand_like(input_tensor)
    }
    
    results = {}
    for name, baseline in baselines.items():
        attributions = integrated_gradients(model, input_tensor, baseline)
        results[name] = {
            'attributions': attributions,
            'baseline_prediction': model(baseline.unsqueeze(0)).item(),
            'input_prediction': model(input_tensor.unsqueeze(0)).item(),
            'attribution_sum': attributions.sum().item()
        }
        
        # Verify completeness
        expected_diff = results[name]['input_prediction'] - results[name]['baseline_prediction']
        actual_sum = results[name]['attribution_sum']
        print(f"
{name} baseline:")
        print(f"  Baseline prediction: {results[name]['baseline_prediction']:.4f}")
        print(f"  Input prediction: {results[name]['input_prediction']:.4f}")
        print(f"  Attribution sum: {actual_sum:.4f}")
        print(f"  Expected diff: {expected_diff:.4f}")
        print(f"  Completeness error: {abs(actual_sum - expected_diff):.6f}")
    
    return results

Baseline Affects Attribution

Implementation for Neural Networks

The Riemann sum approximation of the integral requires choosing the number of steps. More steps = more accurate but slower.

Numerical Approximation

The integral is approximated by:

$$\text{IG}_i(x) \approx (x_i - x'i) \times \frac{1}{m} \sum{k=1}^{m} \frac{\partial f(x' + \frac{k}{m}(x - x'))}{\partial x_i}$$

where $m$ is the number of steps (typically 20-300).

Full Implementation

integrated_gradients_full.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
import torch
import torch.nn as nn
import numpy as np
from typing import Optional, Tuple, Union
 
class IntegratedGradients:
    """
    Integrated Gradients attribution for PyTorch models.
    
    Implements the method from Sundararajan et al., 2017:
    "Axiomatic Attribution for Deep Networks"
    """
    
    def __init__(self, model: nn.Module):
        """
        Parameters
        ----------
        model : PyTorch model (must be differentiable)
        """
        self.model = model
        self.model.eval()
    
    def attribute(
        self,
        inputs: torch.Tensor,
        baselines: Optional[torch.Tensor] = None,
        target: Optional[int] = None,
        n_steps: int = 50,
        method: str = 'riemann_trapezoid',
        return_convergence_delta: bool = False
    ) -> Union[torch.Tensor, Tuple[torch.Tensor, float]]:
        """
        Compute Integrated Gradients attributions.
        
        Parameters
        ----------
        inputs : input tensor to explain, shape (batch, *input_dims)
        baselines : baseline tensor, same shape as inputs (default: zeros)
        target : target class for classification (None for regression)
        n_steps : number of integration steps (higher = more accurate)
        method : integration method ('riemann_left', 'riemann_right', 'riemann_trapezoid')
        return_convergence_delta : if True, return approximation error
        
        Returns
        -------
        attributions : tensor of same shape as inputs
        delta : (optional) convergence error
        """
        if baselines is None:
            baselines = torch.zeros_like(inputs)
        
        assert inputs.shape == baselines.shape, "Inputs and baselines must have same shape"
        
        # Generate alphas based on method
        if method == 'riemann_left':
            alphas = torch.linspace(0, 1 - 1/n_steps, n_steps)
        elif method == 'riemann_right':
            alphas = torch.linspace(1/n_steps, 1, n_steps)
        elif method == 'riemann_trapezoid':
            alphas = torch.linspace(0, 1, n_steps)
        else:
            raise ValueError(f"Unknown method: {method}")
        
        alphas = alphas.to(inputs.device)
        
        # Compute path inputs: baseline + alpha * (input - baseline)
        # Shape: (n_steps, batch, *input_dims)
        delta = inputs - baselines
        path_inputs = baselines.unsqueeze(0) + alphas.view(-1, *([1] * len(inputs.shape))) * delta.unsqueeze(0)
        
        # Flatten steps and batch for single forward pass
        batch_size = inputs.shape[0]
        path_inputs_flat = path_inputs.view(-1, *inputs.shape[1:])
        path_inputs_flat.requires_grad_(True)
        
        # Forward pass
        outputs = self.model(path_inputs_flat)
        
        # Select target class if specified
        if target is not None:
            if outputs.dim() > 1:
                outputs = outputs[:, target]
        else:
            if outputs.dim() > 1:
                outputs = outputs.max(dim=1).values
        
        # Compute gradients
        grads = torch.autograd.grad(
            outputs=outputs.sum(),
            inputs=path_inputs_flat,
            create_graph=False
        )[0]
        
        # Reshape back to (n_steps, batch, *input_dims)
        grads = grads.view(n_steps, batch_size, *inputs.shape[1:])
        
        # Integrate (average) gradients
        if method == 'riemann_trapezoid':
            # Trapezoidal rule: (f(0)/2 + f(1) + ... + f(n-1) + f(n)/2) / n
            grads = (grads[:-1] + grads[1:]) / 2
        
        avg_grads = grads.mean(dim=0)
        
        # Scale by (input - baseline)
        attributions = delta * avg_grads
        
        if return_convergence_delta:
            # Verify completeness: sum of attributions should equal output difference
            with torch.no_grad():
                input_output = self.model(inputs)
                baseline_output = self.model(baselines)
                
                if target is not None:
                    input_output = input_output[:, target]
                    baseline_output = baseline_output[:, target]
                else:
                    if input_output.dim() > 1:
                        input_output = input_output.max(dim=1).values
                        baseline_output = baseline_output.max(dim=1).values
                
                expected_diff = input_output - baseline_output
                actual_sum = attributions.view(batch_size, -1).sum(dim=1)
                convergence_delta = (expected_diff - actual_sum).abs().mean().item()
            
            return attributions, convergence_delta
        
        return attributions
    
    def attribute_with_convergence_check(
        self,
        inputs: torch.Tensor,
        baselines: Optional[torch.Tensor] = None,
        target: Optional[int] = None,
        convergence_threshold: float = 0.01,
        initial_steps: int = 50,
        max_steps: int = 500,
        step_multiplier: float = 2
    ) -> Tuple[torch.Tensor, int]:
        """
        Compute attributions with automatic step adjustment for convergence.
        
        Returns
        -------
        attributions : converged attributions
        final_steps : number of steps used
        """
        n_steps = initial_steps
        
        while n_steps <= max_steps:
            attributions, delta = self.attribute(
                inputs, baselines, target, n_steps,
                return_convergence_delta=True
            )
            
            if delta < convergence_threshold:
                return attributions, n_steps
            
            n_steps = int(n_steps * step_multiplier)
        
        print(f"Warning: Did not converge within {max_steps} steps (delta={delta:.4f})")
        return attributions, n_steps
 
 
# Example usage
if __name__ == "__main__":
    # Create a simple model
    model = nn.Sequential(
        nn.Linear(10, 50),
        nn.ReLU(),
        nn.Linear(50, 20),
        nn.ReLU(),
        nn.Linear(20, 2),
        nn.Softmax(dim=1)
    )
    
    # Random input
    x = torch.rand(1, 10)
    baseline = torch.zeros(1, 10)
    
    # Create explainer
    ig = IntegratedGradients(model)
    
    # Compute attributions
    attrs, delta = ig.attribute(
        x, baseline, target=1, n_steps=100,
        return_convergence_delta=True
    )
    
    print("Input shape:", x.shape)
    print("Attribution shape:", attrs.shape)
    print(f"Convergence delta: {delta:.6f}")
    print("
Attributions:", attrs.numpy().flatten())
    print("
Attribution sum:", attrs.sum().item())
    
    # Verify: should equal f(x) - f(baseline)
    with torch.no_grad():
        diff = model(x)[0, 1] - model(baseline)[0, 1]
    print(f"Output difference: {diff.item():.6f}")

Using the Captum Library

For production use, Captum (by Facebook/Meta) provides highly optimized implementations of Integrated Gradients and many other attribution methods.

Installation

pip install captum

Basic Usage

captum_ig.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import torch
import torch.nn as nn
from captum.attr import IntegratedGradients, NoiseTunnel
from captum.attr import visualization as viz
import numpy as np
 
# Define model
model = nn.Sequential(
    nn.Conv2d(3, 32, 3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Conv2d(32, 64, 3, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(2),
    nn.Flatten(),
    nn.Linear(64 * 8 * 8, 256),
    nn.ReLU(),
    nn.Linear(256, 10)
)
model.eval()
 
# Create explainer
ig = IntegratedGradients(model)
 
# Random image input (batch, channels, height, width)
input_img = torch.rand(1, 3, 32, 32, requires_grad=True)
baseline = torch.zeros(1, 3, 32, 32)
 
# Compute attributions for class 5
attributions = ig.attribute(
    input_img,
    baselines=baseline,
    target=5,
    n_steps=50,
    return_convergence_delta=True
)
 
attr, delta = attributions
print(f"Attribution shape: {attr.shape}")
print(f"Convergence delta: {delta:.6f}")
 
# For noisy inputs, use NoiseTunnel for smoothed attributions
nt = NoiseTunnel(ig)
smoothed_attr = nt.attribute(
    input_img,
    baselines=baseline,
    nt_type='smoothgrad',
    nt_samples=10,
    target=5
)
 
print(f"Smoothed attribution shape: {smoothed_attr.shape}")

Image Visualization with Captum

captum_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import torch
from captum.attr import IntegratedGradients
from captum.attr import visualization as viz
import matplotlib.pyplot as plt
import numpy as np
 
def visualize_image_attributions(
    model,
    input_image,  # Shape: (1, C, H, W)
    target_class,
    baselines=None,
    n_steps=50
):
    """
    Visualize IG attributions for image classification.
    """
    if baselines is None:
        baselines = torch.zeros_like(input_image)
    
    # Compute attributions
    ig = IntegratedGradients(model)
    attributions = ig.attribute(
        input_image,
        baselines=baselines,
        target=target_class,
        n_steps=n_steps
    )
    
    # Convert to numpy for visualization
    # Sum across channels for single attribution map
    attr_sum = attributions.squeeze().sum(dim=0).detach().numpy()
    
    # Original image (assuming normalized)
    img_np = input_image.squeeze().permute(1, 2, 0).detach().numpy()
    
    # Normalize attribution for visualization
    attr_normalized = attr_sum / np.max(np.abs(attr_sum))
    
    # Create visualization
    fig, axes = plt.subplots(1, 4, figsize=(16, 4))
    
    # Original image
    axes[0].imshow(img_np)
    axes[0].set_title('Original Image')
    axes[0].axis('off')
    
    # Attribution heatmap
    im = axes[1].imshow(attr_normalized, cmap='RdBu_r', vmin=-1, vmax=1)
    axes[1].set_title('IG Attribution')
    axes[1].axis('off')
    plt.colorbar(im, ax=axes[1], fraction=0.046)
    
    # Positive attributions only
    attr_positive = np.maximum(attr_normalized, 0)
    axes[2].imshow(attr_positive, cmap='Reds')
    axes[2].set_title('Positive Attribution')
    axes[2].axis('off')
    
    # Overlay
    axes[3].imshow(img_np)
    axes[3].imshow(attr_normalized, cmap='RdBu_r', alpha=0.5, vmin=-1, vmax=1)
    axes[3].set_title('Overlay')
    axes[3].axis('off')
    
    plt.tight_layout()
    plt.savefig('ig_visualization.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    return attributions
 
# For Captum's built-in visualization (more sophisticated)
def captum_builtin_viz(original_image, attributions, prediction_class):
    """
    Use Captum's built-in visualization utilities.
    """
    # Transpose to (H, W, C) for visualization
    attr_np = attributions.squeeze().permute(1, 2, 0).detach().numpy()
    img_np = original_image.squeeze().permute(1, 2, 0).detach().numpy()
    
    # Blended heat map
    fig, ax = viz.visualize_image_attr(
        attr_np,
        img_np,
        method='blended_heat_map',
        sign='all',
        show_colorbar=True,
        title=f'IG Attribution for class {prediction_class}'
    )
    
    return fig

Comparison with Other Gradient Methods

Integrated Gradients belongs to a family of gradient-based attribution methods. Understanding their relationships clarifies when to use each.

Method Overview

Vanilla Gradients: $ abla_x f(x)$

Simple, fast
Fails sensitivity axiom (saturation problem)
Noisy for ReLU networks

Gradient × Input: $x \odot abla_x f(x)$

Adds sign consistency (importance direction)
Still fails sensitivity (zero gradient → zero attribution)

SmoothGrad: Average gradients over noisy versions of input

Reduces noise, improves visual quality
Still has saturation issues

Guided Backpropagation: Only backprop positive gradients through ReLUs

Sharper visualizations
Breaks implementation invariance
Not faithful to model (modifies gradient flow)

DeepLIFT: Reference-based, uses contribution rules per layer

Fast, completeness holds
Not implementation invariant (architecture-specific rules)

Layer-wise Relevance Propagation (LRP): Propagate relevance backward through layers

Completeness by construction
Not implementation invariant (layer-specific rules)

Gradient Method Comparison
Method	Completeness	Sensitivity	Impl. Invariance	Speed
Vanilla Gradient	❌	❌	✅	Very fast
Gradient × Input	❌	❌	✅	Very fast
SmoothGrad	❌	❌	✅	Moderate
Guided Backprop	❌	?	❌	Fast
DeepLIFT	✅	✅	❌	Fast
LRP	✅	✅	❌	Fast
Integrated Gradients	✅	✅	✅	Slow

When to Use Integrated Gradients

IG vs SHAP for Neural Networks

Aspect	Integrated Gradients	SHAP (DeepSHAP/GradientSHAP)
Theoretical basis	Path integral of gradients	Shapley values (game theory)
Baseline	Single explicit baseline	Distribution of baselines
Completeness	IG sums to f(x) - f(baseline)	SHAP sums to f(x) - E[f(X)]
Correlation handling	Assumes feature independence	Better with observational approach
Speed	O(steps × forward+backward)	Similar (multiple passes)
Uniqueness theorem	Yes (for path methods)	Yes (for Shapley)

Advanced Topics

Expected Integrated Gradients

Instead of a single baseline, average IG over multiple baselines sampled from a distribution:

$$\text{EIG}i(x) = \mathbb{E}{x' \sim D}[\text{IG}_i(x; x')]$$

This makes attributions more robust to baseline choice and connects IG closer to SHAP (which implicitly averages over a background distribution).

GradientSHAP

Captum provides GradientSHAP, which combines ideas from SHAP and IG:

Sample baselines from training data
Add noise to create interpolated samples
Compute gradients
Weight by Shapley-like scheme

This provides a fast approximation to SHAP using gradients.

Neuron-level Attribution

IG can attribute to intermediate neurons, not just input features. This helps understand:

Which neurons are most responsible for a prediction
How information flows through the network
Debugging specific layer behaviors

expected_ig.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import torch
from captum.attr import IntegratedGradients
import numpy as np
 
def expected_integrated_gradients(
    model,
    input_tensor,
    baseline_distribution,  # Tensor of shape (n_baselines, *input_shape)
    target=None,
    n_steps=50
):
    """
    Compute Expected Integrated Gradients by averaging over multiple baselines.
    """
    ig = IntegratedGradients(model)
    n_baselines = baseline_distribution.shape[0]
    
    all_attributions = []
    
    for i in range(n_baselines):
        baseline = baseline_distribution[i:i+1]
        
        attr = ig.attribute(
            input_tensor,
            baselines=baseline,
            target=target,
            n_steps=n_steps
        )
        all_attributions.append(attr)
    
    # Average across baselines
    avg_attribution = torch.stack(all_attributions).mean(dim=0)
    
    return avg_attribution
 
 
# Usage with training data as baseline distribution
def compute_eig_with_training_data(model, input_tensor, training_data, n_baselines=50, target=None):
    """
    Use random training samples as baselines for EIG.
    """
    # Sample random baselines from training data
    indices = np.random.choice(len(training_data), n_baselines, replace=False)
    baselines = training_data[indices]
    
    return expected_integrated_gradients(
        model, input_tensor, baselines, target
    )

Summary

Integrated Gradients provides the principled, theoretically grounded approach to neural network attribution. Let's consolidate the key insights:

Key Takeaways

•Vanilla gradient problem: Saturation causes zero gradients for important features; raw gradients are unreliable
•Path integral solution: IG integrates gradients from baseline to input, accumulating attribution through saturation
•Completeness guarantee: IG sum = f(x) - f(baseline); complete decomposition of prediction difference
•Axiomatic foundation: Sensitivity + Implementation Invariance uniquely characterize IG among path methods
•Baseline choice matters: Use domain-appropriate baselines; consider Expected IG for robustness
•Numerical approximation: 50-300 steps typically sufficient; verify with convergence delta
•Captum library: Production-ready implementation for PyTorch with visualization tools

Module Complete