Interpretability Fundamentals - Learning Module

Loading content...

0/278

Intrinsic vs Post-hoc Interpretability

Two Paths to Understanding

How do you understand how a machine learning model makes decisions? There are fundamentally two approaches:

Use a model that is inherently understandable — Choose a model architecture that is transparent by design, where the decision-making process is directly inspectable.
Explain a complex model after training — Build whatever model achieves the best performance, then apply separate techniques to explain its behavior.

This distinction—intrinsic versus post-hoc interpretability—is the most fundamental taxonomy in the field. Every interpretability method falls into one of these categories, and the choice between them has profound implications for model development, deployment, and trust.

What You Will Learn

By the end of this page, you will understand: the precise distinction between intrinsic and post-hoc interpretability, the classes of models that are inherently interpretable, the major categories of post-hoc methods, the tradeoffs involved in each approach, and when to choose one over the other.

Intrinsic Interpretability: Transparent by Design

Intrinsic interpretability refers to models whose structure is inherently understandable to humans. The model's internal mechanics directly reveal how inputs are transformed into outputs. No additional explanation technique is needed—the model is the explanation.

This is interpretability by construction: we constrain the model architecture to forms that humans can comprehend, and we accept whatever accuracy limitations that constraint imposes.

Inherently Interpretable Model Classes

•Linear Models — Logistic regression, linear regression, linear SVMs. Each feature has a coefficient directly indicating its contribution to the prediction. The model is a weighted sum.
•Decision Trees — A tree of if-then rules. Each prediction path is a conjunction of human-readable conditions: 'IF age > 25 AND income > 50k AND debt < 10k THEN approve.'
•Rule Lists and Sets — Explicit collections of rules, often derived from decision trees or association rules. Predictions follow whichever rule matches.
•Generalized Additive Models (GAMs) — Each feature contributes independently through a (potentially non-linear) function: y = f₁(x₁) + f₂(x₂) + ... + fₙ(xₙ). The per-feature contributions are directly visualizable.
•k-Nearest Neighbors (k-NN) — Predictions are based on similar examples. Explanation = 'These are the training examples most similar to your input.'
•Naive Bayes — Probabilistic model with explicit prior and likelihood contributions per feature. Each feature's contribution to the posterior probability is directly computable.
•Sparse Models — Any model with explicit sparsity constraints (L1 regularization, feature selection). A model using only 10 features is more interpretable than one using 1,000.

What Makes a Model 'Interpretable'?

Interpretability roughly correlates with: (1) Linearity — additive contributions are easy to understand; (2) Monotonicity — 'more of X means more of Y' is intuitive; (3) Sparsity — few features are easier to track than many; (4) Decomposability — contributions can be separated and inspected individually; (5) Simulatability — humans can mentally trace the decision process.

The case for intrinsic interpretability:

When you use an intrinsically interpretable model, explanations are guaranteed to be faithful to the model's actual reasoning. There's no gap between what the model does and what the explanation claims it does—because they are the same thing.

This matters enormously in high-stakes domains. When a linear model says that 'age' has coefficient 0.03 in a mortality prediction, that's exactly how age influences the prediction. There's no approximation, no simplification, no potential for the explanation to mislead.

The limitations:

The core limitation is performance. Intrinsically interpretable models impose structural constraints that can limit their ability to capture complex patterns:

Linear models cannot capture non-linear relationships without manual feature engineering
Decision trees suffer from high variance and often require ensembling (which destroys interpretability)
GAMs cannot model feature interactions without explicitly adding interaction terms
Rule lists become unwieldy when many rules are needed for accuracy

intrinsic_interpretability_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_text
from interpret.glassbox import ExplainableBoostingClassifier
 
# Example: Credit approval model with intrinsic interpretability
 
# Option 1: Logistic Regression - Linear, directly interpretable
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)
 
print("Logistic Regression Coefficients:")
for feature, coef in zip(feature_names, log_reg.coef_[0]):
    print(f"  {feature}: {coef:+.3f}")
# Output: Each coefficient directly shows feature influence
# income: +0.482 (higher income → approval)
# debt_to_income: -0.891 (higher ratio → denial)
# credit_score: +0.634 (higher score → approval)
 
# Option 2: Decision Tree - Rule-based, directly readable
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(X_train, y_train)
 
print("
Decision Tree Rules:")
print(export_text(tree, feature_names=list(feature_names)))
# Output: Human-readable tree
# |--- credit_score <= 620
# |   |--- debt_to_income > 0.4
# |   |   |--- class: denied
# |   |--- debt_to_income <= 0.4
# |   |   |--- income <= 35000
# |   |   |   |--- class: denied
# |   |   |--- income > 35000
# |   |   |   |--- class: approved
 
# Option 3: Explainable Boosting Machine (EBM) - GAM with higher accuracy
ebm = ExplainableBoostingClassifier(random_state=42)
ebm.fit(X_train, y_train)
 
# EBM provides per-feature shape functions
from interpret import show
ebm_global = ebm.explain_global()
show(ebm_global)  # Interactive visualization of each feature's contribution
 
# For a single prediction, we can see exact contributions
ebm_local = ebm.explain_local(X_test[[0]])
show(ebm_local)  # Shows: income contributed +0.3, debt contributed -0.45, etc.
 
# All three models are intrinsically interpretable:
# - The explanation IS the model
# - No approximation or post-hoc analysis required
# - Guaranteed faithfulness to actual decision process

Post-hoc Interpretability: Explaining Black Boxes

Post-hoc interpretability refers to techniques applied after a model has been trained to explain its behavior. The model itself remains a black box—we develop separate methods to probe, test, and characterize its decision-making.

This approach decouples model selection from interpretability. We can use whatever model achieves the best performance—deep neural networks, gradient boosting, random forests, transformers—and then apply explanation techniques independently.

Major Categories of Post-hoc Methods

•Feature Attribution Methods — Assign importance scores to input features for a specific prediction. Examples: SHAP, LIME, Integrated Gradients, Permutation Importance.
•Surrogate Models — Train a simple, interpretable model (e.g., decision tree) to mimic the black box's behavior, then interpret the surrogate.
•Example-based Explanations — Explain predictions by reference to training examples: prototypes, influential instances, counterfactual examples.
•Concept-based Explanations — Map model representations to human-understandable concepts. Examples: TCAV (Testing with Concept Activation Vectors).
•Visualization Methods — For deep learning: saliency maps, activation visualization, attention weights, layer-wise relevance propagation.
•Probing and Behavioral Analysis — Test model behavior systematically to characterize decision boundaries, biases, and edge cases.

The Faithfulness Problem

Post-hoc explanations face a fundamental challenge: they are approximations of the true model behavior, not the behavior itself. An explanation might be plausible and convincing yet fail to capture what the model actually does. This gap between explanation and reality—the faithfulness problem—is the central concern of post-hoc interpretability research.

Why post-hoc methods are necessary:

Performance requirements — In many domains, the accuracy difference between interpretable and black-box models is substantial. If a neural network achieves 95% accuracy and a decision tree achieves 78%, the performance gap may be unacceptable.
Pre-existing models — Organizations often have deployed models they cannot replace. Post-hoc methods provide understanding of models that already exist.
Complex data modalities — For images, text, and audio, deep learning is often the only approach that works well. Intrinsic interpretability isn't a viable option.
Research understanding — Studying how neural networks work requires post-hoc analysis tools. Understanding emergent behaviors requires probing.

The spectrum of faithfulness:

Not all post-hoc methods are equally faithful:

Method	Faithfulness Level	Why
Permutation Importance (global)	High	Directly measures prediction changes
SHAP (exact)	High	Grounded in game theory with uniqueness guarantees
SHAP (approximate)	Medium-High	Approximation introduces some error
LIME	Medium	Local linear approximation; may not capture non-linearities
Saliency Maps	Variable	Can be noisy, manipulable, and sensitive to implementation
Attention Weights	Variable	Attention may not correspond to importance
Surrogate Models	Medium-Low	Surrogate may diverge from original in important regions

posthoc_interpretability_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import shap
import lime
import lime.lime_tabular
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
 
# Train a complex, high-performing but opaque model
model = GradientBoostingClassifier(n_estimators=200, max_depth=6, random_state=42)
model.fit(X_train, y_train)
print(f"Model accuracy: {model.score(X_test, y_test):.3f}")
 
# The model is a black box - let's apply post-hoc interpretability
 
# Method 1: SHAP (SHapley Additive exPlanations)
# Theoretically grounded, consistent, accurate
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
 
# Global interpretation: Which features matter overall?
print("
SHAP Global Feature Importance:")
importance = np.abs(shap_values[1]).mean(axis=0)
for i, (name, imp) in enumerate(sorted(
    zip(feature_names, importance), key=lambda x: -x[1]
)):
    print(f"  {i+1}. {name}: {imp:.4f}")
 
# Local interpretation: Why did this specific example get this prediction?
idx = 0  # First test example
print(f"
SHAP Local Explanation for example {idx}:")
print(f"  Prediction: {'Approved' if model.predict(X_test[[idx]])[0] else 'Denied'}")
print(f"  Base value: {explainer.expected_value[1]:.4f}")
for name, val, contrib in sorted(
    zip(feature_names, X_test[idx], shap_values[1][idx]),
    key=lambda x: -abs(x[2])
)[:5]:
    direction = "→ Approval" if contrib > 0 else "→ Denial"
    print(f"  {name}={val:.2f}: {contrib:+.4f} {direction}")
 
# Method 2: LIME (Local Interpretable Model-agnostic Explanations)
# Approximates model locally with a linear model
lime_explainer = lime.lime_tabular.LimeTabularExplainer(
    X_train,
    feature_names=feature_names,
    class_names=['Denied', 'Approved'],
    mode='classification'
)
 
explanation = lime_explainer.explain_instance(
    X_test[idx], 
    model.predict_proba,
    num_features=5
)
 
print(f"
LIME Local Explanation for example {idx}:")
for feature, weight in explanation.as_list():
    print(f"  {feature}: {weight:+.4f}")
 
# Method 3: Permutation Importance (Global)
from sklearn.inspection import permutation_importance
 
perm_importance = permutation_importance(
    model, X_test, y_test, n_repeats=30, random_state=42
)
 
print("
Permutation Importance (decrease in accuracy when feature shuffled):")
for name, imp in sorted(
    zip(feature_names, perm_importance.importances_mean), key=lambda x: -x[1]
):
    print(f"  {name}: {imp:.4f}")
 
# Note: All three methods approximate the model's behavior
# They may give slightly different answers - this is the faithfulness challenge

The Fundamental Tradeoff

The choice between intrinsic and post-hoc interpretability represents a fundamental tradeoff in ML system design. Neither approach dominates the other—each has distinct advantages.

Intrinsic Interpretability

•Guaranteed faithfulness — Explanation IS the model
•No approximation error — What you see is what you get
•Easier to debug — Structure is directly inspectable
•Regulatory friendly — Auditors can verify logic
•Simpler deployment — No explanation pipeline needed
•Training stability — Simpler models, fewer failure modes

Post-hoc Interpretability

•Maximum performance — Use best model for task
•Flexibility — Any model can be explained (in principle)
•Rich explanations — Multiple techniques for different needs
•Complex data — Only option for images/text/audio
•Existing models — Can explain already-deployed systems
•Research tool — Probe neural network internals

Decision Framework: Intrinsic vs Post-hoc
Factor	Favors Intrinsic	Favors Post-hoc
Performance requirements	Accuracy gap is small (<5%)	Accuracy gap is large (>10%)
Regulatory environment	Strict auditability requirements	Flexible compliance options
Decision stakes	High (life, liberty, large financial)	Medium-low (recommendations, non-critical)
Data type	Tabular with meaningful features	Images, text, audio, complex signals
Feature engineering	Features are well-understood	Features are learned (embeddings)
Explanation audience	Non-technical (regulators, users)	Technical (ML engineers, researchers)
Model lifecycle	New model, clean-slate design	Existing model, retrofitting
Explanation fidelity	Must be 100% faithful	Approximate explanations acceptable

The Pragmatic Middle Ground

In practice, many systems use hybrid approaches: an intrinsically interpretable model for high-stakes decisions with post-hoc analysis for edge cases, or a black-box model for scoring with an interpretable model providing reason codes. The choice isn't always binary.

Deep Dive: Intrinsic Model Classes

Let's examine the major classes of intrinsically interpretable models in detail, understanding their interpretability mechanisms and limitations.

Form: ŷ = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ

Interpretability mechanism:

Each coefficient βᵢ represents the change in output per unit change in xᵢ (holding other features constant)
Sign indicates direction of effect
Magnitude indicates strength of effect
Standardized coefficients enable cross-feature comparison

Strengths:

Simplest possible form for understanding feature effects
Statistical inference provides confidence intervals
Well-understood behavior, extensive theory
Fast to train and predict

Limitations:

Cannot capture non-linear relationships without feature engineering
Cannot capture interactions without explicit interaction terms
Coefficient interpretation requires standardization
Multicollinearity can make coefficients unstable

When to use: Tabular data with approximately linear relationships, when explanation of individual feature effects is critical, when statistical inference (p-values, confidence intervals) is needed.

Deep Dive: Post-hoc Method Categories

Post-hoc methods are diverse, each with different theoretical foundations, computational requirements, and types of explanations produced. Understanding this landscape is essential for choosing appropriate methods.

Post-hoc Method Taxonomy
Category	Key Methods	Output Type	Best For
Feature Attribution	SHAP, LIME, Integrated Gradients, DeepLIFT	Importance scores per feature	Understanding individual predictions
Surrogate Models	Global: Tree distillation; Local: LIME	Interpretable model mimicking black box	Overall behavior approximation
Example-based	Prototypes, Counterfactuals, Influential Instances	Training examples explaining prediction	Intuitive, case-based explanations
Visualization	Saliency maps, GradCAM, Activation maximization	Visual highlights on input	Image and visual data
Concept-based	TCAV, Concept Bottleneck Models	Scores for human-defined concepts	Higher-level semantic understanding
Probing	Classifier probes, Behavioral tests	Model capability assessments	Understanding learned representations

Feature Attribution in Detail:

Feature attribution methods assign a numerical importance score to each input feature for a specific prediction. These scores indicate how much each feature contributed to pushing the prediction toward a particular outcome.

Key methods:

SHAP (SHapley Additive exPlanations):
- Based on Shapley values from cooperative game theory
- Satisfies desirable axioms: local accuracy, missingness, consistency
- Interpretable as 'fair credit allocation' among features
- Variants: TreeSHAP (for trees), DeepSHAP (for neural nets), KernelSHAP (model-agnostic)
LIME (Local Interpretable Model-agnostic Explanations):
- Approximates model locally with an interpretable linear model
- Perturbs inputs around the instance of interest
- Weights samples by proximity to original input
- Fast but can have high variance
Integrated Gradients:
- Attributes by integrating gradients along path from baseline to input
- Satisfies sensitivity and implementation invariance axioms
- Requires differentiable models (neural networks)
- Baseline choice affects attributions
Permutation Importance:
- Measures decrease in performance when feature values are shuffled
- Model-agnostic and intuitive
- Captures global importance (over dataset), not local
- Can be affected by feature correlation

No Universal Best Method

Different post-hoc methods answer different questions. SHAP tells you 'how did each feature contribute?' LIME tells you 'what linear rule approximates behavior locally?' Counterfactuals tell you 'what would need to change?' Choose methods based on the question you're trying to answer.

Faithfulness and Evaluation of Explanations

A critical question for post-hoc methods: How do we know the explanation is correct? Unlike intrinsic interpretability, where the model is the explanation, post-hoc explanations are approximations that can be wrong.

Explanation Quality Criteria

•Faithfulness (Fidelity) — Does the explanation accurately reflect the model's actual reasoning? An unfaithful explanation might be plausible but wrong.
•Stability (Robustness) — Do similar inputs produce similar explanations? Unstable explanations that change dramatically with small input perturbations are unreliable.
•Completeness — Does the explanation capture all important factors, or only part of the story? An incomplete explanation might miss crucial influences.
•Compactness — Is the explanation concise enough for humans to understand? A 'complete' explanation citing 1,000 features is unusable.
•Consistency — Do different methods agree on what's important? Divergence between methods suggests at least one is wrong.

Evaluation approaches:

Removal-based evaluation:
- Remove features claimed to be important and measure prediction change
- If important features truly matter, removing them should substantially change predictions
- Variants: ROAR (RemOve And Retrain), deletion metrics
Synthetic ground truth:
- Create models with known feature importance (e.g., linear models with known coefficients)
- Apply post-hoc methods and compare to ground truth
- Limitation: May not generalize to complex models
Human evaluation:
- Present explanations to domain experts for validity assessment
- Check if explanations help humans predict model behavior
- Expensive and subjective but captures real-world utility
Sensitivity analysis:
- Test explanation stability under input perturbations
- Check consistency across random seeds and hyperparameters
- Identify failure modes and edge cases

The Faithfulness Illusion

Research has shown that many post-hoc explanations can be manipulated to produce misleading results without changing model predictions. Adversarial attacks can force explanations to hide discriminatory features or highlight decoy features. This is especially concerning in regulated domains where explanations might be used to create an appearance of fairness while hiding actual bias.

Practical Decision Framework

Given the tradeoffs between intrinsic and post-hoc interpretability, how should practitioners make decisions? Here's a structured framework:

Converting Mermaid diagram...

Key Decision Questions

•What is the accuracy gap? Measure performance of best interpretable model vs. best black-box model on your specific task. This quantifies what you're trading for interpretability.
•What are the regulatory requirements? Some domains mandate intrinsic interpretability. Know the rules before choosing your approach.
•What are the decision stakes? Higher stakes justify prioritizing interpretability over performance. A 2% accuracy loss may be acceptable if it prevents a single catastrophic miscarriage of justice.
•Who is the explanation audience? Technical audiences can work with feature attribution scores. Non-technical audiences need natural language or rule-based explanations.
•What type of explanation is needed? Local (individual prediction), global (overall behavior), or both? Different methods serve different needs.
•What is the data modality? Tabular data enables intrinsic models. Images and text typically require deep learning + post-hoc analysis.

Summary: Intrinsic vs Post-hoc Interpretability

We've established the fundamental taxonomy of interpretability methods. Let's consolidate the key insights:

Key Takeaways

•Intrinsic interpretability means using models that are transparent by construction—linear models, decision trees, GAMs, rule systems. The model IS the explanation.
•Post-hoc interpretability means explaining black-box models after training using separate techniques—SHAP, LIME, saliency maps, surrogate models.
•Intrinsic methods guarantee faithfulness but may sacrifice accuracy. Post-hoc methods enable maximum performance but introduce explanation error.
•Faithfulness is the central challenge for post-hoc methods. Explanations can be plausible but wrong, and can even be manipulated adversarially.
•The choice depends on context: accuracy requirements, regulatory environment, decision stakes, data modality, and explanation audience.
•Modern GAMs (like EBMs) narrow the accuracy gap, making intrinsic interpretability viable for more applications than previously thought.

What's next:

The intrinsic/post-hoc distinction concerns when interpretability is built into the system. Next, we'll explore another crucial dimension: scope. Do we explain individual predictions (local interpretability) or overall model behavior (global interpretability)? This local vs. global distinction shapes what questions interpretability can answer.

Page Complete

You now understand the fundamental distinction between intrinsic and post-hoc interpretability. This taxonomy structures the entire field—every interpretability technique falls into one of these categories. Next, we'll explore the equally important distinction between local and global explanations.