Loading content...
How do you understand how a machine learning model makes decisions? There are fundamentally two approaches:
Use a model that is inherently understandable — Choose a model architecture that is transparent by design, where the decision-making process is directly inspectable.
Explain a complex model after training — Build whatever model achieves the best performance, then apply separate techniques to explain its behavior.
This distinction—intrinsic versus post-hoc interpretability—is the most fundamental taxonomy in the field. Every interpretability method falls into one of these categories, and the choice between them has profound implications for model development, deployment, and trust.
By the end of this page, you will understand: the precise distinction between intrinsic and post-hoc interpretability, the classes of models that are inherently interpretable, the major categories of post-hoc methods, the tradeoffs involved in each approach, and when to choose one over the other.
Intrinsic interpretability refers to models whose structure is inherently understandable to humans. The model's internal mechanics directly reveal how inputs are transformed into outputs. No additional explanation technique is needed—the model is the explanation.
This is interpretability by construction: we constrain the model architecture to forms that humans can comprehend, and we accept whatever accuracy limitations that constraint imposes.
Interpretability roughly correlates with: (1) Linearity — additive contributions are easy to understand; (2) Monotonicity — 'more of X means more of Y' is intuitive; (3) Sparsity — few features are easier to track than many; (4) Decomposability — contributions can be separated and inspected individually; (5) Simulatability — humans can mentally trace the decision process.
The case for intrinsic interpretability:
When you use an intrinsically interpretable model, explanations are guaranteed to be faithful to the model's actual reasoning. There's no gap between what the model does and what the explanation claims it does—because they are the same thing.
This matters enormously in high-stakes domains. When a linear model says that 'age' has coefficient 0.03 in a mortality prediction, that's exactly how age influences the prediction. There's no approximation, no simplification, no potential for the explanation to mislead.
The limitations:
The core limitation is performance. Intrinsically interpretable models impose structural constraints that can limit their ability to capture complex patterns:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.tree import DecisionTreeClassifier, export_textfrom interpret.glassbox import ExplainableBoostingClassifier # Example: Credit approval model with intrinsic interpretability # Option 1: Logistic Regression - Linear, directly interpretablelog_reg = LogisticRegression(random_state=42)log_reg.fit(X_train, y_train) print("Logistic Regression Coefficients:")for feature, coef in zip(feature_names, log_reg.coef_[0]): print(f" {feature}: {coef:+.3f}")# Output: Each coefficient directly shows feature influence# income: +0.482 (higher income → approval)# debt_to_income: -0.891 (higher ratio → denial)# credit_score: +0.634 (higher score → approval) # Option 2: Decision Tree - Rule-based, directly readabletree = DecisionTreeClassifier(max_depth=4, random_state=42)tree.fit(X_train, y_train) print("Decision Tree Rules:")print(export_text(tree, feature_names=list(feature_names)))# Output: Human-readable tree# |--- credit_score <= 620# | |--- debt_to_income > 0.4# | | |--- class: denied# | |--- debt_to_income <= 0.4# | | |--- income <= 35000# | | | |--- class: denied# | | |--- income > 35000# | | | |--- class: approved # Option 3: Explainable Boosting Machine (EBM) - GAM with higher accuracyebm = ExplainableBoostingClassifier(random_state=42)ebm.fit(X_train, y_train) # EBM provides per-feature shape functionsfrom interpret import showebm_global = ebm.explain_global()show(ebm_global) # Interactive visualization of each feature's contribution # For a single prediction, we can see exact contributionsebm_local = ebm.explain_local(X_test[[0]])show(ebm_local) # Shows: income contributed +0.3, debt contributed -0.45, etc. # All three models are intrinsically interpretable:# - The explanation IS the model# - No approximation or post-hoc analysis required# - Guaranteed faithfulness to actual decision processPost-hoc interpretability refers to techniques applied after a model has been trained to explain its behavior. The model itself remains a black box—we develop separate methods to probe, test, and characterize its decision-making.
This approach decouples model selection from interpretability. We can use whatever model achieves the best performance—deep neural networks, gradient boosting, random forests, transformers—and then apply explanation techniques independently.
Post-hoc explanations face a fundamental challenge: they are approximations of the true model behavior, not the behavior itself. An explanation might be plausible and convincing yet fail to capture what the model actually does. This gap between explanation and reality—the faithfulness problem—is the central concern of post-hoc interpretability research.
Why post-hoc methods are necessary:
Performance requirements — In many domains, the accuracy difference between interpretable and black-box models is substantial. If a neural network achieves 95% accuracy and a decision tree achieves 78%, the performance gap may be unacceptable.
Pre-existing models — Organizations often have deployed models they cannot replace. Post-hoc methods provide understanding of models that already exist.
Complex data modalities — For images, text, and audio, deep learning is often the only approach that works well. Intrinsic interpretability isn't a viable option.
Research understanding — Studying how neural networks work requires post-hoc analysis tools. Understanding emergent behaviors requires probing.
The spectrum of faithfulness:
Not all post-hoc methods are equally faithful:
| Method | Faithfulness Level | Why |
|---|---|---|
| Permutation Importance (global) | High | Directly measures prediction changes |
| SHAP (exact) | High | Grounded in game theory with uniqueness guarantees |
| SHAP (approximate) | Medium-High | Approximation introduces some error |
| LIME | Medium | Local linear approximation; may not capture non-linearities |
| Saliency Maps | Variable | Can be noisy, manipulable, and sensitive to implementation |
| Attention Weights | Variable | Attention may not correspond to importance |
| Surrogate Models | Medium-Low | Surrogate may diverge from original in important regions |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
import shapimport limeimport lime.lime_tabularimport numpy as npfrom sklearn.ensemble import GradientBoostingClassifier # Train a complex, high-performing but opaque modelmodel = GradientBoostingClassifier(n_estimators=200, max_depth=6, random_state=42)model.fit(X_train, y_train)print(f"Model accuracy: {model.score(X_test, y_test):.3f}") # The model is a black box - let's apply post-hoc interpretability # Method 1: SHAP (SHapley Additive exPlanations)# Theoretically grounded, consistent, accurateexplainer = shap.TreeExplainer(model)shap_values = explainer.shap_values(X_test) # Global interpretation: Which features matter overall?print("SHAP Global Feature Importance:")importance = np.abs(shap_values[1]).mean(axis=0)for i, (name, imp) in enumerate(sorted( zip(feature_names, importance), key=lambda x: -x[1])): print(f" {i+1}. {name}: {imp:.4f}") # Local interpretation: Why did this specific example get this prediction?idx = 0 # First test exampleprint(f"SHAP Local Explanation for example {idx}:")print(f" Prediction: {'Approved' if model.predict(X_test[[idx]])[0] else 'Denied'}")print(f" Base value: {explainer.expected_value[1]:.4f}")for name, val, contrib in sorted( zip(feature_names, X_test[idx], shap_values[1][idx]), key=lambda x: -abs(x[2]))[:5]: direction = "→ Approval" if contrib > 0 else "→ Denial" print(f" {name}={val:.2f}: {contrib:+.4f} {direction}") # Method 2: LIME (Local Interpretable Model-agnostic Explanations)# Approximates model locally with a linear modellime_explainer = lime.lime_tabular.LimeTabularExplainer( X_train, feature_names=feature_names, class_names=['Denied', 'Approved'], mode='classification') explanation = lime_explainer.explain_instance( X_test[idx], model.predict_proba, num_features=5) print(f"LIME Local Explanation for example {idx}:")for feature, weight in explanation.as_list(): print(f" {feature}: {weight:+.4f}") # Method 3: Permutation Importance (Global)from sklearn.inspection import permutation_importance perm_importance = permutation_importance( model, X_test, y_test, n_repeats=30, random_state=42) print("Permutation Importance (decrease in accuracy when feature shuffled):")for name, imp in sorted( zip(feature_names, perm_importance.importances_mean), key=lambda x: -x[1]): print(f" {name}: {imp:.4f}") # Note: All three methods approximate the model's behavior# They may give slightly different answers - this is the faithfulness challengeThe choice between intrinsic and post-hoc interpretability represents a fundamental tradeoff in ML system design. Neither approach dominates the other—each has distinct advantages.
| Factor | Favors Intrinsic | Favors Post-hoc |
|---|---|---|
| Performance requirements | Accuracy gap is small (<5%) | Accuracy gap is large (>10%) |
| Regulatory environment | Strict auditability requirements | Flexible compliance options |
| Decision stakes | High (life, liberty, large financial) | Medium-low (recommendations, non-critical) |
| Data type | Tabular with meaningful features | Images, text, audio, complex signals |
| Feature engineering | Features are well-understood | Features are learned (embeddings) |
| Explanation audience | Non-technical (regulators, users) | Technical (ML engineers, researchers) |
| Model lifecycle | New model, clean-slate design | Existing model, retrofitting |
| Explanation fidelity | Must be 100% faithful | Approximate explanations acceptable |
In practice, many systems use hybrid approaches: an intrinsically interpretable model for high-stakes decisions with post-hoc analysis for edge cases, or a black-box model for scoring with an interpretable model providing reason codes. The choice isn't always binary.
Let's examine the major classes of intrinsically interpretable models in detail, understanding their interpretability mechanisms and limitations.
Form: ŷ = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
Interpretability mechanism:
Strengths:
Limitations:
When to use: Tabular data with approximately linear relationships, when explanation of individual feature effects is critical, when statistical inference (p-values, confidence intervals) is needed.
Post-hoc methods are diverse, each with different theoretical foundations, computational requirements, and types of explanations produced. Understanding this landscape is essential for choosing appropriate methods.
| Category | Key Methods | Output Type | Best For |
|---|---|---|---|
| Feature Attribution | SHAP, LIME, Integrated Gradients, DeepLIFT | Importance scores per feature | Understanding individual predictions |
| Surrogate Models | Global: Tree distillation; Local: LIME | Interpretable model mimicking black box | Overall behavior approximation |
| Example-based | Prototypes, Counterfactuals, Influential Instances | Training examples explaining prediction | Intuitive, case-based explanations |
| Visualization | Saliency maps, GradCAM, Activation maximization | Visual highlights on input | Image and visual data |
| Concept-based | TCAV, Concept Bottleneck Models | Scores for human-defined concepts | Higher-level semantic understanding |
| Probing | Classifier probes, Behavioral tests | Model capability assessments | Understanding learned representations |
Feature Attribution in Detail:
Feature attribution methods assign a numerical importance score to each input feature for a specific prediction. These scores indicate how much each feature contributed to pushing the prediction toward a particular outcome.
Key methods:
SHAP (SHapley Additive exPlanations):
LIME (Local Interpretable Model-agnostic Explanations):
Integrated Gradients:
Permutation Importance:
Different post-hoc methods answer different questions. SHAP tells you 'how did each feature contribute?' LIME tells you 'what linear rule approximates behavior locally?' Counterfactuals tell you 'what would need to change?' Choose methods based on the question you're trying to answer.
A critical question for post-hoc methods: How do we know the explanation is correct? Unlike intrinsic interpretability, where the model is the explanation, post-hoc explanations are approximations that can be wrong.
Evaluation approaches:
Removal-based evaluation:
Synthetic ground truth:
Human evaluation:
Sensitivity analysis:
Research has shown that many post-hoc explanations can be manipulated to produce misleading results without changing model predictions. Adversarial attacks can force explanations to hide discriminatory features or highlight decoy features. This is especially concerning in regulated domains where explanations might be used to create an appearance of fairness while hiding actual bias.
Given the tradeoffs between intrinsic and post-hoc interpretability, how should practitioners make decisions? Here's a structured framework:
We've established the fundamental taxonomy of interpretability methods. Let's consolidate the key insights:
What's next:
The intrinsic/post-hoc distinction concerns when interpretability is built into the system. Next, we'll explore another crucial dimension: scope. Do we explain individual predictions (local interpretability) or overall model behavior (global interpretability)? This local vs. global distinction shapes what questions interpretability can answer.
You now understand the fundamental distinction between intrinsic and post-hoc interpretability. This taxonomy structures the entire field—every interpretability technique falls into one of these categories. Next, we'll explore the equally important distinction between local and global explanations.