Loading content...
At the heart of every logistic regression model lie its parameters: the weight vector $\mathbf{w} \in \mathbb{R}^d$ and the bias term $b \in \mathbb{R}$ (also called the intercept). Together, these parameters transform any input feature vector into a probability of class membership.
Understanding these parameters goes beyond mere mathematical formalism. Each weight encodes how strongly a feature influences the classification decision. The bias determines the default tendency toward one class or another. Their geometric arrangement defines a decision boundary that separates the classes in feature space.
This page explores model parameters from every angle: their mathematical role, geometric interpretation, estimation procedure, and the practical challenges of finding optimal values.
By the end of this page, you will understand: (1) the mathematical role of weights and bias in logistic regression, (2) how parameters define a linear decision boundary, (3) the maximum likelihood estimation framework, (4) why regularization is essential for parameter estimation, and (5) practical considerations including initialization and feature scaling.
Let's precisely define the parameters and their roles in the logistic regression model.
The Complete Model
Given a feature vector $\mathbf{x} \in \mathbb{R}^d$ with $d$ features, the logistic regression model predicts:
$$P(Y=1|\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \sigma\left(\sum_{j=1}^d w_j x_j + b\right)$$
where:
The Linear Predictor
The quantity inside the sigmoid is called the linear predictor (or logit):
$$z = \mathbf{w}^T\mathbf{x} + b = w_1 x_1 + w_2 x_2 + \cdots + w_d x_d + b$$
This is a linear combination of inputs. The sigmoid then squashes this to (0, 1).
Parameter Count
For a problem with $d$ features:
This linear parameter count (compared to, say, neural networks) is one reason logistic regression is interpretable and computationally efficient.
| Parameter | Symbol | Dimension | Role |
|---|---|---|---|
| Weight for feature j | w_j | Scalar | Controls influence of feature j on prediction |
| Weight vector | w | d × 1 | Defines direction of steepest probability change |
| Bias (intercept) | b | Scalar | Controls decision boundary offset from origin |
| Linear predictor | z = w^T x + b | Scalar | Input to sigmoid; log-odds of positive class |
Compact Notation with Augmented Vectors
We often absorb the bias into the weight vector using augmented notation:
$$\tilde{\mathbf{w}} = \begin{pmatrix} b \ w_1 \ w_2 \ \vdots \ w_d \end{pmatrix} \in \mathbb{R}^{d+1}, \quad \tilde{\mathbf{x}} = \begin{pmatrix} 1 \ x_1 \ x_2 \ \vdots \ x_d \end{pmatrix} \in \mathbb{R}^{d+1}$$
Then the model simplifies to:
$$z = \tilde{\mathbf{w}}^T \tilde{\mathbf{x}}$$
This notation is convenient for matrix formulations and derivations, while conceptually the bias plays a distinct role from the feature weights.
The bias b can be interpreted as the log-odds of the positive class when all features are zero. If b > 0, the model has a 'prior preference' for class 1; if b < 0, for class 0. This becomes the prediction when we have no informative features.
The parameters have a beautiful geometric interpretation that illuminates how logistic regression 'sees' the classification problem.
The Weight Vector as Direction
The weight vector $\mathbf{w}$ points in the direction of steepest increase in probability of class 1. Moving in the direction of $\mathbf{w}$ increases the probability most rapidly; moving opposite to $\mathbf{w}$ decreases it.
The magnitude $|\mathbf{w}|$ controls how quickly probabilities transition from near-0 to near-1. A larger $|\mathbf{w}|$ means a sharper transition (more confident predictions); a smaller $|\mathbf{w}|$ means a gentler, more uncertain transition.
The Decision Boundary
The decision boundary is the set of points where $P(Y=1|\mathbf{x}) = 0.5$, equivalently where the linear predictor equals zero:
$$\mathcal{B} = {\mathbf{x} : \mathbf{w}^T\mathbf{x} + b = 0}$$
This is a hyperplane in $\mathbb{R}^d$:
Normal Vector and Distance
The weight vector $\mathbf{w}$ is the normal vector to the decision boundary—it's perpendicular to the boundary surface.
The signed distance from a point $\mathbf{x}$ to the decision boundary is:
$$\text{distance} = \frac{\mathbf{w}^T\mathbf{x} + b}{|\mathbf{w}|} = \frac{z}{|\mathbf{w}|}$$
Positive distance means $\mathbf{x}$ is on the class-1 side; negative means class-0 side.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npimport matplotlib.pyplot as plt def sigmoid(z): return 1 / (1 + np.exp(-z)) # Define model parameters for 2D casew = np.array([2.0, 1.0]) # Weight vectorb = -1.0 # Bias # Decision boundary: w^T x + b = 0# In 2D: w1*x1 + w2*x2 + b = 0# Solving for x2: x2 = -(w1*x1 + b) / w2 # Create visualizationfig, axes = plt.subplots(1, 2, figsize=(14, 6)) # Create a grid of pointsx1_range = np.linspace(-3, 3, 100)x2_range = np.linspace(-3, 3, 100)X1, X2 = np.meshgrid(x1_range, x2_range) # Compute probabilities on the gridZ = w[0] * X1 + w[1] * X2 + bP = sigmoid(Z) # Left plot: Probability contoursax1 = axes[0]contour = ax1.contourf(X1, X2, P, levels=20, cmap='RdBu_r', alpha=0.8)plt.colorbar(contour, ax=ax1, label='P(Y=1)') # Decision boundary (P = 0.5, z = 0)x1_boundary = x1_rangex2_boundary = -(w[0] * x1_boundary + b) / w[1]ax1.plot(x1_boundary, x2_boundary, 'k-', linewidth=2, label='Decision Boundary (P=0.5)') # Weight vector (normalized for visualization)w_norm = w / np.linalg.norm(w)origin = np.array([0, -b/w[1]]) # Point on boundaryax1.quiver(*origin, w_norm[0], w_norm[1], color='green', scale=5, width=0.02, label='w direction', zorder=5) ax1.set_xlabel('x₁')ax1.set_ylabel('x₂')ax1.set_title('Probability Contours and Decision Boundary')ax1.legend(loc='upper left')ax1.set_xlim(-3, 3)ax1.set_ylim(-3, 3)ax1.set_aspect('equal') # Right plot: 3D surfaceax2 = fig.add_subplot(122, projection='3d')ax2.plot_surface(X1, X2, P, cmap='RdBu_r', alpha=0.8, linewidth=0)ax2.set_xlabel('x₁')ax2.set_ylabel('x₂')ax2.set_zlabel('P(Y=1)')ax2.set_title('Probability Surface') plt.tight_layout()plt.savefig('geometric_interpretation.png', dpi=150)plt.show() # Print geometric quantitiesprint("Geometric Properties")print("=" * 50)print(f"Weight vector w: {w}")print(f"Bias b: {b}")print(f"||w||: {np.linalg.norm(w):.4f}")print(f"Normal direction: {w / np.linalg.norm(w)}")print(f"Distance from origin to boundary: {abs(b) / np.linalg.norm(w):.4f}")Large ||w|| creates sharp probability transitions (confident predictions near the boundary). Small ||w|| creates gradual transitions (uncertain predictions spread over a wide region). Regularization, which penalizes large ||w||, encourages gentler, less overconfident probability estimates.
Each weight $w_j$ controls how feature $x_j$ contributes to the final prediction. The contribution mechanism is quite intuitive once we understand the linear predictor.
Additive Contributions in Log-Odds Space
The linear predictor is a sum of contributions:
$$z = \underbrace{b}{\text{baseline}} + \underbrace{w_1 x_1}{\text{feature 1}} + \underbrace{w_2 x_2}{\text{feature 2}} + \cdots + \underbrace{w_d x_d}{\text{feature d}}$$
Each term $w_j x_j$ is the contribution of feature $j$ to the log-odds. The contributions sum to form the total log-odds, which is then transformed to probability.
Interpreting Weight Signs and Magnitudes
| $w_j$ Sign | $x_j$ Effect | Interpretation |
|---|---|---|
| $w_j > 0$ | Increases log-odds | Higher $x_j$ → more likely class 1 |
| $w_j < 0$ | Decreases log-odds | Higher $x_j$ → more likely class 0 |
| $w_j = 0$ | No effect | Feature $j$ is irrelevant |
| $ | w_j | $ large |
| $ | w_j | $ small |
Feature Attribution Example
Consider a spam classifier with features 'contains_offer', 'from_contact', 'has_attachment':
$$z = -1.0 + 2.5 \cdot \text{contains_offer} - 3.0 \cdot \text{from_contact} + 0.5 \cdot \text{has_attachment}$$
For an email with contains_offer=1, from_contact=0, has_attachment=1: $$z = -1.0 + 2.5(1) - 3.0(0) + 0.5(1) = 2.0$$ $$P(\text{spam}) = \sigma(2.0) \approx 0.88$$
We can decompose why: the 'contains_offer' feature contributed most (+2.5), pushing toward spam.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as np def sigmoid(z): return 1 / (1 + np.exp(-z)) # Spam classifier examplefeature_names = ['contains_offer', 'from_contact', 'has_attachment', 'link_count', 'word_count']weights = np.array([2.5, -3.0, 0.5, 0.3, -0.01])bias = -1.0 def predict_with_attribution(x, feature_names, weights, bias): """Predict and show each feature's contribution.""" contributions = weights * x z = bias + np.sum(contributions) prob = sigmoid(z) print("Feature Attribution Analysis") print("=" * 60) print(f"{'Feature':<20} | {'Value':>8} | {'Weight':>8} | {'Contrib':>10}") print("-" * 60) print(f"{'(bias)':<20} | {'—':>8} | {bias:>8.2f} | {bias:>10.2f}") for fname, val, w, c in zip(feature_names, x, weights, contributions): print(f"{fname:<20} | {val:>8.2f} | {w:>8.2f} | {c:>10.2f}") print("-" * 60) print(f"{'Total log-odds (z)':<20} | {'':>8} | {'':>8} | {z:>10.2f}") print(f"{'P(spam)':<20} | {'':>8} | {'':>8} | {prob:>10.4f}") print() return prob, contributions # Example 1: Likely spamprint(">>> Example 1: Suspicious email")x1 = np.array([1, 0, 1, 5, 50]) # has offer, not from contact, has attachment, 5 linksprob1, _ = predict_with_attribution(x1, feature_names, weights, bias) # Example 2: Likely legitimateprint(">>> Example 2: Email from known contact")x2 = np.array([0, 1, 1, 1, 200]) # no offer, from contact, has attachment, 1 linkprob2, _ = predict_with_attribution(x2, feature_names, weights, bias) # Feature importance by magnitudeprint(">>> Feature Importance (by |weight|)")print("-" * 40)importance_order = np.argsort(-np.abs(weights))for idx in importance_order: print(f"{feature_names[idx]:<20}: |w| = {np.abs(weights[idx]):.4f}")Raw weight magnitudes are only comparable when features are on the same scale. If 'word_count' ranges 0-1000 while 'contains_offer' is binary 0/1, comparing their weights directly is misleading. Always standardize features before comparing weight magnitudes, or use standardized coefficients.
Given training data ${(\mathbf{x}i, y_i)}{i=1}^n$, where do the optimal parameters come from? The answer is Maximum Likelihood Estimation (MLE)—we find parameters that make the observed data most probable.
The Likelihood Function
For a single example $(\mathbf{x}, y)$, the probability of observing that outcome is:
$$P(Y=y | \mathbf{x}; \mathbf{w}, b) = \begin{cases} \sigma(z) & \text{if } y = 1 \ 1 - \sigma(z) & \text{if } y = 0 \end{cases}$$
This can be written compactly as:
$$P(Y=y | \mathbf{x}; \mathbf{w}, b) = \sigma(z)^y \cdot (1 - \sigma(z))^{1-y}$$
Assuming independent observations, the likelihood (probability of all data) is:
$$L(\mathbf{w}, b) = \prod_{i=1}^n \sigma(z_i)^{y_i} \cdot (1 - \sigma(z_i))^{1-y_i}$$
where $z_i = \mathbf{w}^T\mathbf{x}_i + b$.
The Log-Likelihood (Cross-Entropy Loss)
Taking the logarithm converts products to sums:
$$\ell(\mathbf{w}, b) = \sum_{i=1}^n \left[ y_i \log \sigma(z_i) + (1-y_i) \log(1 - \sigma(z_i)) \right]$$
Maximizing this is equivalent to minimizing the negative log-likelihood, which is the binary cross-entropy loss:
$$\mathcal{L}(\mathbf{w}, b) = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log \sigma(z_i) + (1-y_i) \log(1 - \sigma(z_i)) \right]$$
This is the standard loss function for logistic regression.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as np def sigmoid(z): """Numerically stable sigmoid.""" return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def log_likelihood(w, b, X, y): """ Compute log-likelihood for logistic regression. Higher is better. """ z = X @ w + b p = sigmoid(z) # Clip probabilities to avoid log(0) eps = 1e-15 p = np.clip(p, eps, 1 - eps) # Log-likelihood ll = np.sum(y * np.log(p) + (1 - y) * np.log(1 - p)) return ll def cross_entropy_loss(w, b, X, y): """ Compute mean cross-entropy loss. Lower is better. Equals -log_likelihood / n. """ return -log_likelihood(w, b, X, y) / len(y) # Generate synthetic datanp.random.seed(42)n = 1000X = np.random.randn(n, 2)true_w = np.array([1.5, -1.0])true_b = 0.5z_true = X @ true_w + true_by = (np.random.rand(n) < sigmoid(z_true)).astype(int) # Show loss landscape along one dimensionw_test_range = np.linspace(-1, 4, 50)losses = []for w1 in w_test_range: w_temp = np.array([w1, -1.0]) loss = cross_entropy_loss(w_temp, 0.5, X, y) losses.append(loss) print("Maximum Likelihood Estimation")print("=" * 50)print(f"True parameters: w = {true_w}, b = {true_b}")print(f"Loss at true parameters: {cross_entropy_loss(true_w, true_b, X, y):.4f}")print(f"Loss at w = [0, -1]: {cross_entropy_loss(np.array([0, -1]), 0.5, X, y):.4f}")print(f"Loss at w = [3, -1]: {cross_entropy_loss(np.array([3, -1]), 0.5, X, y):.4f}") # Verify: minimum loss near true w1 = 1.5min_idx = np.argmin(losses)print(f"Minimum loss at w1 = {w_test_range[min_idx]:.2f} (true: 1.5)")Unlike linear regression (where the normal equations give a closed-form solution), logistic regression's nonlinear sigmoid means we must optimize numerically. The good news: the loss function is convex, so any local minimum is the global minimum, and gradient-based methods converge reliably.
To find optimal parameters, we use gradient descent—iteratively updating parameters in the direction that decreases the loss.
Gradient Derivation
The gradient of the cross-entropy loss with respect to parameters has an elegant form. For a single example:
$$\frac{\partial \mathcal{L}}{\partial w_j} = (\hat{p} - y) \cdot x_j$$
$$\frac{\partial \mathcal{L}}{\partial b} = (\hat{p} - y)$$
where $\hat{p} = \sigma(z)$ is the predicted probability.
In matrix form for all examples:
$$ abla_\mathbf{w} \mathcal{L} = \frac{1}{n} \mathbf{X}^T (\hat{\mathbf{p}} - \mathbf{y})$$
$$ abla_b \mathcal{L} = \frac{1}{n} \sum_{i=1}^n (\hat{p}_i - y_i)$$
The gradient is the average error times the feature values—beautifully intuitive.
The Update Rule
$$\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \cdot abla_\mathbf{w} \mathcal{L}$$ $$b^{(t+1)} = b^{(t)} - \eta \cdot abla_b \mathcal{L}$$
where $\eta$ is the learning rate, a hyperparameter controlling step size.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as np def sigmoid(z): return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z))) def compute_loss_and_gradient(w, b, X, y): """Compute loss and gradients for logistic regression.""" n = len(y) z = X @ w + b p_hat = sigmoid(z) # Cross-entropy loss eps = 1e-15 p_clipped = np.clip(p_hat, eps, 1 - eps) loss = -np.mean(y * np.log(p_clipped) + (1 - y) * np.log(1 - p_clipped)) # Gradients error = p_hat - y grad_w = (1/n) * X.T @ error grad_b = (1/n) * np.sum(error) return loss, grad_w, grad_b def gradient_descent(X, y, learning_rate=0.1, n_iterations=1000, verbose=True): """Train logistic regression via gradient descent.""" n_features = X.shape[1] # Initialize parameters w = np.zeros(n_features) b = 0.0 losses = [] for i in range(n_iterations): loss, grad_w, grad_b = compute_loss_and_gradient(w, b, X, y) losses.append(loss) # Update parameters w = w - learning_rate * grad_w b = b - learning_rate * grad_b if verbose and (i + 1) % 100 == 0: print(f"Iteration {i+1:4d}: Loss = {loss:.6f}") return w, b, losses # Generate datanp.random.seed(42)n = 1000X = np.random.randn(n, 3)true_w = np.array([1.0, -0.5, 0.8])true_b = 0.3y = (sigmoid(X @ true_w + true_b) > np.random.rand(n)).astype(int) # Trainprint("Gradient Descent Training")print("=" * 50)w_learned, b_learned, losses = gradient_descent(X, y, learning_rate=0.5, n_iterations=500) print(f"True parameters: w = {true_w}, b = {true_b}")print(f"Learned parameters: w = {np.round(w_learned, 4)}, b = {b_learned:.4f}") # Accuracypredictions = (sigmoid(X @ w_learned + b_learned) > 0.5).astype(int)accuracy = np.mean(predictions == y)print(f"Training accuracy: {accuracy:.2%}")The gradient (p̂ - y) · x says: if we predicted too high (p̂ > y), decrease the weight for positive features and increase for negative features. If we predicted too low (p̂ < y), do the opposite. Features with larger magnitudes get larger updates. This makes perfect intuitive sense.
Pure maximum likelihood can lead to problems, especially with linearly separable data or high-dimensional features. Regularization adds a penalty on parameter magnitudes to prevent overfitting.
L2 Regularization (Ridge)
The regularized loss function is:
$$\mathcal{L}_{\text{reg}} = \mathcal{L} + \frac{\lambda}{2} |\mathbf{w}|2^2 = \mathcal{L} + \frac{\lambda}{2} \sum{j=1}^d w_j^2$$
where $\lambda > 0$ is the regularization strength (often written as $1/C$ in sklearn).
The gradient becomes: $$ abla_\mathbf{w} \mathcal{L}{\text{reg}} = abla\mathbf{w} \mathcal{L} + \lambda \mathbf{w}$$
This shrinks weights toward zero, reducing model complexity.
L1 Regularization (Lasso)
$$\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda |\mathbf{w}|1 = \mathcal{L} + \lambda \sum{j=1}^d |w_j|$$
L1 regularization promotes sparsity—driving some weights exactly to zero, effectively performing feature selection.
Why Regularization Matters for Logistic Regression
For linearly separable data, MLE solutions diverge to $\pm \infty$. The sigmoid can achieve perfect classification by making confidence arbitrarily high, but this leads to:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
import numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import StandardScaler # Create linearly separable data (dangerous for unregularized LR)np.random.seed(42)n = 100X = np.vstack([ np.random.randn(n//2, 2) + np.array([2, 2]), np.random.randn(n//2, 2) + np.array([-2, -2])])y = np.array([1] * (n//2) + [0] * (n//2)) # Standardize featuresscaler = StandardScaler()X_scaled = scaler.fit_transform(X) # Compare different regularization strengthsprint("Effect of Regularization Strength on Parameters")print("=" * 70)print(f"{'C (1/λ)':<10} | {'||w||₂':>10} | {'w':>25} | {'Train Acc':>10}")print("-" * 70) for C in [1000, 10, 1, 0.1, 0.01]: model = LogisticRegression(C=C, solver='lbfgs', max_iter=1000) model.fit(X_scaled, y) w = model.coef_[0] w_norm = np.linalg.norm(w) accuracy = model.score(X_scaled, y) print(f"{C:<10} | {w_norm:>10.4f} | [{w[0]:>10.4f}, {w[1]:>10.4f}] | {accuracy:>10.2%}") # Show L1 vs L2print("" + "=" * 70)print("L1 (Lasso) vs L2 (Ridge) Regularization")print("=" * 70) # Add some irrelevant featuresX_noisy = np.hstack([X_scaled, np.random.randn(n, 5)]) for penalty, solver in [('l2', 'lbfgs'), ('l1', 'saga')]: model = LogisticRegression(C=0.1, penalty=penalty, solver=solver, max_iter=5000) model.fit(X_noisy, y) w = model.coef_[0] n_zero = np.sum(np.abs(w) < 0.01) print(f"{penalty.upper()} Regularization:") print(f" Weights: {np.round(w, 3)}") print(f" Weights near zero: {n_zero}/7")When data is perfectly linearly separable, unregularized logistic regression has no finite MLE solution—weights grow infinitely to achieve 'perfect' sigmoid predictions of exactly 0 or 1. Regularization (or early stopping) is not optional in such cases; it's necessary for the algorithm to converge.
Successful logistic regression requires attention to several practical details beyond the core algorithm.
Feature Scaling
Gradient descent converges faster when features are on similar scales. Standardization ($x \to \frac{x-\mu}{\sigma}$) is the most common preprocessing:
Initialization
Typical initialization:
For imbalanced datasets, initializing bias to reflect prior probabilities speeds convergence.
Convergence Criteria
Stop training when:
Advanced Optimizers
Beyond vanilla gradient descent:
12345678910111213141516171819202122232425262728293031323334353637383940414243
import numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import cross_val_score, GridSearchCVfrom sklearn.pipeline import Pipeline # Simulated datanp.random.seed(42)n = 500X = np.random.randn(n, 10)true_w = np.array([1, -0.5, 0.8, 0, 0, 0, 0.3, -0.2, 0, 0]) # Sparse weightsy = (1 / (1 + np.exp(-(X @ true_w + 0.3))) > np.random.rand(n)).astype(int) print("Practical Logistic Regression Pipeline")print("=" * 60) # Create pipeline with preprocessingpipeline = Pipeline([ ('scaler', StandardScaler()), ('classifier', LogisticRegression(max_iter=1000, random_state=42))]) # Grid search for best hyperparametersparam_grid = { 'classifier__C': [0.01, 0.1, 1, 10, 100], 'classifier__penalty': ['l2']} grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')grid_search.fit(X, y) print(f"Best C: {grid_search.best_params_['classifier__C']}")print(f"Best CV accuracy: {grid_search.best_score_:.4f}") # Examine final modelbest_model = grid_search.best_estimator_coefficients = best_model.named_steps['classifier'].coef_[0] print(f"Learned coefficients (scaled features):")for i, (true, learned) in enumerate(zip(true_w, coefficients)): marker = "✓" if (np.abs(true) > 0.1) == (np.abs(learned) > 0.05) else "✗" print(f" Feature {i}: true={true:>6.2f}, learned={learned:>6.3f} {marker}")Always use sklearn's Pipeline to chain preprocessing with the classifier. This ensures that scaling parameters are fit only on training data during cross-validation, preventing data leakage. It also makes deployment simpler—the entire pipeline can be serialized as one object.
We've comprehensively explored the parameters of logistic regression—from their mathematical definition through their geometric meaning to their estimation and regularization. Let's consolidate the key insights:
What's Next:
With parameters understood, we now examine the decision boundary in detail—how the linear equation w^T x + b = 0 defines class separation, how it relates to margin and confidence, and how to visualize and interpret decision surfaces in various dimensions.
You now have a thorough understanding of logistic regression parameters—what they are, what they mean geometrically, how they're learned, and how to work with them in practice. This foundation is essential for model interpretation and debugging.