Loading learning content...
Gradient boosting began as a regression technique, iteratively fitting residuals to reduce squared error. But what about classification—predicting discrete class labels rather than continuous values?
The key insight is that classification can be reframed as regression: instead of predicting labels directly, we predict log-odds (the logarithm of odds of class membership). This prediction is continuous, enabling the same gradient descent machinery we've developed. The logistic loss (also called log loss, cross-entropy loss, or Bernoulli loss) provides the bridge.
Logistic loss is to classification what squared loss is to regression: the default, theoretically grounded choice. It arises naturally from maximum likelihood estimation under the Bernoulli model and produces well-calibrated probability predictions. Understanding logistic loss deeply is essential for anyone using gradient boosting for classification tasks—which in practice means most real-world applications.
By the end of this page, you will understand logistic loss from first principles—its derivation from maximum likelihood, the sigmoid function, gradient and Hessian computation on the log-odds scale, and implementation in gradient boosting classifiers. You'll grasp why gradients target residuals on the probability scale and how to extend to multiclass problems.
In binary classification, we observe:
Our goal is to predict the probability that $y = 1$ given $x$:
$$p(x) = P(Y = 1 | X = x)$$
Why Probabilities Instead of Labels?
The Challenge:
Probabilities must lie in $[0, 1]$, but gradient boosting predicts unbounded real values. We need a transformation.
The Solution: Log-Odds (Logit) Scale
Gradient boosting predicts the log-odds (logit):
$$F(x) = \log\frac{p(x)}{1 - p(x)}$$
This transforms $p \in (0, 1)$ to $F \in (-\infty, +\infty)$, which gradient boosting can freely predict. To recover probabilities, we apply the sigmoid function:
$$p(x) = \sigma(F(x)) = \frac{1}{1 + e^{-F(x)}} = \frac{e^{F(x)}}{1 + e^{F(x)}}$$
The sigmoid σ(z) = 1/(1+e^(-z)) maps any real number to (0,1). For z → +∞, σ(z) → 1. For z → -∞, σ(z) → 0. At z = 0, σ(0) = 0.5. The sigmoid is the inverse of the logit function, creating a bijection between probabilities and log-odds.
| Log-Odds F(x) | Probability p(x) | Interpretation |
|---|---|---|
| -∞ | 0 | Certain negative class |
| -2.2 | 0.10 | 90% likely negative |
| -1.1 | 0.25 | 75% likely negative |
| 0 | 0.50 | Equal probability (uncertain) |
| +1.1 | 0.75 | 75% likely positive |
| +2.2 | 0.90 | 90% likely positive |
| +∞ | 1 | Certain positive class |
Given true label $y \in {0, 1}$ and predicted log-odds $F$, the logistic loss is:
$$L(y, F) = y \cdot \log(1 + e^{-F}) + (1 - y) \cdot \log(1 + e^{F})$$
This can be rewritten in several equivalent forms:
In terms of probability $p = \sigma(F)$: $$L(y, F) = -y \log(p) - (1-y) \log(1-p)$$
This is the cross-entropy between the true distribution (point mass at $y$) and predicted distribution.
Compact form for $y \in {-1, +1}$: $$L(y, F) = \log(1 + e^{-yF})$$
This version treats positive and negative classes symmetrically.
Unified Loss Expression:
For $y \in {0, 1}$, let $\tilde{y} = 2y - 1 \in {-1, +1}$. Then: $$L(y, F) = \log(1 + e^{-\tilde{y}F})$$
Logistic loss arises from maximizing likelihood under the Bernoulli model: P(Y=y|x) = p^y(1-p)^(1-y). Taking negative log-likelihood gives the logistic loss. This is why minimizing logistic loss gives maximum likelihood estimates—with all the associated statistical guarantees.
Understanding the Loss:
Let's examine the loss for specific cases (using $y \in {0, 1}$):
Case: True positive ($y = 1$): $$L(1, F) = \log(1 + e^{-F})$$
Case: True negative ($y = 0$): $$L(0, F) = \log(1 + e^{F})$$
Key Property: Confident wrong predictions incur unbounded loss. This is much harsher than squared loss for regression and drives the model to avoid confident mistakes.
| Log-Odds F | Probability p | Loss L(1, F) | Interpretation |
|---|---|---|---|
| -3 | 0.047 | 3.05 | Very wrong (confident neg) |
| -1 | 0.269 | 1.31 | Moderately wrong |
| 0 | 0.500 | 0.69 | Uncertain |
| +1 | 0.731 | 0.31 | Moderately right |
| +3 | 0.953 | 0.05 | Very right (confident pos) |
| +5 | 0.993 | 0.007 | Extremely confident |
To use gradient boosting, we need the gradient of logistic loss with respect to $F$.
Starting from: $$L(y, F) = y \log(1 + e^{-F}) + (1-y) \log(1 + e^{F})$$
Computing the derivative:
$$\frac{\partial L}{\partial F} = y \cdot \frac{-e^{-F}}{1 + e^{-F}} + (1-y) \cdot \frac{e^{F}}{1 + e^{F}}$$
Simplifying using $\sigma(F) = \frac{1}{1 + e^{-F}}$:
$$\frac{\partial L}{\partial F} = y \cdot (-(1 - \sigma(F))) + (1-y) \cdot \sigma(F)$$
$$= -y + y\sigma(F) + \sigma(F) - y\sigma(F)$$
$$= \sigma(F) - y = p - y$$
The Beautiful Result:
$$\frac{\partial L}{\partial F} = p - y$$
The gradient is simply predicted probability minus true label!
For logistic loss, the gradient equals (p - y), the difference between predicted probability and true label. The pseudo-residual is -(p - y) = y - p. This is remarkably similar to regression: instead of (ŷ - y) for regression, we have (p - y) for classification. Trees learn to predict the probability residual!
Understanding the Gradient:
For $y \in {0, 1}$ and $p = \sigma(F)$:
If $y = 1$ (positive class):
If $y = 0$ (negative class):
Magnitude Matters:
Unlike absolute loss (gradients are ±1), logistic loss gradients scale with confidence:
This automatic scaling is one reason logistic loss works so well.
| Predicted p | Gradient (p - 1) | Pseudo-residual (1 - p) | Interpretation |
|---|---|---|---|
| 0.01 | -0.99 | +0.99 | Strong push toward positive |
| 0.10 | -0.90 | +0.90 | Significant correction needed |
| 0.50 | -0.50 | +0.50 | Moderate correction |
| 0.90 | -0.10 | +0.10 | Minor adjustment |
| 0.99 | -0.01 | +0.01 | Nearly correct, tiny update |
XGBoost and LightGBM use the Hessian (second derivative) for more efficient tree construction. Let's derive it.
Starting from the gradient: $$g = \frac{\partial L}{\partial F} = \sigma(F) - y = p - y$$
Computing the Hessian: $$h = \frac{\partial^2 L}{\partial F^2} = \frac{\partial p}{\partial F}$$
Using the derivative of sigmoid $\frac{d\sigma(z)}{dz} = \sigma(z)(1 - \sigma(z))$:
$$h = p(1 - p)$$
The Hessian is $p(1-p)$—the variance of a Bernoulli random variable!
Properties of the Hessian:
The Hessian p(1-p) tells us how much "leverage" a point has in determining the split. Uncertain points (p ≈ 0.5) have maximum leverage—they can swing either way and contribute most to optimization. Confident points (p ≈ 0 or 1) have low leverage—they're already well-classified and contribute less.
XGBoost Split Finding:
In XGBoost, the optimal split is found by maximizing the reduction in the objective:
$$\text{Gain} = \frac{1}{2}\left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] - \gamma$$
Where:
The Hessian $H$ acts as a weight: points with higher Hessian (uncertain predictions) contribute more to determining the split.
Optimal Leaf Value:
For logistic loss, the optimal leaf value is:
$$w^* = -\frac{\sum_i g_i}{\sum_i h_i + \lambda} = \frac{\sum_i (y_i - p_i)}{\sum_i p_i(1-p_i) + \lambda}$$
This is a Newton step: gradient divided by Hessian (plus regularization).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
import numpy as npfrom sklearn.tree import DecisionTreeRegressor class LogisticLossGradientBoosting: """ Gradient Boosting for binary classification using logistic loss. Key concepts: 1. Predict log-odds F, convert to probability via sigmoid 2. Gradient = (p - y), the probability residual 3. Hessian = p(1-p), the Bernoulli variance 4. Leaf values via Newton step """ def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3): self.n_estimators = n_estimators self.learning_rate = learning_rate self.max_depth = max_depth self.trees = [] self.initial_prediction = None def _sigmoid(self, z): """Sigmoid function with numerical stability.""" return np.where( z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)) ) def _logistic_loss(self, y, F): """Compute logistic loss (cross-entropy).""" p = self._sigmoid(F) # Clip for numerical stability p = np.clip(p, 1e-15, 1 - 1e-15) return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p)) def _gradient(self, y, F): """ Compute gradient of logistic loss. g = p - y where p = sigmoid(F) """ p = self._sigmoid(F) return p - y # Probability residual def _hessian(self, F): """ Compute Hessian of logistic loss. h = p(1-p), the Bernoulli variance """ p = self._sigmoid(F) return p * (1 - p) def _compute_leaf_values(self, tree, X, gradients, hessians, reg_lambda=1.0): """ Compute optimal leaf values using Newton step. w* = -sum(g) / (sum(h) + lambda) """ leaf_indices = tree.apply(X) unique_leaves = np.unique(leaf_indices) leaf_values = {} for leaf in unique_leaves: mask = leaf_indices == leaf g_sum = gradients[mask].sum() h_sum = hessians[mask].sum() # Newton step with regularization leaf_values[leaf] = -g_sum / (h_sum + reg_lambda) return leaf_values, leaf_indices def fit(self, X, y): """Fit gradient boosting classifier.""" n_samples = len(y) # Initialize with log-odds of class proportions p_positive = np.mean(y) self.initial_prediction = np.log(p_positive / (1 - p_positive)) # Current log-odds predictions F = np.full(n_samples, self.initial_prediction) print("Gradient Boosting Classification with Logistic Loss") print("=" * 60) print(f"Class balance: {p_positive:.2%} positive") print(f"Initial log-odds: {self.initial_prediction:.4f}") print(f"Initial loss: {self._logistic_loss(y, F):.4f}") print() for m in range(self.n_estimators): # Compute gradient and Hessian gradients = self._gradient(y, F) # p - y hessians = self._hessian(F) # p(1-p) # Fit tree to negative gradient (pseudo-residual) tree = DecisionTreeRegressor(max_depth=self.max_depth) tree.fit(X, -gradients) # Fit to (y - p) # Compute optimal leaf values with Newton step leaf_values, leaf_indices = self._compute_leaf_values( tree, X, gradients, hessians ) # Get predictions predictions = np.array([leaf_values[l] for l in leaf_indices]) # Update log-odds F += self.learning_rate * predictions # Store tree self.trees.append((tree, leaf_values)) # Track progress if (m + 1) % 20 == 0 or m == 0: loss = self._logistic_loss(y, F) p = self._sigmoid(F) acc = np.mean((p >= 0.5) == y) print(f"Iter {m+1:3d}: Loss = {loss:.4f}, Accuracy = {acc:.2%}") print(f"\nFinal Loss: {self._logistic_loss(y, F):.4f}") return self def predict_proba(self, X): """Predict class probabilities.""" F = np.full(len(X), self.initial_prediction) for tree, leaf_values in self.trees: leaf_indices = tree.apply(X) preds = np.array([leaf_values.get(l, 0) for l in leaf_indices]) F += self.learning_rate * preds p = self._sigmoid(F) return np.column_stack([1 - p, p]) def predict(self, X): """Predict class labels.""" proba = self.predict_proba(X) return (proba[:, 1] >= 0.5).astype(int) # Demonstrationif __name__ == "__main__": from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # Generate classification data X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Train our model model = LogisticLossGradientBoosting(n_estimators=100, learning_rate=0.1) model.fit(X_train, y_train) # Evaluate y_pred = model.predict(X_test) test_acc = np.mean(y_pred == y_test) print(f"\nTest Accuracy: {test_acc:.2%}")Like regression, we need an initial prediction $F_0$ before iterating. For logistic loss, the optimal constant minimizes:
$$F_0 = \arg\min_c \sum_{i=1}^{n} L(y_i, c)$$
Derivation:
Setting the derivative to zero:
$$\sum_{i=1}^{n} (\sigma(c) - y_i) = 0$$
$$n \cdot \sigma(c) = \sum_{i=1}^{n} y_i = n \cdot \bar{y}$$
$$\sigma(c) = \bar{y} \implies c = \log\frac{\bar{y}}{1 - \bar{y}}$$
The optimal initial prediction is the log-odds of the base rate!
If 70% of examples are positive ($\bar{y} = 0.7$): $$F_0 = \log\frac{0.7}{0.3} = \log(2.33) \approx 0.85$$
This corresponds to predicting probability 0.7 for all examples initially—a sensible baseline.
The log-odds initialization automatically handles class imbalance. For rare positive classes (e.g., 1% positive), F₀ ≈ -4.6, corresponding to p ≈ 0.01. The model starts with the correct base rate and only needs to learn adjustments from there. This is much better than starting at F₀ = 0 (p = 0.5).
| Positive Rate | Initial Log-Odds F₀ | Initial Probability p₀ |
|---|---|---|
| 1% | -4.60 | 0.01 |
| 10% | -2.20 | 0.10 |
| 25% | -1.10 | 0.25 |
| 50% | 0.00 | 0.50 |
| 75% | +1.10 | 0.75 |
| 90% | +2.20 | 0.90 |
| 99% | +4.60 | 0.99 |
Logistic loss involves exponentials and logarithms, which can cause numerical issues:
Problem 1: Exponential Overflow
For large |F|, $e^F$ or $e^{-F}$ can overflow:
Solution: Numerically stable sigmoid
def stable_sigmoid(z):
return np.where(
z >= 0,
1 / (1 + np.exp(-z)), # For z >= 0
np.exp(z) / (1 + np.exp(z)) # For z < 0
)
This avoids computing $e^z$ for large positive $z$ or $e^{-z}$ for large negative $z$.
Problem 2: Log of Zero
When $p = 0$ or $p = 1$, log terms become $-\infty$.
Solution: Clip probabilities
p = np.clip(p, 1e-15, 1 - 1e-15)
loss = -y * np.log(p) - (1-y) * np.log(1-p)
Problem 3: Small Hessian
When $p \approx 0$ or $p \approx 1$, the Hessian $p(1-p) \approx 0$, leading to division issues.
Solution: Add regularization or floor
hessian = np.maximum(p * (1 - p), 1e-6)
Many gradient boosting bugs stem from numerical issues with logistic loss. Symptoms include: NaN losses, predictions stuck at 0 or 1, enormous leaf values, or training that suddenly diverges. Always use numerically stable implementations and add appropriate clipping/regularization.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as np def stable_logistic_loss(y, F, eps=1e-15): """ Numerically stable logistic loss computation. Uses the log-sum-exp trick to avoid overflow/underflow. """ # For y=1: loss = log(1 + exp(-F)) # For y=0: loss = log(1 + exp(F)) # Stable version of log(1 + exp(z)) def softplus(z): return np.where( z > 20, # For large z, log(1+exp(z)) ≈ z z, np.log1p(np.exp(np.minimum(z, 20))) ) loss = y * softplus(-F) + (1 - y) * softplus(F) return np.mean(loss) def stable_gradient_hessian(y, F, min_hessian=1e-6): """ Compute gradient and Hessian with numerical stability. """ # Stable sigmoid p = np.where( F >= 0, 1 / (1 + np.exp(-F)), np.exp(F) / (1 + np.exp(F)) ) # Gradient: p - y gradient = p - y # Hessian: p(1-p), with floor to prevent division issues hessian = np.maximum(p * (1 - p), min_hessian) return gradient, hessian # Test edge casesprint("Numerical Stability Test")print("=" * 50) test_cases = [ ("Normal case", 1, 1.0), ("Confident correct", 1, 10.0), ("Very confident correct", 1, 100.0), ("Confident wrong", 1, -10.0), ("Very confident wrong", 1, -100.0),] for name, y, F in test_cases: loss = stable_logistic_loss(np.array([y]), np.array([F])) g, h = stable_gradient_hessian(np.array([y]), np.array([F])) print(f"{name:25s}: F={F:6.1f}, Loss={loss:.6f}, g={g[0]:.6f}, h={h[0]:.6f}")Binary logistic loss extends naturally to multiclass problems with $K$ classes using softmax and cross-entropy.
Softmax for Multiple Classes:
Instead of one log-odds value, we predict $K$ values $F_1(x), \ldots, F_K(x)$. Probabilities are computed via softmax:
$$p_k(x) = \frac{e^{F_k(x)}}{\sum_{j=1}^{K} e^{F_j(x)}}$$
Multiclass Cross-Entropy Loss:
$$L(y, F) = -\sum_{k=1}^{K} \mathbf{1}_{y=k} \log(p_k)$$
For one-hot encoded labels $y \in {0, 1}^K$:
$$L = -\sum_{k=1}^{K} y_k \log(p_k)$$
Gradient for Multiclass:
For each class $k$:
$$\frac{\partial L}{\partial F_k} = p_k - y_k$$
This is identical to binary! Each class gets gradient (predicted probability - indicator).
There are two main approaches to multiclass gradient boosting: (1) One-vs-All: Train K separate binary classifiers. (2) Native multiclass: Train trees that predict K-dimensional outputs. XGBoost and LightGBM support both. Native multiclass is generally more efficient and accurate.
Hessian for Multiclass:
The Hessian becomes a $K \times K$ matrix:
$$h_{ij} = \frac{\partial^2 L}{\partial F_i \partial F_j} = p_i(\mathbf{1}_{i=j} - p_j)$$
Diagonal elements: $h_{kk} = p_k(1 - p_k)$ Off-diagonal: $h_{ij} = -p_i p_j$
This is the covariance matrix of a multinomial distribution!
Practical Implementation:
Most implementations use a diagonal approximation for efficiency: $$h_{kk} = p_k(1 - p_k)$$
This ignores cross-class dependencies but is much faster and works well in practice.
Library Support:
# XGBoost
model = xgb.XGBClassifier(objective='multi:softmax', num_class=K)
# LightGBM
model = lgb.LGBMClassifier(objective='multiclass', num_class=K)
# Scikit-learn
model = GradientBoostingClassifier() # Handles multiclass automatically
We've thoroughly explored logistic loss as the foundation for classification in gradient boosting. Let's consolidate the key insights:
What's Next:
We've now covered the major built-in loss functions: squared, absolute, Huber for regression, and logistic for classification. But what if none of these fit your problem exactly? The final page explores custom loss functions—how to design, implement, and use your own loss functions in gradient boosting frameworks. This is where the true flexibility of gradient boosting shines.
You now understand logistic loss from first principles—connecting probability theory, maximum likelihood, and gradient optimization. This knowledge enables you to use gradient boosting classifiers effectively and understand their behavior at a deep level.