Machine LearningBoosting Theory

Loss Functions for Boosting

LevelAdvanced

Duration75 mins

TopicBoosting Theory

4 / 5

Logistic Loss (Classification)

From Regression to Classification

Gradient boosting began as a regression technique, iteratively fitting residuals to reduce squared error. But what about classification—predicting discrete class labels rather than continuous values?

The key insight is that classification can be reframed as regression: instead of predicting labels directly, we predict log-odds (the logarithm of odds of class membership). This prediction is continuous, enabling the same gradient descent machinery we've developed. The logistic loss (also called log loss, cross-entropy loss, or Bernoulli loss) provides the bridge.

Logistic loss is to classification what squared loss is to regression: the default, theoretically grounded choice. It arises naturally from maximum likelihood estimation under the Bernoulli model and produces well-calibrated probability predictions. Understanding logistic loss deeply is essential for anyone using gradient boosting for classification tasks—which in practice means most real-world applications.

What You Will Learn

By the end of this page, you will understand logistic loss from first principles—its derivation from maximum likelihood, the sigmoid function, gradient and Hessian computation on the log-odds scale, and implementation in gradient boosting classifiers. You'll grasp why gradients target residuals on the probability scale and how to extend to multiclass problems.

From Labels to Probabilities: The Problem Setup

In binary classification, we observe:

Features $x \in \mathbb{R}^d$
Labels $y \in {0, 1}$ (or sometimes ${-1, +1}$)

Our goal is to predict the probability that $y = 1$ given $x$:

$$p(x) = P(Y = 1 | X = x)$$

Why Probabilities Instead of Labels?

Uncertainty Quantification: Knowing $p = 0.51$ vs $p = 0.99$ matters for decision-making
Calibration: Well-calibrated probabilities enable proper risk assessment
Ranking: Probabilities allow ranking predictions by confidence
Threshold Flexibility: Choose decision threshold based on application (precision vs recall)

The Challenge:

Probabilities must lie in $[0, 1]$, but gradient boosting predicts unbounded real values. We need a transformation.

The Solution: Log-Odds (Logit) Scale

Gradient boosting predicts the log-odds (logit):

$$F(x) = \log\frac{p(x)}{1 - p(x)}$$

This transforms $p \in (0, 1)$ to $F \in (-\infty, +\infty)$, which gradient boosting can freely predict. To recover probabilities, we apply the sigmoid function:

$$p(x) = \sigma(F(x)) = \frac{1}{1 + e^{-F(x)}} = \frac{e^{F(x)}}{1 + e^{F(x)}}$$

The Sigmoid Function

The sigmoid σ(z) = 1/(1+e^(-z)) maps any real number to (0,1). For z → +∞, σ(z) → 1. For z → -∞, σ(z) → 0. At z = 0, σ(0) = 0.5. The sigmoid is the inverse of the logit function, creating a bijection between probabilities and log-odds.

Log-Odds to Probability Mapping
Log-Odds F(x)	Probability p(x)	Interpretation
-∞	0	Certain negative class
-2.2	0.10	90% likely negative
-1.1	0.25	75% likely negative
0	0.50	Equal probability (uncertain)
+1.1	0.75	75% likely positive
+2.2	0.90	90% likely positive
+∞	1	Certain positive class

The Logistic Loss Function

Given true label $y \in {0, 1}$ and predicted log-odds $F$, the logistic loss is:

$$L(y, F) = y \cdot \log(1 + e^{-F}) + (1 - y) \cdot \log(1 + e^{F})$$

This can be rewritten in several equivalent forms:

In terms of probability $p = \sigma(F)$: $$L(y, F) = -y \log(p) - (1-y) \log(1-p)$$

This is the cross-entropy between the true distribution (point mass at $y$) and predicted distribution.

Compact form for $y \in {-1, +1}$: $$L(y, F) = \log(1 + e^{-yF})$$

This version treats positive and negative classes symmetrically.

Unified Loss Expression:

For $y \in {0, 1}$, let $\tilde{y} = 2y - 1 \in {-1, +1}$. Then: $$L(y, F) = \log(1 + e^{-\tilde{y}F})$$

Maximum Likelihood Derivation

Logistic loss arises from maximizing likelihood under the Bernoulli model: P(Y=y|x) = p^y(1-p)^(1-y). Taking negative log-likelihood gives the logistic loss. This is why minimizing logistic loss gives maximum likelihood estimates—with all the associated statistical guarantees.

Understanding the Loss:

Let's examine the loss for specific cases (using $y \in {0, 1}$):

Case: True positive ($y = 1$): $$L(1, F) = \log(1 + e^{-F})$$

If $F = +\infty$ (confident positive): $L = 0$ (no penalty)
If $F = 0$ (p = 0.5, uncertain): $L = \log(2) \approx 0.69$
If $F = -\infty$ (confident negative, wrong!): $L = +\infty$

Case: True negative ($y = 0$): $$L(0, F) = \log(1 + e^{F})$$

If $F = -\infty$ (confident negative): $L = 0$ (no penalty)
If $F = 0$ (uncertain): $L = \log(2) \approx 0.69$
If $F = +\infty$ (confident positive, wrong!): $L = +\infty$

Key Property: Confident wrong predictions incur unbounded loss. This is much harsher than squared loss for regression and drives the model to avoid confident mistakes.

Logistic Loss Values (y = 1)
Log-Odds F	Probability p	Loss L(1, F)	Interpretation
-3	0.047	3.05	Very wrong (confident neg)
-1	0.269	1.31	Moderately wrong
0	0.500	0.69	Uncertain
+1	0.731	0.31	Moderately right
+3	0.953	0.05	Very right (confident pos)
+5	0.993	0.007	Extremely confident

Gradient Derivation: The Probability Residual

To use gradient boosting, we need the gradient of logistic loss with respect to $F$.

Starting from: $$L(y, F) = y \log(1 + e^{-F}) + (1-y) \log(1 + e^{F})$$

Computing the derivative:

$$\frac{\partial L}{\partial F} = y \cdot \frac{-e^{-F}}{1 + e^{-F}} + (1-y) \cdot \frac{e^{F}}{1 + e^{F}}$$

Simplifying using $\sigma(F) = \frac{1}{1 + e^{-F}}$:

$$\frac{\partial L}{\partial F} = y \cdot (-(1 - \sigma(F))) + (1-y) \cdot \sigma(F)$$

$$= -y + y\sigma(F) + \sigma(F) - y\sigma(F)$$

$$= \sigma(F) - y = p - y$$

The Beautiful Result:

$$\frac{\partial L}{\partial F} = p - y$$

The gradient is simply predicted probability minus true label!

The Probability Residual

For logistic loss, the gradient equals (p - y), the difference between predicted probability and true label. The pseudo-residual is -(p - y) = y - p. This is remarkably similar to regression: instead of (ŷ - y) for regression, we have (p - y) for classification. Trees learn to predict the probability residual!

Understanding the Gradient:

For $y \in {0, 1}$ and $p = \sigma(F)$:

If $y = 1$ (positive class):

Gradient $= p - 1$, which is negative
Negative gradient (pseudo-residual) $= 1 - p$, which is positive
The tree learns to increase $F$ (increase probability)

If $y = 0$ (negative class):

Gradient $= p - 0 = p$, which is positive
Negative gradient $= -p$, which is negative
The tree learns to decrease $F$ (decrease probability)

Magnitude Matters:

Unlike absolute loss (gradients are ±1), logistic loss gradients scale with confidence:

Confident correct: Small gradient (little update needed)
Uncertain: Moderate gradient (some update)
Confident wrong: Large gradient (strong correction)

This automatic scaling is one reason logistic loss works so well.

Gradient Values for Positive Examples (y = 1)
Predicted p	Gradient (p - 1)	Pseudo-residual (1 - p)	Interpretation
0.01	-0.99	+0.99	Strong push toward positive
0.10	-0.90	+0.90	Significant correction needed
0.50	-0.50	+0.50	Moderate correction
0.90	-0.10	+0.10	Minor adjustment
0.99	-0.01	+0.01	Nearly correct, tiny update

Hessian and Second-Order Optimization

XGBoost and LightGBM use the Hessian (second derivative) for more efficient tree construction. Let's derive it.

Starting from the gradient: $$g = \frac{\partial L}{\partial F} = \sigma(F) - y = p - y$$

Computing the Hessian: $$h = \frac{\partial^2 L}{\partial F^2} = \frac{\partial p}{\partial F}$$

Using the derivative of sigmoid $\frac{d\sigma(z)}{dz} = \sigma(z)(1 - \sigma(z))$:

$$h = p(1 - p)$$

The Hessian is $p(1-p)$—the variance of a Bernoulli random variable!

Properties of the Hessian:

Always non-negative: $p(1-p) \geq 0$ for $p \in [0, 1]$
Maximum at $p = 0.5$: $h = 0.25$ when uncertain
Minimum near 0 or 1: $h \to 0$ for confident predictions
Symmetric: Same value for $p$ and $1-p$

Interpretation of Hessian

The Hessian p(1-p) tells us how much "leverage" a point has in determining the split. Uncertain points (p ≈ 0.5) have maximum leverage—they can swing either way and contribute most to optimization. Confident points (p ≈ 0 or 1) have low leverage—they're already well-classified and contribute less.

XGBoost Split Finding:

In XGBoost, the optimal split is found by maximizing the reduction in the objective:

$$\text{Gain} = \frac{1}{2}\left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] - \gamma$$

Where:

$G_L = \sum_{i \in \text{left}} g_i = \sum_{i \in \text{left}} (p_i - y_i)$
$H_L = \sum_{i \in \text{left}} h_i = \sum_{i \in \text{left}} p_i(1 - p_i)$

The Hessian $H$ acts as a weight: points with higher Hessian (uncertain predictions) contribute more to determining the split.

Optimal Leaf Value:

For logistic loss, the optimal leaf value is:

$$w^* = -\frac{\sum_i g_i}{\sum_i h_i + \lambda} = \frac{\sum_i (y_i - p_i)}{\sum_i p_i(1-p_i) + \lambda}$$

This is a Newton step: gradient divided by Hessian (plus regularization).

logistic_loss_gb.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
class LogisticLossGradientBoosting:
    """
    Gradient Boosting for binary classification using logistic loss.
    
    Key concepts:
    1. Predict log-odds F, convert to probability via sigmoid
    2. Gradient = (p - y), the probability residual
    3. Hessian = p(1-p), the Bernoulli variance
    4. Leaf values via Newton step
    """
    
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.initial_prediction = None
    
    def _sigmoid(self, z):
        """Sigmoid function with numerical stability."""
        return np.where(
            z >= 0,
            1 / (1 + np.exp(-z)),
            np.exp(z) / (1 + np.exp(z))
        )
    
    def _logistic_loss(self, y, F):
        """Compute logistic loss (cross-entropy)."""
        p = self._sigmoid(F)
        # Clip for numerical stability
        p = np.clip(p, 1e-15, 1 - 1e-15)
        return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))
    
    def _gradient(self, y, F):
        """
        Compute gradient of logistic loss.
        
        g = p - y where p = sigmoid(F)
        """
        p = self._sigmoid(F)
        return p - y  # Probability residual
    
    def _hessian(self, F):
        """
        Compute Hessian of logistic loss.
        
        h = p(1-p), the Bernoulli variance
        """
        p = self._sigmoid(F)
        return p * (1 - p)
    
    def _compute_leaf_values(self, tree, X, gradients, hessians, reg_lambda=1.0):
        """
        Compute optimal leaf values using Newton step.
        
        w* = -sum(g) / (sum(h) + lambda)
        """
        leaf_indices = tree.apply(X)
        unique_leaves = np.unique(leaf_indices)
        
        leaf_values = {}
        for leaf in unique_leaves:
            mask = leaf_indices == leaf
            g_sum = gradients[mask].sum()
            h_sum = hessians[mask].sum()
            
            # Newton step with regularization
            leaf_values[leaf] = -g_sum / (h_sum + reg_lambda)
        
        return leaf_values, leaf_indices
    
    def fit(self, X, y):
        """Fit gradient boosting classifier."""
        n_samples = len(y)
        
        # Initialize with log-odds of class proportions
        p_positive = np.mean(y)
        self.initial_prediction = np.log(p_positive / (1 - p_positive))
        
        # Current log-odds predictions
        F = np.full(n_samples, self.initial_prediction)
        
        print("Gradient Boosting Classification with Logistic Loss")
        print("=" * 60)
        print(f"Class balance: {p_positive:.2%} positive")
        print(f"Initial log-odds: {self.initial_prediction:.4f}")
        print(f"Initial loss: {self._logistic_loss(y, F):.4f}")
        print()
        
        for m in range(self.n_estimators):
            # Compute gradient and Hessian
            gradients = self._gradient(y, F)  # p - y
            hessians = self._hessian(F)       # p(1-p)
            
            # Fit tree to negative gradient (pseudo-residual)
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, -gradients)  # Fit to (y - p)
            
            # Compute optimal leaf values with Newton step
            leaf_values, leaf_indices = self._compute_leaf_values(
                tree, X, gradients, hessians
            )
            
            # Get predictions
            predictions = np.array([leaf_values[l] for l in leaf_indices])
            
            # Update log-odds
            F += self.learning_rate * predictions
            
            # Store tree
            self.trees.append((tree, leaf_values))
            
            # Track progress
            if (m + 1) % 20 == 0 or m == 0:
                loss = self._logistic_loss(y, F)
                p = self._sigmoid(F)
                acc = np.mean((p >= 0.5) == y)
                print(f"Iter {m+1:3d}: Loss = {loss:.4f}, Accuracy = {acc:.2%}")
        
        print(f"\nFinal Loss: {self._logistic_loss(y, F):.4f}")
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        F = np.full(len(X), self.initial_prediction)
        
        for tree, leaf_values in self.trees:
            leaf_indices = tree.apply(X)
            preds = np.array([leaf_values.get(l, 0) for l in leaf_indices])
            F += self.learning_rate * preds
        
        p = self._sigmoid(F)
        return np.column_stack([1 - p, p])
    
    def predict(self, X):
        """Predict class labels."""
        proba = self.predict_proba(X)
        return (proba[:, 1] >= 0.5).astype(int)
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    
    # Generate classification data
    X, y = make_classification(n_samples=1000, n_features=20, 
                               n_informative=10, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    # Train our model
    model = LogisticLossGradientBoosting(n_estimators=100, learning_rate=0.1)
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    test_acc = np.mean(y_pred == y_test)
    print(f"\nTest Accuracy: {test_acc:.2%}")

Optimal Initialization

Like regression, we need an initial prediction $F_0$ before iterating. For logistic loss, the optimal constant minimizes:

$$F_0 = \arg\min_c \sum_{i=1}^{n} L(y_i, c)$$

Derivation:

Setting the derivative to zero:

$$\sum_{i=1}^{n} (\sigma(c) - y_i) = 0$$

$$n \cdot \sigma(c) = \sum_{i=1}^{n} y_i = n \cdot \bar{y}$$

$$\sigma(c) = \bar{y} \implies c = \log\frac{\bar{y}}{1 - \bar{y}}$$

The optimal initial prediction is the log-odds of the base rate!

If 70% of examples are positive ($\bar{y} = 0.7$): $$F_0 = \log\frac{0.7}{0.3} = \log(2.33) \approx 0.85$$

This corresponds to predicting probability 0.7 for all examples initially—a sensible baseline.

Class Imbalance

The log-odds initialization automatically handles class imbalance. For rare positive classes (e.g., 1% positive), F₀ ≈ -4.6, corresponding to p ≈ 0.01. The model starts with the correct base rate and only needs to learn adjustments from there. This is much better than starting at F₀ = 0 (p = 0.5).

Initial Log-Odds for Various Class Balances
Positive Rate	Initial Log-Odds F₀	Initial Probability p₀
1%	-4.60	0.01
10%	-2.20	0.10
25%	-1.10	0.25
50%	0.00	0.50
75%	+1.10	0.75
90%	+2.20	0.90
99%	+4.60	0.99

Numerical Stability Considerations

Logistic loss involves exponentials and logarithms, which can cause numerical issues:

Problem 1: Exponential Overflow

For large |F|, $e^F$ or $e^{-F}$ can overflow:

$e^{710} \approx 10^{308}$ exceeds float64 max
$e^{-710} \approx 10^{-308}$ underflows to zero

Solution: Numerically stable sigmoid

def stable_sigmoid(z):
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),      # For z >= 0
        np.exp(z) / (1 + np.exp(z))  # For z < 0
    )

This avoids computing $e^z$ for large positive $z$ or $e^{-z}$ for large negative $z$.

Problem 2: Log of Zero

When $p = 0$ or $p = 1$, log terms become $-\infty$.

Solution: Clip probabilities

p = np.clip(p, 1e-15, 1 - 1e-15)
loss = -y * np.log(p) - (1-y) * np.log(1-p)

Problem 3: Small Hessian

When $p \approx 0$ or $p \approx 1$, the Hessian $p(1-p) \approx 0$, leading to division issues.

Solution: Add regularization or floor

hessian = np.maximum(p * (1 - p), 1e-6)

Watch for These Bugs

Many gradient boosting bugs stem from numerical issues with logistic loss. Symptoms include: NaN losses, predictions stuck at 0 or 1, enormous leaf values, or training that suddenly diverges. Always use numerically stable implementations and add appropriate clipping/regularization.

numerical_stability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
 
def stable_logistic_loss(y, F, eps=1e-15):
    """
    Numerically stable logistic loss computation.
    
    Uses the log-sum-exp trick to avoid overflow/underflow.
    """
    # For y=1: loss = log(1 + exp(-F))
    # For y=0: loss = log(1 + exp(F))
    
    # Stable version of log(1 + exp(z))
    def softplus(z):
        return np.where(
            z > 20,  # For large z, log(1+exp(z)) ≈ z
            z,
            np.log1p(np.exp(np.minimum(z, 20)))
        )
    
    loss = y * softplus(-F) + (1 - y) * softplus(F)
    return np.mean(loss)
 
 
def stable_gradient_hessian(y, F, min_hessian=1e-6):
    """
    Compute gradient and Hessian with numerical stability.
    """
    # Stable sigmoid
    p = np.where(
        F >= 0,
        1 / (1 + np.exp(-F)),
        np.exp(F) / (1 + np.exp(F))
    )
    
    # Gradient: p - y
    gradient = p - y
    
    # Hessian: p(1-p), with floor to prevent division issues
    hessian = np.maximum(p * (1 - p), min_hessian)
    
    return gradient, hessian
 
 
# Test edge cases
print("Numerical Stability Test")
print("=" * 50)
 
test_cases = [
    ("Normal case", 1, 1.0),
    ("Confident correct", 1, 10.0),
    ("Very confident correct", 1, 100.0),
    ("Confident wrong", 1, -10.0),
    ("Very confident wrong", 1, -100.0),
]
 
for name, y, F in test_cases:
    loss = stable_logistic_loss(np.array([y]), np.array([F]))
    g, h = stable_gradient_hessian(np.array([y]), np.array([F]))
    print(f"{name:25s}: F={F:6.1f}, Loss={loss:.6f}, g={g[0]:.6f}, h={h[0]:.6f}")

Extension to Multiclass Classification

Binary logistic loss extends naturally to multiclass problems with $K$ classes using softmax and cross-entropy.

Softmax for Multiple Classes:

Instead of one log-odds value, we predict $K$ values $F_1(x), \ldots, F_K(x)$. Probabilities are computed via softmax:

$$p_k(x) = \frac{e^{F_k(x)}}{\sum_{j=1}^{K} e^{F_j(x)}}$$

Multiclass Cross-Entropy Loss:

$$L(y, F) = -\sum_{k=1}^{K} \mathbf{1}_{y=k} \log(p_k)$$

For one-hot encoded labels $y \in {0, 1}^K$:

$$L = -\sum_{k=1}^{K} y_k \log(p_k)$$

Gradient for Multiclass:

For each class $k$:

$$\frac{\partial L}{\partial F_k} = p_k - y_k$$

This is identical to binary! Each class gets gradient (predicted probability - indicator).

Implementation Approaches

There are two main approaches to multiclass gradient boosting: (1) One-vs-All: Train K separate binary classifiers. (2) Native multiclass: Train trees that predict K-dimensional outputs. XGBoost and LightGBM support both. Native multiclass is generally more efficient and accurate.

Hessian for Multiclass:

The Hessian becomes a $K \times K$ matrix:

$$h_{ij} = \frac{\partial^2 L}{\partial F_i \partial F_j} = p_i(\mathbf{1}_{i=j} - p_j)$$

Diagonal elements: $h_{kk} = p_k(1 - p_k)$ Off-diagonal: $h_{ij} = -p_i p_j$

This is the covariance matrix of a multinomial distribution!

Practical Implementation:

Most implementations use a diagonal approximation for efficiency: $$h_{kk} = p_k(1 - p_k)$$

This ignores cross-class dependencies but is much faster and works well in practice.

Library Support:

# XGBoost
model = xgb.XGBClassifier(objective='multi:softmax', num_class=K)

# LightGBM
model = lgb.LGBMClassifier(objective='multiclass', num_class=K)

# Scikit-learn
model = GradientBoostingClassifier()  # Handles multiclass automatically

Summary: Logistic Loss in Gradient Boosting

We've thoroughly explored logistic loss as the foundation for classification in gradient boosting. Let's consolidate the key insights:

Key Takeaways

•Predict Log-Odds: Gradient boosting predicts F = log(p/(1-p)), convert to probability via sigmoid p = 1/(1+e^{-F}).
•Gradient = (p - y): The probability residual. Trees learn to increase probability for positives, decrease for negatives.
•Hessian = p(1-p): The Bernoulli variance. Uncertain points (p ≈ 0.5) contribute most to optimization.
•Maximum Likelihood: Logistic loss is negative log-likelihood under Bernoulli model. Minimizing gives MLE estimates.
•Initialize with Base Rate: Start with log-odds of class proportion. Handles imbalance automatically.
•Numerical Stability: Use stable sigmoid, clip probabilities, floor Hessian to avoid NaN and overflow.
•Multiclass Extension: Softmax + cross-entropy with gradient (p_k - y_k) for each class.

What's Next:

We've now covered the major built-in loss functions: squared, absolute, Huber for regression, and logistic for classification. But what if none of these fit your problem exactly? The final page explores custom loss functions—how to design, implement, and use your own loss functions in gradient boosting frameworks. This is where the true flexibility of gradient boosting shines.

Page Complete

You now understand logistic loss from first principles—connecting probability theory, maximum likelihood, and gradient optimization. This knowledge enables you to use gradient boosting classifiers effectively and understand their behavior at a deep level.

4 / 5

Loading learning content...

Machine LearningBoosting Theory

Loss Functions for Boosting

LevelAdvanced

Duration75 mins

TopicBoosting Theory

4 / 5

Logistic Loss (Classification)

From Regression to Classification

What You Will Learn

From Labels to Probabilities: The Problem Setup

In binary classification, we observe:

Features $x \in \mathbb{R}^d$
Labels $y \in {0, 1}$ (or sometimes ${-1, +1}$)

Our goal is to predict the probability that $y = 1$ given $x$:

$$p(x) = P(Y = 1 | X = x)$$

Why Probabilities Instead of Labels?

Uncertainty Quantification: Knowing $p = 0.51$ vs $p = 0.99$ matters for decision-making
Calibration: Well-calibrated probabilities enable proper risk assessment
Ranking: Probabilities allow ranking predictions by confidence
Threshold Flexibility: Choose decision threshold based on application (precision vs recall)

The Challenge:

Probabilities must lie in $[0, 1]$, but gradient boosting predicts unbounded real values. We need a transformation.

The Solution: Log-Odds (Logit) Scale

Gradient boosting predicts the log-odds (logit):

$$F(x) = \log\frac{p(x)}{1 - p(x)}$$

This transforms $p \in (0, 1)$ to $F \in (-\infty, +\infty)$, which gradient boosting can freely predict. To recover probabilities, we apply the sigmoid function:

$$p(x) = \sigma(F(x)) = \frac{1}{1 + e^{-F(x)}} = \frac{e^{F(x)}}{1 + e^{F(x)}}$$

The Sigmoid Function

Log-Odds to Probability Mapping
Log-Odds F(x)	Probability p(x)	Interpretation
-∞	0	Certain negative class
-2.2	0.10	90% likely negative
-1.1	0.25	75% likely negative
0	0.50	Equal probability (uncertain)
+1.1	0.75	75% likely positive
+2.2	0.90	90% likely positive
+∞	1	Certain positive class

The Logistic Loss Function

Given true label $y \in {0, 1}$ and predicted log-odds $F$, the logistic loss is:

$$L(y, F) = y \cdot \log(1 + e^{-F}) + (1 - y) \cdot \log(1 + e^{F})$$

This can be rewritten in several equivalent forms:

In terms of probability $p = \sigma(F)$: $$L(y, F) = -y \log(p) - (1-y) \log(1-p)$$

This is the cross-entropy between the true distribution (point mass at $y$) and predicted distribution.

Compact form for $y \in {-1, +1}$: $$L(y, F) = \log(1 + e^{-yF})$$

This version treats positive and negative classes symmetrically.

Unified Loss Expression:

For $y \in {0, 1}$, let $\tilde{y} = 2y - 1 \in {-1, +1}$. Then: $$L(y, F) = \log(1 + e^{-\tilde{y}F})$$

Maximum Likelihood Derivation

Understanding the Loss:

Let's examine the loss for specific cases (using $y \in {0, 1}$):

Case: True positive ($y = 1$): $$L(1, F) = \log(1 + e^{-F})$$

If $F = +\infty$ (confident positive): $L = 0$ (no penalty)
If $F = 0$ (p = 0.5, uncertain): $L = \log(2) \approx 0.69$
If $F = -\infty$ (confident negative, wrong!): $L = +\infty$

Case: True negative ($y = 0$): $$L(0, F) = \log(1 + e^{F})$$

If $F = -\infty$ (confident negative): $L = 0$ (no penalty)
If $F = 0$ (uncertain): $L = \log(2) \approx 0.69$
If $F = +\infty$ (confident positive, wrong!): $L = +\infty$

Key Property: Confident wrong predictions incur unbounded loss. This is much harsher than squared loss for regression and drives the model to avoid confident mistakes.

Logistic Loss Values (y = 1)
Log-Odds F	Probability p	Loss L(1, F)	Interpretation
-3	0.047	3.05	Very wrong (confident neg)
-1	0.269	1.31	Moderately wrong
0	0.500	0.69	Uncertain
+1	0.731	0.31	Moderately right
+3	0.953	0.05	Very right (confident pos)
+5	0.993	0.007	Extremely confident

Gradient Derivation: The Probability Residual

To use gradient boosting, we need the gradient of logistic loss with respect to $F$.

Starting from: $$L(y, F) = y \log(1 + e^{-F}) + (1-y) \log(1 + e^{F})$$

Computing the derivative:

$$\frac{\partial L}{\partial F} = y \cdot \frac{-e^{-F}}{1 + e^{-F}} + (1-y) \cdot \frac{e^{F}}{1 + e^{F}}$$

Simplifying using $\sigma(F) = \frac{1}{1 + e^{-F}}$:

$$\frac{\partial L}{\partial F} = y \cdot (-(1 - \sigma(F))) + (1-y) \cdot \sigma(F)$$

$$= -y + y\sigma(F) + \sigma(F) - y\sigma(F)$$

$$= \sigma(F) - y = p - y$$

The Beautiful Result:

$$\frac{\partial L}{\partial F} = p - y$$

The gradient is simply predicted probability minus true label!

The Probability Residual

Understanding the Gradient:

For $y \in {0, 1}$ and $p = \sigma(F)$:

If $y = 1$ (positive class):

Gradient $= p - 1$, which is negative
Negative gradient (pseudo-residual) $= 1 - p$, which is positive
The tree learns to increase $F$ (increase probability)

If $y = 0$ (negative class):

Gradient $= p - 0 = p$, which is positive
Negative gradient $= -p$, which is negative
The tree learns to decrease $F$ (decrease probability)

Magnitude Matters:

Unlike absolute loss (gradients are ±1), logistic loss gradients scale with confidence:

Confident correct: Small gradient (little update needed)
Uncertain: Moderate gradient (some update)
Confident wrong: Large gradient (strong correction)

This automatic scaling is one reason logistic loss works so well.

Gradient Values for Positive Examples (y = 1)
Predicted p	Gradient (p - 1)	Pseudo-residual (1 - p)	Interpretation
0.01	-0.99	+0.99	Strong push toward positive
0.10	-0.90	+0.90	Significant correction needed
0.50	-0.50	+0.50	Moderate correction
0.90	-0.10	+0.10	Minor adjustment
0.99	-0.01	+0.01	Nearly correct, tiny update

Hessian and Second-Order Optimization

XGBoost and LightGBM use the Hessian (second derivative) for more efficient tree construction. Let's derive it.

Starting from the gradient: $$g = \frac{\partial L}{\partial F} = \sigma(F) - y = p - y$$

Computing the Hessian: $$h = \frac{\partial^2 L}{\partial F^2} = \frac{\partial p}{\partial F}$$

Using the derivative of sigmoid $\frac{d\sigma(z)}{dz} = \sigma(z)(1 - \sigma(z))$:

$$h = p(1 - p)$$

The Hessian is $p(1-p)$—the variance of a Bernoulli random variable!

Properties of the Hessian:

Always non-negative: $p(1-p) \geq 0$ for $p \in [0, 1]$
Maximum at $p = 0.5$: $h = 0.25$ when uncertain
Minimum near 0 or 1: $h \to 0$ for confident predictions
Symmetric: Same value for $p$ and $1-p$

Interpretation of Hessian

XGBoost Split Finding:

In XGBoost, the optimal split is found by maximizing the reduction in the objective:

$$\text{Gain} = \frac{1}{2}\left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda} \right] - \gamma$$

Where:

$G_L = \sum_{i \in \text{left}} g_i = \sum_{i \in \text{left}} (p_i - y_i)$
$H_L = \sum_{i \in \text{left}} h_i = \sum_{i \in \text{left}} p_i(1 - p_i)$

The Hessian $H$ acts as a weight: points with higher Hessian (uncertain predictions) contribute more to determining the split.

Optimal Leaf Value:

For logistic loss, the optimal leaf value is:

$$w^* = -\frac{\sum_i g_i}{\sum_i h_i + \lambda} = \frac{\sum_i (y_i - p_i)}{\sum_i p_i(1-p_i) + \lambda}$$

This is a Newton step: gradient divided by Hessian (plus regularization).

logistic_loss_gb.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from sklearn.tree import DecisionTreeRegressor
 
class LogisticLossGradientBoosting:
    """
    Gradient Boosting for binary classification using logistic loss.
    
    Key concepts:
    1. Predict log-odds F, convert to probability via sigmoid
    2. Gradient = (p - y), the probability residual
    3. Hessian = p(1-p), the Bernoulli variance
    4. Leaf values via Newton step
    """
    
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.initial_prediction = None
    
    def _sigmoid(self, z):
        """Sigmoid function with numerical stability."""
        return np.where(
            z >= 0,
            1 / (1 + np.exp(-z)),
            np.exp(z) / (1 + np.exp(z))
        )
    
    def _logistic_loss(self, y, F):
        """Compute logistic loss (cross-entropy)."""
        p = self._sigmoid(F)
        # Clip for numerical stability
        p = np.clip(p, 1e-15, 1 - 1e-15)
        return -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))
    
    def _gradient(self, y, F):
        """
        Compute gradient of logistic loss.
        
        g = p - y where p = sigmoid(F)
        """
        p = self._sigmoid(F)
        return p - y  # Probability residual
    
    def _hessian(self, F):
        """
        Compute Hessian of logistic loss.
        
        h = p(1-p), the Bernoulli variance
        """
        p = self._sigmoid(F)
        return p * (1 - p)
    
    def _compute_leaf_values(self, tree, X, gradients, hessians, reg_lambda=1.0):
        """
        Compute optimal leaf values using Newton step.
        
        w* = -sum(g) / (sum(h) + lambda)
        """
        leaf_indices = tree.apply(X)
        unique_leaves = np.unique(leaf_indices)
        
        leaf_values = {}
        for leaf in unique_leaves:
            mask = leaf_indices == leaf
            g_sum = gradients[mask].sum()
            h_sum = hessians[mask].sum()
            
            # Newton step with regularization
            leaf_values[leaf] = -g_sum / (h_sum + reg_lambda)
        
        return leaf_values, leaf_indices
    
    def fit(self, X, y):
        """Fit gradient boosting classifier."""
        n_samples = len(y)
        
        # Initialize with log-odds of class proportions
        p_positive = np.mean(y)
        self.initial_prediction = np.log(p_positive / (1 - p_positive))
        
        # Current log-odds predictions
        F = np.full(n_samples, self.initial_prediction)
        
        print("Gradient Boosting Classification with Logistic Loss")
        print("=" * 60)
        print(f"Class balance: {p_positive:.2%} positive")
        print(f"Initial log-odds: {self.initial_prediction:.4f}")
        print(f"Initial loss: {self._logistic_loss(y, F):.4f}")
        print()
        
        for m in range(self.n_estimators):
            # Compute gradient and Hessian
            gradients = self._gradient(y, F)  # p - y
            hessians = self._hessian(F)       # p(1-p)
            
            # Fit tree to negative gradient (pseudo-residual)
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, -gradients)  # Fit to (y - p)
            
            # Compute optimal leaf values with Newton step
            leaf_values, leaf_indices = self._compute_leaf_values(
                tree, X, gradients, hessians
            )
            
            # Get predictions
            predictions = np.array([leaf_values[l] for l in leaf_indices])
            
            # Update log-odds
            F += self.learning_rate * predictions
            
            # Store tree
            self.trees.append((tree, leaf_values))
            
            # Track progress
            if (m + 1) % 20 == 0 or m == 0:
                loss = self._logistic_loss(y, F)
                p = self._sigmoid(F)
                acc = np.mean((p >= 0.5) == y)
                print(f"Iter {m+1:3d}: Loss = {loss:.4f}, Accuracy = {acc:.2%}")
        
        print(f"\nFinal Loss: {self._logistic_loss(y, F):.4f}")
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        F = np.full(len(X), self.initial_prediction)
        
        for tree, leaf_values in self.trees:
            leaf_indices = tree.apply(X)
            preds = np.array([leaf_values.get(l, 0) for l in leaf_indices])
            F += self.learning_rate * preds
        
        p = self._sigmoid(F)
        return np.column_stack([1 - p, p])
    
    def predict(self, X):
        """Predict class labels."""
        proba = self.predict_proba(X)
        return (proba[:, 1] >= 0.5).astype(int)
 
 
# Demonstration
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    
    # Generate classification data
    X, y = make_classification(n_samples=1000, n_features=20, 
                               n_informative=10, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    # Train our model
    model = LogisticLossGradientBoosting(n_estimators=100, learning_rate=0.1)
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    test_acc = np.mean(y_pred == y_test)
    print(f"\nTest Accuracy: {test_acc:.2%}")

Optimal Initialization

Like regression, we need an initial prediction $F_0$ before iterating. For logistic loss, the optimal constant minimizes:

$$F_0 = \arg\min_c \sum_{i=1}^{n} L(y_i, c)$$

Derivation:

Setting the derivative to zero:

$$\sum_{i=1}^{n} (\sigma(c) - y_i) = 0$$

$$n \cdot \sigma(c) = \sum_{i=1}^{n} y_i = n \cdot \bar{y}$$

$$\sigma(c) = \bar{y} \implies c = \log\frac{\bar{y}}{1 - \bar{y}}$$

The optimal initial prediction is the log-odds of the base rate!

If 70% of examples are positive ($\bar{y} = 0.7$): $$F_0 = \log\frac{0.7}{0.3} = \log(2.33) \approx 0.85$$

This corresponds to predicting probability 0.7 for all examples initially—a sensible baseline.

Class Imbalance

Initial Log-Odds for Various Class Balances
Positive Rate	Initial Log-Odds F₀	Initial Probability p₀
1%	-4.60	0.01
10%	-2.20	0.10
25%	-1.10	0.25
50%	0.00	0.50
75%	+1.10	0.75
90%	+2.20	0.90
99%	+4.60	0.99

Numerical Stability Considerations

Logistic loss involves exponentials and logarithms, which can cause numerical issues:

Problem 1: Exponential Overflow

For large |F|, $e^F$ or $e^{-F}$ can overflow:

$e^{710} \approx 10^{308}$ exceeds float64 max
$e^{-710} \approx 10^{-308}$ underflows to zero

Solution: Numerically stable sigmoid

def stable_sigmoid(z):
    return np.where(
        z >= 0,
        1 / (1 + np.exp(-z)),      # For z >= 0
        np.exp(z) / (1 + np.exp(z))  # For z < 0
    )

This avoids computing $e^z$ for large positive $z$ or $e^{-z}$ for large negative $z$.

Problem 2: Log of Zero

When $p = 0$ or $p = 1$, log terms become $-\infty$.

Solution: Clip probabilities

p = np.clip(p, 1e-15, 1 - 1e-15)
loss = -y * np.log(p) - (1-y) * np.log(1-p)

Problem 3: Small Hessian

When $p \approx 0$ or $p \approx 1$, the Hessian $p(1-p) \approx 0$, leading to division issues.

Solution: Add regularization or floor

hessian = np.maximum(p * (1 - p), 1e-6)

Watch for These Bugs

numerical_stability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
 
def stable_logistic_loss(y, F, eps=1e-15):
    """
    Numerically stable logistic loss computation.
    
    Uses the log-sum-exp trick to avoid overflow/underflow.
    """
    # For y=1: loss = log(1 + exp(-F))
    # For y=0: loss = log(1 + exp(F))
    
    # Stable version of log(1 + exp(z))
    def softplus(z):
        return np.where(
            z > 20,  # For large z, log(1+exp(z)) ≈ z
            z,
            np.log1p(np.exp(np.minimum(z, 20)))
        )
    
    loss = y * softplus(-F) + (1 - y) * softplus(F)
    return np.mean(loss)
 
 
def stable_gradient_hessian(y, F, min_hessian=1e-6):
    """
    Compute gradient and Hessian with numerical stability.
    """
    # Stable sigmoid
    p = np.where(
        F >= 0,
        1 / (1 + np.exp(-F)),
        np.exp(F) / (1 + np.exp(F))
    )
    
    # Gradient: p - y
    gradient = p - y
    
    # Hessian: p(1-p), with floor to prevent division issues
    hessian = np.maximum(p * (1 - p), min_hessian)
    
    return gradient, hessian
 
 
# Test edge cases
print("Numerical Stability Test")
print("=" * 50)
 
test_cases = [
    ("Normal case", 1, 1.0),
    ("Confident correct", 1, 10.0),
    ("Very confident correct", 1, 100.0),
    ("Confident wrong", 1, -10.0),
    ("Very confident wrong", 1, -100.0),
]
 
for name, y, F in test_cases:
    loss = stable_logistic_loss(np.array([y]), np.array([F]))
    g, h = stable_gradient_hessian(np.array([y]), np.array([F]))
    print(f"{name:25s}: F={F:6.1f}, Loss={loss:.6f}, g={g[0]:.6f}, h={h[0]:.6f}")

Extension to Multiclass Classification

Binary logistic loss extends naturally to multiclass problems with $K$ classes using softmax and cross-entropy.

Softmax for Multiple Classes:

Instead of one log-odds value, we predict $K$ values $F_1(x), \ldots, F_K(x)$. Probabilities are computed via softmax:

$$p_k(x) = \frac{e^{F_k(x)}}{\sum_{j=1}^{K} e^{F_j(x)}}$$

Multiclass Cross-Entropy Loss:

$$L(y, F) = -\sum_{k=1}^{K} \mathbf{1}_{y=k} \log(p_k)$$

For one-hot encoded labels $y \in {0, 1}^K$:

$$L = -\sum_{k=1}^{K} y_k \log(p_k)$$

Gradient for Multiclass:

For each class $k$:

$$\frac{\partial L}{\partial F_k} = p_k - y_k$$

This is identical to binary! Each class gets gradient (predicted probability - indicator).

Implementation Approaches

Hessian for Multiclass:

The Hessian becomes a $K \times K$ matrix:

$$h_{ij} = \frac{\partial^2 L}{\partial F_i \partial F_j} = p_i(\mathbf{1}_{i=j} - p_j)$$

Diagonal elements: $h_{kk} = p_k(1 - p_k)$ Off-diagonal: $h_{ij} = -p_i p_j$

This is the covariance matrix of a multinomial distribution!

Practical Implementation:

Most implementations use a diagonal approximation for efficiency: $$h_{kk} = p_k(1 - p_k)$$

This ignores cross-class dependencies but is much faster and works well in practice.

Library Support:

# XGBoost
model = xgb.XGBClassifier(objective='multi:softmax', num_class=K)

# LightGBM
model = lgb.LGBMClassifier(objective='multiclass', num_class=K)

# Scikit-learn
model = GradientBoostingClassifier()  # Handles multiclass automatically

Summary: Logistic Loss in Gradient Boosting

We've thoroughly explored logistic loss as the foundation for classification in gradient boosting. Let's consolidate the key insights:

Key Takeaways

•Predict Log-Odds: Gradient boosting predicts F = log(p/(1-p)), convert to probability via sigmoid p = 1/(1+e^{-F}).
•Gradient = (p - y): The probability residual. Trees learn to increase probability for positives, decrease for negatives.
•Hessian = p(1-p): The Bernoulli variance. Uncertain points (p ≈ 0.5) contribute most to optimization.
•Maximum Likelihood: Logistic loss is negative log-likelihood under Bernoulli model. Minimizing gives MLE estimates.
•Initialize with Base Rate: Start with log-odds of class proportion. Handles imbalance automatically.
•Numerical Stability: Use stable sigmoid, clip probabilities, floor Hessian to avoid NaN and overflow.
•Multiclass Extension: Softmax + cross-entropy with gradient (p_k - y_k) for each class.

What's Next:

Page Complete

4 / 5