Logistic Regression Model - Learning Module

Loading content...

0/278

Model Parameters

The Parameters That Define Classification

At the heart of every logistic regression model lie its parameters: the weight vector $\mathbf{w} \in \mathbb{R}^d$ and the bias term $b \in \mathbb{R}$ (also called the intercept). Together, these parameters transform any input feature vector into a probability of class membership.

Understanding these parameters goes beyond mere mathematical formalism. Each weight encodes how strongly a feature influences the classification decision. The bias determines the default tendency toward one class or another. Their geometric arrangement defines a decision boundary that separates the classes in feature space.

This page explores model parameters from every angle: their mathematical role, geometric interpretation, estimation procedure, and the practical challenges of finding optimal values.

What You Will Learn

By the end of this page, you will understand: (1) the mathematical role of weights and bias in logistic regression, (2) how parameters define a linear decision boundary, (3) the maximum likelihood estimation framework, (4) why regularization is essential for parameter estimation, and (5) practical considerations including initialization and feature scaling.

Anatomy of Model Parameters

Let's precisely define the parameters and their roles in the logistic regression model.

The Complete Model

Given a feature vector $\mathbf{x} \in \mathbb{R}^d$ with $d$ features, the logistic regression model predicts:

$$P(Y=1|\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \sigma\left(\sum_{j=1}^d w_j x_j + b\right)$$

where:

$\mathbf{w} = (w_1, w_2, \ldots, w_d)^T$ is the weight vector (or coefficients)
$b$ is the bias (or intercept)
$\sigma(\cdot)$ is the sigmoid function

The Linear Predictor

The quantity inside the sigmoid is called the linear predictor (or logit):

$$z = \mathbf{w}^T\mathbf{x} + b = w_1 x_1 + w_2 x_2 + \cdots + w_d x_d + b$$

This is a linear combination of inputs. The sigmoid then squashes this to (0, 1).

Parameter Count

For a problem with $d$ features:

Number of weights: $d$
Number of bias terms: $1$
Total parameters: $d + 1$

This linear parameter count (compared to, say, neural networks) is one reason logistic regression is interpretable and computationally efficient.

Role of Each Parameter Component
Parameter	Symbol	Dimension	Role
Weight for feature j	w_j	Scalar	Controls influence of feature j on prediction
Weight vector	w	d × 1	Defines direction of steepest probability change
Bias (intercept)	b	Scalar	Controls decision boundary offset from origin
Linear predictor	z = w^T x + b	Scalar	Input to sigmoid; log-odds of positive class

Compact Notation with Augmented Vectors

We often absorb the bias into the weight vector using augmented notation:

$$\tilde{\mathbf{w}} = \begin{pmatrix} b \ w_1 \ w_2 \ \vdots \ w_d \end{pmatrix} \in \mathbb{R}^{d+1}, \quad \tilde{\mathbf{x}} = \begin{pmatrix} 1 \ x_1 \ x_2 \ \vdots \ x_d \end{pmatrix} \in \mathbb{R}^{d+1}$$

Then the model simplifies to:

$$z = \tilde{\mathbf{w}}^T \tilde{\mathbf{x}}$$

This notation is convenient for matrix formulations and derivations, while conceptually the bias plays a distinct role from the feature weights.

Bias as Default Preference

The bias b can be interpreted as the log-odds of the positive class when all features are zero. If b > 0, the model has a 'prior preference' for class 1; if b < 0, for class 0. This becomes the prediction when we have no informative features.

Geometric Interpretation of Parameters

The parameters have a beautiful geometric interpretation that illuminates how logistic regression 'sees' the classification problem.

The Weight Vector as Direction

The weight vector $\mathbf{w}$ points in the direction of steepest increase in probability of class 1. Moving in the direction of $\mathbf{w}$ increases the probability most rapidly; moving opposite to $\mathbf{w}$ decreases it.

The magnitude $|\mathbf{w}|$ controls how quickly probabilities transition from near-0 to near-1. A larger $|\mathbf{w}|$ means a sharper transition (more confident predictions); a smaller $|\mathbf{w}|$ means a gentler, more uncertain transition.

The Decision Boundary

The decision boundary is the set of points where $P(Y=1|\mathbf{x}) = 0.5$, equivalently where the linear predictor equals zero:

$$\mathcal{B} = {\mathbf{x} : \mathbf{w}^T\mathbf{x} + b = 0}$$

This is a hyperplane in $\mathbb{R}^d$:

In 2D: a line
In 3D: a plane
In higher dimensions: a $(d-1)$-dimensional hyperplane

Normal Vector and Distance

The weight vector $\mathbf{w}$ is the normal vector to the decision boundary—it's perpendicular to the boundary surface.

The signed distance from a point $\mathbf{x}$ to the decision boundary is:

$$\text{distance} = \frac{\mathbf{w}^T\mathbf{x} + b}{|\mathbf{w}|} = \frac{z}{|\mathbf{w}|}$$

Positive distance means $\mathbf{x}$ is on the class-1 side; negative means class-0 side.

geometric_interpretation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
import matplotlib.pyplot as plt
 
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
 
# Define model parameters for 2D case
w = np.array([2.0, 1.0])  # Weight vector
b = -1.0                   # Bias
 
# Decision boundary: w^T x + b = 0
# In 2D: w1*x1 + w2*x2 + b = 0
# Solving for x2: x2 = -(w1*x1 + b) / w2
 
# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
 
# Create a grid of points
x1_range = np.linspace(-3, 3, 100)
x2_range = np.linspace(-3, 3, 100)
X1, X2 = np.meshgrid(x1_range, x2_range)
 
# Compute probabilities on the grid
Z = w[0] * X1 + w[1] * X2 + b
P = sigmoid(Z)
 
# Left plot: Probability contours
ax1 = axes[0]
contour = ax1.contourf(X1, X2, P, levels=20, cmap='RdBu_r', alpha=0.8)
plt.colorbar(contour, ax=ax1, label='P(Y=1)')
 
# Decision boundary (P = 0.5, z = 0)
x1_boundary = x1_range
x2_boundary = -(w[0] * x1_boundary + b) / w[1]
ax1.plot(x1_boundary, x2_boundary, 'k-', linewidth=2, label='Decision Boundary (P=0.5)')
 
# Weight vector (normalized for visualization)
w_norm = w / np.linalg.norm(w)
origin = np.array([0, -b/w[1]])  # Point on boundary
ax1.quiver(*origin, w_norm[0], w_norm[1], color='green', scale=5, 
           width=0.02, label='w direction', zorder=5)
 
ax1.set_xlabel('x₁')
ax1.set_ylabel('x₂')
ax1.set_title('Probability Contours and Decision Boundary')
ax1.legend(loc='upper left')
ax1.set_xlim(-3, 3)
ax1.set_ylim(-3, 3)
ax1.set_aspect('equal')
 
# Right plot: 3D surface
ax2 = fig.add_subplot(122, projection='3d')
ax2.plot_surface(X1, X2, P, cmap='RdBu_r', alpha=0.8, linewidth=0)
ax2.set_xlabel('x₁')
ax2.set_ylabel('x₂')
ax2.set_zlabel('P(Y=1)')
ax2.set_title('Probability Surface')
 
plt.tight_layout()
plt.savefig('geometric_interpretation.png', dpi=150)
plt.show()
 
# Print geometric quantities
print("Geometric Properties")
print("=" * 50)
print(f"Weight vector w: {w}")
print(f"Bias b: {b}")
print(f"||w||: {np.linalg.norm(w):.4f}")
print(f"Normal direction: {w / np.linalg.norm(w)}")
print(f"Distance from origin to boundary: {abs(b) / np.linalg.norm(w):.4f}")

The ||w|| Trade-off

Large ||w|| creates sharp probability transitions (confident predictions near the boundary). Small ||w|| creates gradual transitions (uncertain predictions spread over a wide region). Regularization, which penalizes large ||w||, encourages gentler, less overconfident probability estimates.

How Each Feature Contributes

Each weight $w_j$ controls how feature $x_j$ contributes to the final prediction. The contribution mechanism is quite intuitive once we understand the linear predictor.

Additive Contributions in Log-Odds Space

The linear predictor is a sum of contributions:

$$z = \underbrace{b}{\text{baseline}} + \underbrace{w_1 x_1}{\text{feature 1}} + \underbrace{w_2 x_2}{\text{feature 2}} + \cdots + \underbrace{w_d x_d}{\text{feature d}}$$

Each term $w_j x_j$ is the contribution of feature $j$ to the log-odds. The contributions sum to form the total log-odds, which is then transformed to probability.

Interpreting Weight Signs and Magnitudes

$w_j$ Sign	$x_j$ Effect	Interpretation
$w_j > 0$	Increases log-odds	Higher $x_j$ → more likely class 1
$w_j < 0$	Decreases log-odds	Higher $x_j$ → more likely class 0
$w_j = 0$	No effect	Feature $j$ is irrelevant
$	w_j	$ large
$	w_j	$ small

Feature Attribution Example

Consider a spam classifier with features 'contains_offer', 'from_contact', 'has_attachment':

$$z = -1.0 + 2.5 \cdot \text{contains_offer} - 3.0 \cdot \text{from_contact} + 0.5 \cdot \text{has_attachment}$$

For an email with contains_offer=1, from_contact=0, has_attachment=1: $$z = -1.0 + 2.5(1) - 3.0(0) + 0.5(1) = 2.0$$ $$P(\text{spam}) = \sigma(2.0) \approx 0.88$$

We can decompose why: the 'contains_offer' feature contributed most (+2.5), pushing toward spam.

feature_attribution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
 
def sigmoid(z):
    return 1 / (1 + np.exp(-z))
 
# Spam classifier example
feature_names = ['contains_offer', 'from_contact', 'has_attachment', 'link_count', 'word_count']
weights = np.array([2.5, -3.0, 0.5, 0.3, -0.01])
bias = -1.0
 
def predict_with_attribution(x, feature_names, weights, bias):
    """Predict and show each feature's contribution."""
    contributions = weights * x
    z = bias + np.sum(contributions)
    prob = sigmoid(z)
    
    print("Feature Attribution Analysis")
    print("=" * 60)
    print(f"{'Feature':<20} | {'Value':>8} | {'Weight':>8} | {'Contrib':>10}")
    print("-" * 60)
    print(f"{'(bias)':<20} | {'—':>8} | {bias:>8.2f} | {bias:>10.2f}")
    
    for fname, val, w, c in zip(feature_names, x, weights, contributions):
        print(f"{fname:<20} | {val:>8.2f} | {w:>8.2f} | {c:>10.2f}")
    
    print("-" * 60)
    print(f"{'Total log-odds (z)':<20} | {'':>8} | {'':>8} | {z:>10.2f}")
    print(f"{'P(spam)':<20} | {'':>8} | {'':>8} | {prob:>10.4f}")
    print()
    
    return prob, contributions
 
# Example 1: Likely spam
print("
>>> Example 1: Suspicious email")
x1 = np.array([1, 0, 1, 5, 50])  # has offer, not from contact, has attachment, 5 links
prob1, _ = predict_with_attribution(x1, feature_names, weights, bias)
 
# Example 2: Likely legitimate
print("
>>> Example 2: Email from known contact")
x2 = np.array([0, 1, 1, 1, 200])  # no offer, from contact, has attachment, 1 link
prob2, _ = predict_with_attribution(x2, feature_names, weights, bias)
 
# Feature importance by magnitude
print("
>>> Feature Importance (by |weight|)")
print("-" * 40)
importance_order = np.argsort(-np.abs(weights))
for idx in importance_order:
    print(f"{feature_names[idx]:<20}: |w| = {np.abs(weights[idx]):.4f}")

Feature Scale Affects Weight Magnitude

Raw weight magnitudes are only comparable when features are on the same scale. If 'word_count' ranges 0-1000 while 'contains_offer' is binary 0/1, comparing their weights directly is misleading. Always standardize features before comparing weight magnitudes, or use standardized coefficients.

Parameter Estimation: Maximum Likelihood

Given training data ${(\mathbf{x}i, y_i)}{i=1}^n$, where do the optimal parameters come from? The answer is Maximum Likelihood Estimation (MLE)—we find parameters that make the observed data most probable.

The Likelihood Function

For a single example $(\mathbf{x}, y)$, the probability of observing that outcome is:

$$P(Y=y | \mathbf{x}; \mathbf{w}, b) = \begin{cases} \sigma(z) & \text{if } y = 1 \ 1 - \sigma(z) & \text{if } y = 0 \end{cases}$$

This can be written compactly as:

$$P(Y=y | \mathbf{x}; \mathbf{w}, b) = \sigma(z)^y \cdot (1 - \sigma(z))^{1-y}$$

Assuming independent observations, the likelihood (probability of all data) is:

$$L(\mathbf{w}, b) = \prod_{i=1}^n \sigma(z_i)^{y_i} \cdot (1 - \sigma(z_i))^{1-y_i}$$

where $z_i = \mathbf{w}^T\mathbf{x}_i + b$.

The Log-Likelihood (Cross-Entropy Loss)

Taking the logarithm converts products to sums:

$$\ell(\mathbf{w}, b) = \sum_{i=1}^n \left[ y_i \log \sigma(z_i) + (1-y_i) \log(1 - \sigma(z_i)) \right]$$

Maximizing this is equivalent to minimizing the negative log-likelihood, which is the binary cross-entropy loss:

$$\mathcal{L}(\mathbf{w}, b) = -\frac{1}{n} \sum_{i=1}^n \left[ y_i \log \sigma(z_i) + (1-y_i) \log(1 - \sigma(z_i)) \right]$$

This is the standard loss function for logistic regression.

mle_derivation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
 
def sigmoid(z):
    """Numerically stable sigmoid."""
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)),
                    np.exp(z) / (1 + np.exp(z)))
 
def log_likelihood(w, b, X, y):
    """
    Compute log-likelihood for logistic regression.
    Higher is better.
    """
    z = X @ w + b
    p = sigmoid(z)
    # Clip probabilities to avoid log(0)
    eps = 1e-15
    p = np.clip(p, eps, 1 - eps)
    # Log-likelihood
    ll = np.sum(y * np.log(p) + (1 - y) * np.log(1 - p))
    return ll
 
def cross_entropy_loss(w, b, X, y):
    """
    Compute mean cross-entropy loss.
    Lower is better. Equals -log_likelihood / n.
    """
    return -log_likelihood(w, b, X, y) / len(y)
 
# Generate synthetic data
np.random.seed(42)
n = 1000
X = np.random.randn(n, 2)
true_w = np.array([1.5, -1.0])
true_b = 0.5
z_true = X @ true_w + true_b
y = (np.random.rand(n) < sigmoid(z_true)).astype(int)
 
# Show loss landscape along one dimension
w_test_range = np.linspace(-1, 4, 50)
losses = []
for w1 in w_test_range:
    w_temp = np.array([w1, -1.0])
    loss = cross_entropy_loss(w_temp, 0.5, X, y)
    losses.append(loss)
 
print("Maximum Likelihood Estimation")
print("=" * 50)
print(f"True parameters: w = {true_w}, b = {true_b}")
print(f"
Loss at true parameters: {cross_entropy_loss(true_w, true_b, X, y):.4f}")
print(f"Loss at w = [0, -1]:     {cross_entropy_loss(np.array([0, -1]), 0.5, X, y):.4f}")
print(f"Loss at w = [3, -1]:     {cross_entropy_loss(np.array([3, -1]), 0.5, X, y):.4f}")
 
# Verify: minimum loss near true w1 = 1.5
min_idx = np.argmin(losses)
print(f"
Minimum loss at w1 = {w_test_range[min_idx]:.2f} (true: 1.5)")

No Closed-Form Solution

Unlike linear regression (where the normal equations give a closed-form solution), logistic regression's nonlinear sigmoid means we must optimize numerically. The good news: the loss function is convex, so any local minimum is the global minimum, and gradient-based methods converge reliably.

Optimization via Gradient Descent

To find optimal parameters, we use gradient descent—iteratively updating parameters in the direction that decreases the loss.

Gradient Derivation

The gradient of the cross-entropy loss with respect to parameters has an elegant form. For a single example:

$$\frac{\partial \mathcal{L}}{\partial w_j} = (\hat{p} - y) \cdot x_j$$

$$\frac{\partial \mathcal{L}}{\partial b} = (\hat{p} - y)$$

where $\hat{p} = \sigma(z)$ is the predicted probability.

In matrix form for all examples:

$$ abla_\mathbf{w} \mathcal{L} = \frac{1}{n} \mathbf{X}^T (\hat{\mathbf{p}} - \mathbf{y})$$

$$ abla_b \mathcal{L} = \frac{1}{n} \sum_{i=1}^n (\hat{p}_i - y_i)$$

The gradient is the average error times the feature values—beautifully intuitive.

The Update Rule

$$\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \eta \cdot abla_\mathbf{w} \mathcal{L}$$ $$b^{(t+1)} = b^{(t)} - \eta \cdot abla_b \mathcal{L}$$

where $\eta$ is the learning rate, a hyperparameter controlling step size.

gradient_descent_logistic.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
 
def sigmoid(z):
    return np.where(z >= 0, 
                    1 / (1 + np.exp(-z)),
                    np.exp(z) / (1 + np.exp(z)))
 
def compute_loss_and_gradient(w, b, X, y):
    """Compute loss and gradients for logistic regression."""
    n = len(y)
    z = X @ w + b
    p_hat = sigmoid(z)
    
    # Cross-entropy loss
    eps = 1e-15
    p_clipped = np.clip(p_hat, eps, 1 - eps)
    loss = -np.mean(y * np.log(p_clipped) + (1 - y) * np.log(1 - p_clipped))
    
    # Gradients
    error = p_hat - y
    grad_w = (1/n) * X.T @ error
    grad_b = (1/n) * np.sum(error)
    
    return loss, grad_w, grad_b
 
def gradient_descent(X, y, learning_rate=0.1, n_iterations=1000, verbose=True):
    """Train logistic regression via gradient descent."""
    n_features = X.shape[1]
    
    # Initialize parameters
    w = np.zeros(n_features)
    b = 0.0
    
    losses = []
    for i in range(n_iterations):
        loss, grad_w, grad_b = compute_loss_and_gradient(w, b, X, y)
        losses.append(loss)
        
        # Update parameters
        w = w - learning_rate * grad_w
        b = b - learning_rate * grad_b
        
        if verbose and (i + 1) % 100 == 0:
            print(f"Iteration {i+1:4d}: Loss = {loss:.6f}")
    
    return w, b, losses
 
# Generate data
np.random.seed(42)
n = 1000
X = np.random.randn(n, 3)
true_w = np.array([1.0, -0.5, 0.8])
true_b = 0.3
y = (sigmoid(X @ true_w + true_b) > np.random.rand(n)).astype(int)
 
# Train
print("Gradient Descent Training")
print("=" * 50)
w_learned, b_learned, losses = gradient_descent(X, y, learning_rate=0.5, n_iterations=500)
 
print(f"
True parameters:    w = {true_w}, b = {true_b}")
print(f"Learned parameters: w = {np.round(w_learned, 4)}, b = {b_learned:.4f}")
 
# Accuracy
predictions = (sigmoid(X @ w_learned + b_learned) > 0.5).astype(int)
accuracy = np.mean(predictions == y)
print(f"Training accuracy: {accuracy:.2%}")

Understanding the Gradient Intuitively

The gradient (p̂ - y) · x says: if we predicted too high (p̂ > y), decrease the weight for positive features and increase for negative features. If we predicted too low (p̂ < y), do the opposite. Features with larger magnitudes get larger updates. This makes perfect intuitive sense.

Regularization: Controlling Parameter Magnitude

Pure maximum likelihood can lead to problems, especially with linearly separable data or high-dimensional features. Regularization adds a penalty on parameter magnitudes to prevent overfitting.

L2 Regularization (Ridge)

The regularized loss function is:

$$\mathcal{L}_{\text{reg}} = \mathcal{L} + \frac{\lambda}{2} |\mathbf{w}|2^2 = \mathcal{L} + \frac{\lambda}{2} \sum{j=1}^d w_j^2$$

where $\lambda > 0$ is the regularization strength (often written as $1/C$ in sklearn).

The gradient becomes: $$ abla_\mathbf{w} \mathcal{L}{\text{reg}} = abla\mathbf{w} \mathcal{L} + \lambda \mathbf{w}$$

This shrinks weights toward zero, reducing model complexity.

L1 Regularization (Lasso)

$$\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda |\mathbf{w}|1 = \mathcal{L} + \lambda \sum{j=1}^d |w_j|$$

L1 regularization promotes sparsity—driving some weights exactly to zero, effectively performing feature selection.

Why Regularization Matters for Logistic Regression

For linearly separable data, MLE solutions diverge to $\pm \infty$. The sigmoid can achieve perfect classification by making confidence arbitrarily high, but this leads to:

Infinite parameter values
Numerically unstable optimization
Poor generalization (overly confident predictions)

regularization_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
 
# Create linearly separable data (dangerous for unregularized LR)
np.random.seed(42)
n = 100
X = np.vstack([
    np.random.randn(n//2, 2) + np.array([2, 2]),
    np.random.randn(n//2, 2) + np.array([-2, -2])
])
y = np.array([1] * (n//2) + [0] * (n//2))
 
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Compare different regularization strengths
print("Effect of Regularization Strength on Parameters")
print("=" * 70)
print(f"{'C (1/λ)':<10} | {'||w||₂':>10} | {'w':>25} | {'Train Acc':>10}")
print("-" * 70)
 
for C in [1000, 10, 1, 0.1, 0.01]:
    model = LogisticRegression(C=C, solver='lbfgs', max_iter=1000)
    model.fit(X_scaled, y)
    
    w = model.coef_[0]
    w_norm = np.linalg.norm(w)
    accuracy = model.score(X_scaled, y)
    
    print(f"{C:<10} | {w_norm:>10.4f} | [{w[0]:>10.4f}, {w[1]:>10.4f}] | {accuracy:>10.2%}")
 
# Show L1 vs L2
print("
" + "=" * 70)
print("L1 (Lasso) vs L2 (Ridge) Regularization")
print("=" * 70)
 
# Add some irrelevant features
X_noisy = np.hstack([X_scaled, np.random.randn(n, 5)])
 
for penalty, solver in [('l2', 'lbfgs'), ('l1', 'saga')]:
    model = LogisticRegression(C=0.1, penalty=penalty, solver=solver, max_iter=5000)
    model.fit(X_noisy, y)
    w = model.coef_[0]
    n_zero = np.sum(np.abs(w) < 0.01)
    print(f"
{penalty.upper()} Regularization:")
    print(f"  Weights: {np.round(w, 3)}")
    print(f"  Weights near zero: {n_zero}/7")

The Separability Problem

When data is perfectly linearly separable, unregularized logistic regression has no finite MLE solution—weights grow infinitely to achieve 'perfect' sigmoid predictions of exactly 0 or 1. Regularization (or early stopping) is not optional in such cases; it's necessary for the algorithm to converge.

Practical Considerations

Successful logistic regression requires attention to several practical details beyond the core algorithm.

Feature Scaling

Gradient descent converges faster when features are on similar scales. Standardization ($x \to \frac{x-\mu}{\sigma}$) is the most common preprocessing:

Computationally beneficial: balanced gradients across features
Regularization works correctly: penalty treats features equally
Coefficient comparability: after standardization, larger |w| means more important

Initialization

Typical initialization:

Weights: zeros or small random values (e.g., $\mathcal{N}(0, 0.01)$)
Bias: zero, or set to match class imbalance: $b = \log(p_1 / p_0)$

For imbalanced datasets, initializing bias to reflect prior probabilities speeds convergence.

Convergence Criteria

Stop training when:

Loss change falls below threshold: $|\mathcal{L}^{(t)} - \mathcal{L}^{(t-1)}| < \epsilon$
Gradient magnitude is small: $| abla \mathcal{L}| < \epsilon$
Maximum iterations reached

Advanced Optimizers

Beyond vanilla gradient descent:

Newton's Method / IRLS: Uses Hessian for faster convergence
L-BFGS: Quasi-Newton method, default in many libraries
SGD with momentum: For very large datasets
Adam: Adaptive learning rates per parameter

Checklist for Training Logistic Regression

•Standardize features (especially if using regularization)
•Handle class imbalance (reweight samples, adjust threshold, or oversample)
•Choose regularization (L2 default; L1 if feature selection needed)
•Select regularization strength via cross-validation
•Monitor convergence (loss should decrease; watch for oscillation)
•Check for perfect separation (if weights explode, add regularization)
•Validate on held-out data (training accuracy can mislead)

practical_logistic_regression.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
 
# Simulated data
np.random.seed(42)
n = 500
X = np.random.randn(n, 10)
true_w = np.array([1, -0.5, 0.8, 0, 0, 0, 0.3, -0.2, 0, 0])  # Sparse weights
y = (1 / (1 + np.exp(-(X @ true_w + 0.3))) > np.random.rand(n)).astype(int)
 
print("Practical Logistic Regression Pipeline")
print("=" * 60)
 
# Create pipeline with preprocessing
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])
 
# Grid search for best hyperparameters
param_grid = {
    'classifier__C': [0.01, 0.1, 1, 10, 100],
    'classifier__penalty': ['l2']
}
 
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)
 
print(f"Best C: {grid_search.best_params_['classifier__C']}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")
 
# Examine final model
best_model = grid_search.best_estimator_
coefficients = best_model.named_steps['classifier'].coef_[0]
 
print(f"
Learned coefficients (scaled features):")
for i, (true, learned) in enumerate(zip(true_w, coefficients)):
    marker = "✓" if (np.abs(true) > 0.1) == (np.abs(learned) > 0.05) else "✗"
    print(f"  Feature {i}: true={true:>6.2f}, learned={learned:>6.3f} {marker}")

Use sklearn's Pipeline

Always use sklearn's Pipeline to chain preprocessing with the classifier. This ensures that scaling parameters are fit only on training data during cross-validation, preventing data leakage. It also makes deployment simpler—the entire pipeline can be serialized as one object.

Summary: Understanding Model Parameters

We've comprehensively explored the parameters of logistic regression—from their mathematical definition through their geometric meaning to their estimation and regularization. Let's consolidate the key insights:

Key Takeaways

•Parameters: Weight vector w ∈ ℝ^d and bias b ∈ ℝ define the model. Total: d + 1 parameters.
•Linear Predictor: z = w^T x + b is the log-odds of class 1. The sigmoid converts this to probability.
•Geometric Meaning: w is normal to the decision boundary; ||w|| controls transition sharpness; b offsets the boundary.
•Feature Contributions: Each w_j x_j term contributes additively to log-odds. Signs indicate direction; magnitudes indicate strength.
•Maximum Likelihood: Parameters are found by maximizing the likelihood of observed labels, equivalent to minimizing cross-entropy loss.
•Gradient Descent: The update uses (p̂ - y) · x—intuitively pushing predictions toward truth.
•Regularization: L2 prevents weight explosion; L1 promotes sparsity. Essential for separable data.
•Practical Tips: Scale features, choose regularization via CV, use proper initialization for imbalanced data.

What's Next:

With parameters understood, we now examine the decision boundary in detail—how the linear equation w^T x + b = 0 defines class separation, how it relates to margin and confidence, and how to visualize and interpret decision surfaces in various dimensions.

Page Complete

You now have a thorough understanding of logistic regression parameters—what they are, what they mean geometrically, how they're learned, and how to work with them in practice. This foundation is essential for model interpretation and debugging.