Supervised Learning - ClassificationMulti-class Logistic Regression

Multi-class Logistic Regression

LevelIntermediate

Duration90 mins

TopicMulti-class Logistic Regression

5 / 5

Computational Considerations

Scaling Multi-class Classification

With the mathematical foundations of multinomial logistic regression established—softmax, cross-entropy, and multi-class strategies—we now face the engineering reality: How do we train these models at scale?

Real-world multi-class problems can involve:

Millions of training examples (ImageNet: 1.2M images)
Thousands of classes (ImageNet: 1,000; large language models: 50K+ vocabulary)
High-dimensional features (images: 100K+ pixels; text: millions of n-grams)

Understanding the computational complexity, optimization algorithms, and practical tradeoffs is essential for applying multi-class classification in production settings.

What You Will Learn

By the end of this page, you will understand: the computational complexity of training and inference; gradient descent variants for optimization; mini-batch training dynamics; regularization techniques; handling very large K (thousands of classes); and practical implementation considerations for production deployment.

Computational Complexity Analysis

Let's analyze the computational cost of multinomial logistic regression with:

$n$ = number of training examples
$d$ = number of features
$K$ = number of classes

Parameter Count

Weight matrix $W \in \mathbb{R}^{K \times d}$ and bias vector $\mathbf{b} \in \mathbb{R}^K$: $$\text{Total parameters} = K \cdot d + K = K(d + 1)$$

For reference class parameterization: $(K-1)(d+1)$ parameters.

Forward Pass (Single Sample)

Compute logits: $\mathbf{z} = W\mathbf{x} + \mathbf{b}$
- Matrix-vector multiplication: $O(K \cdot d)$
Compute softmax: $\mathbf{p} = \text{softmax}(\mathbf{z})$
- Find max: $O(K)$
- Exponentials: $O(K)$
- Sum and normalize: $O(K)$
- Total softmax: $O(K)$

Total forward pass per sample: $O(K \cdot d)$

For $n$ samples: $O(n \cdot K \cdot d)$

Backward Pass (Gradient Computation)

For a single sample with true class $c$:

Compute gradient w.r.t. logits: $\nabla_{\mathbf{z}} \mathcal{L} = \hat{\mathbf{p}} - \mathbf{y}$ — $O(K)$
Compute gradient w.r.t. weights: $\nabla_W \mathcal{L} = (\hat{\mathbf{p}} - \mathbf{y}) \mathbf{x}^T$ — outer product, $O(K \cdot d)$
Compute gradient w.r.t. bias: $\nabla_{\mathbf{b}} \mathcal{L} = \hat{\mathbf{p}} - \mathbf{y}$ — $O(K)$

Total backward pass per sample: $O(K \cdot d)$

Full Training Epoch

One pass through $n$ samples:

Forward: $O(n \cdot K \cdot d)$
Backward: $O(n \cdot K \cdot d)$
Parameter update: $O(K \cdot d)$

Total per epoch: $O(n \cdot K \cdot d)$

With $T$ epochs: $O(T \cdot n \cdot K \cdot d)$

Complexity Summary
Operation	Time Complexity	Space Complexity
Store parameters		$O(K \cdot d)$
Forward pass (1 sample)	$O(K \cdot d)$	$O(K)$ activations
Backward pass (1 sample)	$O(K \cdot d)$	$O(K \cdot d)$ gradients
Forward pass (batch B)	$O(B \cdot K \cdot d)$	$O(B \cdot K)$
Full epoch	$O(n \cdot K \cdot d)$	$O(K \cdot d)$
Inference (1 sample)	$O(K \cdot d)$	$O(K)$

The K Factor

Complexity is linear in $K$, which becomes problematic for very large class counts. For $K = 50,000$ (LLM vocabulary), a single softmax computation involves 50K exponentials and a sum over 50K terms. Specialized techniques (hierarchical softmax, sampled softmax) address this.

Optimization Algorithms

Training multinomial logistic regression requires minimizing the cross-entropy loss. Several optimization algorithms are available, each with different tradeoffs.

Batch Gradient Descent

Classic approach: compute gradient on all training data, then update.

$$\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - \eta \cdot \frac{1}{n} \sum_{i=1}^{n} \nabla_{\mathbf{\theta}} \mathcal{L}_i$$

Properties:

Deterministic (same update for same data)
Stable convergence (no gradient noise)
Expensive per iteration: $O(n \cdot K \cdot d)$
Unsuitable for large $n$ (millions of samples)

Stochastic Gradient Descent (SGD)

Update after each sample (or small batch):

$$\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - \eta \cdot \nabla_{\mathbf{\theta}} \mathcal{L}_i$$

Properties:

Noisy gradients (variance in updates)
Fast per-iteration: $O(K \cdot d)$
Can escape shallow local minima (though convex problem doesn't have local minima)
Requires learning rate scheduling for convergence

Mini-batch SGD

The practical standard: update on batches of size $B$:

$$\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - \eta \cdot \frac{1}{B} \sum_{i \in \text{batch}} \nabla_{\mathbf{\theta}} \mathcal{L}_i$$

Properties:

Balances compute efficiency (parallelizable) with stable gradients
Typical $B \in {32, 64, 128, 256}$
Gradient variance $\propto 1/B$
GPU-efficient: matrix operations

Second-Order Methods

Use curvature (Hessian) information for faster convergence:

Newton's Method: $$\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - H^{-1} \nabla_{\mathbf{\theta}} \mathcal{L}$$

where $H$ is the Hessian matrix.

Challenge: Hessian is $Kd \times Kd$—infeasible to compute/invert for large $K$ or $d$.

L-BFGS (Limited-memory BFGS):

Quasi-Newton method approximating $H^{-1}$ from gradient history
Memory: $O(m \cdot K \cdot d)$ for $m$ stored vectors (typically $m=10$)
Often converges in far fewer iterations than SGD
Good for moderate-size problems that fit in memory
Default solver in sklearn for multinomial logistic regression

Optimizer Comparison for Logistic Regression
Algorithm	Per-iteration Cost	Iterations to Converge	Best For
Batch GD	$O(n \cdot K \cdot d)$	Low	Small datasets
SGD	$O(K \cdot d)$	High	Very large datasets, online
Mini-batch SGD	$O(B \cdot K \cdot d)$	Medium-High	Large datasets, GPU
L-BFGS	$O(n \cdot K \cdot d + m \cdot K \cdot d)$	Very Low	Medium datasets, sklearn default
Newton	$O(n \cdot K \cdot d + (Kd)^3)$	Very Low	Small $K$ and $d$

Modern Adaptive Optimizers

For deep learning with softmax output layers, adaptive methods like Adam, AdaGrad, and RMSprop are standard. They adapt learning rates per-parameter based on gradient history, combining benefits of momentum and adaptive learning rates. For pure logistic regression, L-BFGS often outperforms adaptive methods.

Mini-batch Training Dynamics

Mini-batch training is the workhorse of modern machine learning. Understanding its dynamics is crucial for effective training.

Batch Size Selection

The batch size $B$ creates a fundamental tradeoff:

Smaller B (e.g., 16-32)	Larger B (e.g., 256-1024)
More gradient noise	Smoother gradients
More frequent updates	Fewer updates per epoch
Better generalization	May overfit
Lower GPU utilization	Better parallelism
Smaller memory footprint	Larger memory required

The Generalization-Optimization Tradeoff:

Recent research (Keskar et al., 2017) suggests smaller batches often find 'flatter' minima that generalize better. Larger batches converge to 'sharper' minima with poorer generalization.

Linear Scaling Rule: When increasing batch size by factor $k$, also increase learning rate by factor $k$ (approximately). This maintains similar training dynamics.

minibatch_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import numpy as np
 
def softmax_stable(z):
    """Numerically stable softmax."""
    z_max = np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z - z_max)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def cross_entropy_loss(logits, labels):
    """Compute cross-entropy loss."""
    n = logits.shape[0]
    log_probs = logits - np.max(logits, axis=1, keepdims=True)
    log_probs = log_probs - np.log(np.sum(np.exp(log_probs), axis=1, keepdims=True))
    loss = -np.mean(log_probs[np.arange(n), labels])
    return loss
 
def minibatch_train_multinomial_lr(
    X, y, K,
    batch_size=64,
    epochs=100,
    learning_rate=0.1,
    l2_reg=0.01,
    verbose=True
):
    """
    Train multinomial logistic regression with mini-batch SGD.
    
    Args:
        X: Features (n, d)
        y: Labels (n,) integers in [0, K-1]
        K: Number of classes
        batch_size: Mini-batch size
        epochs: Number of training epochs
        learning_rate: Base learning rate
        l2_reg: L2 regularization strength
        verbose: Print progress
    
    Returns:
        W: Weight matrix (K, d)
        b: Bias vector (K,)
        history: Training history
    """
    n, d = X.shape
    
    # Initialize parameters (small random)
    W = 0.01 * np.random.randn(K, d)
    b = np.zeros(K)
    
    # Training history
    history = {'loss': [], 'accuracy': []}
    
    # Number of batches per epoch
    n_batches = int(np.ceil(n / batch_size))
    
    for epoch in range(epochs):
        # Shuffle data each epoch
        perm = np.random.permutation(n)
        X_shuffled = X[perm]
        y_shuffled = y[perm]
        
        epoch_loss = 0
        
        for batch_idx in range(n_batches):
            # Get batch
            start = batch_idx * batch_size
            end = min(start + batch_size, n)
            X_batch = X_shuffled[start:end]
            y_batch = y_shuffled[start:end]
            actual_batch_size = end - start
            
            # Forward pass
            logits = X_batch @ W.T + b  # (batch, K)
            probs = softmax_stable(logits)  # (batch, K)
            
            # Compute batch loss
            batch_loss = cross_entropy_loss(logits, y_batch)
            batch_loss += 0.5 * l2_reg * np.sum(W**2)  # L2 penalty
            epoch_loss += batch_loss * actual_batch_size
            
            # Backward pass: gradient = (p - y) for each sample
            grad_logits = probs.copy()  # (batch, K)
            grad_logits[np.arange(actual_batch_size), y_batch] -= 1
            grad_logits /= actual_batch_size
            
            # Gradients w.r.t. W and b
            grad_W = grad_logits.T @ X_batch  # (K, d)
            grad_W += l2_reg * W  # L2 regularization gradient
            grad_b = grad_logits.sum(axis=0)  # (K,)
            
            # Parameter update (SGD)
            W -= learning_rate * grad_W
            b -= learning_rate * grad_b
        
        # Epoch statistics
        epoch_loss /= n
        
        # Compute accuracy
        logits_all = X @ W.T + b
        predictions = np.argmax(logits_all, axis=1)
        accuracy = np.mean(predictions == y)
        
        history['loss'].append(epoch_loss)
        history['accuracy'].append(accuracy)
        
        if verbose and (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}/{epochs}: "
                  f"Loss={epoch_loss:.4f}, Acc={accuracy:.4f}")
    
    return W, b, history
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    
    # Generate data
    X, y = make_classification(
        n_samples=5000,
        n_features=20,
        n_informative=10,
        n_classes=5,
        n_clusters_per_class=1,
        random_state=42
    )
    
    # Split and scale
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    # Train
    print("Training with mini-batch SGD...")
    W, b, history = minibatch_train_multinomial_lr(
        X_train, y_train, K=5,
        batch_size=64,
        epochs=100,
        learning_rate=0.5,
        l2_reg=0.001
    )
    
    # Test accuracy
    test_logits = X_test @ W.T + b
    test_preds = np.argmax(test_logits, axis=1)
    test_acc = np.mean(test_preds == y_test)
    print(f"\nTest Accuracy: {test_acc:.4f}")

Learning Rate Scheduling

Fixed learning rates often underperform. Common schedules:

Step decay: Reduce LR by factor every $N$ epochs $$\eta_t = \eta_0 \cdot \gamma^{\lfloor t/N \rfloor}$$
Exponential decay: Smooth continuous reduction $$\eta_t = \eta_0 \cdot e^{-\lambda t}$$
Cosine annealing: Smooth cycle $$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\frac{\pi t}{T}))$$
Warmup + decay: Start low, ramp up, then decay (common in deep learning)

Choosing initial learning rate:

Too high: Divergence (loss explodes)
Too low: Slow convergence
Rule of thumb: Start with 0.1 or 0.01, adjust based on loss curve

Regularization Techniques

Regularization is crucial for preventing overfitting, especially with high-dimensional features or limited data.

L2 Regularization (Ridge)

Add squared weight penalty: $$\mathcal{L}{\text{reg}} = \mathcal{L}{\text{CE}} + \frac{\lambda}{2} \sum_{k=1}^{K} |\mathbf{w}_k|_2^2$$

Properties:

Shrinks all weights toward zero (but not exactly to zero)
Convex: preserves global optimum
Closed-form gradient: $\nabla_{\mathbf{w}_k} \text{penalty} = \lambda \mathbf{w}_k$
sklearn: C = 1/λ (inverse regularization)

L1 Regularization (Lasso)

$$\mathcal{L}{\text{reg}} = \mathcal{L}{\text{CE}} + \lambda \sum_{k=1}^{K} |\mathbf{w}_k|_1$$

Properties:

Induces sparsity: many weights become exactly zero
Feature selection effect
Subgradient needed (non-differentiable at zero)
Coordinate descent often used for optimization

Elastic Net

Combines L1 and L2: $$\mathcal{L}{\text{reg}} = \mathcal{L}{\text{CE}} + \lambda_1 |W|_1 + \lambda_2 |W|_2^2$$

Benefits:

Sparsity from L1 with stability from L2
Groups of correlated features selected together
sklearn: l1_ratio parameter controls mix

Early Stopping

Implicit regularization through limited training:

Monitor validation loss during training
Stop when validation loss starts increasing
Use parameters from best validation epoch

Properties:

Simple to implement
Adapts to data (no explicit $\lambda$ tuning)
Connected to L2 regularization theoretically

Dropout (for Neural Networks with Softmax)

Randomly zero activations during training:

Each hidden unit dropped with probability $p$
At inference, scale by $(1-p)$
Acts as ensemble of thinned networks
Not typically used for simple logistic regression; standard for deep nets

regularization_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np
 
def compare_regularization(X, y, cv=5):
    """
    Compare different regularization strategies.
    """
    results = {}
    
    # No regularization (very high C)
    model_none = LogisticRegression(
        C=1e10, multi_class='multinomial', solver='lbfgs',
        max_iter=1000, random_state=42
    )
    scores = cross_val_score(model_none, X, y, cv=cv)
    results['No regularization'] = (scores.mean(), scores.std())
    
    # L2 regularization (default)
    for C in [0.01, 0.1, 1.0, 10.0]:
        model = LogisticRegression(
            C=C, penalty='l2', multi_class='multinomial',
            solver='lbfgs', max_iter=1000, random_state=42
        )
        scores = cross_val_score(model, X, y, cv=cv)
        results[f'L2 (C={C})'] = (scores.mean(), scores.std())
    
    # L1 regularization
    for C in [0.01, 0.1, 1.0]:
        model = LogisticRegression(
            C=C, penalty='l1', solver='saga',
            multi_class='multinomial', max_iter=1000, random_state=42
        )
        scores = cross_val_score(model, X, y, cv=cv)
        results[f'L1 (C={C})'] = (scores.mean(), scores.std())
    
    # Elastic Net
    for C, l1_ratio in [(0.1, 0.5), (1.0, 0.5)]:
        model = LogisticRegression(
            C=C, penalty='elasticnet', l1_ratio=l1_ratio,
            solver='saga', multi_class='multinomial',
            max_iter=1000, random_state=42
        )
        scores = cross_val_score(model, X, y, cv=cv)
        results[f'ElasticNet (C={C}, l1={l1_ratio})'] = (scores.mean(), scores.std())
    
    # Print results
    print("=== Regularization Comparison ===\n")
    for name, (mean, std) in sorted(results.items(), key=lambda x: -x[1][0]):
        print(f"{name:35s}: {mean:.4f} (+/- {std*2:.4f})")
    
    return results
 
def analyze_sparsity(X, y, C_values=[0.001, 0.01, 0.1, 1.0]):
    """
    Analyze sparsity induced by L1 regularization.
    """
    print("\n=== L1 Sparsity Analysis ===\n")
    
    for C in C_values:
        model = LogisticRegression(
            C=C, penalty='l1', solver='saga',
            multi_class='multinomial', max_iter=1000, random_state=42
        )
        model.fit(X, y)
        
        # Count non-zero weights
        n_nonzero = np.sum(model.coef_ != 0)
        total = model.coef_.size
        sparsity = 100 * (1 - n_nonzero / total)
        
        print(f"C={C:6.4f}: {n_nonzero:4d}/{total} non-zero "
              f"({sparsity:.1f}% sparse)")
 
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # High-dimensional data (more features than samples for some splits)
    X, y = make_classification(
        n_samples=500,
        n_features=100,
        n_informative=20,
        n_redundant=10,
        n_classes=5,
        random_state=42
    )
    
    compare_regularization(X, y)
    analyze_sparsity(X, y)

Hyperparameter Selection

Use cross-validation to select regularization strength. Grid search over logarithmic range (e.g., C ∈ {0.001, 0.01, 0.1, 1, 10, 100}) is common. For production, consider Bayesian optimization for efficient hyperparameter search.

Handling Very Large K (Thousands of Classes)

When $K$ becomes very large (thousands or more), standard softmax becomes prohibitively expensive. Several techniques address this challenge.

The Bottleneck

For each training sample:

Compute $K$ logits: $O(K \cdot d)$
Softmax normalization: sum of $K$ exponentials
Cross-entropy gradient: update all $K$ weight vectors

For $K = 50,000$ (typical vocabulary), one forward-backward pass touches 50K output neurons—even if only one is the true class.

Technique 1: Hierarchical Softmax

Organize classes in a tree (typically binary). Each leaf is a class; internal nodes are binary classifiers.

Prediction:

Start at root
At each node, take left or right branch based on binary classifier output
Reach a leaf (predicted class)

Training:

For true class, only update classifiers on the path from root to that leaf
Path length $\approx \log_2 K$ for balanced tree

Complexity reduction:

Approach	Per-sample complexity
Full softmax	$O(K \cdot d)$
Hierarchical softmax	$O(\log K \cdot d)$

For $K = 50,000$: from 50K to ~16 operations (per weight access).

Challenge: Tree structure matters. Common approaches:

Huffman tree: Frequent classes near root (shorter paths)
Learned tree: Cluster classes by embedding similarity
Random tree: Simple baseline, surprisingly effective sometimes

Technique 2: Sampled Softmax / Negative Sampling

Instead of computing full softmax, sample negative classes:

For true class $c$, include it in computation
Sample $m \ll K$ negative classes from a proposal distribution
Compute softmax over only these $m+1$ classes

Loss: $$\mathcal{L}{\text{sampled}} = -\log \frac{e^{z_c}}{e^{z_c} + \sum{k \in \text{neg}} e^{z_k}}$$

Complexity: $O(m \cdot d)$ instead of $O(K \cdot d)$. Typically $m \sim 100$-1000.

Technique 3: Noise Contrastive Estimation (NCE)

Frame as binary classification: distinguish real data from noise.

Positive samples: True (context, word) pairs
Negative samples: (context, noise word) pairs from proposal distribution $Q$

NCE objective: $$\mathcal{L}{NCE} = \log \sigma(z_c - \log k \cdot Q(c)) + \sum{j \sim Q} \log \sigma(-z_j + \log k \cdot Q(j))$$

As $k \to \infty$, NCE approaches MLE. Used in Word2Vec (negative sampling variant).

Technique 4: Class-based Softmax

Group classes into clusters. Two-stage softmax:

Predict cluster: $P(\text{cluster} | \mathbf{x})$
Predict class within cluster: $P(\text{class} | \text{cluster}, \mathbf{x})$

$$P(y | \mathbf{x}) = P(\text{cluster}(y) | \mathbf{x}) \cdot P(y | \text{cluster}(y), \mathbf{x})$$

If $C$ clusters with $K/C$ classes each:

Full softmax: $O(K)$
Class-based: $O(C + K/C)$, minimized at $C = \sqrt{K}$ giving $O(\sqrt{K})$

When to Use Approximations

For $K < 1000$: Full softmax is usually fine. For $K \sim 1000$-$10000$: Consider class-based or sampled softmax. For $K > 10000$: Hierarchical or sampled softmax almost essential. For $K > 100000$: Specialized architectures and efficient implementations required.

Memory Optimization Strategies

Memory constraints often limit model size and batch size. Several techniques help manage memory efficiently.

Memory Breakdown

For multinomial logistic regression:

Parameters: $O(K \cdot d)$ for weights + $O(K)$ for biases
Activations (batch): $O(B \cdot d)$ input features + $O(B \cdot K)$ logits
Gradients: $O(K \cdot d)$ for weight gradients
Optimizer state: Depends on optimizer
- SGD: $O(1)$ (just LR)
- Momentum: $O(K \cdot d)$ (velocity)
- Adam: $O(2 \cdot K \cdot d)$ (first and second moments)

Example: $K = 10000$, $d = 1000$

Parameters: 10M floats = 40 MB (float32)
Adam optimizer state: 80 MB additional
Batch of 256: negligible compared to above

For deep networks, activations dominate—for logistic regression, parameters do.

Strategy 1: Gradient Accumulation

Simulate large batches without memory for activations:

accum_steps = 4
for i, (x, y) in enumerate(dataloader):
    loss = model(x, y) / accum_steps
    loss.backward()  # Accumulates gradients
    
    if (i + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Effect: Same effective batch size $B \times \text{accum_steps}$ with memory for batch $B$.

Strategy 2: Mixed Precision Training

Use float16 for most computations, float32 for sensitive operations:

Activations/gradients: float16 (half memory)
Parameters/optimizer state: float32 (full precision)
Loss scaling to prevent underflow

Frameworks: PyTorch AMP (torch.cuda.amp), TensorFlow mixed precision.

Speedup: 2-3x on modern GPUs (Tensor Cores), plus memory reduction.

Strategy 3: Sparse Features

For text/categorical data, features are often sparse (bag-of-words, one-hot).

Dense representation: $\mathbf{x} \in \mathbb{R}^{1000000}$ (vocabulary) consumes 4 MB per sample.

Sparse representation: Store only non-zero indices and values. Average document: ~100 unique words. Memory: ~1 KB per sample (100 indices + 100 values).

Sparse matrix-vector multiplication:

Dense: $O(K \cdot d)$ regardless of sparsity
Sparse: $O(K \cdot \text{nnz})$ where $\text{nnz} \ll d$

Strategy 4: Model Parallel for Very Large K

When $K$ is so large that the output layer doesn't fit on one GPU:

Partition classes across GPUs
Each GPU computes partial softmax
AllReduce to combine for full softmax

Used in large language models with 50K+ vocabulary.

Memory Optimization Summary
Technique	Memory Reduction	Tradeoff
Smaller batch size	Linear	More gradient noise, slower training
Gradient accumulation	Linear in accumulation steps	Same memory, same effective batch
Mixed precision (FP16)	~2x	Slight numerical considerations
Sparse features	Proportional to sparsity	Requires sparse-aware operations
Gradient checkpointing	Trade compute for memory	Slower training (~30%)

Production Engineering Considerations

Deploying multi-class classifiers in production requires attention to several engineering concerns.

1. Inference Latency

For online serving, prediction latency matters:

Single prediction: $O(K \cdot d)$ multiply-adds
Target: Often <10ms including I/O

Optimizations:

Compile model to optimized format (ONNX, TensorRT)
Batch predictions when possible
Use efficient BLAS libraries (MKL, OpenBLAS)
Consider quantization (int8 weights)

2. Batch Inference

For offline scoring (e.g., daily batch jobs):

Maximize throughput over latency
Use GPU batching
Process data in parallel across machines

3. Model Size and Transport

For edge deployment or frequent model updates:

Full model: $K \cdot d \cdot 4$ bytes (float32)
Compressed: Use quantization, pruning
For $K=1000$, $d=1000$: 4 MB (manageable)

4. Calibration

Predicted probabilities should match observed frequencies:

Train: 70% of samples where model predicts 70% should be positive
Deep networks often overconfident; logistic regression typically well-calibrated

Calibration techniques:

Temperature scaling: $\mathbf{p} = \text{softmax}(\mathbf{z}/T)$, tune $T$ on validation set
Platt scaling: Logistic regression on model outputs
Isotonic regression: Non-parametric calibration

5. Monitoring in Production

Metrics to track:

Accuracy, precision, recall per class
Confidence distribution shift
Input feature distribution drift
Prediction latency percentiles

Alerts:

Sudden accuracy drops
New classes appearing in data (unseen during training)
Feature pipeline failures (null values, wrong types)

production_inference.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import numpy as np
import time
 
class OptimizedMultinomialClassifier:
    """
    Production-ready multinomial classifier with optimizations.
    """
    
    def __init__(self, W, b):
        """
        Args:
            W: Weight matrix (K, d), float32 or float16
            b: Bias vector (K,)
        """
        self.W = np.ascontiguousarray(W)  # Ensure memory layout
        self.b = b
        self.K, self.d = W.shape
        
    @classmethod
    def from_sklearn(cls, sklearn_model):
        """Load from trained sklearn LogisticRegression."""
        return cls(sklearn_model.coef_, sklearn_model.intercept_)
    
    def predict(self, X):
        """
        Predict class labels.
        
        Args:
            X: Features (n, d) or (d,) for single sample
        
        Returns:
            Predicted class indices
        """
        single = X.ndim == 1
        if single:
            X = X.reshape(1, -1)
        
        logits = X @ self.W.T + self.b
        predictions = np.argmax(logits, axis=1)
        
        return predictions[0] if single else predictions
    
    def predict_proba(self, X, temperature=1.0):
        """
        Predict class probabilities with optional temperature.
        """
        single = X.ndim == 1
        if single:
            X = X.reshape(1, -1)
        
        logits = (X @ self.W.T + self.b) / temperature
        
        # Stable softmax
        logits_max = np.max(logits, axis=1, keepdims=True)
        exp_logits = np.exp(logits - logits_max)
        probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
        
        return probs[0] if single else probs
    
    def predict_top_k(self, X, k=5):
        """
        Return top-k predictions with probabilities.
        """
        probs = self.predict_proba(X)
        if probs.ndim == 1:
            probs = probs.reshape(1, -1)
        
        # Get top k indices
        top_k_idx = np.argsort(probs, axis=1)[:, -k:][:, ::-1]
        
        # Get corresponding probabilities
        n = probs.shape[0]
        top_k_probs = probs[np.arange(n)[:, None], top_k_idx]
        
        return top_k_idx, top_k_probs
 
def benchmark_inference(model, X, n_runs=100):
    """Benchmark inference latency."""
    
    # Warmup
    for _ in range(10):
        _ = model.predict(X)
    
    # Benchmark
    latencies = []
    for _ in range(n_runs):
        start = time.perf_counter()
        _ = model.predict(X)
        latencies.append((time.perf_counter() - start) * 1000)  # ms
    
    latencies = np.array(latencies)
    print(f"Inference latency (n={X.shape[0]} samples):")
    print(f"  Mean: {latencies.mean():.3f} ms")
    print(f"  P50:  {np.percentile(latencies, 50):.3f} ms")
    print(f"  P99:  {np.percentile(latencies, 99):.3f} ms")
 
# Example
if __name__ == "__main__":
    from sklearn.linear_model import LogisticRegression
    from sklearn.datasets import make_classification
    
    # Create and train model
    X, y = make_classification(n_samples=1000, n_features=100,
                               n_classes=10, n_informative=20,
                               random_state=42)
    
    sklearn_model = LogisticRegression(
        multi_class='multinomial', max_iter=1000
    ).fit(X, y)
    
    # Create optimized version
    model = OptimizedMultinomialClassifier.from_sklearn(sklearn_model)
    
    # Benchmark
    X_test = np.random.randn(1, 100)  # Single sample
    benchmark_inference(model, X_test)
    
    X_batch = np.random.randn(100, 100)  # Batch
    benchmark_inference(model, X_batch)

Summary: Computational Considerations

We have explored the computational landscape of training and deploying multi-class logistic regression. Let's consolidate the key insights:

Key Takeaways

•Complexity: Training scales as $O(T \cdot n \cdot K \cdot d)$; inference as $O(K \cdot d)$ per sample.
•Optimizers: L-BFGS for moderate data, mini-batch SGD for large data; adaptive methods (Adam) for deep learning.
•Mini-batch Dynamics: Batch size trades compute efficiency vs. gradient noise; learning rate scheduling improves convergence.
•Regularization: L2 for shrinkage with cross-validation for $\lambda$; L1 for sparsity and feature selection.
•Large K Problem: For $K > 1000$, consider hierarchical softmax, sampled softmax, or class-based decomposition.
•Memory Optimization: Gradient accumulation, mixed precision, sparse features, and model parallelism enable scaling.
•Production: Optimize for inference latency; calibrate probabilities; monitor for drift and failures.

Module Complete:

With this page, we have completed Module 4 on Multi-class Logistic Regression. You now have a comprehensive understanding of:

The softmax function and its mathematical properties
Multinomial logistic regression model structure and interpretation
The cross-entropy loss and its information-theoretic foundations
One-vs-all and other decomposition strategies
Computational considerations for training and deployment at scale

This knowledge forms the foundation for understanding classification in neural networks and more advanced multi-class methods.

Module Complete

Congratulations! You have mastered multi-class logistic regression from mathematical foundations through practical deployment. This knowledge directly transfers to understanding the output layers of deep neural networks and serves as a powerful baseline for any multi-class classification problem.

5 / 5

Loading learning content...

Supervised Learning - ClassificationMulti-class Logistic Regression

Multi-class Logistic Regression

LevelIntermediate

Duration90 mins

TopicMulti-class Logistic Regression

5 / 5

Computational Considerations

Scaling Multi-class Classification

Real-world multi-class problems can involve:

Millions of training examples (ImageNet: 1.2M images)
Thousands of classes (ImageNet: 1,000; large language models: 50K+ vocabulary)
High-dimensional features (images: 100K+ pixels; text: millions of n-grams)

Understanding the computational complexity, optimization algorithms, and practical tradeoffs is essential for applying multi-class classification in production settings.

What You Will Learn

Computational Complexity Analysis

Let's analyze the computational cost of multinomial logistic regression with:

$n$ = number of training examples
$d$ = number of features
$K$ = number of classes

Parameter Count

Weight matrix $W \in \mathbb{R}^{K \times d}$ and bias vector $\mathbf{b} \in \mathbb{R}^K$: $$\text{Total parameters} = K \cdot d + K = K(d + 1)$$

For reference class parameterization: $(K-1)(d+1)$ parameters.

Forward Pass (Single Sample)

Compute logits: $\mathbf{z} = W\mathbf{x} + \mathbf{b}$
- Matrix-vector multiplication: $O(K \cdot d)$
Compute softmax: $\mathbf{p} = \text{softmax}(\mathbf{z})$
- Find max: $O(K)$
- Exponentials: $O(K)$
- Sum and normalize: $O(K)$
- Total softmax: $O(K)$

Total forward pass per sample: $O(K \cdot d)$

For $n$ samples: $O(n \cdot K \cdot d)$

Backward Pass (Gradient Computation)

For a single sample with true class $c$:

Compute gradient w.r.t. logits: $\nabla_{\mathbf{z}} \mathcal{L} = \hat{\mathbf{p}} - \mathbf{y}$ — $O(K)$
Compute gradient w.r.t. weights: $\nabla_W \mathcal{L} = (\hat{\mathbf{p}} - \mathbf{y}) \mathbf{x}^T$ — outer product, $O(K \cdot d)$
Compute gradient w.r.t. bias: $\nabla_{\mathbf{b}} \mathcal{L} = \hat{\mathbf{p}} - \mathbf{y}$ — $O(K)$

Total backward pass per sample: $O(K \cdot d)$

Full Training Epoch

One pass through $n$ samples:

Forward: $O(n \cdot K \cdot d)$
Backward: $O(n \cdot K \cdot d)$
Parameter update: $O(K \cdot d)$

Total per epoch: $O(n \cdot K \cdot d)$

With $T$ epochs: $O(T \cdot n \cdot K \cdot d)$

Complexity Summary
Operation	Time Complexity	Space Complexity
Store parameters		$O(K \cdot d)$
Forward pass (1 sample)	$O(K \cdot d)$	$O(K)$ activations
Backward pass (1 sample)	$O(K \cdot d)$	$O(K \cdot d)$ gradients
Forward pass (batch B)	$O(B \cdot K \cdot d)$	$O(B \cdot K)$
Full epoch	$O(n \cdot K \cdot d)$	$O(K \cdot d)$
Inference (1 sample)	$O(K \cdot d)$	$O(K)$

The K Factor

Optimization Algorithms

Training multinomial logistic regression requires minimizing the cross-entropy loss. Several optimization algorithms are available, each with different tradeoffs.

Batch Gradient Descent

Classic approach: compute gradient on all training data, then update.

$$\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - \eta \cdot \frac{1}{n} \sum_{i=1}^{n} \nabla_{\mathbf{\theta}} \mathcal{L}_i$$

Properties:

Deterministic (same update for same data)
Stable convergence (no gradient noise)
Expensive per iteration: $O(n \cdot K \cdot d)$
Unsuitable for large $n$ (millions of samples)

Stochastic Gradient Descent (SGD)

Update after each sample (or small batch):

$$\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - \eta \cdot \nabla_{\mathbf{\theta}} \mathcal{L}_i$$

Properties:

Noisy gradients (variance in updates)
Fast per-iteration: $O(K \cdot d)$
Can escape shallow local minima (though convex problem doesn't have local minima)
Requires learning rate scheduling for convergence

Mini-batch SGD

The practical standard: update on batches of size $B$:

$$\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - \eta \cdot \frac{1}{B} \sum_{i \in \text{batch}} \nabla_{\mathbf{\theta}} \mathcal{L}_i$$

Properties:

Balances compute efficiency (parallelizable) with stable gradients
Typical $B \in {32, 64, 128, 256}$
Gradient variance $\propto 1/B$
GPU-efficient: matrix operations

Second-Order Methods

Use curvature (Hessian) information for faster convergence:

Newton's Method: $$\mathbf{\theta}^{(t+1)} = \mathbf{\theta}^{(t)} - H^{-1} \nabla_{\mathbf{\theta}} \mathcal{L}$$

where $H$ is the Hessian matrix.

Challenge: Hessian is $Kd \times Kd$—infeasible to compute/invert for large $K$ or $d$.

L-BFGS (Limited-memory BFGS):

Quasi-Newton method approximating $H^{-1}$ from gradient history
Memory: $O(m \cdot K \cdot d)$ for $m$ stored vectors (typically $m=10$)
Often converges in far fewer iterations than SGD
Good for moderate-size problems that fit in memory
Default solver in sklearn for multinomial logistic regression

Optimizer Comparison for Logistic Regression
Algorithm	Per-iteration Cost	Iterations to Converge	Best For
Batch GD	$O(n \cdot K \cdot d)$	Low	Small datasets
SGD	$O(K \cdot d)$	High	Very large datasets, online
Mini-batch SGD	$O(B \cdot K \cdot d)$	Medium-High	Large datasets, GPU
L-BFGS	$O(n \cdot K \cdot d + m \cdot K \cdot d)$	Very Low	Medium datasets, sklearn default
Newton	$O(n \cdot K \cdot d + (Kd)^3)$	Very Low	Small $K$ and $d$

Modern Adaptive Optimizers

Mini-batch Training Dynamics

Mini-batch training is the workhorse of modern machine learning. Understanding its dynamics is crucial for effective training.

Batch Size Selection

The batch size $B$ creates a fundamental tradeoff:

Smaller B (e.g., 16-32)	Larger B (e.g., 256-1024)
More gradient noise	Smoother gradients
More frequent updates	Fewer updates per epoch
Better generalization	May overfit
Lower GPU utilization	Better parallelism
Smaller memory footprint	Larger memory required

The Generalization-Optimization Tradeoff:

Recent research (Keskar et al., 2017) suggests smaller batches often find 'flatter' minima that generalize better. Larger batches converge to 'sharper' minima with poorer generalization.

Linear Scaling Rule: When increasing batch size by factor $k$, also increase learning rate by factor $k$ (approximately). This maintains similar training dynamics.

minibatch_training.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
import numpy as np
 
def softmax_stable(z):
    """Numerically stable softmax."""
    z_max = np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z - z_max)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def cross_entropy_loss(logits, labels):
    """Compute cross-entropy loss."""
    n = logits.shape[0]
    log_probs = logits - np.max(logits, axis=1, keepdims=True)
    log_probs = log_probs - np.log(np.sum(np.exp(log_probs), axis=1, keepdims=True))
    loss = -np.mean(log_probs[np.arange(n), labels])
    return loss
 
def minibatch_train_multinomial_lr(
    X, y, K,
    batch_size=64,
    epochs=100,
    learning_rate=0.1,
    l2_reg=0.01,
    verbose=True
):
    """
    Train multinomial logistic regression with mini-batch SGD.
    
    Args:
        X: Features (n, d)
        y: Labels (n,) integers in [0, K-1]
        K: Number of classes
        batch_size: Mini-batch size
        epochs: Number of training epochs
        learning_rate: Base learning rate
        l2_reg: L2 regularization strength
        verbose: Print progress
    
    Returns:
        W: Weight matrix (K, d)
        b: Bias vector (K,)
        history: Training history
    """
    n, d = X.shape
    
    # Initialize parameters (small random)
    W = 0.01 * np.random.randn(K, d)
    b = np.zeros(K)
    
    # Training history
    history = {'loss': [], 'accuracy': []}
    
    # Number of batches per epoch
    n_batches = int(np.ceil(n / batch_size))
    
    for epoch in range(epochs):
        # Shuffle data each epoch
        perm = np.random.permutation(n)
        X_shuffled = X[perm]
        y_shuffled = y[perm]
        
        epoch_loss = 0
        
        for batch_idx in range(n_batches):
            # Get batch
            start = batch_idx * batch_size
            end = min(start + batch_size, n)
            X_batch = X_shuffled[start:end]
            y_batch = y_shuffled[start:end]
            actual_batch_size = end - start
            
            # Forward pass
            logits = X_batch @ W.T + b  # (batch, K)
            probs = softmax_stable(logits)  # (batch, K)
            
            # Compute batch loss
            batch_loss = cross_entropy_loss(logits, y_batch)
            batch_loss += 0.5 * l2_reg * np.sum(W**2)  # L2 penalty
            epoch_loss += batch_loss * actual_batch_size
            
            # Backward pass: gradient = (p - y) for each sample
            grad_logits = probs.copy()  # (batch, K)
            grad_logits[np.arange(actual_batch_size), y_batch] -= 1
            grad_logits /= actual_batch_size
            
            # Gradients w.r.t. W and b
            grad_W = grad_logits.T @ X_batch  # (K, d)
            grad_W += l2_reg * W  # L2 regularization gradient
            grad_b = grad_logits.sum(axis=0)  # (K,)
            
            # Parameter update (SGD)
            W -= learning_rate * grad_W
            b -= learning_rate * grad_b
        
        # Epoch statistics
        epoch_loss /= n
        
        # Compute accuracy
        logits_all = X @ W.T + b
        predictions = np.argmax(logits_all, axis=1)
        accuracy = np.mean(predictions == y)
        
        history['loss'].append(epoch_loss)
        history['accuracy'].append(accuracy)
        
        if verbose and (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}/{epochs}: "
                  f"Loss={epoch_loss:.4f}, Acc={accuracy:.4f}")
    
    return W, b, history
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    
    # Generate data
    X, y = make_classification(
        n_samples=5000,
        n_features=20,
        n_informative=10,
        n_classes=5,
        n_clusters_per_class=1,
        random_state=42
    )
    
    # Split and scale
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    
    # Train
    print("Training with mini-batch SGD...")
    W, b, history = minibatch_train_multinomial_lr(
        X_train, y_train, K=5,
        batch_size=64,
        epochs=100,
        learning_rate=0.5,
        l2_reg=0.001
    )
    
    # Test accuracy
    test_logits = X_test @ W.T + b
    test_preds = np.argmax(test_logits, axis=1)
    test_acc = np.mean(test_preds == y_test)
    print(f"\nTest Accuracy: {test_acc:.4f}")

Learning Rate Scheduling

Fixed learning rates often underperform. Common schedules:

Step decay: Reduce LR by factor every $N$ epochs $$\eta_t = \eta_0 \cdot \gamma^{\lfloor t/N \rfloor}$$
Exponential decay: Smooth continuous reduction $$\eta_t = \eta_0 \cdot e^{-\lambda t}$$
Cosine annealing: Smooth cycle $$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\frac{\pi t}{T}))$$
Warmup + decay: Start low, ramp up, then decay (common in deep learning)

Choosing initial learning rate:

Too high: Divergence (loss explodes)
Too low: Slow convergence
Rule of thumb: Start with 0.1 or 0.01, adjust based on loss curve

Regularization Techniques

Regularization is crucial for preventing overfitting, especially with high-dimensional features or limited data.

L2 Regularization (Ridge)

Add squared weight penalty: $$\mathcal{L}{\text{reg}} = \mathcal{L}{\text{CE}} + \frac{\lambda}{2} \sum_{k=1}^{K} |\mathbf{w}_k|_2^2$$

Properties:

Shrinks all weights toward zero (but not exactly to zero)
Convex: preserves global optimum
Closed-form gradient: $\nabla_{\mathbf{w}_k} \text{penalty} = \lambda \mathbf{w}_k$
sklearn: C = 1/λ (inverse regularization)

L1 Regularization (Lasso)

$$\mathcal{L}{\text{reg}} = \mathcal{L}{\text{CE}} + \lambda \sum_{k=1}^{K} |\mathbf{w}_k|_1$$

Properties:

Induces sparsity: many weights become exactly zero
Feature selection effect
Subgradient needed (non-differentiable at zero)
Coordinate descent often used for optimization

Elastic Net

Combines L1 and L2: $$\mathcal{L}{\text{reg}} = \mathcal{L}{\text{CE}} + \lambda_1 |W|_1 + \lambda_2 |W|_2^2$$

Benefits:

Sparsity from L1 with stability from L2
Groups of correlated features selected together
sklearn: l1_ratio parameter controls mix

Early Stopping

Implicit regularization through limited training:

Monitor validation loss during training
Stop when validation loss starts increasing
Use parameters from best validation epoch

Properties:

Simple to implement
Adapts to data (no explicit $\lambda$ tuning)
Connected to L2 regularization theoretically

Dropout (for Neural Networks with Softmax)

Randomly zero activations during training:

Each hidden unit dropped with probability $p$
At inference, scale by $(1-p)$
Acts as ensemble of thinned networks
Not typically used for simple logistic regression; standard for deep nets

regularization_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import numpy as np
 
def compare_regularization(X, y, cv=5):
    """
    Compare different regularization strategies.
    """
    results = {}
    
    # No regularization (very high C)
    model_none = LogisticRegression(
        C=1e10, multi_class='multinomial', solver='lbfgs',
        max_iter=1000, random_state=42
    )
    scores = cross_val_score(model_none, X, y, cv=cv)
    results['No regularization'] = (scores.mean(), scores.std())
    
    # L2 regularization (default)
    for C in [0.01, 0.1, 1.0, 10.0]:
        model = LogisticRegression(
            C=C, penalty='l2', multi_class='multinomial',
            solver='lbfgs', max_iter=1000, random_state=42
        )
        scores = cross_val_score(model, X, y, cv=cv)
        results[f'L2 (C={C})'] = (scores.mean(), scores.std())
    
    # L1 regularization
    for C in [0.01, 0.1, 1.0]:
        model = LogisticRegression(
            C=C, penalty='l1', solver='saga',
            multi_class='multinomial', max_iter=1000, random_state=42
        )
        scores = cross_val_score(model, X, y, cv=cv)
        results[f'L1 (C={C})'] = (scores.mean(), scores.std())
    
    # Elastic Net
    for C, l1_ratio in [(0.1, 0.5), (1.0, 0.5)]:
        model = LogisticRegression(
            C=C, penalty='elasticnet', l1_ratio=l1_ratio,
            solver='saga', multi_class='multinomial',
            max_iter=1000, random_state=42
        )
        scores = cross_val_score(model, X, y, cv=cv)
        results[f'ElasticNet (C={C}, l1={l1_ratio})'] = (scores.mean(), scores.std())
    
    # Print results
    print("=== Regularization Comparison ===\n")
    for name, (mean, std) in sorted(results.items(), key=lambda x: -x[1][0]):
        print(f"{name:35s}: {mean:.4f} (+/- {std*2:.4f})")
    
    return results
 
def analyze_sparsity(X, y, C_values=[0.001, 0.01, 0.1, 1.0]):
    """
    Analyze sparsity induced by L1 regularization.
    """
    print("\n=== L1 Sparsity Analysis ===\n")
    
    for C in C_values:
        model = LogisticRegression(
            C=C, penalty='l1', solver='saga',
            multi_class='multinomial', max_iter=1000, random_state=42
        )
        model.fit(X, y)
        
        # Count non-zero weights
        n_nonzero = np.sum(model.coef_ != 0)
        total = model.coef_.size
        sparsity = 100 * (1 - n_nonzero / total)
        
        print(f"C={C:6.4f}: {n_nonzero:4d}/{total} non-zero "
              f"({sparsity:.1f}% sparse)")
 
if __name__ == "__main__":
    from sklearn.datasets import make_classification
    
    # High-dimensional data (more features than samples for some splits)
    X, y = make_classification(
        n_samples=500,
        n_features=100,
        n_informative=20,
        n_redundant=10,
        n_classes=5,
        random_state=42
    )
    
    compare_regularization(X, y)
    analyze_sparsity(X, y)

Hyperparameter Selection

Handling Very Large K (Thousands of Classes)

When $K$ becomes very large (thousands or more), standard softmax becomes prohibitively expensive. Several techniques address this challenge.

The Bottleneck

For each training sample:

Compute $K$ logits: $O(K \cdot d)$
Softmax normalization: sum of $K$ exponentials
Cross-entropy gradient: update all $K$ weight vectors

For $K = 50,000$ (typical vocabulary), one forward-backward pass touches 50K output neurons—even if only one is the true class.

Technique 1: Hierarchical Softmax

Organize classes in a tree (typically binary). Each leaf is a class; internal nodes are binary classifiers.

Prediction:

Start at root
At each node, take left or right branch based on binary classifier output
Reach a leaf (predicted class)

Training:

For true class, only update classifiers on the path from root to that leaf
Path length $\approx \log_2 K$ for balanced tree

Complexity reduction:

Approach	Per-sample complexity
Full softmax	$O(K \cdot d)$
Hierarchical softmax	$O(\log K \cdot d)$

For $K = 50,000$: from 50K to ~16 operations (per weight access).

Challenge: Tree structure matters. Common approaches:

Huffman tree: Frequent classes near root (shorter paths)
Learned tree: Cluster classes by embedding similarity
Random tree: Simple baseline, surprisingly effective sometimes

Technique 2: Sampled Softmax / Negative Sampling

Instead of computing full softmax, sample negative classes:

For true class $c$, include it in computation
Sample $m \ll K$ negative classes from a proposal distribution
Compute softmax over only these $m+1$ classes

Loss: $$\mathcal{L}{\text{sampled}} = -\log \frac{e^{z_c}}{e^{z_c} + \sum{k \in \text{neg}} e^{z_k}}$$

Complexity: $O(m \cdot d)$ instead of $O(K \cdot d)$. Typically $m \sim 100$-1000.

Technique 3: Noise Contrastive Estimation (NCE)

Frame as binary classification: distinguish real data from noise.

Positive samples: True (context, word) pairs
Negative samples: (context, noise word) pairs from proposal distribution $Q$

NCE objective: $$\mathcal{L}{NCE} = \log \sigma(z_c - \log k \cdot Q(c)) + \sum{j \sim Q} \log \sigma(-z_j + \log k \cdot Q(j))$$

As $k \to \infty$, NCE approaches MLE. Used in Word2Vec (negative sampling variant).

Technique 4: Class-based Softmax

Group classes into clusters. Two-stage softmax:

Predict cluster: $P(\text{cluster} | \mathbf{x})$
Predict class within cluster: $P(\text{class} | \text{cluster}, \mathbf{x})$

$$P(y | \mathbf{x}) = P(\text{cluster}(y) | \mathbf{x}) \cdot P(y | \text{cluster}(y), \mathbf{x})$$

If $C$ clusters with $K/C$ classes each:

Full softmax: $O(K)$
Class-based: $O(C + K/C)$, minimized at $C = \sqrt{K}$ giving $O(\sqrt{K})$

When to Use Approximations

Memory Optimization Strategies

Memory constraints often limit model size and batch size. Several techniques help manage memory efficiently.

Memory Breakdown

For multinomial logistic regression:

Parameters: $O(K \cdot d)$ for weights + $O(K)$ for biases
Activations (batch): $O(B \cdot d)$ input features + $O(B \cdot K)$ logits
Gradients: $O(K \cdot d)$ for weight gradients
Optimizer state: Depends on optimizer
- SGD: $O(1)$ (just LR)
- Momentum: $O(K \cdot d)$ (velocity)
- Adam: $O(2 \cdot K \cdot d)$ (first and second moments)

Example: $K = 10000$, $d = 1000$

Parameters: 10M floats = 40 MB (float32)
Adam optimizer state: 80 MB additional
Batch of 256: negligible compared to above

For deep networks, activations dominate—for logistic regression, parameters do.

Strategy 1: Gradient Accumulation

Simulate large batches without memory for activations:

accum_steps = 4
for i, (x, y) in enumerate(dataloader):
    loss = model(x, y) / accum_steps
    loss.backward()  # Accumulates gradients
    
    if (i + 1) % accum_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

Effect: Same effective batch size $B \times \text{accum_steps}$ with memory for batch $B$.

Strategy 2: Mixed Precision Training

Use float16 for most computations, float32 for sensitive operations:

Activations/gradients: float16 (half memory)
Parameters/optimizer state: float32 (full precision)
Loss scaling to prevent underflow

Frameworks: PyTorch AMP (torch.cuda.amp), TensorFlow mixed precision.

Speedup: 2-3x on modern GPUs (Tensor Cores), plus memory reduction.

Strategy 3: Sparse Features

For text/categorical data, features are often sparse (bag-of-words, one-hot).

Dense representation: $\mathbf{x} \in \mathbb{R}^{1000000}$ (vocabulary) consumes 4 MB per sample.

Sparse representation: Store only non-zero indices and values. Average document: ~100 unique words. Memory: ~1 KB per sample (100 indices + 100 values).

Sparse matrix-vector multiplication:

Dense: $O(K \cdot d)$ regardless of sparsity
Sparse: $O(K \cdot \text{nnz})$ where $\text{nnz} \ll d$

Strategy 4: Model Parallel for Very Large K

When $K$ is so large that the output layer doesn't fit on one GPU:

Partition classes across GPUs
Each GPU computes partial softmax
AllReduce to combine for full softmax

Used in large language models with 50K+ vocabulary.

Memory Optimization Summary
Technique	Memory Reduction	Tradeoff
Smaller batch size	Linear	More gradient noise, slower training
Gradient accumulation	Linear in accumulation steps	Same memory, same effective batch
Mixed precision (FP16)	~2x	Slight numerical considerations
Sparse features	Proportional to sparsity	Requires sparse-aware operations
Gradient checkpointing	Trade compute for memory	Slower training (~30%)

Production Engineering Considerations

Deploying multi-class classifiers in production requires attention to several engineering concerns.

1. Inference Latency

For online serving, prediction latency matters:

Single prediction: $O(K \cdot d)$ multiply-adds
Target: Often <10ms including I/O

Optimizations:

Compile model to optimized format (ONNX, TensorRT)
Batch predictions when possible
Use efficient BLAS libraries (MKL, OpenBLAS)
Consider quantization (int8 weights)

2. Batch Inference

For offline scoring (e.g., daily batch jobs):

Maximize throughput over latency
Use GPU batching
Process data in parallel across machines

3. Model Size and Transport

For edge deployment or frequent model updates:

Full model: $K \cdot d \cdot 4$ bytes (float32)
Compressed: Use quantization, pruning
For $K=1000$, $d=1000$: 4 MB (manageable)

4. Calibration

Predicted probabilities should match observed frequencies:

Train: 70% of samples where model predicts 70% should be positive
Deep networks often overconfident; logistic regression typically well-calibrated

Calibration techniques:

Temperature scaling: $\mathbf{p} = \text{softmax}(\mathbf{z}/T)$, tune $T$ on validation set
Platt scaling: Logistic regression on model outputs
Isotonic regression: Non-parametric calibration

5. Monitoring in Production

Metrics to track:

Accuracy, precision, recall per class
Confidence distribution shift
Input feature distribution drift
Prediction latency percentiles

Alerts:

Sudden accuracy drops
New classes appearing in data (unseen during training)
Feature pipeline failures (null values, wrong types)

production_inference.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import numpy as np
import time
 
class OptimizedMultinomialClassifier:
    """
    Production-ready multinomial classifier with optimizations.
    """
    
    def __init__(self, W, b):
        """
        Args:
            W: Weight matrix (K, d), float32 or float16
            b: Bias vector (K,)
        """
        self.W = np.ascontiguousarray(W)  # Ensure memory layout
        self.b = b
        self.K, self.d = W.shape
        
    @classmethod
    def from_sklearn(cls, sklearn_model):
        """Load from trained sklearn LogisticRegression."""
        return cls(sklearn_model.coef_, sklearn_model.intercept_)
    
    def predict(self, X):
        """
        Predict class labels.
        
        Args:
            X: Features (n, d) or (d,) for single sample
        
        Returns:
            Predicted class indices
        """
        single = X.ndim == 1
        if single:
            X = X.reshape(1, -1)
        
        logits = X @ self.W.T + self.b
        predictions = np.argmax(logits, axis=1)
        
        return predictions[0] if single else predictions
    
    def predict_proba(self, X, temperature=1.0):
        """
        Predict class probabilities with optional temperature.
        """
        single = X.ndim == 1
        if single:
            X = X.reshape(1, -1)
        
        logits = (X @ self.W.T + self.b) / temperature
        
        # Stable softmax
        logits_max = np.max(logits, axis=1, keepdims=True)
        exp_logits = np.exp(logits - logits_max)
        probs = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
        
        return probs[0] if single else probs
    
    def predict_top_k(self, X, k=5):
        """
        Return top-k predictions with probabilities.
        """
        probs = self.predict_proba(X)
        if probs.ndim == 1:
            probs = probs.reshape(1, -1)
        
        # Get top k indices
        top_k_idx = np.argsort(probs, axis=1)[:, -k:][:, ::-1]
        
        # Get corresponding probabilities
        n = probs.shape[0]
        top_k_probs = probs[np.arange(n)[:, None], top_k_idx]
        
        return top_k_idx, top_k_probs
 
def benchmark_inference(model, X, n_runs=100):
    """Benchmark inference latency."""
    
    # Warmup
    for _ in range(10):
        _ = model.predict(X)
    
    # Benchmark
    latencies = []
    for _ in range(n_runs):
        start = time.perf_counter()
        _ = model.predict(X)
        latencies.append((time.perf_counter() - start) * 1000)  # ms
    
    latencies = np.array(latencies)
    print(f"Inference latency (n={X.shape[0]} samples):")
    print(f"  Mean: {latencies.mean():.3f} ms")
    print(f"  P50:  {np.percentile(latencies, 50):.3f} ms")
    print(f"  P99:  {np.percentile(latencies, 99):.3f} ms")
 
# Example
if __name__ == "__main__":
    from sklearn.linear_model import LogisticRegression
    from sklearn.datasets import make_classification
    
    # Create and train model
    X, y = make_classification(n_samples=1000, n_features=100,
                               n_classes=10, n_informative=20,
                               random_state=42)
    
    sklearn_model = LogisticRegression(
        multi_class='multinomial', max_iter=1000
    ).fit(X, y)
    
    # Create optimized version
    model = OptimizedMultinomialClassifier.from_sklearn(sklearn_model)
    
    # Benchmark
    X_test = np.random.randn(1, 100)  # Single sample
    benchmark_inference(model, X_test)
    
    X_batch = np.random.randn(100, 100)  # Batch
    benchmark_inference(model, X_batch)

Summary: Computational Considerations

We have explored the computational landscape of training and deploying multi-class logistic regression. Let's consolidate the key insights:

Key Takeaways

•Complexity: Training scales as $O(T \cdot n \cdot K \cdot d)$; inference as $O(K \cdot d)$ per sample.
•Optimizers: L-BFGS for moderate data, mini-batch SGD for large data; adaptive methods (Adam) for deep learning.
•Mini-batch Dynamics: Batch size trades compute efficiency vs. gradient noise; learning rate scheduling improves convergence.
•Regularization: L2 for shrinkage with cross-validation for $\lambda$; L1 for sparsity and feature selection.
•Large K Problem: For $K > 1000$, consider hierarchical softmax, sampled softmax, or class-based decomposition.
•Memory Optimization: Gradient accumulation, mixed precision, sparse features, and model parallelism enable scaling.
•Production: Optimize for inference latency; calibrate probabilities; monitor for drift and failures.

Module Complete:

With this page, we have completed Module 4 on Multi-class Logistic Regression. You now have a comprehensive understanding of:

The softmax function and its mathematical properties
Multinomial logistic regression model structure and interpretation
The cross-entropy loss and its information-theoretic foundations
One-vs-all and other decomposition strategies
Computational considerations for training and deployment at scale

This knowledge forms the foundation for understanding classification in neural networks and more advanced multi-class methods.

Module Complete

5 / 5