Softmax Classifier Optimization via Gradient Descent (Hard) — Practice with Code Visualizer

Softmax Classifier Optimization

The softmax classifier (also known as multinomial logistic regression) is a fundamental algorithm for multi-class classification problems. It generalizes binary logistic regression to handle K > 2 mutually exclusive classes by computing a probability distribution over all possible outcomes.

Core Mathematical Formulation

Given a dataset with n samples and d features, where each sample belongs to one of K classes, the softmax classifier learns a weight matrix W of dimensions (d × K). For an input feature vector x, the model computes class probabilities using the softmax function:

$$P(y = k | x) = \frac{e^{x^T w_k}}{\sum_{j=1}^{K} e^{x^T w_j}}$$

where w_k is the k-th column of the weight matrix, representing the learned parameters for class k.

Cross-Entropy Loss Function

The objective is to minimize the cross-entropy loss (also called negative log-likelihood), which measures the dissimilarity between predicted probabilities and true labels:

$$L(W) = -\frac{1}{n} \sum_{i=1}^{n} \log P(y_i | x_i)$$

For a true label y_i = k, this becomes -log(P(y = k | x_i)), penalizing low probability assignments to the correct class.

Gradient Descent Optimization

To minimize the loss, we compute the gradient with respect to the weights and iteratively update them:

$$W \leftarrow W - \alpha \cdot abla_W L$$

where α is the learning rate. The gradient has an elegant form: for each class k, it equals X^T (P - Y) / n, where P is the matrix of predicted probabilities and Y is the one-hot encoded label matrix.

Implementation Details

Your implementation should:

Initialize weights to zeros with shape (n_features, n_classes)
Forward pass: Compute softmax probabilities for all samples
Compute loss: Calculate cross-entropy loss for the current weights
Backward pass: Compute gradients with respect to weights
Update weights: Apply gradient descent step
Record loss: Store the loss value at each iteration
Return results: Final weight matrix and loss trajectory, both rounded to 4 decimal places

Numerical Stability Consideration

When implementing softmax, subtract the maximum value in each row of the logits before exponentiation to prevent numerical overflow: $$\text{softmax}(z)_k = \frac{e^{z_k - \max(z)}}{\sum_j e^{z_j - \max(z)}}$$

This mathematically equivalent formulation prevents inf values from appearing during computation.

With 3 samples, 2 features, and 3 classes, we train for 10 iterations at learning rate 0.01.

Initial State: • Weights initialized to zeros: shape (2, 3) • Initial softmax outputs uniform probabilities: [1/3, 1/3, 1/3] for each sample • Initial cross-entropy loss: -log(1/3) ≈ 1.0986

Training Progress: • Each iteration, gradients push weights to increase probability of true classes • Loss decreases monotonically from 1.0986 to 1.0648 • Final weights encode learned feature-class relationships

The weight matrix shows that feature 0 correlates positively with class 0 (0.0053) and negatively with class 1 (-0.0207), while feature 1 shows opposite patterns.

This is a binary classification problem (K=2) with 4 samples.

Dataset Analysis: • Samples [1.0, 0.0] and [1.0, 1.0] belong to class 0 • Samples [0.0, 1.0] and [0.0, 0.0] belong to class 1

Training Dynamics: • Initial loss: -log(0.5) = 0.6931 (random guessing between 2 classes) • Higher learning rate (0.1) causes faster convergence • Loss reduces steadily to 0.6473 after 5 iterations

Weight Interpretation: • Feature 0 (first feature) strongly predicts class 0 (weight 0.119 vs -0.119) • Feature 1 has minimal discriminative power (weights near zero) • This aligns with the data: class 0 samples have higher first feature values

A 3-class problem with 6 samples distributed across classes.

Data Distribution: • Class 0: samples with mixed positive/negative features • Class 1: samples with varying feature combinations • Class 2: samples including the origin and an off-diagonal point

Convergence Analysis: • Initial loss: log(3) ≈ 1.0986 (uniform distribution over 3 classes) • Moderate learning rate (0.05) yields gradual, stable descent • Loss decreases smoothly by ~0.0125 over 8 iterations

Weight Patterns: • Class 2 has negative weight for feature 0 (-0.0546), positive for feature 1 (0.0113) • Class 1 shows positive association with feature 0 (0.0435) • The overlapping class boundaries make full separation challenging