Loading problem...
The softmax classifier (also known as multinomial logistic regression) is a fundamental algorithm for multi-class classification problems. It generalizes binary logistic regression to handle K > 2 mutually exclusive classes by computing a probability distribution over all possible outcomes.
Given a dataset with n samples and d features, where each sample belongs to one of K classes, the softmax classifier learns a weight matrix W of dimensions (d × K). For an input feature vector x, the model computes class probabilities using the softmax function:
$$P(y = k | x) = \frac{e^{x^T w_k}}{\sum_{j=1}^{K} e^{x^T w_j}}$$
where w_k is the k-th column of the weight matrix, representing the learned parameters for class k.
The objective is to minimize the cross-entropy loss (also called negative log-likelihood), which measures the dissimilarity between predicted probabilities and true labels:
$$L(W) = -\frac{1}{n} \sum_{i=1}^{n} \log P(y_i | x_i)$$
For a true label y_i = k, this becomes -log(P(y = k | x_i)), penalizing low probability assignments to the correct class.
To minimize the loss, we compute the gradient with respect to the weights and iteratively update them:
$$W \leftarrow W - \alpha \cdot abla_W L$$
where α is the learning rate. The gradient has an elegant form: for each class k, it equals X^T (P - Y) / n, where P is the matrix of predicted probabilities and Y is the one-hot encoded label matrix.
Your implementation should:
When implementing softmax, subtract the maximum value in each row of the logits before exponentiation to prevent numerical overflow: $$\text{softmax}(z)_k = \frac{e^{z_k - \max(z)}}{\sum_j e^{z_j - \max(z)}}$$
This mathematically equivalent formulation prevents inf values from appearing during computation.
X = [[0.5, -1.2], [-0.3, 1.1], [0.8, -0.6]]
y = [0, 1, 2]
learning_rate = 0.01
n_iterations = 10weights = [[0.0053, -0.0207, 0.0154], [-0.0317, 0.0436, -0.0119]]
losses = [1.0986, 1.0947, 1.0909, 1.0871, 1.0833, 1.0795, 1.0758, 1.0721, 1.0685, 1.0648]With 3 samples, 2 features, and 3 classes, we train for 10 iterations at learning rate 0.01.
Initial State: • Weights initialized to zeros: shape (2, 3) • Initial softmax outputs uniform probabilities: [1/3, 1/3, 1/3] for each sample • Initial cross-entropy loss: -log(1/3) ≈ 1.0986
Training Progress: • Each iteration, gradients push weights to increase probability of true classes • Loss decreases monotonically from 1.0986 to 1.0648 • Final weights encode learned feature-class relationships
The weight matrix shows that feature 0 correlates positively with class 0 (0.0053) and negatively with class 1 (-0.0207), while feature 1 shows opposite patterns.
X = [[1.0, 0.0], [0.0, 1.0], [1.0, 1.0], [0.0, 0.0]]
y = [0, 1, 0, 1]
learning_rate = 0.1
n_iterations = 5weights = [[0.119, -0.119], [-0.003, 0.003]]
losses = [0.6931, 0.6808, 0.6691, 0.6579, 0.6473]This is a binary classification problem (K=2) with 4 samples.
Dataset Analysis: • Samples [1.0, 0.0] and [1.0, 1.0] belong to class 0 • Samples [0.0, 1.0] and [0.0, 0.0] belong to class 1
Training Dynamics: • Initial loss: -log(0.5) = 0.6931 (random guessing between 2 classes) • Higher learning rate (0.1) causes faster convergence • Loss reduces steadily to 0.6473 after 5 iterations
Weight Interpretation: • Feature 0 (first feature) strongly predicts class 0 (weight 0.119 vs -0.119) • Feature 1 has minimal discriminative power (weights near zero) • This aligns with the data: class 0 samples have higher first feature values
X = [[1.0, 0.5], [0.5, 1.0], [0.0, 0.0], [-0.5, -0.5], [0.5, -0.5], [-0.5, 0.5]]
y = [0, 1, 2, 0, 1, 2]
learning_rate = 0.05
n_iterations = 8weights = [[0.0111, 0.0435, -0.0546], [-0.0219, 0.0106, 0.0113]]
losses = [1.0986, 1.0968, 1.0949, 1.0931, 1.0913, 1.0896, 1.0878, 1.0861]A 3-class problem with 6 samples distributed across classes.
Data Distribution: • Class 0: samples with mixed positive/negative features • Class 1: samples with varying feature combinations • Class 2: samples including the origin and an off-diagonal point
Convergence Analysis: • Initial loss: log(3) ≈ 1.0986 (uniform distribution over 3 classes) • Moderate learning rate (0.05) yields gradual, stable descent • Loss decreases smoothly by ~0.0125 over 8 iterations
Weight Patterns: • Class 2 has negative weight for feature 0 (-0.0546), positive for feature 1 (0.0113) • Class 1 shows positive association with feature 0 (0.0435) • The overlapping class boundaries make full separation challenging
Constraints