Multi Class Logistic Regression - Learning Module

Loading content...

0/245

Multinomial Logistic Regression

The Complete Multi-class Classification Framework

With the softmax function in our toolkit, we can now construct the complete multinomial logistic regression model—a principled extension of binary logistic regression to any number of classes. This model is foundational: it underpins the output layers of virtually every classification neural network and remains a powerful baseline for multi-class problems.

Multinomial logistic regression is known by several names:

Softmax regression — emphasizing the softmax output transformation
Maximum entropy classifier — from its information-theoretic derivation
Multiclass logit / Polytomous logit — from econometrics
Multinomial logit — the formal statistical name

Despite this nomenclature diversity, all these terms describe the same model. In this page, we develop multinomial logistic regression completely—from model specification through parameterization choices, geometric interpretation, and practical considerations.

What You Will Learn

By the end of this page, you will understand: the complete probabilistic model specification; how to parameterize the model with reference class vs. full parameterization; the geometric interpretation of decision boundaries; the relationship between logits and log-odds; and practical implementation patterns for producing well-calibrated multi-class predictions.

Model Specification

The Setting

We have:

Features: $\mathbf{x} \in \mathbb{R}^d$ — a $d$-dimensional feature vector
Labels: $y \in {1, 2, \ldots, K}$ — one of $K$ possible classes
Training data: ${(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$

Our goal is to model the conditional probability distribution $P(y | \mathbf{x})$ over all $K$ classes given features $\mathbf{x}$.

The Model

Multinomial logistic regression models the conditional probabilities as:

$$P(y = k | \mathbf{x}; \boldsymbol{\Theta}) = \frac{\exp(z_k)}{\sum_{j=1}^{K} \exp(z_j)} = \text{softmax}(\mathbf{z})_k$$

where the logit (or score) for class $k$ is a linear function of features:

$$z_k = \mathbf{w}k^T \mathbf{x} + b_k = \sum{i=1}^{d} w_{ki} x_i + b_k$$

Here:

$\mathbf{w}_k \in \mathbb{R}^d$ is the weight vector for class $k$
$b_k \in \mathbb{R}$ is the bias (intercept) for class $k$
$\boldsymbol{\Theta} = {(\mathbf{w}_1, b_1), \ldots, (\mathbf{w}_K, b_K)}$ is the full parameter set

Matrix Notation

We often write the model compactly as: $\mathbf{z} = W\mathbf{x} + \mathbf{b}$ where $W \in \mathbb{R}^{K \times d}$ (rows are weight vectors) and $\mathbf{b} \in \mathbb{R}^K$. Or, with augmented features $\tilde{\mathbf{x}} = [1, \mathbf{x}]$: $\mathbf{z} = \tilde{W}\tilde{\mathbf{x}}$ where $\tilde{W} \in \mathbb{R}^{K \times (d+1)}$ absorbs biases into the first column.

The Modeling Assumption

Multinomial logistic regression makes a specific assumption about how class probabilities depend on features: the log-odds between any two classes is a linear function of features.

For classes $k$ and $l$:

$$\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = \log \frac{e^{z_k}}{e^{z_l}} = z_k - z_l = (\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l)$$

This is the log-linear model assumption: log-odds are linear in features. The assumption is strong but often reasonable and leads to a model with many desirable properties:

Interpretable parameters: Each weight $w_{ki}$ represents how feature $i$ affects the log-odds of class $k$
Convex optimization: Maximum likelihood estimation is a convex problem
No distributional assumptions on X: Unlike discriminant analysis, we don't assume feature distributions
Calibrated probabilities: Under correct specification, predicted probabilities match true frequencies

Formal Distributional Form

The model specifies that $y | \mathbf{x}$ follows a categorical distribution (generalization of Bernoulli to $K$ outcomes):

$$y | \mathbf{x}; \boldsymbol{\Theta} \sim \text{Categorical}(p_1, p_2, \ldots, p_K)$$

where $p_k = P(y=k|\mathbf{x}) = \text{softmax}(\mathbf{z})_k$.

Equivalently, writing $y$ as a one-hot vector $\mathbf{y} \in {0,1}^K$ with $y_k = 1$ iff $y = k$:

$$P(\mathbf{y}|\mathbf{x}; \boldsymbol{\Theta}) = \prod_{k=1}^{K} p_k^{y_k} = \prod_{k=1}^{K} \left(\frac{e^{z_k}}{\sum_j e^{z_j}}\right)^{y_k}$$

This product form is crucial for likelihood-based learning.

Parameterization: Reference Class vs. Full

Due to the translation invariance of softmax, multinomial logistic regression is fundamentally overparameterized: adding a constant to all logits doesn't change the probability distribution. This creates both opportunities and complications.

The Identifiability Problem

Consider parameters $\boldsymbol{\Theta} = {(\mathbf{w}k, b_k)}{k=1}^K$ and shifted parameters $\boldsymbol{\Theta}' = {(\mathbf{w}k + \mathbf{c}, b_k + d)}{k=1}^K$ for any $\mathbf{c} \in \mathbb{R}^d$, $d \in \mathbb{R}$:

$$z'_k = z_k + \mathbf{c}^T\mathbf{x} + d$$

Since the shift is identical for all classes: $$\text{softmax}(\mathbf{z}') = \text{softmax}(\mathbf{z})$$

These infinitely many parameter settings produce identical predictions. The model is not identifiable without constraints.

Resolution 1: Reference Class Parameterization

Fix one class (say class $K$) as the reference with $\mathbf{w}_K = \mathbf{0}$ and $b_K = 0$:

$$z_K = 0 \quad \text{(always)}$$ $$z_k = \mathbf{w}_k^T \mathbf{x} + b_k \quad \text{for } k = 1, \ldots, K-1$$

Probabilities under reference class parameterization:

$$P(y=k|\mathbf{x}) = \frac{e^{z_k}}{1 + \sum_{j=1}^{K-1} e^{z_j}} \quad \text{for } k = 1, \ldots, K-1$$

$$P(y=K|\mathbf{x}) = \frac{1}{1 + \sum_{j=1}^{K-1} e^{z_j}}$$

Note the similarity to binary logistic regression:

Binary: $P(y=1) = \frac{e^z}{1+e^z} = \sigma(z)$ — exactly the $K=2$ case

Advantages:

Identifiability: Parameters uniquely determined (no redundancy)
Direct interpretation: $\mathbf{w}_k$ represents log-odds of class $k$ vs. reference class $K$
Fewer parameters: $(K-1)(d+1)$ instead of $K(d+1)$

Disadvantages:

Asymmetric interpretation: Reference class is special, others are relative
Coefficient comparison: $\mathbf{w}_k - \mathbf{w}_l$ for non-reference classes requires extra computation
Reference choice matters for interpretation (but not for predictions)

Choosing the Reference Class

The reference class choice doesn't affect predictions—only interpretation. Common choices: (1) Most frequent class for stable estimation, (2) Natural baseline (e.g., 'healthy' vs. disease classes), (3) Alphabetically first for reproducibility. Statistical packages often use the first or last level.

Resolution 2: Full Parameterization with Regularization

Alternatively, keep all $K$ parameter sets but add regularization that implicitly breaks symmetry:

$$\mathcal{L}{\text{regularized}} = \mathcal{L}{\text{NLL}} + \lambda \sum_{k=1}^{K} |\mathbf{w}_k|^2$$

The L2 penalty prefers solutions where parameters are small, effectively centering them around zero. While still technically overparameterized, regularization:

Selects a unique solution (the minimum-norm one among equivalent solutions)
Improves generalization through the usual regularization benefits
Maintains symmetric treatment of all classes

Neural Network Convention:

Deep learning frameworks typically use full parameterization:

Final layer: $\mathbf{z} = W\mathbf{h} + \mathbf{b}$ where $W \in \mathbb{R}^{K \times d_{\text{hidden}}}$
Softmax: $\mathbf{p} = \text{softmax}(\mathbf{z})$
Weight decay applies to $W$

This symmetric treatment is convenient for implementation and doesn't cause optimization issues thanks to regularization.

Comparison of Parameterization Approaches
Aspect	Reference Class	Full + Regularization
Number of parameters	$(K-1)(d+1)$	$K(d+1)$
Identifiable	Yes	With regularization
Class symmetry	No (reference is special)	Yes
Interpretation	$\mathbf{w}_k$ = log-odds vs. reference	$\mathbf{w}_k - \mathbf{w}_l$ = log-odds of $k$ vs. $l$
Common usage	Statistics, econometrics	Machine learning, deep learning
Optimization	Standard MLE	Regularized MLE

Log-Odds and Coefficient Interpretation

The log-linear structure of multinomial logistic regression enables rich interpretation of parameters. Understanding log-odds is essential for model diagnostics and domain insights.

Log-Odds Definition

The log-odds (or logit) of class $k$ versus class $l$ is:

$$\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = z_k - z_l = (\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l)$$

This is the natural logarithm of the odds ratio—how many times more likely class $k$ is than class $l$.

Interpreting Coefficients (Reference Class)

With reference class $K$ (where $\mathbf{w}_K = 0$, $b_K = 0$):

$$\log \frac{P(y=k|\mathbf{x})}{P(y=K|\mathbf{x})} = \mathbf{w}_k^T \mathbf{x} + b_k$$

For a unit increase in feature $x_i$ (holding others constant):

$$\log \frac{P(y=k|\mathbf{x} + \mathbf{e}_i)}{P(y=K|\mathbf{x} + \mathbf{e}i)} - \log \frac{P(y=k|\mathbf{x})}{P(y=K|\mathbf{x})} = w{ki}$$

Thus:

$w_{ki} > 0$: Unit increase in $x_i$ increases log-odds of class $k$ vs. reference by $w_{ki}$. The odds of class $k$ are multiplied by $e^{w_{ki}}$.
$w_{ki} < 0$: Unit increase in $x_i$ decreases log-odds of class $k$ vs. reference.
$w_{ki} = 0$: Feature $x_i$ doesn't affect relative probability of class $k$ vs. reference.

Odds Ratio Interpretation

Exponentiating gives the odds ratio:

$$\text{OR}{ki} = e^{w{ki}}$$

For a unit increase in feature $x_i$:

$\text{OR}{ki} > 1$: Odds of class $k$ (vs. reference) multiply by $\text{OR}{ki}$
$\text{OR}{ki} < 1$: Odds of class $k$ (vs. reference) divide by $1/\text{OR}{ki}$
$\text{OR}_{ki} = 1$: Feature has no effect on relative class probability

Example:

Suppose we're classifying customer churn into: Stayed (reference), Churned to Competitor A, Churned to Competitor B.

If the coefficient for 'years as customer' for 'Churned to A' is $w = -0.3$:

$$\text{OR} = e^{-0.3} \approx 0.74$$

Interpretation: For each additional year as a customer, the odds of churning to Competitor A (versus staying) multiply by 0.74—i.e., they decrease by 26%.

If the same coefficient for 'Churned to B' is $w = -0.1$:

$$\text{OR} = e^{-0.1} \approx 0.90$$

Tenure reduces churn to B by only 10% per year—less protective than against A.

All-Else-Equal Caveat

Coefficient interpretations assume other features are held constant. With correlated features, this 'all else equal' scenario may not occur in practice. Also, in regularized models, coefficients are biased toward zero and quantitative interpretation should be done cautiously.

Comparing Non-Reference Classes

To compare classes $k$ and $l$ (neither is reference):

$$\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = (\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l)$$

The effect of feature $i$ on log-odds of $k$ vs. $l$ is $w_{ki} - w_{li}$.

Statistical Inference:

To test whether classes $k$ and $l$ differ in their relationship with feature $i$:

Null hypothesis: $H_0: w_{ki} = w_{li}$
Test statistic: Wald test on $(w_{ki} - w_{li})$ requires estimating the covariance between $\hat{w}{ki}$ and $\hat{w}{li}$

This is more complex than testing single coefficients and often requires simultaneous inference adjustments (Bonferroni, etc.) when comparing multiple pairs.

Decision Boundaries and Geometry

The geometry of multinomial logistic regression reveals a beautiful structure: the feature space is partitioned into regions by linear decision boundaries (hyperplanes).

Pairwise Decision Boundaries

The boundary between classes $k$ and $l$ occurs where their probabilities are equal:

$$P(y=k|\mathbf{x}) = P(y=l|\mathbf{x})$$

This implies $z_k = z_l$, so:

$$(\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l) = 0$$

This is a hyperplane in $\mathbb{R}^d$:

Normal vector: $\mathbf{w}_k - \mathbf{w}_l$
Offset: $-(b_k - b_l)$

For $K$ classes, there are $\binom{K}{2} = \frac{K(K-1)}{2}$ pairwise boundaries.

The Voronoi-like Partition

Even though there are $\binom{K}{2}$ pairwise boundaries, the actual decision regions are simpler. Each class $k$ 'owns' the region where it has the highest logit:

$$\text{Predict class } k \text{ where } z_k > z_j \text{ for all } j \neq k$$

This creates a partition of feature space into (at most) $K$ convex polytopes—regions bounded by intersecting hyperplanes.

Properties of the Decision Regions:

Convexity: Each class region is convex (intersection of half-spaces)
Simply connected: No class region is disconnected
May be unbounded: Regions can extend to infinity
May be empty: With certain parameters, some classes may never be predicted

Soft vs. Hard Boundaries:

The model outputs probability distributions, not hard predictions. Near a decision boundary:

Probabilities are close to uniform (high uncertainty)
Moving perpendicular to the boundary shifts probability smoothly
Temperature scaling controls the sharpness: high T → gradual transitions, low T → sharp boundaries

decision_boundary_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
import matplotlib.pyplot as plt
 
def softmax(z):
    """Numerically stable softmax."""
    z = np.asarray(z)
    z_max = np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z - z_max)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def multinomial_logistic_prediction(X, W, b):
    """
    Compute class probabilities for multinomial logistic regression.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        W: Weight matrix (n_classes, n_features)
        b: Bias vector (n_classes,)
    
    Returns:
        Probability matrix (n_samples, n_classes)
    """
    logits = X @ W.T + b  # (n_samples, n_classes)
    return softmax(logits)
 
def plot_decision_regions(W, b, class_names, ax=None, resolution=200):
    """
    Visualize decision regions for 2D features and K classes.
    
    Args:
        W: Weight matrix (K, 2)
        b: Bias vector (K,)
        class_names: List of class names
        ax: Matplotlib axis
        resolution: Grid resolution for plotting
    """
    if ax is None:
        fig, ax = plt.subplots(1, 1, figsize=(10, 8))
    
    # Create a grid
    x1_range = np.linspace(-3, 3, resolution)
    x2_range = np.linspace(-3, 3, resolution)
    X1, X2 = np.meshgrid(x1_range, x2_range)
    X_grid = np.column_stack([X1.ravel(), X2.ravel()])
    
    # Get predictions
    probs = multinomial_logistic_prediction(X_grid, W, b)
    predictions = np.argmax(probs, axis=1).reshape(X1.shape)
    
    # Also compute max probability (certainty)
    max_probs = np.max(probs, axis=1).reshape(X1.shape)
    
    # Plot decision regions with colors
    K = len(class_names)
    colors = plt.cm.Set2(np.linspace(0, 1, K))
    
    ax.contourf(X1, X2, predictions, levels=np.arange(-0.5, K, 1),
                colors=colors, alpha=0.4)
    
    # Plot decision boundaries (where probabilities are equal)
    for k in range(K):
        for l in range(k+1, K):
            # Boundary: z_k = z_l => (w_k - w_l)^T x + (b_k - b_l) = 0
            # For 2D: (w_k1 - w_l1)*x1 + (w_k2 - w_l2)*x2 + (b_k - b_l) = 0
            # => x2 = -[(w_k1 - w_l1)*x1 + (b_k - b_l)] / (w_k2 - w_l2)
            w_diff = W[k] - W[l]
            b_diff = b[k] - b[l]
            
            if np.abs(w_diff[1]) > 1e-6:
                x2_boundary = -(w_diff[0] * x1_range + b_diff) / w_diff[1]
                mask = (x2_boundary >= -3) & (x2_boundary <= 3)
                ax.plot(x1_range[mask], x2_boundary[mask], 'k--', 
                       linewidth=1.5, alpha=0.7)
    
    # Add legend
    for k in range(K):
        ax.scatter([], [], c=[colors[k]], s=100, label=class_names[k])
    ax.legend(loc='upper right')
    
    ax.set_xlim(-3, 3)
    ax.set_ylim(-3, 3)
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.set_title('Multinomial Logistic Regression Decision Regions')
    
    return ax
 
# Example with 3 classes
np.random.seed(42)
 
# Weight matrix: each row is the weight vector for a class
W = np.array([
    [1.5, 0.5],   # Class 0 favors high x1
    [-1.0, 1.5],  # Class 1 favors high x2
    [0.0, -1.5],  # Class 2 favors low x2
])
 
b = np.array([0.0, 0.5, -0.5])
 
class_names = ['Class A', 'Class B', 'Class C']
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot 1: Decision regions
plot_decision_regions(W, b, class_names, ax=axes[0])
 
# Plot 2: Probability landscape for Class B
x1_range = np.linspace(-3, 3, 200)
x2_range = np.linspace(-3, 3, 200)
X1, X2 = np.meshgrid(x1_range, x2_range)
X_grid = np.column_stack([X1.ravel(), X2.ravel()])
probs = multinomial_logistic_prediction(X_grid, W, b)
prob_B = probs[:, 1].reshape(X1.shape)
 
contour = axes[1].contourf(X1, X2, prob_B, levels=20, cmap='Blues')
plt.colorbar(contour, ax=axes[1], label='P(Class B)')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title('Probability of Class B')
 
plt.tight_layout()
plt.savefig('multinomial_decision_boundaries.png', dpi=150,
            bbox_inches='tight', facecolor='white')
plt.show()
 
print("Decision boundaries (hyperplanes):")
for k in range(len(class_names)):
    for l in range(k+1, len(class_names)):
        w_diff = W[k] - W[l]
        b_diff = b[k] - b[l]
        print(f"  {class_names[k]} vs {class_names[l]}: "
              f"{w_diff[0]:.2f}*x1 + {w_diff[1]:.2f}*x2 + {b_diff:.2f} = 0")

Linear Boundaries Only

Multinomial logistic regression, like binary logistic regression, creates only linear decision boundaries. For nonlinear boundaries, use feature engineering (polynomial features, basis expansions) or switch to nonlinear models (neural networks, kernel methods, tree-based models).

Relationship to Binary Logistic Regression

Binary logistic regression is not merely a special case of multinomial logistic regression—it is multinomial logistic regression with $K=2$. Let's verify this correspondence precisely.

Binary Setup Recap

In binary logistic regression with classes ${0, 1}$:

$$P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}$$ $$P(y=0|\mathbf{x}) = 1 - P(y=1|\mathbf{x}) = \frac{1}{1 + e^{\mathbf{w}^T \mathbf{x} + b}}$$

Parameters: Single weight vector $\mathbf{w} \in \mathbb{R}^d$ and bias $b \in \mathbb{R}$.

Multinomial with $K=2$

Using reference class parameterization with class 0 as reference:

$z_0 = 0$ (reference)
$z_1 = \mathbf{w}_1^T \mathbf{x} + b_1$

Then: $$P(y=1|\mathbf{x}) = \frac{e^{z_1}}{e^{z_0} + e^{z_1}} = \frac{e^{z_1}}{1 + e^{z_1}} = \frac{1}{1 + e^{-z_1}}$$

With $\mathbf{w}_1 = \mathbf{w}$ and $b_1 = b$, this is exactly the sigmoid formula.

Similarly: $$P(y=0|\mathbf{x}) = \frac{1}{1 + e^{z_1}} = \frac{1}{1 + e^{\mathbf{w}^T \mathbf{x} + b}}$$

The correspondence is exact.

Binary vs. Multinomial Logistic Regression
Aspect	Binary ($K=2$)	Multinomial ($K > 2$)
Output activation	Sigmoid $\sigma(z)$	Softmax$(\mathbf{z})$
Output probabilities	$p, 1-p$	$p_1, p_2, \ldots, p_K$
Parameters (reference)	$(d+1)$	$(K-1)(d+1)$
Parameters (full, regularized)	$2(d+1)$ (redundant)	$K(d+1)$
Loss function	Binary cross-entropy	Categorical cross-entropy
Decision boundary	Single hyperplane	$\binom{K}{2}$ hyperplanes
Library function	`sigmoid` + BCE	`softmax` + CE

When to Use Which Formulation

For binary classification, both formulations are equivalent, but sigmoid is more efficient (half the parameters). Use sigmoid + binary cross-entropy for $K=2$. For $K > 2$, you must use softmax + categorical cross-entropy. Modern frameworks dispatch automatically based on output dimension.

Gradient Correspondence

The gradient structures also correspond. For binary logistic regression with $y \in {0, 1}$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i) \mathbf{x}_i$$

For multinomial with one-hot $\mathbf{y}$ and class $k$ for sample $i$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}k} = \frac{1}{n} \sum{i=1}^{n} (\hat{p}{ik} - y{ik}) \mathbf{x}_i$$

The form $$(\text{predicted probability} - \text{true label}) \times \text{features}$$ is universal. This elegant gradient structure arises from the canonical link function property of logistic regression within the generalized linear model framework.

Model Assumptions and Limitations

Understanding when multinomial logistic regression is appropriate—and when it fails—requires examining its underlying assumptions.

Assumption 1: Log-linear Relationship

The model assumes log-odds are linear functions of features:

$$\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = \text{linear in } \mathbf{x}$$

This is violated when:

Feature effects are nonlinear (quadratic, threshold, etc.)
Features interact in complex ways
The decision boundary is inherently curved

Remedies:

Feature engineering: Add polynomial features, interactions
Kernel methods: Implicit feature expansion
Neural networks: Learned nonlinear feature extraction

Assumption 2: Independence of Irrelevant Alternatives (IIA)

This famous property (from econometrics) states that the ratio of probabilities for any two classes doesn't depend on other classes:

$$\frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = \frac{e^{z_k}}{e^{z_l}} = e^{z_k - z_l}$$

This ratio only involves $k$ and $l$—adding or removing class $m$ doesn't affect it.

The Red Bus / Blue Bus Problem

Classic example: A commuter chooses between CAR (50%) and RED_BUS (50%). Now BLUE_BUS is added. IIA implies RED/BLUE ratio unchanged, so we get CAR: 33%, RED: 33%, BLUE: 33%. But rationally, the buses should split the original 50%: CAR: 50%, RED: 25%, BLUE: 25%. IIA fails when alternatives are similar.

When IIA is Problematic:

Classes have 'nested' structure (buses vs. car)
Classes share unobserved attributes
Adding a new class should mostly draw from similar existing classes

Remedies for IIA Violation:

Nested logit: Hierarchical choice structure
Mixed logit: Random coefficients allowing correlation
Probit models: Correlated error terms via covariance matrix

Assumption 3: No Complete Separation

Complete separation occurs when a hyperplane perfectly separates all instances of some classes. When this happens:

MLEs don't exist: coefficients → ±∞
Predicted probabilities → 0 or 1 exactly
Standard errors are undefined

Detection:

Warnings from optimization (convergence failure)
Very large coefficient estimates
Predicted probabilities at 0 or 1

Remedies:

Regularization (L1/L2): Most common, prevents coefficient explosion
Firth's penalized likelihood: Bias-reduced estimates
Exact logistic regression: For small samples
Bayesian methods: Proper priors prevent unbounded posteriors

Assumption 4: Correctly Specified Linear Predictor

The model assumes we've included the 'right' features in the linear predictor. Misspecification leads to:

Biased coefficient estimates
Poor calibration (predicted probabilities don't match observed frequencies)
Reduced predictive power

Remedies:

Model selection: AIC, BIC, cross-validation
Regularization: Lasso for feature selection
Residual diagnostics: Check for systematic patterns

Practical Robustness:

Despite these assumptions, multinomial logistic regression is remarkably robust in practice:

Often performs well even with mild violations
Regularization handles many edge cases
Provides interpretable baselines
Fast to train on large datasets

It remains the go-to method for multi-class classification before trying more complex models.

Implementation Patterns

Implementing multinomial logistic regression correctly requires attention to several practical details. Let's examine production-quality patterns.

Data Preparation

Feature scaling: Unlike tree-based methods, logistic regression benefits from standardized features (mean 0, std 1). This helps optimization converge faster and makes coefficients comparable.
One-hot encoding: Categorical features must be one-hot encoded. Drop one level per feature to avoid multicollinearity (or rely on regularization).
Label encoding: Convert string labels to integers ${0, 1, \ldots, K-1}$. For loss computation, convert to one-hot format.
Train/validation split: Essential for hyperparameter tuning (regularization strength).

multinomial_lr_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
 
def train_multinomial_logistic_regression(X, y, C=1.0, max_iter=1000):
    """
    Train multinomial logistic regression with best practices.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Labels (n_samples,), can be strings or integers
        C: Inverse regularization strength (smaller = more regularization)
        max_iter: Maximum iterations for solver
    
    Returns:
        model: Trained LogisticRegression model
        scaler: Fitted StandardScaler for features
        label_encoder: Fitted LabelEncoder for labels
    """
    # Encode labels to integers
    label_encoder = LabelEncoder()
    y_encoded = label_encoder.fit_transform(y)
    n_classes = len(label_encoder.classes_)
    
    print(f"Classes: {label_encoder.classes_}")
    print(f"Number of classes: {n_classes}")
    
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Train/test split
    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
    )
    
    # Train model
    # multi_class='multinomial' uses softmax (vs 'ovr' for one-vs-rest)
    # solver='lbfgs' works well for multinomial
    model = LogisticRegression(
        C=C,
        multi_class='multinomial',
        solver='lbfgs',
        max_iter=max_iter,
        random_state=42,
    )
    
    model.fit(X_train, y_train)
    
    # Evaluate
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    print(f"\nTraining accuracy: {train_acc:.4f}")
    print(f"Test accuracy: {test_acc:.4f}")
    
    # Detailed classification report
    y_pred = model.predict(X_test)
    print("\nClassification Report:")
    print(classification_report(
        y_test, y_pred, 
        target_names=label_encoder.classes_
    ))
    
    return model, scaler, label_encoder
 
def analyze_coefficients(model, feature_names, label_encoder):
    """
    Analyze and interpret model coefficients.
    
    Args:
        model: Trained LogisticRegression model
        feature_names: List of feature names
        label_encoder: Fitted LabelEncoder
    """
    print("\n=== Coefficient Analysis ===\n")
    
    # model.coef_ has shape (n_classes, n_features)
    # model.intercept_ has shape (n_classes,)
    
    for k, class_name in enumerate(label_encoder.classes_):
        print(f"Class '{class_name}':")
        
        coef = model.coef_[k]
        intercept = model.intercept_[k]
        
        # Sort features by absolute coefficient magnitude
        sorted_idx = np.argsort(np.abs(coef))[::-1]
        
        print(f"  Intercept: {intercept:.4f}")
        print(f"  Top 5 features by importance:")
        for idx in sorted_idx[:5]:
            odds_ratio = np.exp(coef[idx])
            print(f"    {feature_names[idx]}: coef={coef[idx]:.4f}, "
                  f"odds_ratio={odds_ratio:.4f}")
        print()
 
def predict_with_probabilities(model, scaler, label_encoder, X_new):
    """
    Make predictions with probability distributions.
    
    Args:
        model: Trained model
        scaler: Fitted scaler
        label_encoder: Fitted label encoder
        X_new: New feature matrix
    
    Returns:
        predictions: Predicted class names
        probabilities: Probability for each class
    """
    X_scaled = scaler.transform(X_new)
    
    # Get probabilities (softmax output)
    probs = model.predict_proba(X_scaled)
    
    # Get predicted classes
    pred_indices = model.predict(X_scaled)
    predictions = label_encoder.inverse_transform(pred_indices)
    
    return predictions, probs
 
# Example usage
if __name__ == "__main__":
    # Generate synthetic data
    from sklearn.datasets import make_classification
    
    X, y = make_classification(
        n_samples=1000,
        n_features=10,
        n_informative=5,
        n_redundant=2,
        n_classes=4,
        n_clusters_per_class=1,
        random_state=42
    )
    
    # Convert numeric labels to strings for realistic demo
    label_map = {0: 'Class_A', 1: 'Class_B', 2: 'Class_C', 3: 'Class_D'}
    y_str = np.array([label_map[yi] for yi in y])
    
    feature_names = [f'feature_{i}' for i in range(X.shape[1])]
    
    # Train model
    model, scaler, le = train_multinomial_logistic_regression(X, y_str, C=1.0)
    
    # Analyze coefficients
    analyze_coefficients(model, feature_names, le)
    
    # Make predictions on new data
    X_new = np.random.randn(3, 10)
    preds, probs = predict_with_probabilities(model, scaler, le, X_new)
    
    print("=== Predictions on New Data ===")
    for i in range(len(preds)):
        print(f"Sample {i}: Predicted={preds[i]}")
        for j, class_name in enumerate(le.classes_):
            print(f"  P({class_name}) = {probs[i, j]:.4f}")
        print()

Summary: Multinomial Logistic Regression

We have developed multinomial logistic regression as the complete framework for probabilistic multi-class classification. Let's consolidate the key insights:

Key Takeaways

•Model Structure: Linear logits $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$ transformed through softmax produce class probabilities.
•Parameterization Choice: Reference class (identifiable, fewer params) vs. full + regularization (symmetric, neural network convention).
•Log-Odds Interpretation: Coefficients represent log-odds changes; exponentiated coefficients are odds ratios.
•Decision Geometry: Linear hyperplane boundaries create convex decision regions.
•Binary Equivalence: With $K=2$, multinomial logistic regression reduces exactly to binary logistic regression.
•IIA Assumption: Ratio of any two class probabilities is independent of other classes—a limitation for substitutable classes.
•Practical Implementation: Scale features, use regularization, verify calibration, interpret with caution.

What's Next:

With the model structure established, we now turn to the crucial question: How do we train this model? The next page develops the cross-entropy loss function, deriving it from maximum likelihood principles and understanding why it's the natural choice for training softmax-based classifiers.

Page Complete

You now have a comprehensive understanding of multinomial logistic regression—its mathematical formulation, interpretation, geometry, and implementation. This model serves as the foundation for understanding classification output layers in modern deep learning.

Multinomial Logistic Regression

The Complete Multi-class Classification Framework

Multinomial logistic regression is known by several names:

Softmax regression — emphasizing the softmax output transformation
Maximum entropy classifier — from its information-theoretic derivation
Multiclass logit / Polytomous logit — from econometrics
Multinomial logit — the formal statistical name

What You Will Learn

Model Specification

The Setting

We have:

Features: $\mathbf{x} \in \mathbb{R}^d$ — a $d$-dimensional feature vector
Labels: $y \in {1, 2, \ldots, K}$ — one of $K$ possible classes
Training data: ${(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)}$

Our goal is to model the conditional probability distribution $P(y | \mathbf{x})$ over all $K$ classes given features $\mathbf{x}$.

The Model

Multinomial logistic regression models the conditional probabilities as:

$$P(y = k | \mathbf{x}; \boldsymbol{\Theta}) = \frac{\exp(z_k)}{\sum_{j=1}^{K} \exp(z_j)} = \text{softmax}(\mathbf{z})_k$$

where the logit (or score) for class $k$ is a linear function of features:

$$z_k = \mathbf{w}k^T \mathbf{x} + b_k = \sum{i=1}^{d} w_{ki} x_i + b_k$$

Here:

$\mathbf{w}_k \in \mathbb{R}^d$ is the weight vector for class $k$
$b_k \in \mathbb{R}$ is the bias (intercept) for class $k$
$\boldsymbol{\Theta} = {(\mathbf{w}_1, b_1), \ldots, (\mathbf{w}_K, b_K)}$ is the full parameter set

Matrix Notation

The Modeling Assumption

Multinomial logistic regression makes a specific assumption about how class probabilities depend on features: the log-odds between any two classes is a linear function of features.

For classes $k$ and $l$:

$$\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = \log \frac{e^{z_k}}{e^{z_l}} = z_k - z_l = (\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l)$$

This is the log-linear model assumption: log-odds are linear in features. The assumption is strong but often reasonable and leads to a model with many desirable properties:

Interpretable parameters: Each weight $w_{ki}$ represents how feature $i$ affects the log-odds of class $k$
Convex optimization: Maximum likelihood estimation is a convex problem
No distributional assumptions on X: Unlike discriminant analysis, we don't assume feature distributions
Calibrated probabilities: Under correct specification, predicted probabilities match true frequencies

Formal Distributional Form

The model specifies that $y | \mathbf{x}$ follows a categorical distribution (generalization of Bernoulli to $K$ outcomes):

$$y | \mathbf{x}; \boldsymbol{\Theta} \sim \text{Categorical}(p_1, p_2, \ldots, p_K)$$

where $p_k = P(y=k|\mathbf{x}) = \text{softmax}(\mathbf{z})_k$.

Equivalently, writing $y$ as a one-hot vector $\mathbf{y} \in {0,1}^K$ with $y_k = 1$ iff $y = k$:

$$P(\mathbf{y}|\mathbf{x}; \boldsymbol{\Theta}) = \prod_{k=1}^{K} p_k^{y_k} = \prod_{k=1}^{K} \left(\frac{e^{z_k}}{\sum_j e^{z_j}}\right)^{y_k}$$

This product form is crucial for likelihood-based learning.

Parameterization: Reference Class vs. Full

The Identifiability Problem

$$z'_k = z_k + \mathbf{c}^T\mathbf{x} + d$$

Since the shift is identical for all classes: $$\text{softmax}(\mathbf{z}') = \text{softmax}(\mathbf{z})$$

These infinitely many parameter settings produce identical predictions. The model is not identifiable without constraints.

Resolution 1: Reference Class Parameterization

Fix one class (say class $K$) as the reference with $\mathbf{w}_K = \mathbf{0}$ and $b_K = 0$:

$$z_K = 0 \quad \text{(always)}$$ $$z_k = \mathbf{w}_k^T \mathbf{x} + b_k \quad \text{for } k = 1, \ldots, K-1$$

Probabilities under reference class parameterization:

$$P(y=k|\mathbf{x}) = \frac{e^{z_k}}{1 + \sum_{j=1}^{K-1} e^{z_j}} \quad \text{for } k = 1, \ldots, K-1$$

$$P(y=K|\mathbf{x}) = \frac{1}{1 + \sum_{j=1}^{K-1} e^{z_j}}$$

Note the similarity to binary logistic regression:

Binary: $P(y=1) = \frac{e^z}{1+e^z} = \sigma(z)$ — exactly the $K=2$ case

Advantages:

Identifiability: Parameters uniquely determined (no redundancy)
Direct interpretation: $\mathbf{w}_k$ represents log-odds of class $k$ vs. reference class $K$
Fewer parameters: $(K-1)(d+1)$ instead of $K(d+1)$

Disadvantages:

Asymmetric interpretation: Reference class is special, others are relative
Coefficient comparison: $\mathbf{w}_k - \mathbf{w}_l$ for non-reference classes requires extra computation
Reference choice matters for interpretation (but not for predictions)

Choosing the Reference Class

Resolution 2: Full Parameterization with Regularization

Alternatively, keep all $K$ parameter sets but add regularization that implicitly breaks symmetry:

$$\mathcal{L}{\text{regularized}} = \mathcal{L}{\text{NLL}} + \lambda \sum_{k=1}^{K} |\mathbf{w}_k|^2$$

The L2 penalty prefers solutions where parameters are small, effectively centering them around zero. While still technically overparameterized, regularization:

Selects a unique solution (the minimum-norm one among equivalent solutions)
Improves generalization through the usual regularization benefits
Maintains symmetric treatment of all classes

Neural Network Convention:

Deep learning frameworks typically use full parameterization:

Final layer: $\mathbf{z} = W\mathbf{h} + \mathbf{b}$ where $W \in \mathbb{R}^{K \times d_{\text{hidden}}}$
Softmax: $\mathbf{p} = \text{softmax}(\mathbf{z})$
Weight decay applies to $W$

This symmetric treatment is convenient for implementation and doesn't cause optimization issues thanks to regularization.

Comparison of Parameterization Approaches
Aspect	Reference Class	Full + Regularization
Number of parameters	$(K-1)(d+1)$	$K(d+1)$
Identifiable	Yes	With regularization
Class symmetry	No (reference is special)	Yes
Interpretation	$\mathbf{w}_k$ = log-odds vs. reference	$\mathbf{w}_k - \mathbf{w}_l$ = log-odds of $k$ vs. $l$
Common usage	Statistics, econometrics	Machine learning, deep learning
Optimization	Standard MLE	Regularized MLE

Log-Odds and Coefficient Interpretation

The log-linear structure of multinomial logistic regression enables rich interpretation of parameters. Understanding log-odds is essential for model diagnostics and domain insights.

Log-Odds Definition

The log-odds (or logit) of class $k$ versus class $l$ is:

$$\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = z_k - z_l = (\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l)$$

This is the natural logarithm of the odds ratio—how many times more likely class $k$ is than class $l$.

Interpreting Coefficients (Reference Class)

With reference class $K$ (where $\mathbf{w}_K = 0$, $b_K = 0$):

$$\log \frac{P(y=k|\mathbf{x})}{P(y=K|\mathbf{x})} = \mathbf{w}_k^T \mathbf{x} + b_k$$

For a unit increase in feature $x_i$ (holding others constant):

$$\log \frac{P(y=k|\mathbf{x} + \mathbf{e}_i)}{P(y=K|\mathbf{x} + \mathbf{e}i)} - \log \frac{P(y=k|\mathbf{x})}{P(y=K|\mathbf{x})} = w{ki}$$

Thus:

$w_{ki} > 0$: Unit increase in $x_i$ increases log-odds of class $k$ vs. reference by $w_{ki}$. The odds of class $k$ are multiplied by $e^{w_{ki}}$.
$w_{ki} < 0$: Unit increase in $x_i$ decreases log-odds of class $k$ vs. reference.
$w_{ki} = 0$: Feature $x_i$ doesn't affect relative probability of class $k$ vs. reference.

Odds Ratio Interpretation

Exponentiating gives the odds ratio:

$$\text{OR}{ki} = e^{w{ki}}$$

For a unit increase in feature $x_i$:

$\text{OR}{ki} > 1$: Odds of class $k$ (vs. reference) multiply by $\text{OR}{ki}$
$\text{OR}{ki} < 1$: Odds of class $k$ (vs. reference) divide by $1/\text{OR}{ki}$
$\text{OR}_{ki} = 1$: Feature has no effect on relative class probability

Example:

Suppose we're classifying customer churn into: Stayed (reference), Churned to Competitor A, Churned to Competitor B.

If the coefficient for 'years as customer' for 'Churned to A' is $w = -0.3$:

$$\text{OR} = e^{-0.3} \approx 0.74$$

Interpretation: For each additional year as a customer, the odds of churning to Competitor A (versus staying) multiply by 0.74—i.e., they decrease by 26%.

If the same coefficient for 'Churned to B' is $w = -0.1$:

$$\text{OR} = e^{-0.1} \approx 0.90$$

Tenure reduces churn to B by only 10% per year—less protective than against A.

All-Else-Equal Caveat

Comparing Non-Reference Classes

To compare classes $k$ and $l$ (neither is reference):

$$\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = (\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l)$$

The effect of feature $i$ on log-odds of $k$ vs. $l$ is $w_{ki} - w_{li}$.

Statistical Inference:

To test whether classes $k$ and $l$ differ in their relationship with feature $i$:

Null hypothesis: $H_0: w_{ki} = w_{li}$
Test statistic: Wald test on $(w_{ki} - w_{li})$ requires estimating the covariance between $\hat{w}{ki}$ and $\hat{w}{li}$

This is more complex than testing single coefficients and often requires simultaneous inference adjustments (Bonferroni, etc.) when comparing multiple pairs.

Decision Boundaries and Geometry

The geometry of multinomial logistic regression reveals a beautiful structure: the feature space is partitioned into regions by linear decision boundaries (hyperplanes).

Pairwise Decision Boundaries

The boundary between classes $k$ and $l$ occurs where their probabilities are equal:

$$P(y=k|\mathbf{x}) = P(y=l|\mathbf{x})$$

This implies $z_k = z_l$, so:

$$(\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l) = 0$$

This is a hyperplane in $\mathbb{R}^d$:

Normal vector: $\mathbf{w}_k - \mathbf{w}_l$
Offset: $-(b_k - b_l)$

For $K$ classes, there are $\binom{K}{2} = \frac{K(K-1)}{2}$ pairwise boundaries.

The Voronoi-like Partition

Even though there are $\binom{K}{2}$ pairwise boundaries, the actual decision regions are simpler. Each class $k$ 'owns' the region where it has the highest logit:

$$\text{Predict class } k \text{ where } z_k > z_j \text{ for all } j \neq k$$

This creates a partition of feature space into (at most) $K$ convex polytopes—regions bounded by intersecting hyperplanes.

Properties of the Decision Regions:

Convexity: Each class region is convex (intersection of half-spaces)
Simply connected: No class region is disconnected
May be unbounded: Regions can extend to infinity
May be empty: With certain parameters, some classes may never be predicted

Soft vs. Hard Boundaries:

The model outputs probability distributions, not hard predictions. Near a decision boundary:

Probabilities are close to uniform (high uncertainty)
Moving perpendicular to the boundary shifts probability smoothly
Temperature scaling controls the sharpness: high T → gradual transitions, low T → sharp boundaries

decision_boundary_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
import matplotlib.pyplot as plt
 
def softmax(z):
    """Numerically stable softmax."""
    z = np.asarray(z)
    z_max = np.max(z, axis=-1, keepdims=True)
    exp_z = np.exp(z - z_max)
    return exp_z / np.sum(exp_z, axis=-1, keepdims=True)
 
def multinomial_logistic_prediction(X, W, b):
    """
    Compute class probabilities for multinomial logistic regression.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        W: Weight matrix (n_classes, n_features)
        b: Bias vector (n_classes,)
    
    Returns:
        Probability matrix (n_samples, n_classes)
    """
    logits = X @ W.T + b  # (n_samples, n_classes)
    return softmax(logits)
 
def plot_decision_regions(W, b, class_names, ax=None, resolution=200):
    """
    Visualize decision regions for 2D features and K classes.
    
    Args:
        W: Weight matrix (K, 2)
        b: Bias vector (K,)
        class_names: List of class names
        ax: Matplotlib axis
        resolution: Grid resolution for plotting
    """
    if ax is None:
        fig, ax = plt.subplots(1, 1, figsize=(10, 8))
    
    # Create a grid
    x1_range = np.linspace(-3, 3, resolution)
    x2_range = np.linspace(-3, 3, resolution)
    X1, X2 = np.meshgrid(x1_range, x2_range)
    X_grid = np.column_stack([X1.ravel(), X2.ravel()])
    
    # Get predictions
    probs = multinomial_logistic_prediction(X_grid, W, b)
    predictions = np.argmax(probs, axis=1).reshape(X1.shape)
    
    # Also compute max probability (certainty)
    max_probs = np.max(probs, axis=1).reshape(X1.shape)
    
    # Plot decision regions with colors
    K = len(class_names)
    colors = plt.cm.Set2(np.linspace(0, 1, K))
    
    ax.contourf(X1, X2, predictions, levels=np.arange(-0.5, K, 1),
                colors=colors, alpha=0.4)
    
    # Plot decision boundaries (where probabilities are equal)
    for k in range(K):
        for l in range(k+1, K):
            # Boundary: z_k = z_l => (w_k - w_l)^T x + (b_k - b_l) = 0
            # For 2D: (w_k1 - w_l1)*x1 + (w_k2 - w_l2)*x2 + (b_k - b_l) = 0
            # => x2 = -[(w_k1 - w_l1)*x1 + (b_k - b_l)] / (w_k2 - w_l2)
            w_diff = W[k] - W[l]
            b_diff = b[k] - b[l]
            
            if np.abs(w_diff[1]) > 1e-6:
                x2_boundary = -(w_diff[0] * x1_range + b_diff) / w_diff[1]
                mask = (x2_boundary >= -3) & (x2_boundary <= 3)
                ax.plot(x1_range[mask], x2_boundary[mask], 'k--', 
                       linewidth=1.5, alpha=0.7)
    
    # Add legend
    for k in range(K):
        ax.scatter([], [], c=[colors[k]], s=100, label=class_names[k])
    ax.legend(loc='upper right')
    
    ax.set_xlim(-3, 3)
    ax.set_ylim(-3, 3)
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.set_title('Multinomial Logistic Regression Decision Regions')
    
    return ax
 
# Example with 3 classes
np.random.seed(42)
 
# Weight matrix: each row is the weight vector for a class
W = np.array([
    [1.5, 0.5],   # Class 0 favors high x1
    [-1.0, 1.5],  # Class 1 favors high x2
    [0.0, -1.5],  # Class 2 favors low x2
])
 
b = np.array([0.0, 0.5, -0.5])
 
class_names = ['Class A', 'Class B', 'Class C']
 
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
 
# Plot 1: Decision regions
plot_decision_regions(W, b, class_names, ax=axes[0])
 
# Plot 2: Probability landscape for Class B
x1_range = np.linspace(-3, 3, 200)
x2_range = np.linspace(-3, 3, 200)
X1, X2 = np.meshgrid(x1_range, x2_range)
X_grid = np.column_stack([X1.ravel(), X2.ravel()])
probs = multinomial_logistic_prediction(X_grid, W, b)
prob_B = probs[:, 1].reshape(X1.shape)
 
contour = axes[1].contourf(X1, X2, prob_B, levels=20, cmap='Blues')
plt.colorbar(contour, ax=axes[1], label='P(Class B)')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title('Probability of Class B')
 
plt.tight_layout()
plt.savefig('multinomial_decision_boundaries.png', dpi=150,
            bbox_inches='tight', facecolor='white')
plt.show()
 
print("Decision boundaries (hyperplanes):")
for k in range(len(class_names)):
    for l in range(k+1, len(class_names)):
        w_diff = W[k] - W[l]
        b_diff = b[k] - b[l]
        print(f"  {class_names[k]} vs {class_names[l]}: "
              f"{w_diff[0]:.2f}*x1 + {w_diff[1]:.2f}*x2 + {b_diff:.2f} = 0")

Linear Boundaries Only

Relationship to Binary Logistic Regression

Binary logistic regression is not merely a special case of multinomial logistic regression—it is multinomial logistic regression with $K=2$. Let's verify this correspondence precisely.

Binary Setup Recap

In binary logistic regression with classes ${0, 1}$:

Parameters: Single weight vector $\mathbf{w} \in \mathbb{R}^d$ and bias $b \in \mathbb{R}$.

Multinomial with $K=2$

Using reference class parameterization with class 0 as reference:

$z_0 = 0$ (reference)
$z_1 = \mathbf{w}_1^T \mathbf{x} + b_1$

Then: $$P(y=1|\mathbf{x}) = \frac{e^{z_1}}{e^{z_0} + e^{z_1}} = \frac{e^{z_1}}{1 + e^{z_1}} = \frac{1}{1 + e^{-z_1}}$$

With $\mathbf{w}_1 = \mathbf{w}$ and $b_1 = b$, this is exactly the sigmoid formula.

Similarly: $$P(y=0|\mathbf{x}) = \frac{1}{1 + e^{z_1}} = \frac{1}{1 + e^{\mathbf{w}^T \mathbf{x} + b}}$$

The correspondence is exact.

Binary vs. Multinomial Logistic Regression
Aspect	Binary ($K=2$)	Multinomial ($K > 2$)
Output activation	Sigmoid $\sigma(z)$	Softmax$(\mathbf{z})$
Output probabilities	$p, 1-p$	$p_1, p_2, \ldots, p_K$
Parameters (reference)	$(d+1)$	$(K-1)(d+1)$
Parameters (full, regularized)	$2(d+1)$ (redundant)	$K(d+1)$
Loss function	Binary cross-entropy	Categorical cross-entropy
Decision boundary	Single hyperplane	$\binom{K}{2}$ hyperplanes
Library function	`sigmoid` + BCE	`softmax` + CE

When to Use Which Formulation

Gradient Correspondence

The gradient structures also correspond. For binary logistic regression with $y \in {0, 1}$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i) \mathbf{x}_i$$

For multinomial with one-hot $\mathbf{y}$ and class $k$ for sample $i$:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}k} = \frac{1}{n} \sum{i=1}^{n} (\hat{p}{ik} - y{ik}) \mathbf{x}_i$$

Model Assumptions and Limitations

Understanding when multinomial logistic regression is appropriate—and when it fails—requires examining its underlying assumptions.

Assumption 1: Log-linear Relationship

The model assumes log-odds are linear functions of features:

$$\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = \text{linear in } \mathbf{x}$$

This is violated when:

Feature effects are nonlinear (quadratic, threshold, etc.)
Features interact in complex ways
The decision boundary is inherently curved

Remedies:

Feature engineering: Add polynomial features, interactions
Kernel methods: Implicit feature expansion
Neural networks: Learned nonlinear feature extraction

Assumption 2: Independence of Irrelevant Alternatives (IIA)

This famous property (from econometrics) states that the ratio of probabilities for any two classes doesn't depend on other classes:

$$\frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = \frac{e^{z_k}}{e^{z_l}} = e^{z_k - z_l}$$

This ratio only involves $k$ and $l$—adding or removing class $m$ doesn't affect it.

The Red Bus / Blue Bus Problem

When IIA is Problematic:

Classes have 'nested' structure (buses vs. car)
Classes share unobserved attributes
Adding a new class should mostly draw from similar existing classes

Remedies for IIA Violation:

Nested logit: Hierarchical choice structure
Mixed logit: Random coefficients allowing correlation
Probit models: Correlated error terms via covariance matrix

Assumption 3: No Complete Separation

Complete separation occurs when a hyperplane perfectly separates all instances of some classes. When this happens:

MLEs don't exist: coefficients → ±∞
Predicted probabilities → 0 or 1 exactly
Standard errors are undefined

Detection:

Warnings from optimization (convergence failure)
Very large coefficient estimates
Predicted probabilities at 0 or 1

Remedies:

Regularization (L1/L2): Most common, prevents coefficient explosion
Firth's penalized likelihood: Bias-reduced estimates
Exact logistic regression: For small samples
Bayesian methods: Proper priors prevent unbounded posteriors

Assumption 4: Correctly Specified Linear Predictor

The model assumes we've included the 'right' features in the linear predictor. Misspecification leads to:

Biased coefficient estimates
Poor calibration (predicted probabilities don't match observed frequencies)
Reduced predictive power

Remedies:

Model selection: AIC, BIC, cross-validation
Regularization: Lasso for feature selection
Residual diagnostics: Check for systematic patterns

Practical Robustness:

Despite these assumptions, multinomial logistic regression is remarkably robust in practice:

Often performs well even with mild violations
Regularization handles many edge cases
Provides interpretable baselines
Fast to train on large datasets

It remains the go-to method for multi-class classification before trying more complex models.

Implementation Patterns

Implementing multinomial logistic regression correctly requires attention to several practical details. Let's examine production-quality patterns.

Data Preparation

Feature scaling: Unlike tree-based methods, logistic regression benefits from standardized features (mean 0, std 1). This helps optimization converge faster and makes coefficients comparable.
One-hot encoding: Categorical features must be one-hot encoded. Drop one level per feature to avoid multicollinearity (or rely on regularization).
Label encoding: Convert string labels to integers ${0, 1, \ldots, K-1}$. For loss computation, convert to one-hot format.
Train/validation split: Essential for hyperparameter tuning (regularization strength).

multinomial_lr_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
 
def train_multinomial_logistic_regression(X, y, C=1.0, max_iter=1000):
    """
    Train multinomial logistic regression with best practices.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Labels (n_samples,), can be strings or integers
        C: Inverse regularization strength (smaller = more regularization)
        max_iter: Maximum iterations for solver
    
    Returns:
        model: Trained LogisticRegression model
        scaler: Fitted StandardScaler for features
        label_encoder: Fitted LabelEncoder for labels
    """
    # Encode labels to integers
    label_encoder = LabelEncoder()
    y_encoded = label_encoder.fit_transform(y)
    n_classes = len(label_encoder.classes_)
    
    print(f"Classes: {label_encoder.classes_}")
    print(f"Number of classes: {n_classes}")
    
    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Train/test split
    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded
    )
    
    # Train model
    # multi_class='multinomial' uses softmax (vs 'ovr' for one-vs-rest)
    # solver='lbfgs' works well for multinomial
    model = LogisticRegression(
        C=C,
        multi_class='multinomial',
        solver='lbfgs',
        max_iter=max_iter,
        random_state=42,
    )
    
    model.fit(X_train, y_train)
    
    # Evaluate
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    print(f"\nTraining accuracy: {train_acc:.4f}")
    print(f"Test accuracy: {test_acc:.4f}")
    
    # Detailed classification report
    y_pred = model.predict(X_test)
    print("\nClassification Report:")
    print(classification_report(
        y_test, y_pred, 
        target_names=label_encoder.classes_
    ))
    
    return model, scaler, label_encoder
 
def analyze_coefficients(model, feature_names, label_encoder):
    """
    Analyze and interpret model coefficients.
    
    Args:
        model: Trained LogisticRegression model
        feature_names: List of feature names
        label_encoder: Fitted LabelEncoder
    """
    print("\n=== Coefficient Analysis ===\n")
    
    # model.coef_ has shape (n_classes, n_features)
    # model.intercept_ has shape (n_classes,)
    
    for k, class_name in enumerate(label_encoder.classes_):
        print(f"Class '{class_name}':")
        
        coef = model.coef_[k]
        intercept = model.intercept_[k]
        
        # Sort features by absolute coefficient magnitude
        sorted_idx = np.argsort(np.abs(coef))[::-1]
        
        print(f"  Intercept: {intercept:.4f}")
        print(f"  Top 5 features by importance:")
        for idx in sorted_idx[:5]:
            odds_ratio = np.exp(coef[idx])
            print(f"    {feature_names[idx]}: coef={coef[idx]:.4f}, "
                  f"odds_ratio={odds_ratio:.4f}")
        print()
 
def predict_with_probabilities(model, scaler, label_encoder, X_new):
    """
    Make predictions with probability distributions.
    
    Args:
        model: Trained model
        scaler: Fitted scaler
        label_encoder: Fitted label encoder
        X_new: New feature matrix
    
    Returns:
        predictions: Predicted class names
        probabilities: Probability for each class
    """
    X_scaled = scaler.transform(X_new)
    
    # Get probabilities (softmax output)
    probs = model.predict_proba(X_scaled)
    
    # Get predicted classes
    pred_indices = model.predict(X_scaled)
    predictions = label_encoder.inverse_transform(pred_indices)
    
    return predictions, probs
 
# Example usage
if __name__ == "__main__":
    # Generate synthetic data
    from sklearn.datasets import make_classification
    
    X, y = make_classification(
        n_samples=1000,
        n_features=10,
        n_informative=5,
        n_redundant=2,
        n_classes=4,
        n_clusters_per_class=1,
        random_state=42
    )
    
    # Convert numeric labels to strings for realistic demo
    label_map = {0: 'Class_A', 1: 'Class_B', 2: 'Class_C', 3: 'Class_D'}
    y_str = np.array([label_map[yi] for yi in y])
    
    feature_names = [f'feature_{i}' for i in range(X.shape[1])]
    
    # Train model
    model, scaler, le = train_multinomial_logistic_regression(X, y_str, C=1.0)
    
    # Analyze coefficients
    analyze_coefficients(model, feature_names, le)
    
    # Make predictions on new data
    X_new = np.random.randn(3, 10)
    preds, probs = predict_with_probabilities(model, scaler, le, X_new)
    
    print("=== Predictions on New Data ===")
    for i in range(len(preds)):
        print(f"Sample {i}: Predicted={preds[i]}")
        for j, class_name in enumerate(le.classes_):
            print(f"  P({class_name}) = {probs[i, j]:.4f}")
        print()

Summary: Multinomial Logistic Regression

We have developed multinomial logistic regression as the complete framework for probabilistic multi-class classification. Let's consolidate the key insights:

Key Takeaways

•Model Structure: Linear logits $z_k = \mathbf{w}_k^T \mathbf{x} + b_k$ transformed through softmax produce class probabilities.
•Parameterization Choice: Reference class (identifiable, fewer params) vs. full + regularization (symmetric, neural network convention).
•Log-Odds Interpretation: Coefficients represent log-odds changes; exponentiated coefficients are odds ratios.
•Decision Geometry: Linear hyperplane boundaries create convex decision regions.
•Binary Equivalence: With $K=2$, multinomial logistic regression reduces exactly to binary logistic regression.
•IIA Assumption: Ratio of any two class probabilities is independent of other classes—a limitation for substitutable classes.
•Practical Implementation: Scale features, use regularization, verify calibration, interpret with caution.

What's Next:

Page Complete