Loading content...
With the softmax function in our toolkit, we can now construct the complete multinomial logistic regression model—a principled extension of binary logistic regression to any number of classes. This model is foundational: it underpins the output layers of virtually every classification neural network and remains a powerful baseline for multi-class problems.
Multinomial logistic regression is known by several names:
Despite this nomenclature diversity, all these terms describe the same model. In this page, we develop multinomial logistic regression completely—from model specification through parameterization choices, geometric interpretation, and practical considerations.
By the end of this page, you will understand: the complete probabilistic model specification; how to parameterize the model with reference class vs. full parameterization; the geometric interpretation of decision boundaries; the relationship between logits and log-odds; and practical implementation patterns for producing well-calibrated multi-class predictions.
The Setting
We have:
Our goal is to model the conditional probability distribution $P(y | \mathbf{x})$ over all $K$ classes given features $\mathbf{x}$.
The Model
Multinomial logistic regression models the conditional probabilities as:
$$P(y = k | \mathbf{x}; \boldsymbol{\Theta}) = \frac{\exp(z_k)}{\sum_{j=1}^{K} \exp(z_j)} = \text{softmax}(\mathbf{z})_k$$
where the logit (or score) for class $k$ is a linear function of features:
$$z_k = \mathbf{w}k^T \mathbf{x} + b_k = \sum{i=1}^{d} w_{ki} x_i + b_k$$
Here:
We often write the model compactly as: $\mathbf{z} = W\mathbf{x} + \mathbf{b}$ where $W \in \mathbb{R}^{K \times d}$ (rows are weight vectors) and $\mathbf{b} \in \mathbb{R}^K$. Or, with augmented features $\tilde{\mathbf{x}} = [1, \mathbf{x}]$: $\mathbf{z} = \tilde{W}\tilde{\mathbf{x}}$ where $\tilde{W} \in \mathbb{R}^{K \times (d+1)}$ absorbs biases into the first column.
The Modeling Assumption
Multinomial logistic regression makes a specific assumption about how class probabilities depend on features: the log-odds between any two classes is a linear function of features.
For classes $k$ and $l$:
$$\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = \log \frac{e^{z_k}}{e^{z_l}} = z_k - z_l = (\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l)$$
This is the log-linear model assumption: log-odds are linear in features. The assumption is strong but often reasonable and leads to a model with many desirable properties:
Formal Distributional Form
The model specifies that $y | \mathbf{x}$ follows a categorical distribution (generalization of Bernoulli to $K$ outcomes):
$$y | \mathbf{x}; \boldsymbol{\Theta} \sim \text{Categorical}(p_1, p_2, \ldots, p_K)$$
where $p_k = P(y=k|\mathbf{x}) = \text{softmax}(\mathbf{z})_k$.
Equivalently, writing $y$ as a one-hot vector $\mathbf{y} \in {0,1}^K$ with $y_k = 1$ iff $y = k$:
$$P(\mathbf{y}|\mathbf{x}; \boldsymbol{\Theta}) = \prod_{k=1}^{K} p_k^{y_k} = \prod_{k=1}^{K} \left(\frac{e^{z_k}}{\sum_j e^{z_j}}\right)^{y_k}$$
This product form is crucial for likelihood-based learning.
Due to the translation invariance of softmax, multinomial logistic regression is fundamentally overparameterized: adding a constant to all logits doesn't change the probability distribution. This creates both opportunities and complications.
The Identifiability Problem
Consider parameters $\boldsymbol{\Theta} = {(\mathbf{w}k, b_k)}{k=1}^K$ and shifted parameters $\boldsymbol{\Theta}' = {(\mathbf{w}k + \mathbf{c}, b_k + d)}{k=1}^K$ for any $\mathbf{c} \in \mathbb{R}^d$, $d \in \mathbb{R}$:
$$z'_k = z_k + \mathbf{c}^T\mathbf{x} + d$$
Since the shift is identical for all classes: $$\text{softmax}(\mathbf{z}') = \text{softmax}(\mathbf{z})$$
These infinitely many parameter settings produce identical predictions. The model is not identifiable without constraints.
Resolution 1: Reference Class Parameterization
Fix one class (say class $K$) as the reference with $\mathbf{w}_K = \mathbf{0}$ and $b_K = 0$:
$$z_K = 0 \quad \text{(always)}$$ $$z_k = \mathbf{w}_k^T \mathbf{x} + b_k \quad \text{for } k = 1, \ldots, K-1$$
Probabilities under reference class parameterization:
$$P(y=k|\mathbf{x}) = \frac{e^{z_k}}{1 + \sum_{j=1}^{K-1} e^{z_j}} \quad \text{for } k = 1, \ldots, K-1$$
$$P(y=K|\mathbf{x}) = \frac{1}{1 + \sum_{j=1}^{K-1} e^{z_j}}$$
Note the similarity to binary logistic regression:
Advantages:
Disadvantages:
The reference class choice doesn't affect predictions—only interpretation. Common choices: (1) Most frequent class for stable estimation, (2) Natural baseline (e.g., 'healthy' vs. disease classes), (3) Alphabetically first for reproducibility. Statistical packages often use the first or last level.
Resolution 2: Full Parameterization with Regularization
Alternatively, keep all $K$ parameter sets but add regularization that implicitly breaks symmetry:
$$\mathcal{L}{\text{regularized}} = \mathcal{L}{\text{NLL}} + \lambda \sum_{k=1}^{K} |\mathbf{w}_k|^2$$
The L2 penalty prefers solutions where parameters are small, effectively centering them around zero. While still technically overparameterized, regularization:
Neural Network Convention:
Deep learning frameworks typically use full parameterization:
This symmetric treatment is convenient for implementation and doesn't cause optimization issues thanks to regularization.
| Aspect | Reference Class | Full + Regularization |
|---|---|---|
| Number of parameters | $(K-1)(d+1)$ | $K(d+1)$ |
| Identifiable | Yes | With regularization |
| Class symmetry | No (reference is special) | Yes |
| Interpretation | $\mathbf{w}_k$ = log-odds vs. reference | $\mathbf{w}_k - \mathbf{w}_l$ = log-odds of $k$ vs. $l$ |
| Common usage | Statistics, econometrics | Machine learning, deep learning |
| Optimization | Standard MLE | Regularized MLE |
The log-linear structure of multinomial logistic regression enables rich interpretation of parameters. Understanding log-odds is essential for model diagnostics and domain insights.
Log-Odds Definition
The log-odds (or logit) of class $k$ versus class $l$ is:
$$\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = z_k - z_l = (\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l)$$
This is the natural logarithm of the odds ratio—how many times more likely class $k$ is than class $l$.
Interpreting Coefficients (Reference Class)
With reference class $K$ (where $\mathbf{w}_K = 0$, $b_K = 0$):
$$\log \frac{P(y=k|\mathbf{x})}{P(y=K|\mathbf{x})} = \mathbf{w}_k^T \mathbf{x} + b_k$$
For a unit increase in feature $x_i$ (holding others constant):
$$\log \frac{P(y=k|\mathbf{x} + \mathbf{e}_i)}{P(y=K|\mathbf{x} + \mathbf{e}i)} - \log \frac{P(y=k|\mathbf{x})}{P(y=K|\mathbf{x})} = w{ki}$$
Thus:
Odds Ratio Interpretation
Exponentiating gives the odds ratio:
$$\text{OR}{ki} = e^{w{ki}}$$
For a unit increase in feature $x_i$:
Example:
Suppose we're classifying customer churn into: Stayed (reference), Churned to Competitor A, Churned to Competitor B.
If the coefficient for 'years as customer' for 'Churned to A' is $w = -0.3$:
$$\text{OR} = e^{-0.3} \approx 0.74$$
Interpretation: For each additional year as a customer, the odds of churning to Competitor A (versus staying) multiply by 0.74—i.e., they decrease by 26%.
If the same coefficient for 'Churned to B' is $w = -0.1$:
$$\text{OR} = e^{-0.1} \approx 0.90$$
Tenure reduces churn to B by only 10% per year—less protective than against A.
Coefficient interpretations assume other features are held constant. With correlated features, this 'all else equal' scenario may not occur in practice. Also, in regularized models, coefficients are biased toward zero and quantitative interpretation should be done cautiously.
Comparing Non-Reference Classes
To compare classes $k$ and $l$ (neither is reference):
$$\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = (\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l)$$
The effect of feature $i$ on log-odds of $k$ vs. $l$ is $w_{ki} - w_{li}$.
Statistical Inference:
To test whether classes $k$ and $l$ differ in their relationship with feature $i$:
This is more complex than testing single coefficients and often requires simultaneous inference adjustments (Bonferroni, etc.) when comparing multiple pairs.
The geometry of multinomial logistic regression reveals a beautiful structure: the feature space is partitioned into regions by linear decision boundaries (hyperplanes).
Pairwise Decision Boundaries
The boundary between classes $k$ and $l$ occurs where their probabilities are equal:
$$P(y=k|\mathbf{x}) = P(y=l|\mathbf{x})$$
This implies $z_k = z_l$, so:
$$(\mathbf{w}_k - \mathbf{w}_l)^T \mathbf{x} + (b_k - b_l) = 0$$
This is a hyperplane in $\mathbb{R}^d$:
For $K$ classes, there are $\binom{K}{2} = \frac{K(K-1)}{2}$ pairwise boundaries.
The Voronoi-like Partition
Even though there are $\binom{K}{2}$ pairwise boundaries, the actual decision regions are simpler. Each class $k$ 'owns' the region where it has the highest logit:
$$\text{Predict class } k \text{ where } z_k > z_j \text{ for all } j \neq k$$
This creates a partition of feature space into (at most) $K$ convex polytopes—regions bounded by intersecting hyperplanes.
Properties of the Decision Regions:
Soft vs. Hard Boundaries:
The model outputs probability distributions, not hard predictions. Near a decision boundary:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
import numpy as npimport matplotlib.pyplot as plt def softmax(z): """Numerically stable softmax.""" z = np.asarray(z) z_max = np.max(z, axis=-1, keepdims=True) exp_z = np.exp(z - z_max) return exp_z / np.sum(exp_z, axis=-1, keepdims=True) def multinomial_logistic_prediction(X, W, b): """ Compute class probabilities for multinomial logistic regression. Args: X: Feature matrix (n_samples, n_features) W: Weight matrix (n_classes, n_features) b: Bias vector (n_classes,) Returns: Probability matrix (n_samples, n_classes) """ logits = X @ W.T + b # (n_samples, n_classes) return softmax(logits) def plot_decision_regions(W, b, class_names, ax=None, resolution=200): """ Visualize decision regions for 2D features and K classes. Args: W: Weight matrix (K, 2) b: Bias vector (K,) class_names: List of class names ax: Matplotlib axis resolution: Grid resolution for plotting """ if ax is None: fig, ax = plt.subplots(1, 1, figsize=(10, 8)) # Create a grid x1_range = np.linspace(-3, 3, resolution) x2_range = np.linspace(-3, 3, resolution) X1, X2 = np.meshgrid(x1_range, x2_range) X_grid = np.column_stack([X1.ravel(), X2.ravel()]) # Get predictions probs = multinomial_logistic_prediction(X_grid, W, b) predictions = np.argmax(probs, axis=1).reshape(X1.shape) # Also compute max probability (certainty) max_probs = np.max(probs, axis=1).reshape(X1.shape) # Plot decision regions with colors K = len(class_names) colors = plt.cm.Set2(np.linspace(0, 1, K)) ax.contourf(X1, X2, predictions, levels=np.arange(-0.5, K, 1), colors=colors, alpha=0.4) # Plot decision boundaries (where probabilities are equal) for k in range(K): for l in range(k+1, K): # Boundary: z_k = z_l => (w_k - w_l)^T x + (b_k - b_l) = 0 # For 2D: (w_k1 - w_l1)*x1 + (w_k2 - w_l2)*x2 + (b_k - b_l) = 0 # => x2 = -[(w_k1 - w_l1)*x1 + (b_k - b_l)] / (w_k2 - w_l2) w_diff = W[k] - W[l] b_diff = b[k] - b[l] if np.abs(w_diff[1]) > 1e-6: x2_boundary = -(w_diff[0] * x1_range + b_diff) / w_diff[1] mask = (x2_boundary >= -3) & (x2_boundary <= 3) ax.plot(x1_range[mask], x2_boundary[mask], 'k--', linewidth=1.5, alpha=0.7) # Add legend for k in range(K): ax.scatter([], [], c=[colors[k]], s=100, label=class_names[k]) ax.legend(loc='upper right') ax.set_xlim(-3, 3) ax.set_ylim(-3, 3) ax.set_xlabel('Feature 1') ax.set_ylabel('Feature 2') ax.set_title('Multinomial Logistic Regression Decision Regions') return ax # Example with 3 classesnp.random.seed(42) # Weight matrix: each row is the weight vector for a classW = np.array([ [1.5, 0.5], # Class 0 favors high x1 [-1.0, 1.5], # Class 1 favors high x2 [0.0, -1.5], # Class 2 favors low x2]) b = np.array([0.0, 0.5, -0.5]) class_names = ['Class A', 'Class B', 'Class C'] fig, axes = plt.subplots(1, 2, figsize=(14, 5)) # Plot 1: Decision regionsplot_decision_regions(W, b, class_names, ax=axes[0]) # Plot 2: Probability landscape for Class Bx1_range = np.linspace(-3, 3, 200)x2_range = np.linspace(-3, 3, 200)X1, X2 = np.meshgrid(x1_range, x2_range)X_grid = np.column_stack([X1.ravel(), X2.ravel()])probs = multinomial_logistic_prediction(X_grid, W, b)prob_B = probs[:, 1].reshape(X1.shape) contour = axes[1].contourf(X1, X2, prob_B, levels=20, cmap='Blues')plt.colorbar(contour, ax=axes[1], label='P(Class B)')axes[1].set_xlabel('Feature 1')axes[1].set_ylabel('Feature 2')axes[1].set_title('Probability of Class B') plt.tight_layout()plt.savefig('multinomial_decision_boundaries.png', dpi=150, bbox_inches='tight', facecolor='white')plt.show() print("Decision boundaries (hyperplanes):")for k in range(len(class_names)): for l in range(k+1, len(class_names)): w_diff = W[k] - W[l] b_diff = b[k] - b[l] print(f" {class_names[k]} vs {class_names[l]}: " f"{w_diff[0]:.2f}*x1 + {w_diff[1]:.2f}*x2 + {b_diff:.2f} = 0")Multinomial logistic regression, like binary logistic regression, creates only linear decision boundaries. For nonlinear boundaries, use feature engineering (polynomial features, basis expansions) or switch to nonlinear models (neural networks, kernel methods, tree-based models).
Binary logistic regression is not merely a special case of multinomial logistic regression—it is multinomial logistic regression with $K=2$. Let's verify this correspondence precisely.
Binary Setup Recap
In binary logistic regression with classes ${0, 1}$:
$$P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T \mathbf{x} + b)}}$$ $$P(y=0|\mathbf{x}) = 1 - P(y=1|\mathbf{x}) = \frac{1}{1 + e^{\mathbf{w}^T \mathbf{x} + b}}$$
Parameters: Single weight vector $\mathbf{w} \in \mathbb{R}^d$ and bias $b \in \mathbb{R}$.
Multinomial with $K=2$
Using reference class parameterization with class 0 as reference:
Then: $$P(y=1|\mathbf{x}) = \frac{e^{z_1}}{e^{z_0} + e^{z_1}} = \frac{e^{z_1}}{1 + e^{z_1}} = \frac{1}{1 + e^{-z_1}}$$
With $\mathbf{w}_1 = \mathbf{w}$ and $b_1 = b$, this is exactly the sigmoid formula.
Similarly: $$P(y=0|\mathbf{x}) = \frac{1}{1 + e^{z_1}} = \frac{1}{1 + e^{\mathbf{w}^T \mathbf{x} + b}}$$
The correspondence is exact.
| Aspect | Binary ($K=2$) | Multinomial ($K > 2$) |
|---|---|---|
| Output activation | Sigmoid $\sigma(z)$ | Softmax$(\mathbf{z})$ |
| Output probabilities | $p, 1-p$ | $p_1, p_2, \ldots, p_K$ |
| Parameters (reference) | $(d+1)$ | $(K-1)(d+1)$ |
| Parameters (full, regularized) | $2(d+1)$ (redundant) | $K(d+1)$ |
| Loss function | Binary cross-entropy | Categorical cross-entropy |
| Decision boundary | Single hyperplane | $\binom{K}{2}$ hyperplanes |
| Library function | sigmoid + BCE | softmax + CE |
For binary classification, both formulations are equivalent, but sigmoid is more efficient (half the parameters). Use sigmoid + binary cross-entropy for $K=2$. For $K > 2$, you must use softmax + categorical cross-entropy. Modern frameworks dispatch automatically based on output dimension.
Gradient Correspondence
The gradient structures also correspond. For binary logistic regression with $y \in {0, 1}$:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{1}{n} \sum_{i=1}^{n} (\hat{p}_i - y_i) \mathbf{x}_i$$
For multinomial with one-hot $\mathbf{y}$ and class $k$ for sample $i$:
$$\frac{\partial \mathcal{L}}{\partial \mathbf{w}k} = \frac{1}{n} \sum{i=1}^{n} (\hat{p}{ik} - y{ik}) \mathbf{x}_i$$
The form $$(\text{predicted probability} - \text{true label}) \times \text{features}$$ is universal. This elegant gradient structure arises from the canonical link function property of logistic regression within the generalized linear model framework.
Understanding when multinomial logistic regression is appropriate—and when it fails—requires examining its underlying assumptions.
Assumption 1: Log-linear Relationship
The model assumes log-odds are linear functions of features:
$$\log \frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = \text{linear in } \mathbf{x}$$
This is violated when:
Remedies:
Assumption 2: Independence of Irrelevant Alternatives (IIA)
This famous property (from econometrics) states that the ratio of probabilities for any two classes doesn't depend on other classes:
$$\frac{P(y=k|\mathbf{x})}{P(y=l|\mathbf{x})} = \frac{e^{z_k}}{e^{z_l}} = e^{z_k - z_l}$$
This ratio only involves $k$ and $l$—adding or removing class $m$ doesn't affect it.
Classic example: A commuter chooses between CAR (50%) and RED_BUS (50%). Now BLUE_BUS is added. IIA implies RED/BLUE ratio unchanged, so we get CAR: 33%, RED: 33%, BLUE: 33%. But rationally, the buses should split the original 50%: CAR: 50%, RED: 25%, BLUE: 25%. IIA fails when alternatives are similar.
When IIA is Problematic:
Remedies for IIA Violation:
Assumption 3: No Complete Separation
Complete separation occurs when a hyperplane perfectly separates all instances of some classes. When this happens:
Detection:
Remedies:
Assumption 4: Correctly Specified Linear Predictor
The model assumes we've included the 'right' features in the linear predictor. Misspecification leads to:
Remedies:
Practical Robustness:
Despite these assumptions, multinomial logistic regression is remarkably robust in practice:
It remains the go-to method for multi-class classification before trying more complex models.
Implementing multinomial logistic regression correctly requires attention to several practical details. Let's examine production-quality patterns.
Data Preparation
Feature scaling: Unlike tree-based methods, logistic regression benefits from standardized features (mean 0, std 1). This helps optimization converge faster and makes coefficients comparable.
One-hot encoding: Categorical features must be one-hot encoded. Drop one level per feature to avoid multicollinearity (or rely on regularization).
Label encoding: Convert string labels to integers ${0, 1, \ldots, K-1}$. For loss computation, convert to one-hot format.
Train/validation split: Essential for hyperparameter tuning (regularization strength).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
import numpy as npfrom sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import StandardScaler, LabelEncoderfrom sklearn.model_selection import train_test_split, cross_val_scorefrom sklearn.metrics import classification_report, confusion_matrix def train_multinomial_logistic_regression(X, y, C=1.0, max_iter=1000): """ Train multinomial logistic regression with best practices. Args: X: Feature matrix (n_samples, n_features) y: Labels (n_samples,), can be strings or integers C: Inverse regularization strength (smaller = more regularization) max_iter: Maximum iterations for solver Returns: model: Trained LogisticRegression model scaler: Fitted StandardScaler for features label_encoder: Fitted LabelEncoder for labels """ # Encode labels to integers label_encoder = LabelEncoder() y_encoded = label_encoder.fit_transform(y) n_classes = len(label_encoder.classes_) print(f"Classes: {label_encoder.classes_}") print(f"Number of classes: {n_classes}") # Scale features scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Train/test split X_train, X_test, y_train, y_test = train_test_split( X_scaled, y_encoded, test_size=0.2, random_state=42, stratify=y_encoded ) # Train model # multi_class='multinomial' uses softmax (vs 'ovr' for one-vs-rest) # solver='lbfgs' works well for multinomial model = LogisticRegression( C=C, multi_class='multinomial', solver='lbfgs', max_iter=max_iter, random_state=42, ) model.fit(X_train, y_train) # Evaluate train_acc = model.score(X_train, y_train) test_acc = model.score(X_test, y_test) print(f"\nTraining accuracy: {train_acc:.4f}") print(f"Test accuracy: {test_acc:.4f}") # Detailed classification report y_pred = model.predict(X_test) print("\nClassification Report:") print(classification_report( y_test, y_pred, target_names=label_encoder.classes_ )) return model, scaler, label_encoder def analyze_coefficients(model, feature_names, label_encoder): """ Analyze and interpret model coefficients. Args: model: Trained LogisticRegression model feature_names: List of feature names label_encoder: Fitted LabelEncoder """ print("\n=== Coefficient Analysis ===\n") # model.coef_ has shape (n_classes, n_features) # model.intercept_ has shape (n_classes,) for k, class_name in enumerate(label_encoder.classes_): print(f"Class '{class_name}':") coef = model.coef_[k] intercept = model.intercept_[k] # Sort features by absolute coefficient magnitude sorted_idx = np.argsort(np.abs(coef))[::-1] print(f" Intercept: {intercept:.4f}") print(f" Top 5 features by importance:") for idx in sorted_idx[:5]: odds_ratio = np.exp(coef[idx]) print(f" {feature_names[idx]}: coef={coef[idx]:.4f}, " f"odds_ratio={odds_ratio:.4f}") print() def predict_with_probabilities(model, scaler, label_encoder, X_new): """ Make predictions with probability distributions. Args: model: Trained model scaler: Fitted scaler label_encoder: Fitted label encoder X_new: New feature matrix Returns: predictions: Predicted class names probabilities: Probability for each class """ X_scaled = scaler.transform(X_new) # Get probabilities (softmax output) probs = model.predict_proba(X_scaled) # Get predicted classes pred_indices = model.predict(X_scaled) predictions = label_encoder.inverse_transform(pred_indices) return predictions, probs # Example usageif __name__ == "__main__": # Generate synthetic data from sklearn.datasets import make_classification X, y = make_classification( n_samples=1000, n_features=10, n_informative=5, n_redundant=2, n_classes=4, n_clusters_per_class=1, random_state=42 ) # Convert numeric labels to strings for realistic demo label_map = {0: 'Class_A', 1: 'Class_B', 2: 'Class_C', 3: 'Class_D'} y_str = np.array([label_map[yi] for yi in y]) feature_names = [f'feature_{i}' for i in range(X.shape[1])] # Train model model, scaler, le = train_multinomial_logistic_regression(X, y_str, C=1.0) # Analyze coefficients analyze_coefficients(model, feature_names, le) # Make predictions on new data X_new = np.random.randn(3, 10) preds, probs = predict_with_probabilities(model, scaler, le, X_new) print("=== Predictions on New Data ===") for i in range(len(preds)): print(f"Sample {i}: Predicted={preds[i]}") for j, class_name in enumerate(le.classes_): print(f" P({class_name}) = {probs[i, j]:.4f}") print()We have developed multinomial logistic regression as the complete framework for probabilistic multi-class classification. Let's consolidate the key insights:
What's Next:
With the model structure established, we now turn to the crucial question: How do we train this model? The next page develops the cross-entropy loss function, deriving it from maximum likelihood principles and understanding why it's the natural choice for training softmax-based classifiers.
You now have a comprehensive understanding of multinomial logistic regression—its mathematical formulation, interpretation, geometry, and implementation. This model serves as the foundation for understanding classification output layers in modern deep learning.