Loading content...
The defining characteristic of Linear Discriminant Analysis—what separates it from Quadratic Discriminant Analysis and gives it its name—is the assumption that all classes share a common covariance matrix. This single assumption has profound consequences: it determines the geometry of decision boundaries, enables a powerful dimensionality reduction interpretation, and connects LDA to Fisher's classical work on discriminant functions.
In this page, we explore the shared covariance assumption from multiple angles: how to estimate the pooled covariance matrix, why it leads to linear boundaries, how LDA relates to Fisher's discriminant, and the geometric intuition behind class separation. Understanding these connections transforms LDA from a black-box algorithm into a deeply interpretable method with clear mathematical foundations.
By the end of this page, you will understand: how the pooled covariance matrix is estimated and what it represents, the mathematical derivation showing why equal covariances yield linear boundaries, Fisher's discriminant analysis as a dimensionality reduction technique, and the geometric interpretation of LDA projections.
When we assume all classes share the same covariance matrix $\Sigma$, we must estimate this single matrix from training data. The natural approach is to pool information from all classes, leveraging the assumption that they're all estimating the same underlying quantity.
The pooled sample covariance:
Given $K$ classes with $n_k$ samples in class $k$ (total $n = \sum_k n_k$), and class-specific sample covariance matrices $\hat{\Sigma}_k$, the pooled covariance is:
$$\hat{\Sigma}{\text{pooled}} = \frac{\sum{k=1}^{K}(n_k - 1)\hat{\Sigma}k}{\sum{k=1}^{K}(n_k - 1)} = \frac{\sum_{k=1}^{K}(n_k - 1)\hat{\Sigma}_k}{n - K}$$
Alternatively, computing directly from deviations:
$$\hat{\Sigma}{\text{pooled}} = \frac{1}{n - K}\sum{k=1}^{K}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$$
Here, each observation is centered around its class mean, not the global mean. This is critical: we're measuring within-class variation, not total variation.
The denominator is $n - K$ rather than $n$ because we lose one degree of freedom per class when estimating class means. Each class contributes $n_k - 1$ degrees of freedom to the pooled estimate, summing to $n - K$. This ensures the pooled covariance is an unbiased estimator of $\Sigma$.
Intuition behind pooling:
To understand why pooling works, consider the signal-to-noise perspective:
If the assumption $\Sigma_1 = \Sigma_2 = \cdots = \Sigma_K$ holds, each class-specific estimate is a noisy observation of the same $\Sigma$. Averaging (with appropriate weights) reduces noise while preserving the signal—the variance of $\hat{\Sigma}_{\text{pooled}}$ is lower than any individual $\hat{\Sigma}_k$.
The bias-variance tradeoff:
This is the fundamental tradeoff between LDA and QDA. LDA pools aggressively (low variance, potential bias), while QDA doesn't pool at all (no bias from misspecification, but higher variance in estimates).
| Approach | Formula | When It Helps | When It Hurts |
|---|---|---|---|
| Class-specific (QDA) | $\hat{\Sigma}_k$ per class | Large $n_k$, different covariances | Small $n_k$, high dimension |
| Pooled (LDA) | Weighted average of $\hat{\Sigma}_k$ | Similar covariances, moderate data | Very different covariances |
| Diagonal (Naive Bayes) | Diagonal of $\hat{\Sigma}_k$ | Independent features, limited data | Correlated features |
| Regularized | Shrinkage between pooled and identity | High dimension, limited data | Large $n$, clear structure |
To deeply understand LDA, we must decompose total data variation into two components: within-class scatter (variation around class means) and between-class scatter (variation of class means around the global mean). This decomposition is fundamental to Fisher's discriminant analysis.
Total scatter matrix:
$$S_T = \sum_{i=1}^{n}(x_i - \bar{x})(x_i - \bar{x})^T$$
where $\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i$ is the global mean.
Within-class scatter matrix:
$$S_W = \sum_{k=1}^{K}\sum_{i: y_i = k}(x_i - \mu_k)(x_i - \mu_k)^T = \sum_{k=1}^{K}(n_k - 1)\hat{\Sigma}_k$$
This measures how spread out the data is within each class. Note that $\hat{\Sigma}_{\text{pooled}} = \frac{1}{n-K}S_W$.
Between-class scatter matrix:
$$S_B = \sum_{k=1}^{K}n_k(\mu_k - \bar{x})(\mu_k - \bar{x})^T$$
This measures how spread out the class means are from each other.
A beautiful identity holds: $S_T = S_W + S_B$. Total variation equals within-class variation plus between-class variation. This is the multivariate generalization of the ANOVA decomposition. LDA exploits this by finding directions that maximize between-class scatter relative to within-class scatter.
Geometric interpretation:
Imagine the data points in feature space:
For effective classification:
The ratio $S_B / S_W$ (in an appropriate sense, since these are matrices) quantifies discriminability. Fisher's LDA finds directions that maximize this ratio.
Properties of the scatter matrices:
The rank of $S_B$ being at most $K-1$ has profound implications: for classification into $K$ classes, we only need at most $K-1$ dimensions. This is why LDA can reduce a $p$-dimensional problem to $(K-1)$-dimensional.
Let's rigorously derive the form of LDA decision boundaries for a $K$-class problem. This derivation illuminates exactly where the linearity comes from.
Starting point: Bayes classification
We classify $x$ to the class maximizing the posterior probability:
$$\hat{y} = \arg\max_k P(Y = k | X = x)$$
By Bayes' theorem:
$$P(Y = k | X = x) = \frac{P(X = x | Y = k) \cdot \pi_k}{\sum_j P(X = x | Y = j) \cdot \pi_j}$$
Since the denominator is constant across classes, maximizing the posterior is equivalent to maximizing the numerator:
$$\hat{y} = \arg\max_k P(X = x | Y = k) \cdot \pi_k$$
Taking logarithms (monotonic transformation, preserves argmax):
$$\hat{y} = \arg\max_k \left[\log P(X = x | Y = k) + \log \pi_k\right]$$
Substituting the Gaussian density:
For class $k$ with mean $\mu_k$ and covariance $\Sigma$ (shared across all classes):
$$\log P(X = x | Y = k) = -\frac{p}{2}\log(2\pi) - \frac{1}{2}\log|\Sigma| - \frac{1}{2}(x - \mu_k)^T\Sigma^{-1}(x - \mu_k)$$
The first two terms ($-\frac{p}{2}\log(2\pi)$ and $-\frac{1}{2}\log|\Sigma|$) are constant across all classes because $\Sigma$ is shared. They don't affect which class wins the argmax.
So we only need to maximize:
$$\delta_k(x) = -\frac{1}{2}(x - \mu_k)^T\Sigma^{-1}(x - \mu_k) + \log \pi_k$$
This is the discriminant function for class $k$.
Notice that the log-determinant term $\log|\Sigma|$ cancels because it doesn't depend on $k$. If covariances were class-specific ($\Sigma_k$), this term would be $\log|\Sigma_k|$, varying by class and contributing to a quadratic boundary. The shared covariance is essential.
Expanding the quadratic form:
$$(x - \mu_k)^T\Sigma^{-1}(x - \mu_k) = x^T\Sigma^{-1}x - 2\mu_k^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k$$
The term $x^T\Sigma^{-1}x$ is the same for all classes (it doesn't depend on $k$). Dropping constants that don't affect the argmax:
$$\delta_k(x) = \mu_k^T\Sigma^{-1}x - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \log \pi_k$$
This can be written as:
$$\delta_k(x) = w_k^T x + b_k$$
where:
This is a linear function of $x$—hence Linear Discriminant Analysis.
The decision boundary between two classes:
The boundary between classes $k$ and $l$ is where $\delta_k(x) = \delta_l(x)$:
$$(w_k - w_l)^T x + (b_k - b_l) = 0$$
This is the equation of a hyperplane in $p$-dimensional space. The normal vector to this hyperplane is:
$$w_k - w_l = \Sigma^{-1}(\mu_k - \mu_l)$$
The direction separating classes $k$ and $l$ is determined by the difference of their means, transformed by $\Sigma^{-1}$. This transformation 'whitens' the feature space: directions of high variance are compressed, correcting for correlations.
For two classes specifically:
The classification rule simplifies to:
$$\text{Classify to class 1 if } (\mu_1 - \mu_2)^T\Sigma^{-1}x > \frac{1}{2}(\mu_1^T\Sigma^{-1}\mu_1 - \mu_2^T\Sigma^{-1}\mu_2) - \log\frac{\pi_1}{\pi_2}$$
The left side is a projection of $x$ onto the direction $\Sigma^{-1}(\mu_1 - \mu_2)$. Classification becomes a simple threshold on a one-dimensional projection.
Ronald Fisher approached discriminant analysis from a different perspective: rather than starting with probabilistic assumptions, he asked a purely geometric question:
What linear projection maximizes the separation between classes relative to their spread?
Remarkably, under LDA's assumptions, Fisher's geometric solution coincides exactly with the Bayesian solution. This provides an alternative, illuminating view of LDA.
The Fisher criterion (two-class case):
Consider projecting data onto a line defined by direction $w$. For each point $x$, its projection is $z = w^T x$. Fisher's criterion is:
$$J(w) = \frac{\text{(between-class variance of projections)}}{\text{(within-class variance of projections)}}$$
Let $\tilde{\mu}_k = w^T\mu_k$ be the projected mean of class $k$, and $\tilde{s}_k^2$ be the within-class variance of projections for class $k$.
$$J(w) = \frac{(\tilde{\mu}_1 - \tilde{\mu}_2)^2}{\tilde{s}_1^2 + \tilde{s}_2^2}$$
We want to find the $w$ that maximizes this ratio.
Fisher's criterion balances two competing goals: (1) make the projected class means as far apart as possible (large numerator), and (2) make the projected points within each class as tight as possible (small denominator). A direction that separates means but also stretches within-class variance isn't useful—data clouds would overlap on the projection.
Solving the optimization:
Express $J(w)$ in terms of scatter matrices. The projected between-class scatter is:
$$(\tilde{\mu}_1 - \tilde{\mu}_2)^2 = (w^T(\mu_1 - \mu_2))^2 = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw = w^T S_B w$$
The projected within-class scatter is:
$$\tilde{s}_1^2 + \tilde{s}_2^2 = w^T S_W w$$
So the Fisher criterion becomes:
$$J(w) = \frac{w^T S_B w}{w^T S_W w}$$
This is a Rayleigh quotient. It's maximized when $w$ is the solution to the generalized eigenvalue problem:
$$S_B w = \lambda S_W w$$
or equivalently:
$$S_W^{-1} S_B w = \lambda w$$
For the two-class case, $S_B$ is rank-1 (it's $(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T$ up to a constant), so there's only one non-trivial eigenvector:
$$w^* = S_W^{-1}(\mu_1 - \mu_2)$$
Substituting the pooled covariance estimate $\hat{\Sigma} = \frac{1}{n-2}S_W$:
$$w^* \propto \hat{\Sigma}^{-1}(\mu_1 - \mu_2)$$
This is exactly the direction we derived from the Bayesian perspective! Fisher's geometric approach yields the same classifier as the probabilistic approach under Gaussian assumptions.
Multi-class Fisher discriminant:
For $K$ classes, we seek multiple discriminant directions. The optimization becomes:
$$\max_{W} \frac{\det(W^T S_B W)}{\det(W^T S_W W)}$$
where $W$ is a $p \times d$ matrix (finding $d$ directions).
The solution is given by the eigenvectors of $S_W^{-1}S_B$ corresponding to the largest eigenvalues. Since $S_B$ has rank at most $K-1$, there are at most $K-1$ non-zero eigenvalues, meaning:
This is LDA as dimensionality reduction: reducing $p$ features to at most $K-1$ discriminant components while preserving classification accuracy.
The geometry of LDA provides powerful intuition for understanding its behavior. Let's visualize what LDA does in feature space.
The whitening transformation:
LDA can be understood as a two-step process:
After whitening, the classes become spherical (identity covariance), and the optimal classifier simply assigns to the nearest mean. The covariance structure is embedded in the transformation rather than explicitly in the distance metric.
In the original space, comparing Euclidean distances to class means would be wrong—it ignores the correlation structure. Whitening 'sphereizes' the distribution, making Euclidean distance equivalent to Mahalanobis distance. After whitening, the nearest-mean classifier is optimal under Gaussian assumptions.
Visualizing decision regions:
For a two-class problem:
When $\mu_1 eq \mu_2$, the boundary is perpendicular to the line connecting the means in the whitened space (not necessarily in the original space, unless $\Sigma = I$).
Effect of priors:
Unequal priors shift the decision boundary away from the higher-prior class. Geometrically:
Multi-class regions:
For $K$ classes, the decision regions are convex polyhedra (intersections of half-spaces). Each pair of classes contributes a hyperplane boundary, and each class's region is the intersection of all half-spaces where that class wins pairwise comparisons.
| Property | Description | Consequence |
|---|---|---|
| Linear boundaries | Hyperplanes separate classes | Efficient computation, simple interpretation |
| Convex regions | Each class region is convex | No isolated 'islands' of one class inside another |
| Prior-dependent | Priors shift boundaries | Can control false positive/negative tradeoff |
| Mahalanobis-based | Uses correlation-corrected distance | Handles correlated features appropriately |
| Projection-based | Reduces to $K-1$ dimensions | Effective for high-dimensional data |
In practice, we estimate LDA parameters from training data and plug them into the discriminant functions. The estimation procedure is straightforward:
Step 1: Estimate class priors
$$\hat{\pi}_k = \frac{n_k}{n}$$
Step 2: Estimate class means
$$\hat{\mu}k = \frac{1}{n_k}\sum{i: y_i = k}x_i$$
Step 3: Estimate pooled covariance
$$\hat{\Sigma} = \frac{1}{n - K}\sum_{k=1}^{K}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$$
Step 4: Compute discriminant functions
For a new observation $x$:
$$\hat{\delta}_k(x) = \hat{\mu}_k^T\hat{\Sigma}^{-1}x - \frac{1}{2}\hat{\mu}_k^T\hat{\Sigma}^{-1}\hat{\mu}_k + \log \hat{\pi}_k$$
Step 5: Classify
$$\hat{y} = \arg\max_k \hat{\delta}_k(x)$$
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import numpy as npfrom scipy.linalg import inv, eigh class LinearDiscriminantAnalysis: """ LDA implementation from first principles. Demonstrates the shared covariance assumption. """ def __init__(self): self.classes_ = None self.means_ = None # Class means self.priors_ = None # Class priors self.covariance_ = None # Pooled covariance self.coef_ = None # Σ^(-1) @ μ_k for each class self.intercept_ = None # Bias terms def fit(self, X, y): """Fit LDA model to training data.""" n_samples, n_features = X.shape self.classes_ = np.unique(y) n_classes = len(self.classes_) # Step 1: Estimate class priors self.priors_ = np.array([np.sum(y == k) / n_samples for k in self.classes_]) # Step 2: Estimate class means self.means_ = np.array([X[y == k].mean(axis=0) for k in self.classes_]) # Step 3: Estimate pooled covariance S_W = np.zeros((n_features, n_features)) for k in self.classes_: X_k = X[y == k] X_k_centered = X_k - self.means_[k] S_W += X_k_centered.T @ X_k_centered self.covariance_ = S_W / (n_samples - n_classes) # Step 4: Compute discriminant coefficients Sigma_inv = inv(self.covariance_) self.coef_ = Sigma_inv @ self.means_.T # p x K matrix # Compute bias terms: -0.5 * μ_k^T Σ^(-1) μ_k + log(π_k) self.intercept_ = np.array([ -0.5 * self.means_[k] @ Sigma_inv @ self.means_[k] + np.log(self.priors_[k]) for k in range(n_classes) ]) return self def decision_function(self, X): """Compute discriminant scores for each class.""" # δ_k(x) = x^T Σ^(-1) μ_k + intercept_k return X @ self.coef_ + self.intercept_ def predict(self, X): """Predict class labels.""" scores = self.decision_function(X) return self.classes_[np.argmax(scores, axis=1)] def predict_proba(self, X): """Predict posterior probabilities.""" scores = self.decision_function(X) # Softmax to convert log-posteriors to probabilities exp_scores = np.exp(scores - scores.max(axis=1, keepdims=True)) return exp_scores / exp_scores.sum(axis=1, keepdims=True) def transform(self, X, n_components=None): """Project data onto discriminant directions (Fisher's view).""" if n_components is None: n_components = min(len(self.classes_) - 1, X.shape[1]) # Compute between-class scatter overall_mean = X.mean(axis=0) S_B = np.zeros((X.shape[1], X.shape[1])) for k, n_k in enumerate(self.priors_ * len(X)): diff = (self.means_[k] - overall_mean).reshape(-1, 1) S_B += n_k * diff @ diff.T # Solve generalized eigenvalue problem Sigma_inv = inv(self.covariance_) eigenvalues, eigenvectors = eigh(S_B, self.covariance_) # Sort by decreasing eigenvalue idx = np.argsort(eigenvalues)[::-1] self.scalings_ = eigenvectors[:, idx[:n_components]] return X @ self.scalings_In practice, directly inverting the covariance matrix can be numerically unstable, especially with high-dimensional or collinear features. Production implementations use Cholesky decomposition, SVD, or regularization to handle near-singular covariances. The pseudocode above is pedagogical; use scipy.linalg.solve or sklearn for robust implementations.
The shared covariance assumption creates connections between LDA and several other important methods:
LDA and Linear Regression:
For two-class problems, LDA is equivalent to least squares regression with specific coding:
This connection explains why LDA is sometimes called 'linear regression in disguise' for binary classification.
LDA and PCA:
Both are linear projection methods, but with different objectives:
PCA is unsupervised (ignores labels), while LDA is supervised (uses labels to find discriminative directions). In some cases, the PCA directions are also discriminative, but this isn't guaranteed.
LDA and Gaussian Naive Bayes:
Both assume Gaussian class-conditionals, but:
Gaussian NB is more restrictive (zero correlations) but requires fewer parameters. LDA models correlations but assumes they're the same across classes.
LDA and Logistic Regression:
Both produce linear boundaries for binary classification:
Asymptotically (infinite data), both converge to the same boundary when LDA assumptions hold. With finite data, LDA can be more efficient (uses more assumptions), while logistic regression is more robust to misspecification.
What's next:
The shared covariance assumption is restrictive. When classes truly have different covariances, LDA boundaries are suboptimal. The next page explores Quadratic Discriminant Analysis (QDA), which relaxes this assumption, allowing class-specific covariances and resulting in quadratic (curved) decision boundaries.
You now deeply understand the shared covariance assumption: how it's estimated, why it leads to linear boundaries, Fisher's discriminant interpretation, and the geometric intuition behind LDA projections. Next, we'll explore QDA with class-specific covariances.