Linear And Quadratic Discriminant Analysis - Learning Module

Loading content...

0/278

Shared Covariance

The Heart of Linear Discriminant Analysis

The defining characteristic of Linear Discriminant Analysis—what separates it from Quadratic Discriminant Analysis and gives it its name—is the assumption that all classes share a common covariance matrix. This single assumption has profound consequences: it determines the geometry of decision boundaries, enables a powerful dimensionality reduction interpretation, and connects LDA to Fisher's classical work on discriminant functions.

In this page, we explore the shared covariance assumption from multiple angles: how to estimate the pooled covariance matrix, why it leads to linear boundaries, how LDA relates to Fisher's discriminant, and the geometric intuition behind class separation. Understanding these connections transforms LDA from a black-box algorithm into a deeply interpretable method with clear mathematical foundations.

What You Will Learn

By the end of this page, you will understand: how the pooled covariance matrix is estimated and what it represents, the mathematical derivation showing why equal covariances yield linear boundaries, Fisher's discriminant analysis as a dimensionality reduction technique, and the geometric interpretation of LDA projections.

The Pooled Covariance Matrix

When we assume all classes share the same covariance matrix $\Sigma$, we must estimate this single matrix from training data. The natural approach is to pool information from all classes, leveraging the assumption that they're all estimating the same underlying quantity.

The pooled sample covariance:

Given $K$ classes with $n_k$ samples in class $k$ (total $n = \sum_k n_k$), and class-specific sample covariance matrices $\hat{\Sigma}_k$, the pooled covariance is:

$$\hat{\Sigma}{\text{pooled}} = \frac{\sum{k=1}^{K}(n_k - 1)\hat{\Sigma}k}{\sum{k=1}^{K}(n_k - 1)} = \frac{\sum_{k=1}^{K}(n_k - 1)\hat{\Sigma}_k}{n - K}$$

Alternatively, computing directly from deviations:

$$\hat{\Sigma}{\text{pooled}} = \frac{1}{n - K}\sum{k=1}^{K}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$$

Here, each observation is centered around its class mean, not the global mean. This is critical: we're measuring within-class variation, not total variation.

Degrees of Freedom

The denominator is $n - K$ rather than $n$ because we lose one degree of freedom per class when estimating class means. Each class contributes $n_k - 1$ degrees of freedom to the pooled estimate, summing to $n - K$. This ensures the pooled covariance is an unbiased estimator of $\Sigma$.

Intuition behind pooling:

To understand why pooling works, consider the signal-to-noise perspective:

Signal: The true covariance matrix $\Sigma$ that we're trying to estimate
Noise: Sampling variability in each $\hat{\Sigma}_k$

If the assumption $\Sigma_1 = \Sigma_2 = \cdots = \Sigma_K$ holds, each class-specific estimate is a noisy observation of the same $\Sigma$. Averaging (with appropriate weights) reduces noise while preserving the signal—the variance of $\hat{\Sigma}_{\text{pooled}}$ is lower than any individual $\hat{\Sigma}_k$.

The bias-variance tradeoff:

If the assumption holds: pooling reduces variance without introducing bias
If the assumption fails: pooling introduces bias (the pooled matrix is 'wrong' for all classes)

This is the fundamental tradeoff between LDA and QDA. LDA pools aggressively (low variance, potential bias), while QDA doesn't pool at all (no bias from misspecification, but higher variance in estimates).

Comparison of Covariance Estimation Approaches
Approach	Formula	When It Helps	When It Hurts
Class-specific (QDA)	$\hat{\Sigma}_k$ per class	Large $n_k$, different covariances	Small $n_k$, high dimension
Pooled (LDA)	Weighted average of $\hat{\Sigma}_k$	Similar covariances, moderate data	Very different covariances
Diagonal (Naive Bayes)	Diagonal of $\hat{\Sigma}_k$	Independent features, limited data	Correlated features
Regularized	Shrinkage between pooled and identity	High dimension, limited data	Large $n$, clear structure

Within-Class and Between-Class Scatter Matrices

To deeply understand LDA, we must decompose total data variation into two components: within-class scatter (variation around class means) and between-class scatter (variation of class means around the global mean). This decomposition is fundamental to Fisher's discriminant analysis.

Total scatter matrix:

$$S_T = \sum_{i=1}^{n}(x_i - \bar{x})(x_i - \bar{x})^T$$

where $\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i$ is the global mean.

Within-class scatter matrix:

$$S_W = \sum_{k=1}^{K}\sum_{i: y_i = k}(x_i - \mu_k)(x_i - \mu_k)^T = \sum_{k=1}^{K}(n_k - 1)\hat{\Sigma}_k$$

This measures how spread out the data is within each class. Note that $\hat{\Sigma}_{\text{pooled}} = \frac{1}{n-K}S_W$.

Between-class scatter matrix:

$$S_B = \sum_{k=1}^{K}n_k(\mu_k - \bar{x})(\mu_k - \bar{x})^T$$

This measures how spread out the class means are from each other.

The Fundamental Decomposition

A beautiful identity holds: $S_T = S_W + S_B$. Total variation equals within-class variation plus between-class variation. This is the multivariate generalization of the ANOVA decomposition. LDA exploits this by finding directions that maximize between-class scatter relative to within-class scatter.

Geometric interpretation:

Imagine the data points in feature space:

$S_W$ captures the 'clouds' of points around each class mean—how diffuse each cluster is
$S_B$ captures how separated the cluster centers are from each other

For effective classification:

We want large between-class scatter: class means far apart, easy to distinguish
We want small within-class scatter: tight clusters, little overlap

The ratio $S_B / S_W$ (in an appropriate sense, since these are matrices) quantifies discriminability. Fisher's LDA finds directions that maximize this ratio.

Properties of the scatter matrices:

$S_W$ is positive semi-definite: It's a sum of outer products, each PSD
$S_B$ has rank at most $K-1$: It's a sum of $K$ rank-1 terms, but they lie in a $(K-1)$-dimensional subspace (since $\sum_k n_k(\mu_k - \bar{x}) = 0$)
$S_T$ is positive semi-definite: Sum of PSD matrices

The rank of $S_B$ being at most $K-1$ has profound implications: for classification into $K$ classes, we only need at most $K-1$ dimensions. This is why LDA can reduce a $p$-dimensional problem to $(K-1)$-dimensional.

Why Equal Covariances Yield Linear Boundaries

Let's rigorously derive the form of LDA decision boundaries for a $K$-class problem. This derivation illuminates exactly where the linearity comes from.

Starting point: Bayes classification

We classify $x$ to the class maximizing the posterior probability:

$$\hat{y} = \arg\max_k P(Y = k | X = x)$$

By Bayes' theorem:

$$P(Y = k | X = x) = \frac{P(X = x | Y = k) \cdot \pi_k}{\sum_j P(X = x | Y = j) \cdot \pi_j}$$

Since the denominator is constant across classes, maximizing the posterior is equivalent to maximizing the numerator:

$$\hat{y} = \arg\max_k P(X = x | Y = k) \cdot \pi_k$$

Taking logarithms (monotonic transformation, preserves argmax):

$$\hat{y} = \arg\max_k \left[\log P(X = x | Y = k) + \log \pi_k\right]$$

Substituting the Gaussian density:

For class $k$ with mean $\mu_k$ and covariance $\Sigma$ (shared across all classes):

$$\log P(X = x | Y = k) = -\frac{p}{2}\log(2\pi) - \frac{1}{2}\log|\Sigma| - \frac{1}{2}(x - \mu_k)^T\Sigma^{-1}(x - \mu_k)$$

The first two terms ($-\frac{p}{2}\log(2\pi)$ and $-\frac{1}{2}\log|\Sigma|$) are constant across all classes because $\Sigma$ is shared. They don't affect which class wins the argmax.

So we only need to maximize:

$$\delta_k(x) = -\frac{1}{2}(x - \mu_k)^T\Sigma^{-1}(x - \mu_k) + \log \pi_k$$

This is the discriminant function for class $k$.

The Critical Cancellation

Notice that the log-determinant term $\log|\Sigma|$ cancels because it doesn't depend on $k$. If covariances were class-specific ($\Sigma_k$), this term would be $\log|\Sigma_k|$, varying by class and contributing to a quadratic boundary. The shared covariance is essential.

Expanding the quadratic form:

$$(x - \mu_k)^T\Sigma^{-1}(x - \mu_k) = x^T\Sigma^{-1}x - 2\mu_k^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k$$

The term $x^T\Sigma^{-1}x$ is the same for all classes (it doesn't depend on $k$). Dropping constants that don't affect the argmax:

$$\delta_k(x) = \mu_k^T\Sigma^{-1}x - \frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \log \pi_k$$

This can be written as:

$$\delta_k(x) = w_k^T x + b_k$$

where:

$w_k = \Sigma^{-1}\mu_k$ is the weight vector for class $k$
$b_k = -\frac{1}{2}\mu_k^T\Sigma^{-1}\mu_k + \log \pi_k$ is the bias for class $k$

This is a linear function of $x$—hence Linear Discriminant Analysis.

The decision boundary between two classes:

The boundary between classes $k$ and $l$ is where $\delta_k(x) = \delta_l(x)$:

$$(w_k - w_l)^T x + (b_k - b_l) = 0$$

This is the equation of a hyperplane in $p$-dimensional space. The normal vector to this hyperplane is:

$$w_k - w_l = \Sigma^{-1}(\mu_k - \mu_l)$$

The direction separating classes $k$ and $l$ is determined by the difference of their means, transformed by $\Sigma^{-1}$. This transformation 'whitens' the feature space: directions of high variance are compressed, correcting for correlations.

For two classes specifically:

The classification rule simplifies to:

$$\text{Classify to class 1 if } (\mu_1 - \mu_2)^T\Sigma^{-1}x > \frac{1}{2}(\mu_1^T\Sigma^{-1}\mu_1 - \mu_2^T\Sigma^{-1}\mu_2) - \log\frac{\pi_1}{\pi_2}$$

The left side is a projection of $x$ onto the direction $\Sigma^{-1}(\mu_1 - \mu_2)$. Classification becomes a simple threshold on a one-dimensional projection.

Fisher's Linear Discriminant: The Projection View

Ronald Fisher approached discriminant analysis from a different perspective: rather than starting with probabilistic assumptions, he asked a purely geometric question:

What linear projection maximizes the separation between classes relative to their spread?

Remarkably, under LDA's assumptions, Fisher's geometric solution coincides exactly with the Bayesian solution. This provides an alternative, illuminating view of LDA.

The Fisher criterion (two-class case):

Consider projecting data onto a line defined by direction $w$. For each point $x$, its projection is $z = w^T x$. Fisher's criterion is:

$$J(w) = \frac{\text{(between-class variance of projections)}}{\text{(within-class variance of projections)}}$$

Let $\tilde{\mu}_k = w^T\mu_k$ be the projected mean of class $k$, and $\tilde{s}_k^2$ be the within-class variance of projections for class $k$.

$$J(w) = \frac{(\tilde{\mu}_1 - \tilde{\mu}_2)^2}{\tilde{s}_1^2 + \tilde{s}_2^2}$$

We want to find the $w$ that maximizes this ratio.

Intuition for the Fisher Criterion

Fisher's criterion balances two competing goals: (1) make the projected class means as far apart as possible (large numerator), and (2) make the projected points within each class as tight as possible (small denominator). A direction that separates means but also stretches within-class variance isn't useful—data clouds would overlap on the projection.

Solving the optimization:

Express $J(w)$ in terms of scatter matrices. The projected between-class scatter is:

$$(\tilde{\mu}_1 - \tilde{\mu}_2)^2 = (w^T(\mu_1 - \mu_2))^2 = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw = w^T S_B w$$

The projected within-class scatter is:

$$\tilde{s}_1^2 + \tilde{s}_2^2 = w^T S_W w$$

So the Fisher criterion becomes:

$$J(w) = \frac{w^T S_B w}{w^T S_W w}$$

This is a Rayleigh quotient. It's maximized when $w$ is the solution to the generalized eigenvalue problem:

$$S_B w = \lambda S_W w$$

or equivalently:

$$S_W^{-1} S_B w = \lambda w$$

For the two-class case, $S_B$ is rank-1 (it's $(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T$ up to a constant), so there's only one non-trivial eigenvector:

$$w^* = S_W^{-1}(\mu_1 - \mu_2)$$

Substituting the pooled covariance estimate $\hat{\Sigma} = \frac{1}{n-2}S_W$:

$$w^* \propto \hat{\Sigma}^{-1}(\mu_1 - \mu_2)$$

This is exactly the direction we derived from the Bayesian perspective! Fisher's geometric approach yields the same classifier as the probabilistic approach under Gaussian assumptions.

Multi-class Fisher discriminant:

For $K$ classes, we seek multiple discriminant directions. The optimization becomes:

$$\max_{W} \frac{\det(W^T S_B W)}{\det(W^T S_W W)}$$

where $W$ is a $p \times d$ matrix (finding $d$ directions).

The solution is given by the eigenvectors of $S_W^{-1}S_B$ corresponding to the largest eigenvalues. Since $S_B$ has rank at most $K-1$, there are at most $K-1$ non-zero eigenvalues, meaning:

For $K$ classes, we can find at most $K-1$ discriminant directions
These directions span the most discriminative subspace
Projecting onto this subspace loses no discriminative information (under LDA assumptions)

This is LDA as dimensionality reduction: reducing $p$ features to at most $K-1$ discriminant components while preserving classification accuracy.

Geometric Interpretation of LDA

The geometry of LDA provides powerful intuition for understanding its behavior. Let's visualize what LDA does in feature space.

The whitening transformation:

LDA can be understood as a two-step process:

Whiten the data: Transform $x \mapsto \Sigma^{-1/2}x$ so that within-class covariance becomes identity
Project onto class means: In the whitened space, simply look at distances to each class mean

After whitening, the classes become spherical (identity covariance), and the optimal classifier simply assigns to the nearest mean. The covariance structure is embedded in the transformation rather than explicitly in the distance metric.

Why Whitening Works

In the original space, comparing Euclidean distances to class means would be wrong—it ignores the correlation structure. Whitening 'sphereizes' the distribution, making Euclidean distance equivalent to Mahalanobis distance. After whitening, the nearest-mean classifier is optimal under Gaussian assumptions.

Visualizing decision regions:

For a two-class problem:

The decision boundary is a hyperplane passing through the midpoint of $\mu_1$ and $\mu_2$ (when priors are equal)
The normal to this hyperplane is $\Sigma^{-1}(\mu_1 - \mu_2)$
Points on one side of the hyperplane are classified as class 1, the other side as class 2

When $\mu_1 eq \mu_2$, the boundary is perpendicular to the line connecting the means in the whitened space (not necessarily in the original space, unless $\Sigma = I$).

Effect of priors:

Unequal priors shift the decision boundary away from the higher-prior class. Geometrically:

Equal priors: boundary is the 'perpendicular bisector' of the class means (in Mahalanobis distance)
Unequal priors: boundary shifts toward the lower-prior class, requiring less 'evidence' to classify into the higher-prior class

Multi-class regions:

For $K$ classes, the decision regions are convex polyhedra (intersections of half-spaces). Each pair of classes contributes a hyperplane boundary, and each class's region is the intersection of all half-spaces where that class wins pairwise comparisons.

Properties of LDA Decision Regions
Property	Description	Consequence
Linear boundaries	Hyperplanes separate classes	Efficient computation, simple interpretation
Convex regions	Each class region is convex	No isolated 'islands' of one class inside another
Prior-dependent	Priors shift boundaries	Can control false positive/negative tradeoff
Mahalanobis-based	Uses correlation-corrected distance	Handles correlated features appropriately
Projection-based	Reduces to $K-1$ dimensions	Effective for high-dimensional data

Estimating Discriminant Functions from Data

In practice, we estimate LDA parameters from training data and plug them into the discriminant functions. The estimation procedure is straightforward:

Step 1: Estimate class priors

$$\hat{\pi}_k = \frac{n_k}{n}$$

Step 2: Estimate class means

$$\hat{\mu}k = \frac{1}{n_k}\sum{i: y_i = k}x_i$$

Step 3: Estimate pooled covariance

$$\hat{\Sigma} = \frac{1}{n - K}\sum_{k=1}^{K}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$$

Step 4: Compute discriminant functions

For a new observation $x$:

$$\hat{\delta}_k(x) = \hat{\mu}_k^T\hat{\Sigma}^{-1}x - \frac{1}{2}\hat{\mu}_k^T\hat{\Sigma}^{-1}\hat{\mu}_k + \log \hat{\pi}_k$$

Step 5: Classify

$$\hat{y} = \arg\max_k \hat{\delta}_k(x)$$

lda_from_scratch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from scipy.linalg import inv, eigh
 
class LinearDiscriminantAnalysis:
    """
    LDA implementation from first principles.
    Demonstrates the shared covariance assumption.
    """
    
    def __init__(self):
        self.classes_ = None
        self.means_ = None        # Class means
        self.priors_ = None       # Class priors
        self.covariance_ = None   # Pooled covariance
        self.coef_ = None         # Σ^(-1) @ μ_k for each class
        self.intercept_ = None    # Bias terms
    
    def fit(self, X, y):
        """Fit LDA model to training data."""
        n_samples, n_features = X.shape
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        
        # Step 1: Estimate class priors
        self.priors_ = np.array([np.sum(y == k) / n_samples 
                                  for k in self.classes_])
        
        # Step 2: Estimate class means
        self.means_ = np.array([X[y == k].mean(axis=0) 
                                 for k in self.classes_])
        
        # Step 3: Estimate pooled covariance
        S_W = np.zeros((n_features, n_features))
        for k in self.classes_:
            X_k = X[y == k]
            X_k_centered = X_k - self.means_[k]
            S_W += X_k_centered.T @ X_k_centered
        
        self.covariance_ = S_W / (n_samples - n_classes)
        
        # Step 4: Compute discriminant coefficients
        Sigma_inv = inv(self.covariance_)
        self.coef_ = Sigma_inv @ self.means_.T  # p x K matrix
        
        # Compute bias terms: -0.5 * μ_k^T Σ^(-1) μ_k + log(π_k)
        self.intercept_ = np.array([
            -0.5 * self.means_[k] @ Sigma_inv @ self.means_[k] 
            + np.log(self.priors_[k])
            for k in range(n_classes)
        ])
        
        return self
    
    def decision_function(self, X):
        """Compute discriminant scores for each class."""
        # δ_k(x) = x^T Σ^(-1) μ_k + intercept_k
        return X @ self.coef_ + self.intercept_
    
    def predict(self, X):
        """Predict class labels."""
        scores = self.decision_function(X)
        return self.classes_[np.argmax(scores, axis=1)]
    
    def predict_proba(self, X):
        """Predict posterior probabilities."""
        scores = self.decision_function(X)
        # Softmax to convert log-posteriors to probabilities
        exp_scores = np.exp(scores - scores.max(axis=1, keepdims=True))
        return exp_scores / exp_scores.sum(axis=1, keepdims=True)
    
    def transform(self, X, n_components=None):
        """Project data onto discriminant directions (Fisher's view)."""
        if n_components is None:
            n_components = min(len(self.classes_) - 1, X.shape[1])
        
        # Compute between-class scatter
        overall_mean = X.mean(axis=0)
        S_B = np.zeros((X.shape[1], X.shape[1]))
        for k, n_k in enumerate(self.priors_ * len(X)):
            diff = (self.means_[k] - overall_mean).reshape(-1, 1)
            S_B += n_k * diff @ diff.T
        
        # Solve generalized eigenvalue problem
        Sigma_inv = inv(self.covariance_)
        eigenvalues, eigenvectors = eigh(S_B, self.covariance_)
        
        # Sort by decreasing eigenvalue
        idx = np.argsort(eigenvalues)[::-1]
        self.scalings_ = eigenvectors[:, idx[:n_components]]
        
        return X @ self.scalings_

Numerical Stability

In practice, directly inverting the covariance matrix can be numerically unstable, especially with high-dimensional or collinear features. Production implementations use Cholesky decomposition, SVD, or regularization to handle near-singular covariances. The pseudocode above is pedagogical; use scipy.linalg.solve or sklearn for robust implementations.

Connections to Other Methods

The shared covariance assumption creates connections between LDA and several other important methods:

LDA and Linear Regression:

For two-class problems, LDA is equivalent to least squares regression with specific coding:

Encode classes as $y = \frac{n}{n_1}$ for class 1 and $y = -\frac{n}{n_2}$ for class 2
The OLS solution $\hat{\beta} = (X^TX)^{-1}X^Ty$ gives the LDA direction

This connection explains why LDA is sometimes called 'linear regression in disguise' for binary classification.

LDA and PCA:

Both are linear projection methods, but with different objectives:

PCA: Maximize total variance $w^T S_T w$
LDA: Maximize between-class / within-class ratio $\frac{w^T S_B w}{w^T S_W w}$

PCA is unsupervised (ignores labels), while LDA is supervised (uses labels to find discriminative directions). In some cases, the PCA directions are also discriminative, but this isn't guaranteed.

PCA Properties

•Unsupervised
•Maximizes total variance
•Based on $S_T$
•Up to $p$ components
•Orthogonal directions

LDA Properties

•Supervised
•Maximizes class separation
•Based on $S_B$ and $S_W$
•Up to $K-1$ components
•Not necessarily orthogonal

LDA and Gaussian Naive Bayes:

Both assume Gaussian class-conditionals, but:

Gaussian NB: Diagonal covariance (independence assumption)
LDA: Full shared covariance (no independence, but shared across classes)

Gaussian NB is more restrictive (zero correlations) but requires fewer parameters. LDA models correlations but assumes they're the same across classes.

LDA and Logistic Regression:

Both produce linear boundaries for binary classification:

LDA: Generative (models $P(X|Y)$), requires Gaussian assumption
Logistic Regression: Discriminative (models $P(Y|X)$), no distributional assumption

Asymptotically (infinite data), both converge to the same boundary when LDA assumptions hold. With finite data, LDA can be more efficient (uses more assumptions), while logistic regression is more robust to misspecification.

Summary: The Power of Shared Covariance

Key Takeaways

•The pooled covariance is estimated by weighting class-specific covariances by their degrees of freedom.
•Within-class scatter $S_W$ measures variation around class means; between-class scatter $S_B$ measures separation of class means.
•Equal covariances cause quadratic terms to cancel in the log-posterior, yielding linear discriminant functions.
•Fisher's criterion maximizes between-class variance relative to within-class variance—a geometric view equivalent to the probabilistic approach.
•LDA projects data onto at most $K-1$ directions while preserving discriminative information.
•The geometry is Mahalanobis-based: whitening the data makes the nearest-mean classifier optimal.
•LDA connects to regression, PCA, Gaussian NB, and logistic regression in illuminating ways.

What's next:

The shared covariance assumption is restrictive. When classes truly have different covariances, LDA boundaries are suboptimal. The next page explores Quadratic Discriminant Analysis (QDA), which relaxes this assumption, allowing class-specific covariances and resulting in quadratic (curved) decision boundaries.

Page Complete

You now deeply understand the shared covariance assumption: how it's estimated, why it leads to linear boundaries, Fisher's discriminant interpretation, and the geometric intuition behind LDA projections. Next, we'll explore QDA with class-specific covariances.