Loading learning content...
Linear Discriminant Analysis (LDA) stands as one of the most elegant and foundational algorithms in statistical pattern recognition. Developed by Ronald Fisher in 1936 for classifying iris species, LDA exemplifies how strong modeling assumptions can lead to simple, interpretable, and computationally efficient classifiers.
Unlike discriminative models such as logistic regression that directly model the decision boundary, LDA takes a generative approach: it models the underlying probability distribution of each class and uses Bayes' theorem to make predictions. This fundamental distinction carries profound implications for how LDA learns, generalizes, and behaves under different data conditions.
However, the power of LDA comes at a price—a set of strong assumptions that must hold (at least approximately) for the method to work well. Understanding these assumptions deeply is not merely academic; it is essential for knowing when LDA is appropriate, how to diagnose problems, and when to consider alternatives like QDA or more flexible methods.
By the end of this page, you will understand the three core assumptions of LDA: class-conditional Gaussian distributions, shared covariance matrices, and class prior probabilities. You will learn why these assumptions lead to linear decision boundaries, how to diagnose assumption violations, and when LDA remains robust despite violations.
Before diving into LDA's specific assumptions, we must understand the generative classification paradigm. In generative classification, we model:
Once these components are estimated, Bayes' theorem gives us the posterior probability for classification:
$$P(Y = k | X = x) = \frac{P(X = x | Y = k) \cdot P(Y = k)}{\sum_{j=1}^{K} P(X = x | Y = j) \cdot P(Y = j)}$$
The classification rule assigns each observation to the class with the highest posterior:
$$\hat{y} = \arg\max_k P(Y = k | X = x)$$
This framework is powerful because once we have good estimates of the class-conditional densities, we can:
The generative approach requires modeling more structure than strictly necessary for classification. This can be beneficial (more efficient use of data when assumptions hold) or detrimental (bias when assumptions are violated). Andrew Ng and Michael Jordan's famous 2001 paper showed that generative models can outperform discriminative models with limited data, but discriminative models dominate asymptotically.
The modeling challenge:
The core difficulty is estimating the class-conditional densities $P(X | Y = k)$. In principle, we could use any density estimation technique—kernel density estimation, Gaussian mixtures, normalizing flows, etc. However:
LDA takes the third approach: it assumes a very specific, constrained parametric form for the class-conditional densities. This constraint is simultaneously LDA's greatest strength (few parameters, easy estimation) and its greatest limitation (potential model misspecification).
The first and most fundamental assumption of LDA is that the features within each class follow a multivariate Gaussian (normal) distribution. Mathematically, for each class $k$:
$$X | Y = k \sim \mathcal{N}(\mu_k, \Sigma_k)$$
where:
The multivariate Gaussian probability density function is:
$$P(X = x | Y = k) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}} \exp\left(-\frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k)\right)$$
This density defines elliptical contours of equal probability in feature space, centered at $\mu_k$ and shaped by $\Sigma_k$.
The term $(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k)$ is the squared Mahalanobis distance from $x$ to $\mu_k$, accounting for correlations. It measures 'statistical distance' in a way that Euclidean distance cannot—directions of high variance are compressed, creating an appropriate measure of typicality for the distribution.
Why Gaussian?
The Gaussian distribution is the most commonly assumed form for several compelling reasons:
Theoretical justification (Central Limit Theorem): When features are aggregates of many small, independent effects, the CLT suggests Gaussian behavior. Heights, weights, test scores, and many natural measurements exhibit approximately Gaussian distributions.
Maximum entropy property: Among all distributions with a given mean and variance, the Gaussian has maximum entropy (disorder). If we only know the first and second moments, the Gaussian is the 'least committed' distribution we can assume.
Analytical tractability: Gaussians are closed under linear operations, marginalization, and conditioning. The math works out beautifully—log-likelihoods become quadratic, posteriors remain Gaussian, and decision boundaries take simple forms.
Parameter estimation: With $n$ samples from class $k$, we can efficiently estimate:
$$\hat{\mu}k = \frac{1}{n_k}\sum{i: y_i = k} x_i \quad \text{(sample mean)}$$
$$\hat{\Sigma}k = \frac{1}{n_k - 1}\sum{i: y_i = k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T \quad \text{(sample covariance)}$$
| Scenario | Gaussian Behavior? | Recommendation |
|---|---|---|
| Continuous measurements (height, weight) | Often approximately hold | LDA appropriate |
| Heavy-tailed data (financial returns) | Usually violated—extreme values | Consider robust methods |
| Bounded data (percentages, probabilities) | Violated at boundaries | Transform first or use different model |
| Multimodal classes (mixed subpopulations) | Fundamentally violated | Use mixture models or nonparametric |
| Categorical or count features | Inappropriate assumption | Use Naive Bayes with appropriate distributions |
| High-dimensional sparse data (text) | Often problematic | Consider discriminative methods |
Diagnosing Gaussianity:
Before applying LDA, it's prudent to assess whether the Gaussian assumption is reasonable:
Univariate normality tests: Apply Shapiro-Wilk or Anderson-Darling tests to each feature within each class. However, note that these tests are sensitive to sample size and cannot assess multivariate normality.
Q-Q plots: For each feature and class, compare empirical quantiles against Gaussian quantiles. Systematic departures (S-curves, heavy tails) indicate violations.
Multivariate normality: Use tests like Mardia's test (measures multivariate skewness and kurtosis) or Henze-Zirkler test. Visual inspection via chi-squared Q-Q plots of Mahalanobis distances can also help.
Density visualization: For low-dimensional data, kernel density estimates can reveal multimodality or asymmetry that contradicts the unimodal, symmetric Gaussian.
Robustness: Despite being a parametric assumption, LDA often exhibits surprising robustness to mild Gaussian violations—especially if classes are well-separated. The shared covariance assumption (discussed next) matters more in practice.
The second core assumption of LDA—and the one that distinguishes it from QDA—is that all classes share the same covariance matrix:
$$\Sigma_1 = \Sigma_2 = \cdots = \Sigma_K = \Sigma$$
This is called the homoscedasticity assumption (from Greek 'homos' meaning same and 'skedasis' meaning dispersion). Each class-conditional Gaussian has the same shape and orientation; only the means differ.
Geometric interpretation:
The equal covariance assumption means that the elliptical contours of constant density have the same shape, size, and orientation across all classes—they are simply translated to different locations (the class means). Imagine identical ellipsoids placed at different centers in feature space.
The shared covariance assumption is what makes decision boundaries linear. When covariances differ, boundaries become quadratic (curves, ellipses, hyperbolas). The linearity of LDA decision boundaries—its name and key property—flows directly from this assumption.
Mathematical consequence: From QDA to LDA
Let's see how equal covariances lead to linear boundaries. For a two-class problem, the decision boundary is where the posterior probabilities are equal:
$$P(Y = 1 | X) = P(Y = 2 | X)$$
Using Bayes' theorem and taking logarithms:
$$\log P(X | Y = 1) + \log P(Y = 1) = \log P(X | Y = 2) + \log P(Y = 2)$$
Substituting the Gaussian densities:
$$-\frac{1}{2}(x - \mu_1)^T\Sigma_1^{-1}(x - \mu_1) - \frac{1}{2}\log|\Sigma_1| + \log\pi_1$$ $$= -\frac{1}{2}(x - \mu_2)^T\Sigma_2^{-1}(x - \mu_2) - \frac{1}{2}\log|\Sigma_2| + \log\pi_2$$
With different covariances ($\Sigma_1 \neq \Sigma_2$), expanding the quadratic forms gives terms like $x^T\Sigma_1^{-1}x$ and $x^T\Sigma_2^{-1}x$ that don't cancel—leaving a quadratic function of $x$. This is QDA.
With equal covariances ($\Sigma_1 = \Sigma_2 = \Sigma$), these quadratic terms perfectly cancel:
$$-\frac{1}{2}x^T\Sigma^{-1}x + \mu_1^T\Sigma^{-1}x - \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 + \log\pi_1$$ $$= -\frac{1}{2}x^T\Sigma^{-1}x + \mu_2^T\Sigma^{-1}x - \frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2 + \log\pi_2$$
The $x^T\Sigma^{-1}x$ terms cancel, leaving:
$$(\mu_1 - \mu_2)^T\Sigma^{-1}x = \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 - \frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2 + \log\frac{\pi_2}{\pi_1}$$
This is a linear function of $x$—hence Linear Discriminant Analysis.
Estimating the pooled covariance matrix:
The maximum likelihood estimate of the shared covariance matrix pools data from all classes:
$$\hat{\Sigma} = \frac{1}{n - K}\sum_{k=1}^{K}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$$
Equivalently:
$$\hat{\Sigma} = \frac{\sum_{k=1}^{K}(n_k - 1)\hat{\Sigma}k}{\sum{k=1}^{K}(n_k - 1)}$$
This is a weighted average of the class-specific sample covariance matrices, where weights are $(n_k - 1)$ (degrees of freedom for each class). Larger classes contribute more to the estimate.
Testing the assumption:
Box's M-test formally tests the null hypothesis $H_0: \Sigma_1 = \Sigma_2 = \cdots = \Sigma_K$. However, in practice:
A practical approach is to compare the eigenvalues of class-specific covariance matrices. If the ratios of largest to smallest eigenvalues differ substantially across classes, the assumption is likely violated.
The third component of LDA is the specification of class prior probabilities $\pi_k = P(Y = k)$. While not an 'assumption' in the same sense as Gaussianity, how we estimate or specify priors significantly affects model behavior, especially for imbalanced classification problems.
Estimation approaches:
Sample proportions (most common): $$\hat{\pi}_k = \frac{n_k}{n}$$ where $n_k$ is the number of training samples in class $k$.
Equal priors: $$\pi_k = \frac{1}{K} \quad \forall k$$ This ignores training set class frequencies, useful when the test distribution differs.
Domain-specified priors: Set based on domain knowledge of population frequencies (e.g., disease prevalence).
With imbalanced classes (e.g., 95% negative, 5% positive), sample proportions will push predictions toward the majority class. In fraud detection or disease screening, this can mean missing nearly all positive cases! Consider whether training proportions match deployment proportions, and adjust priors accordingly.
Effect of priors on decision boundaries:
Priors shift the decision boundary. For a two-class problem, the log-odds ratio includes the term $\log(\pi_1/\pi_2)$. This shifts the boundary toward the class with lower prior—requiring more 'evidence' (in the form of likelihood) to classify into the less common class.
Implicit assumption:
Using sample proportions as priors implicitly assumes that:
When any of these assumptions fail, adjusting priors becomes essential. In medical diagnosis, adjusting priors to reflect disease prevalence (rather than case-control study proportions) is critical for calibrated predictions.
| Prior Strategy | When to Use | Effect on Predictions |
|---|---|---|
| Sample proportions | Training reflects population; symmetric costs | Boundary at natural ratio |
| Equal priors | Training imbalanced but want unbiased classification | Boundary purely based on likelihood |
| Population priors | Training sample is biased (case-control studies) | Calibrated posterior probabilities |
| Cost-adjusted priors | Asymmetric misclassification costs | Boundary shifts toward costly-to-miss class |
Beyond the three core assumptions, LDA makes several additional implicit assumptions that practitioners should be aware of:
In high-dimensional settings ($p$ large relative to $n$), estimating $\Sigma$ becomes increasingly difficult. With $p > n$, the sample covariance is singular. Even when $n > p$ but not by much, the inverse is numerically unstable. Regularized LDA (discussed later in this module) addresses this by shrinking the covariance estimate.
Feature preprocessing implications:
Certain preprocessing steps affect how well LDA assumptions hold:
Standardization: Centering and scaling features doesn't change Gaussianity but can improve numerical stability.
Log/power transformations: Can induce approximate normality for right-skewed features (common in count data, durations, monetary values).
Box-Cox transformations: Optimal power transformation toward normality, but must be applied within each class separately for the class-conditional assumption.
Principal component analysis: Decorrelates features, but doesn't guarantee normality of the components.
Outlier removal/winsorization: Can substantially improve Gaussian fit but may lose information.
Understanding the consequences of assumption violations helps you decide when LDA is acceptable versus when alternatives are necessary.
Violation: Non-Gaussian distributions
Mild violations (slightly heavy tails, minor asymmetry):
Severe violations (multimodality, very heavy tails, discrete data):
Empirically, LDA is often more robust to non-Gaussianity than theory suggests. This is partly because LDA doesn't need the full distributional form to be correct—it needs the linear decision boundary to be approximately correct. Many non-Gaussian distributions still have reasonably linear Bayes boundaries.
Violation: Unequal covariances
This is typically more consequential than non-Gaussianity:
When class covariances are substantially different:
Diagnosis:
Solutions:
| Violation | Severity | Impact | Mitigation |
|---|---|---|---|
| Mild non-Gaussianity | Low | Slightly suboptimal boundaries | Often acceptable; transformations if needed |
| Severe non-Gaussianity | High | Wrong boundaries, miscalibrated probabilities | Use different method (SVM, RF, etc.) |
| Unequal covariances | Medium-High | Linear boundary where quadratic is needed | Use QDA or regularized DA |
| Multimodal classes | High | Single Gaussian can't represent structure | Mixture models or nonparametric |
| Outliers | Medium | Distorted mean/covariance estimates | Robust estimation or outlier removal |
| High dimensionality ($p \approx n$) | High | Singular or unstable covariance | Regularized LDA or dimensionality reduction |
Understanding how LDA relates to other classification methods illuminates its unique position:
LDA vs Logistic Regression:
Both produce linear decision boundaries, but from different directions:
Key differences:
LDA tends to outperform logistic regression when: (1) Sample size is small relative to dimensionality, (2) Classes are well-separated, (3) The Gaussian assumption approximately holds. Logistic regression wins when assumptions are violated or with very large datasets.
LDA vs QDA:
Both are Gaussian generative classifiers:
LDA is essentially a constrained QDA where the constraint is $\Sigma_1 = \Sigma_2 = \cdots = \Sigma_K$.
LDA vs Naive Bayes (Gaussian):
Both make Gaussian assumptions:
Naive Bayes makes a stronger assumption (zero correlations) but requires even fewer parameters: $O(Kp)$ vs LDA's $O(p^2)$.
LDA as dimensionality reduction:
LDA can also be viewed as a technique that finds the projection maximizing class separation. In this view, it's an alternative to PCA for supervised dimensionality reduction—projecting onto the directions that best discriminate classes rather than maximize variance.
What's next:
Now that we understand the assumptions that justify linear decision boundaries, the next page explores the shared covariance structure in greater detail—how it's estimated, why it enables dimensionality reduction, and how to interpret the resulting discriminant functions geometrically.
You now have a deep understanding of the three core assumptions underlying LDA: Gaussian class-conditional distributions, equal covariance matrices, and class prior probabilities. You understand why these assumptions lead to linear decision boundaries and what happens when they're violated. Next, we'll explore the shared covariance structure in depth.