Machine LearningLinear and Quadratic Discriminant Analysis

Linear and Quadratic Discriminant Analysis

LevelIntermediate

Duration90 mins

TopicLinear and Quadratic Discriminant Analysis

1 / 5

LDA Assumptions

The Foundation of Linear Discriminant Analysis

Linear Discriminant Analysis (LDA) stands as one of the most elegant and foundational algorithms in statistical pattern recognition. Developed by Ronald Fisher in 1936 for classifying iris species, LDA exemplifies how strong modeling assumptions can lead to simple, interpretable, and computationally efficient classifiers.

Unlike discriminative models such as logistic regression that directly model the decision boundary, LDA takes a generative approach: it models the underlying probability distribution of each class and uses Bayes' theorem to make predictions. This fundamental distinction carries profound implications for how LDA learns, generalizes, and behaves under different data conditions.

However, the power of LDA comes at a price—a set of strong assumptions that must hold (at least approximately) for the method to work well. Understanding these assumptions deeply is not merely academic; it is essential for knowing when LDA is appropriate, how to diagnose problems, and when to consider alternatives like QDA or more flexible methods.

What You Will Learn

By the end of this page, you will understand the three core assumptions of LDA: class-conditional Gaussian distributions, shared covariance matrices, and class prior probabilities. You will learn why these assumptions lead to linear decision boundaries, how to diagnose assumption violations, and when LDA remains robust despite violations.

The Generative Classification Framework

Before diving into LDA's specific assumptions, we must understand the generative classification paradigm. In generative classification, we model:

Class priors $P(Y = k)$: The probability of each class occurring in the population
Class-conditional densities $P(X | Y = k)$: The probability distribution of features given each class

Once these components are estimated, Bayes' theorem gives us the posterior probability for classification:

$$P(Y = k | X = x) = \frac{P(X = x | Y = k) \cdot P(Y = k)}{\sum_{j=1}^{K} P(X = x | Y = j) \cdot P(Y = j)}$$

The classification rule assigns each observation to the class with the highest posterior:

$$\hat{y} = \arg\max_k P(Y = k | X = x)$$

This framework is powerful because once we have good estimates of the class-conditional densities, we can:

Generate synthetic data from each class
Understand the data-generating process
Compute posterior probabilities, not just class labels
Handle missing data through marginalization

Generative vs Discriminative Tradeoff

The generative approach requires modeling more structure than strictly necessary for classification. This can be beneficial (more efficient use of data when assumptions hold) or detrimental (bias when assumptions are violated). Andrew Ng and Michael Jordan's famous 2001 paper showed that generative models can outperform discriminative models with limited data, but discriminative models dominate asymptotically.

The modeling challenge:

The core difficulty is estimating the class-conditional densities $P(X | Y = k)$. In principle, we could use any density estimation technique—kernel density estimation, Gaussian mixtures, normalizing flows, etc. However:

Nonparametric methods suffer from the curse of dimensionality
Flexible parametric methods require many parameters
Simple parametric methods impose strong restrictions

LDA takes the third approach: it assumes a very specific, constrained parametric form for the class-conditional densities. This constraint is simultaneously LDA's greatest strength (few parameters, easy estimation) and its greatest limitation (potential model misspecification).

Assumption 1: Gaussian Class-Conditional Distributions

The first and most fundamental assumption of LDA is that the features within each class follow a multivariate Gaussian (normal) distribution. Mathematically, for each class $k$:

$$X | Y = k \sim \mathcal{N}(\mu_k, \Sigma_k)$$

where:

$\mu_k \in \mathbb{R}^p$ is the mean vector for class $k$
$\Sigma_k \in \mathbb{R}^{p \times p}$ is the covariance matrix for class $k$
$p$ is the number of features

The multivariate Gaussian probability density function is:

$$P(X = x | Y = k) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}} \exp\left(-\frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k)\right)$$

This density defines elliptical contours of equal probability in feature space, centered at $\mu_k$ and shaped by $\Sigma_k$.

The Mahalanobis Distance

The term $(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k)$ is the squared Mahalanobis distance from $x$ to $\mu_k$, accounting for correlations. It measures 'statistical distance' in a way that Euclidean distance cannot—directions of high variance are compressed, creating an appropriate measure of typicality for the distribution.

Why Gaussian?

The Gaussian distribution is the most commonly assumed form for several compelling reasons:

Theoretical justification (Central Limit Theorem): When features are aggregates of many small, independent effects, the CLT suggests Gaussian behavior. Heights, weights, test scores, and many natural measurements exhibit approximately Gaussian distributions.

Maximum entropy property: Among all distributions with a given mean and variance, the Gaussian has maximum entropy (disorder). If we only know the first and second moments, the Gaussian is the 'least committed' distribution we can assume.

Analytical tractability: Gaussians are closed under linear operations, marginalization, and conditioning. The math works out beautifully—log-likelihoods become quadratic, posteriors remain Gaussian, and decision boundaries take simple forms.

Parameter estimation: With $n$ samples from class $k$, we can efficiently estimate:

$$\hat{\mu}k = \frac{1}{n_k}\sum{i: y_i = k} x_i \quad \text{(sample mean)}$$

$$\hat{\Sigma}k = \frac{1}{n_k - 1}\sum{i: y_i = k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T \quad \text{(sample covariance)}$$

When the Gaussian Assumption Holds vs Fails
Scenario	Gaussian Behavior?	Recommendation
Continuous measurements (height, weight)	Often approximately hold	LDA appropriate
Heavy-tailed data (financial returns)	Usually violated—extreme values	Consider robust methods
Bounded data (percentages, probabilities)	Violated at boundaries	Transform first or use different model
Multimodal classes (mixed subpopulations)	Fundamentally violated	Use mixture models or nonparametric
Categorical or count features	Inappropriate assumption	Use Naive Bayes with appropriate distributions
High-dimensional sparse data (text)	Often problematic	Consider discriminative methods

Diagnosing Gaussianity:

Before applying LDA, it's prudent to assess whether the Gaussian assumption is reasonable:

Univariate normality tests: Apply Shapiro-Wilk or Anderson-Darling tests to each feature within each class. However, note that these tests are sensitive to sample size and cannot assess multivariate normality.
Q-Q plots: For each feature and class, compare empirical quantiles against Gaussian quantiles. Systematic departures (S-curves, heavy tails) indicate violations.
Multivariate normality: Use tests like Mardia's test (measures multivariate skewness and kurtosis) or Henze-Zirkler test. Visual inspection via chi-squared Q-Q plots of Mahalanobis distances can also help.
Density visualization: For low-dimensional data, kernel density estimates can reveal multimodality or asymmetry that contradicts the unimodal, symmetric Gaussian.

Robustness: Despite being a parametric assumption, LDA often exhibits surprising robustness to mild Gaussian violations—especially if classes are well-separated. The shared covariance assumption (discussed next) matters more in practice.

Assumption 2: Equal Covariance Matrices (Homoscedasticity)

The second core assumption of LDA—and the one that distinguishes it from QDA—is that all classes share the same covariance matrix:

$$\Sigma_1 = \Sigma_2 = \cdots = \Sigma_K = \Sigma$$

This is called the homoscedasticity assumption (from Greek 'homos' meaning same and 'skedasis' meaning dispersion). Each class-conditional Gaussian has the same shape and orientation; only the means differ.

Geometric interpretation:

The equal covariance assumption means that the elliptical contours of constant density have the same shape, size, and orientation across all classes—they are simply translated to different locations (the class means). Imagine identical ellipsoids placed at different centers in feature space.

Why This Assumption Matters So Much

The shared covariance assumption is what makes decision boundaries linear. When covariances differ, boundaries become quadratic (curves, ellipses, hyperbolas). The linearity of LDA decision boundaries—its name and key property—flows directly from this assumption.

Mathematical consequence: From QDA to LDA

Let's see how equal covariances lead to linear boundaries. For a two-class problem, the decision boundary is where the posterior probabilities are equal:

$$P(Y = 1 | X) = P(Y = 2 | X)$$

Using Bayes' theorem and taking logarithms:

$$\log P(X | Y = 1) + \log P(Y = 1) = \log P(X | Y = 2) + \log P(Y = 2)$$

Substituting the Gaussian densities:

$$-\frac{1}{2}(x - \mu_1)^T\Sigma_1^{-1}(x - \mu_1) - \frac{1}{2}\log|\Sigma_1| + \log\pi_1$$ $$= -\frac{1}{2}(x - \mu_2)^T\Sigma_2^{-1}(x - \mu_2) - \frac{1}{2}\log|\Sigma_2| + \log\pi_2$$

With different covariances ($\Sigma_1 \neq \Sigma_2$), expanding the quadratic forms gives terms like $x^T\Sigma_1^{-1}x$ and $x^T\Sigma_2^{-1}x$ that don't cancel—leaving a quadratic function of $x$. This is QDA.

With equal covariances ($\Sigma_1 = \Sigma_2 = \Sigma$), these quadratic terms perfectly cancel:

$$-\frac{1}{2}x^T\Sigma^{-1}x + \mu_1^T\Sigma^{-1}x - \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 + \log\pi_1$$ $$= -\frac{1}{2}x^T\Sigma^{-1}x + \mu_2^T\Sigma^{-1}x - \frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2 + \log\pi_2$$

The $x^T\Sigma^{-1}x$ terms cancel, leaving:

$$(\mu_1 - \mu_2)^T\Sigma^{-1}x = \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 - \frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2 + \log\frac{\pi_2}{\pi_1}$$

This is a linear function of $x$—hence Linear Discriminant Analysis.

Advantages of Assumed Equal Covariance

•Fewer parameters: $O(p^2)$ instead of $O(Kp^2)$
•More stable estimates: Pool data across classes
•Better with limited data: Reduced variance
•Linear boundaries: Simpler interpretation
•Faster computation: One matrix inversion

Disadvantages of Assumed Equal Covariance

•Model misspecification: May not capture true structure
•Bias: Systematic prediction errors if violated
•Suboptimal boundaries: Cannot adapt to class-specific spread
•Missed patterns: Cannot model class-specific correlations
•Overconfident predictions: In regions of violated assumption

Estimating the pooled covariance matrix:

The maximum likelihood estimate of the shared covariance matrix pools data from all classes:

$$\hat{\Sigma} = \frac{1}{n - K}\sum_{k=1}^{K}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$$

Equivalently:

$$\hat{\Sigma} = \frac{\sum_{k=1}^{K}(n_k - 1)\hat{\Sigma}k}{\sum{k=1}^{K}(n_k - 1)}$$

This is a weighted average of the class-specific sample covariance matrices, where weights are $(n_k - 1)$ (degrees of freedom for each class). Larger classes contribute more to the estimate.

Testing the assumption:

Box's M-test formally tests the null hypothesis $H_0: \Sigma_1 = \Sigma_2 = \cdots = \Sigma_K$. However, in practice:

The test is sensitive to normality violations (it tests both simultaneously)
With large samples, even trivial differences become 'significant'
With small samples, the test has low power

A practical approach is to compare the eigenvalues of class-specific covariance matrices. If the ratios of largest to smallest eigenvalues differ substantially across classes, the assumption is likely violated.

Assumption 3: Class Prior Probabilities

The third component of LDA is the specification of class prior probabilities $\pi_k = P(Y = k)$. While not an 'assumption' in the same sense as Gaussianity, how we estimate or specify priors significantly affects model behavior, especially for imbalanced classification problems.

Estimation approaches:

Sample proportions (most common): $$\hat{\pi}_k = \frac{n_k}{n}$$ where $n_k$ is the number of training samples in class $k$.
Equal priors: $$\pi_k = \frac{1}{K} \quad \forall k$$ This ignores training set class frequencies, useful when the test distribution differs.
Domain-specified priors: Set based on domain knowledge of population frequencies (e.g., disease prevalence).

Priors and Class Imbalance

With imbalanced classes (e.g., 95% negative, 5% positive), sample proportions will push predictions toward the majority class. In fraud detection or disease screening, this can mean missing nearly all positive cases! Consider whether training proportions match deployment proportions, and adjust priors accordingly.

Effect of priors on decision boundaries:

Priors shift the decision boundary. For a two-class problem, the log-odds ratio includes the term $\log(\pi_1/\pi_2)$. This shifts the boundary toward the class with lower prior—requiring more 'evidence' (in the form of likelihood) to classify into the less common class.

Implicit assumption:

Using sample proportions as priors implicitly assumes that:

The training data is a representative sample from the target population
The class distribution in training matches the class distribution at test time
The cost of misclassification is symmetric across classes

When any of these assumptions fail, adjusting priors becomes essential. In medical diagnosis, adjusting priors to reflect disease prevalence (rather than case-control study proportions) is critical for calibrated predictions.

Impact of Prior Probability Choices
Prior Strategy	When to Use	Effect on Predictions
Sample proportions	Training reflects population; symmetric costs	Boundary at natural ratio
Equal priors	Training imbalanced but want unbiased classification	Boundary purely based on likelihood
Population priors	Training sample is biased (case-control studies)	Calibrated posterior probabilities
Cost-adjusted priors	Asymmetric misclassification costs	Boundary shifts toward costly-to-miss class

Additional Implicit Assumptions

Beyond the three core assumptions, LDA makes several additional implicit assumptions that practitioners should be aware of:

Implicit Assumptions in LDA

•Independence of observations: Each training sample is drawn independently. Violations occur with time series, spatial data, or clustered samples.
•No outliers: Gaussian MLE estimates are sensitive to outliers. A single extreme point can substantially distort $\hat{\mu}$ and $\hat{\Sigma}$.
•Non-singular covariance: The covariance matrix must be invertible. If $n < p$ (more features than samples), $\hat{\Sigma}$ is singular and LDA fails without regularization.
•Features are continuous: Gaussian distributions are defined over continuous variables. Discrete or categorical features violate the assumption fundamentally.
•Stable relationships: The class-conditional distributions don't change over time (no concept drift).

The High-Dimensional Challenge

In high-dimensional settings ($p$ large relative to $n$), estimating $\Sigma$ becomes increasingly difficult. With $p > n$, the sample covariance is singular. Even when $n > p$ but not by much, the inverse is numerically unstable. Regularized LDA (discussed later in this module) addresses this by shrinking the covariance estimate.

Feature preprocessing implications:

Certain preprocessing steps affect how well LDA assumptions hold:

Standardization: Centering and scaling features doesn't change Gaussianity but can improve numerical stability.
Log/power transformations: Can induce approximate normality for right-skewed features (common in count data, durations, monetary values).
Box-Cox transformations: Optimal power transformation toward normality, but must be applied within each class separately for the class-conditional assumption.
Principal component analysis: Decorrelates features, but doesn't guarantee normality of the components.
Outlier removal/winsorization: Can substantially improve Gaussian fit but may lose information.

What Happens When Assumptions Are Violated

Understanding the consequences of assumption violations helps you decide when LDA is acceptable versus when alternatives are necessary.

Violation: Non-Gaussian distributions

Mild violations (slightly heavy tails, minor asymmetry):

LDA often still performs reasonably well
Mean and covariance capture most of the relevant structure
Classification accuracy may be slightly suboptimal but not catastrophic

Severe violations (multimodality, very heavy tails, discrete data):

Decision boundaries are fundamentally wrong
Posterior probabilities are miscalibrated
May perform much worse than discriminative alternatives
Consider Gaussian mixture models, kernel methods, or nonparametric approaches

LDA's Surprising Robustness

Empirically, LDA is often more robust to non-Gaussianity than theory suggests. This is partly because LDA doesn't need the full distributional form to be correct—it needs the linear decision boundary to be approximately correct. Many non-Gaussian distributions still have reasonably linear Bayes boundaries.

Violation: Unequal covariances

This is typically more consequential than non-Gaussianity:

When class covariances are substantially different:

Linear boundaries are suboptimal; quadratic boundaries would fit better
LDA may underperform on one class while overperforming on another
Classes with smaller variance are often underrepresented in predictions
The pooled covariance is a poor summary of either class's spread

Diagnosis:

Compare determinants $|\hat{\Sigma}_k|$ across classes
Visualize class-specific scatter ellipses
Compute eigenvalue ratios for each class's covariance
Use Box's M-test (with appropriate caveats)

Solutions:

Use QDA if you have enough data (discussed next page)
Use regularized discriminant analysis as a compromise
Apply transformations that stabilize variance

Assumption Violations and Their Practical Impact
Violation	Severity	Impact	Mitigation
Mild non-Gaussianity	Low	Slightly suboptimal boundaries	Often acceptable; transformations if needed
Severe non-Gaussianity	High	Wrong boundaries, miscalibrated probabilities	Use different method (SVM, RF, etc.)
Unequal covariances	Medium-High	Linear boundary where quadratic is needed	Use QDA or regularized DA
Multimodal classes	High	Single Gaussian can't represent structure	Mixture models or nonparametric
Outliers	Medium	Distorted mean/covariance estimates	Robust estimation or outlier removal
High dimensionality ($p \approx n$)	High	Singular or unstable covariance	Regularized LDA or dimensionality reduction

LDA's Relationship to Other Methods

Understanding how LDA relates to other classification methods illuminates its unique position:

LDA vs Logistic Regression:

Both produce linear decision boundaries, but from different directions:

LDA: Generative—models $P(X|Y)$ and $P(Y)$, derives $P(Y|X)$
Logistic Regression: Discriminative—directly models $P(Y|X)$

Key differences:

LDA is more efficient when Gaussian assumptions hold (uses all data properties)
Logistic regression is more robust to model misspecification
LDA requires invertible covariance; logistic regression doesn't
LDA naturally extends to multi-class; logistic regression requires extensions (softmax, OvR)
LDA provides a useful dimensionality reduction (Fisher's discriminant)

When LDA Beats Logistic Regression

LDA tends to outperform logistic regression when: (1) Sample size is small relative to dimensionality, (2) Classes are well-separated, (3) The Gaussian assumption approximately holds. Logistic regression wins when assumptions are violated or with very large datasets.

LDA vs QDA:

Both are Gaussian generative classifiers:

LDA: Shared covariance → linear boundary, $O(p^2)$ parameters
QDA: Class-specific covariances → quadratic boundaries, $O(Kp^2)$ parameters

LDA is essentially a constrained QDA where the constraint is $\Sigma_1 = \Sigma_2 = \cdots = \Sigma_K$.

LDA vs Naive Bayes (Gaussian):

Both make Gaussian assumptions:

Gaussian NB: Assumes features are conditionally independent given class (diagonal covariance)
LDA: Models full covariance structure (including correlations)

Naive Bayes makes a stronger assumption (zero correlations) but requires even fewer parameters: $O(Kp)$ vs LDA's $O(p^2)$.

LDA as dimensionality reduction:

LDA can also be viewed as a technique that finds the projection maximizing class separation. In this view, it's an alternative to PCA for supervised dimensionality reduction—projecting onto the directions that best discriminate classes rather than maximize variance.

Summary: Understanding LDA Assumptions

Key Takeaways

•LDA is a generative classifier that models class-conditional distributions and uses Bayes' theorem for classification.
•Assumption 1: Features follow multivariate Gaussian distributions within each class.
•Assumption 2: All classes share the same covariance matrix—this is what makes boundaries linear.
•Assumption 3: Class priors must be specified, typically from sample proportions but adjustable for imbalance.
•Equal covariances are crucial: Violating this assumption changes optimal boundaries from linear to quadratic.
•LDA is often robust to mild violations of Gaussianity but more sensitive to covariance heterogeneity.
•Implicit assumptions include independence, no outliers, non-singular covariance, and continuous features.
•LDA relates to logistic regression (same boundary form, different approach) and QDA (relaxed covariance constraint).

What's next:

Now that we understand the assumptions that justify linear decision boundaries, the next page explores the shared covariance structure in greater detail—how it's estimated, why it enables dimensionality reduction, and how to interpret the resulting discriminant functions geometrically.

Page Complete

You now have a deep understanding of the three core assumptions underlying LDA: Gaussian class-conditional distributions, equal covariance matrices, and class prior probabilities. You understand why these assumptions lead to linear decision boundaries and what happens when they're violated. Next, we'll explore the shared covariance structure in depth.

1 / 5

Loading learning content...

Machine LearningLinear and Quadratic Discriminant Analysis

Linear and Quadratic Discriminant Analysis

LevelIntermediate

Duration90 mins

TopicLinear and Quadratic Discriminant Analysis

1 / 5

LDA Assumptions

The Foundation of Linear Discriminant Analysis

What You Will Learn

The Generative Classification Framework

Before diving into LDA's specific assumptions, we must understand the generative classification paradigm. In generative classification, we model:

Class priors $P(Y = k)$: The probability of each class occurring in the population
Class-conditional densities $P(X | Y = k)$: The probability distribution of features given each class

Once these components are estimated, Bayes' theorem gives us the posterior probability for classification:

$$P(Y = k | X = x) = \frac{P(X = x | Y = k) \cdot P(Y = k)}{\sum_{j=1}^{K} P(X = x | Y = j) \cdot P(Y = j)}$$

The classification rule assigns each observation to the class with the highest posterior:

$$\hat{y} = \arg\max_k P(Y = k | X = x)$$

This framework is powerful because once we have good estimates of the class-conditional densities, we can:

Generate synthetic data from each class
Understand the data-generating process
Compute posterior probabilities, not just class labels
Handle missing data through marginalization

Generative vs Discriminative Tradeoff

The modeling challenge:

Nonparametric methods suffer from the curse of dimensionality
Flexible parametric methods require many parameters
Simple parametric methods impose strong restrictions

Assumption 1: Gaussian Class-Conditional Distributions

The first and most fundamental assumption of LDA is that the features within each class follow a multivariate Gaussian (normal) distribution. Mathematically, for each class $k$:

$$X | Y = k \sim \mathcal{N}(\mu_k, \Sigma_k)$$

where:

$\mu_k \in \mathbb{R}^p$ is the mean vector for class $k$
$\Sigma_k \in \mathbb{R}^{p \times p}$ is the covariance matrix for class $k$
$p$ is the number of features

The multivariate Gaussian probability density function is:

$$P(X = x | Y = k) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}} \exp\left(-\frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k)\right)$$

This density defines elliptical contours of equal probability in feature space, centered at $\mu_k$ and shaped by $\Sigma_k$.

The Mahalanobis Distance

Why Gaussian?

The Gaussian distribution is the most commonly assumed form for several compelling reasons:

Parameter estimation: With $n$ samples from class $k$, we can efficiently estimate:

$$\hat{\mu}k = \frac{1}{n_k}\sum{i: y_i = k} x_i \quad \text{(sample mean)}$$

$$\hat{\Sigma}k = \frac{1}{n_k - 1}\sum{i: y_i = k} (x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T \quad \text{(sample covariance)}$$

When the Gaussian Assumption Holds vs Fails
Scenario	Gaussian Behavior?	Recommendation
Continuous measurements (height, weight)	Often approximately hold	LDA appropriate
Heavy-tailed data (financial returns)	Usually violated—extreme values	Consider robust methods
Bounded data (percentages, probabilities)	Violated at boundaries	Transform first or use different model
Multimodal classes (mixed subpopulations)	Fundamentally violated	Use mixture models or nonparametric
Categorical or count features	Inappropriate assumption	Use Naive Bayes with appropriate distributions
High-dimensional sparse data (text)	Often problematic	Consider discriminative methods

Diagnosing Gaussianity:

Before applying LDA, it's prudent to assess whether the Gaussian assumption is reasonable:

Univariate normality tests: Apply Shapiro-Wilk or Anderson-Darling tests to each feature within each class. However, note that these tests are sensitive to sample size and cannot assess multivariate normality.
Q-Q plots: For each feature and class, compare empirical quantiles against Gaussian quantiles. Systematic departures (S-curves, heavy tails) indicate violations.
Multivariate normality: Use tests like Mardia's test (measures multivariate skewness and kurtosis) or Henze-Zirkler test. Visual inspection via chi-squared Q-Q plots of Mahalanobis distances can also help.
Density visualization: For low-dimensional data, kernel density estimates can reveal multimodality or asymmetry that contradicts the unimodal, symmetric Gaussian.

Assumption 2: Equal Covariance Matrices (Homoscedasticity)

The second core assumption of LDA—and the one that distinguishes it from QDA—is that all classes share the same covariance matrix:

$$\Sigma_1 = \Sigma_2 = \cdots = \Sigma_K = \Sigma$$

Geometric interpretation:

Why This Assumption Matters So Much

Mathematical consequence: From QDA to LDA

Let's see how equal covariances lead to linear boundaries. For a two-class problem, the decision boundary is where the posterior probabilities are equal:

$$P(Y = 1 | X) = P(Y = 2 | X)$$

Using Bayes' theorem and taking logarithms:

$$\log P(X | Y = 1) + \log P(Y = 1) = \log P(X | Y = 2) + \log P(Y = 2)$$

Substituting the Gaussian densities:

$$-\frac{1}{2}(x - \mu_1)^T\Sigma_1^{-1}(x - \mu_1) - \frac{1}{2}\log|\Sigma_1| + \log\pi_1$$ $$= -\frac{1}{2}(x - \mu_2)^T\Sigma_2^{-1}(x - \mu_2) - \frac{1}{2}\log|\Sigma_2| + \log\pi_2$$

With equal covariances ($\Sigma_1 = \Sigma_2 = \Sigma$), these quadratic terms perfectly cancel:

The $x^T\Sigma^{-1}x$ terms cancel, leaving:

$$(\mu_1 - \mu_2)^T\Sigma^{-1}x = \frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1 - \frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2 + \log\frac{\pi_2}{\pi_1}$$

This is a linear function of $x$—hence Linear Discriminant Analysis.

Advantages of Assumed Equal Covariance

•Fewer parameters: $O(p^2)$ instead of $O(Kp^2)$
•More stable estimates: Pool data across classes
•Better with limited data: Reduced variance
•Linear boundaries: Simpler interpretation
•Faster computation: One matrix inversion

Disadvantages of Assumed Equal Covariance

•Model misspecification: May not capture true structure
•Bias: Systematic prediction errors if violated
•Suboptimal boundaries: Cannot adapt to class-specific spread
•Missed patterns: Cannot model class-specific correlations
•Overconfident predictions: In regions of violated assumption

Estimating the pooled covariance matrix:

The maximum likelihood estimate of the shared covariance matrix pools data from all classes:

$$\hat{\Sigma} = \frac{1}{n - K}\sum_{k=1}^{K}\sum_{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$$

Equivalently:

$$\hat{\Sigma} = \frac{\sum_{k=1}^{K}(n_k - 1)\hat{\Sigma}k}{\sum{k=1}^{K}(n_k - 1)}$$

This is a weighted average of the class-specific sample covariance matrices, where weights are $(n_k - 1)$ (degrees of freedom for each class). Larger classes contribute more to the estimate.

Testing the assumption:

Box's M-test formally tests the null hypothesis $H_0: \Sigma_1 = \Sigma_2 = \cdots = \Sigma_K$. However, in practice:

The test is sensitive to normality violations (it tests both simultaneously)
With large samples, even trivial differences become 'significant'
With small samples, the test has low power

Assumption 3: Class Prior Probabilities

Estimation approaches:

Sample proportions (most common): $$\hat{\pi}_k = \frac{n_k}{n}$$ where $n_k$ is the number of training samples in class $k$.
Equal priors: $$\pi_k = \frac{1}{K} \quad \forall k$$ This ignores training set class frequencies, useful when the test distribution differs.
Domain-specified priors: Set based on domain knowledge of population frequencies (e.g., disease prevalence).

Priors and Class Imbalance

Effect of priors on decision boundaries:

Implicit assumption:

Using sample proportions as priors implicitly assumes that:

The training data is a representative sample from the target population
The class distribution in training matches the class distribution at test time
The cost of misclassification is symmetric across classes

Impact of Prior Probability Choices
Prior Strategy	When to Use	Effect on Predictions
Sample proportions	Training reflects population; symmetric costs	Boundary at natural ratio
Equal priors	Training imbalanced but want unbiased classification	Boundary purely based on likelihood
Population priors	Training sample is biased (case-control studies)	Calibrated posterior probabilities
Cost-adjusted priors	Asymmetric misclassification costs	Boundary shifts toward costly-to-miss class

Additional Implicit Assumptions

Beyond the three core assumptions, LDA makes several additional implicit assumptions that practitioners should be aware of:

Implicit Assumptions in LDA

•Independence of observations: Each training sample is drawn independently. Violations occur with time series, spatial data, or clustered samples.
•No outliers: Gaussian MLE estimates are sensitive to outliers. A single extreme point can substantially distort $\hat{\mu}$ and $\hat{\Sigma}$.
•Non-singular covariance: The covariance matrix must be invertible. If $n < p$ (more features than samples), $\hat{\Sigma}$ is singular and LDA fails without regularization.
•Features are continuous: Gaussian distributions are defined over continuous variables. Discrete or categorical features violate the assumption fundamentally.
•Stable relationships: The class-conditional distributions don't change over time (no concept drift).

The High-Dimensional Challenge

Feature preprocessing implications:

Certain preprocessing steps affect how well LDA assumptions hold:

Standardization: Centering and scaling features doesn't change Gaussianity but can improve numerical stability.
Log/power transformations: Can induce approximate normality for right-skewed features (common in count data, durations, monetary values).
Box-Cox transformations: Optimal power transformation toward normality, but must be applied within each class separately for the class-conditional assumption.
Principal component analysis: Decorrelates features, but doesn't guarantee normality of the components.
Outlier removal/winsorization: Can substantially improve Gaussian fit but may lose information.

What Happens When Assumptions Are Violated

Understanding the consequences of assumption violations helps you decide when LDA is acceptable versus when alternatives are necessary.

Violation: Non-Gaussian distributions

Mild violations (slightly heavy tails, minor asymmetry):

LDA often still performs reasonably well
Mean and covariance capture most of the relevant structure
Classification accuracy may be slightly suboptimal but not catastrophic

Severe violations (multimodality, very heavy tails, discrete data):

Decision boundaries are fundamentally wrong
Posterior probabilities are miscalibrated
May perform much worse than discriminative alternatives
Consider Gaussian mixture models, kernel methods, or nonparametric approaches

LDA's Surprising Robustness

Violation: Unequal covariances

This is typically more consequential than non-Gaussianity:

When class covariances are substantially different:

Linear boundaries are suboptimal; quadratic boundaries would fit better
LDA may underperform on one class while overperforming on another
Classes with smaller variance are often underrepresented in predictions
The pooled covariance is a poor summary of either class's spread

Diagnosis:

Compare determinants $|\hat{\Sigma}_k|$ across classes
Visualize class-specific scatter ellipses
Compute eigenvalue ratios for each class's covariance
Use Box's M-test (with appropriate caveats)

Solutions:

Use QDA if you have enough data (discussed next page)
Use regularized discriminant analysis as a compromise
Apply transformations that stabilize variance

Assumption Violations and Their Practical Impact
Violation	Severity	Impact	Mitigation
Mild non-Gaussianity	Low	Slightly suboptimal boundaries	Often acceptable; transformations if needed
Severe non-Gaussianity	High	Wrong boundaries, miscalibrated probabilities	Use different method (SVM, RF, etc.)
Unequal covariances	Medium-High	Linear boundary where quadratic is needed	Use QDA or regularized DA
Multimodal classes	High	Single Gaussian can't represent structure	Mixture models or nonparametric
Outliers	Medium	Distorted mean/covariance estimates	Robust estimation or outlier removal
High dimensionality ($p \approx n$)	High	Singular or unstable covariance	Regularized LDA or dimensionality reduction

LDA's Relationship to Other Methods

Understanding how LDA relates to other classification methods illuminates its unique position:

LDA vs Logistic Regression:

Both produce linear decision boundaries, but from different directions:

LDA: Generative—models $P(X|Y)$ and $P(Y)$, derives $P(Y|X)$
Logistic Regression: Discriminative—directly models $P(Y|X)$

Key differences:

LDA is more efficient when Gaussian assumptions hold (uses all data properties)
Logistic regression is more robust to model misspecification
LDA requires invertible covariance; logistic regression doesn't
LDA naturally extends to multi-class; logistic regression requires extensions (softmax, OvR)
LDA provides a useful dimensionality reduction (Fisher's discriminant)

When LDA Beats Logistic Regression

LDA vs QDA:

Both are Gaussian generative classifiers:

LDA: Shared covariance → linear boundary, $O(p^2)$ parameters
QDA: Class-specific covariances → quadratic boundaries, $O(Kp^2)$ parameters

LDA is essentially a constrained QDA where the constraint is $\Sigma_1 = \Sigma_2 = \cdots = \Sigma_K$.

LDA vs Naive Bayes (Gaussian):

Both make Gaussian assumptions:

Gaussian NB: Assumes features are conditionally independent given class (diagonal covariance)
LDA: Models full covariance structure (including correlations)

Naive Bayes makes a stronger assumption (zero correlations) but requires even fewer parameters: $O(Kp)$ vs LDA's $O(p^2)$.

LDA as dimensionality reduction:

Summary: Understanding LDA Assumptions

Key Takeaways

•LDA is a generative classifier that models class-conditional distributions and uses Bayes' theorem for classification.
•Assumption 1: Features follow multivariate Gaussian distributions within each class.
•Assumption 2: All classes share the same covariance matrix—this is what makes boundaries linear.
•Assumption 3: Class priors must be specified, typically from sample proportions but adjustable for imbalance.
•Equal covariances are crucial: Violating this assumption changes optimal boundaries from linear to quadratic.
•LDA is often robust to mild violations of Gaussianity but more sensitive to covariance heterogeneity.
•Implicit assumptions include independence, no outliers, non-singular covariance, and continuous features.
•LDA relates to logistic regression (same boundary form, different approach) and QDA (relaxed covariance constraint).

What's next:

Page Complete

1 / 5