Loading content...
Linear Discriminant Analysis makes a strong assumption: all classes share the same covariance matrix. This assumption yields elegant linear decision boundaries but may be fundamentally wrong when classes truly differ in their internal structure—when one class is compact and another is diffuse, or when features are correlated differently across classes.
Quadratic Discriminant Analysis (QDA) relaxes this constraint, allowing each class to have its own covariance matrix $\Sigma_k$. The result is a richer model capable of capturing class-specific correlation structures and quadratic (curved) decision boundaries that can better separate classes with heterogeneous variances.
However, this flexibility comes at a cost: QDA requires estimating many more parameters, making it susceptible to overfitting when sample sizes are small relative to dimensionality. Understanding the LDA-QDA tradeoff is essential for choosing the right method in practice.
By the end of this page, you will understand: how QDA generative model differs from LDA, why class-specific covariances lead to quadratic boundaries, the geometry of QDA decision surfaces, parameter estimation for QDA, the bias-variance tradeoff between LDA and QDA, and when to prefer each method.
Like LDA, QDA is a generative classifier that models the joint distribution $P(X, Y)$ by specifying class priors and class-conditional densities. The key difference is in the covariance assumptions.
The QDA model:
For each class $k \in {1, 2, \ldots, K}$:
$$X | Y = k \sim \mathcal{N}(\mu_k, \Sigma_k)$$
where:
The probability density for class $k$ is:
$$P(X = x | Y = k) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}\exp\left(-\frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k)\right)$$
QDA relaxes only one of LDA's assumptions: the equal covariance constraint. It still assumes Gaussian class-conditional distributions—each class is a multivariate Gaussian, just with its own shape and orientation. The 'Q' in QDA refers to the quadratic form of the resulting discriminant functions, not a quadratic assumption about distributions.
Parameter count comparison:
The number of parameters required by each method reveals the complexity tradeoff:
For $K$ classes and $p$ features:
| Model | Means | Covariances | Priors | Total |
|---|---|---|---|---|
| LDA | $Kp$ | $\frac{p(p+1)}{2}$ | $K-1$ | $Kp + \frac{p(p+1)}{2} + K - 1$ |
| QDA | $Kp$ | $K \cdot \frac{p(p+1)}{2}$ | $K-1$ | $Kp + K\frac{p(p+1)}{2} + K - 1$ |
Example: With $K = 3$ classes and $p = 10$ features:
QDA requires about $K$ times as many covariance parameters. For high-dimensional problems, this difference becomes substantial.
| Aspect | LDA | QDA |
|---|---|---|
| Class-conditional distribution | $\mathcal{N}(\mu_k, \Sigma)$ | $\mathcal{N}(\mu_k, \Sigma_k)$ |
| Covariance structure | Shared across all classes | Different for each class |
| Number of covariance parameters | $\frac{p(p+1)}{2}$ | $K \cdot \frac{p(p+1)}{2}$ |
| Decision boundary form | Linear (hyperplanes) | Quadratic (conics) |
| Flexibility | Low | High |
| Variance of estimates | Lower (pooling) | Higher (no pooling) |
Let's rigorously derive the form of QDA decision boundaries, showing exactly where the quadratic terms arise.
The classification objective:
We classify $x$ to the class maximizing the posterior:
$$\hat{y} = \arg\max_k P(Y = k | X = x)$$
Using Bayes' rule and taking logarithms:
$$\hat{y} = \arg\max_k \left[\log P(X = x | Y = k) + \log \pi_k\right]$$
Substituting the Gaussian density with class-specific covariance:
$$\log P(X = x | Y = k) = -\frac{p}{2}\log(2\pi) - \frac{1}{2}\log|\Sigma_k| - \frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k)$$
The critical difference from LDA:
In LDA, the terms $-\frac{p}{2}\log(2\pi)$ and $-\frac{1}{2}\log|\Sigma|$ are constant across classes and can be dropped. In QDA, $\log|\Sigma_k|$ depends on $k$ and must be retained.
The QDA discriminant function:
$$\delta_k(x) = -\frac{1}{2}\log|\Sigma_k| - \frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k) + \log\pi_k$$
Expanding the quadratic form:
$$(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k) = x^T\Sigma_k^{-1}x - 2\mu_k^T\Sigma_k^{-1}x + \mu_k^T\Sigma_k^{-1}\mu_k$$
So:
$$\delta_k(x) = -\frac{1}{2}x^T\Sigma_k^{-1}x + \mu_k^T\Sigma_k^{-1}x - \frac{1}{2}\mu_k^T\Sigma_k^{-1}\mu_k - \frac{1}{2}\log|\Sigma_k| + \log\pi_k$$
This can be written in the form:
$$\delta_k(x) = x^T A_k x + b_k^T x + c_k$$
where:
This is a quadratic function of $x$—hence Quadratic Discriminant Analysis.
In LDA, the quadratic terms $x^T\Sigma^{-1}x$ cancel across classes because $\Sigma$ is shared. In QDA, each class has $x^T\Sigma_k^{-1}x$—since $\Sigma_k^{-1}$ differs by class, these terms don't cancel and contribute to a quadratic boundary.
The decision boundary between classes $k$ and $l$:
The boundary is where $\delta_k(x) = \delta_l(x)$:
$$x^T(A_k - A_l)x + (b_k - b_l)^Tx + (c_k - c_l) = 0$$
This is the equation of a quadric surface (also called a conic section in 2D). Depending on the eigenvalues of $(A_k - A_l)$, this surface can be:
The specific shape depends on the relationship between $\Sigma_k$ and $\Sigma_l$.
Understanding the geometry of QDA boundaries provides intuition for when QDA is beneficial and how it differs from LDA.
Elliptical class contours:
Each class's Gaussian distribution has elliptical contours of equal probability density. In QDA, these ellipses can have different:
The decision boundary between two classes occurs where the probability densities (weighted by priors) are equal—where two elliptical 'hills' have the same height.
Examples of boundary shapes:
Ellipse: One class has much larger variance than the other. The smaller-variance class is 'inside' a closed curve, the larger-variance class 'outside.'
Hyperbola: Classes have similar overall variance but different orientations. The boundary separates them with open curves.
Two lines (degenerate hyperbola): When covariances are nearly equal in some directions but different in others.
Near-linear: When covariance differences are small, QDA boundaries are close to LDA's linear boundaries.
Unlike LDA (where all class regions are convex), QDA class regions can be non-convex and even disconnected. A class can have multiple separate 'islands' in feature space—impossible with LDA's linear boundaries.
Visualizing the difference:
Consider a two-class problem in 2D:
LDA would fit a straight line boundary. But the true optimal boundary curves: near the x-axis, Class 2's elongated ellipse dominates; away from the x-axis, Class 1's compact circle dominates. QDA captures this with a hyperbolic boundary.
The multi-class case:
With $K$ classes, there are $\binom{K}{2}$ pairwise boundaries, each potentially quadratic. The overall decision regions are intersections of quadratic constraints, yielding complex shapes. Unlike LDA's convex polyhedra, QDA regions can have curved edges and non-convex shapes.
| Covariance Relationship | Boundary Type | Geometric Interpretation |
|---|---|---|
| $\Sigma_1 = \lambda \Sigma_2$ (proportional) | Ellipse/Circle | One class surrounded by another |
| $\Sigma_1, \Sigma_2$ have same eigenvectors, different eigenvalues | Axis-aligned hyperbola | Classes separated along principal axes |
| $\Sigma_1, \Sigma_2$ have different eigenvectors | Rotated hyperbola | Oblique separation |
| $\Sigma_1 \approx \Sigma_2$ | Near-linear | Close to LDA boundary |
| One class has near-zero variance in some direction | Degenerate (lines) | Class creates a 'wall' |
QDA parameter estimation follows the maximum likelihood principle, estimating separate covariance matrices for each class.
Step 1: Estimate class priors
$$\hat{\pi}_k = \frac{n_k}{n}$$
Step 2: Estimate class means
$$\hat{\mu}k = \frac{1}{n_k}\sum{i: y_i = k}x_i$$
Step 3: Estimate class-specific covariances
Unlike LDA, we do not pool. Each class gets its own estimate:
$$\hat{\Sigma}k = \frac{1}{n_k - 1}\sum{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$$
Step 4: Compute discriminant functions
For a new observation $x$:
$$\hat{\delta}_k(x) = -\frac{1}{2}\log|\hat{\Sigma}_k| - \frac{1}{2}(x - \hat{\mu}_k)^T\hat{\Sigma}_k^{-1}(x - \hat{\mu}_k) + \log\hat{\pi}_k$$
Step 5: Classify
$$\hat{y} = \arg\max_k \hat{\delta}_k(x)$$
For QDA, each class requires $n_k > p$ samples for $\hat{\Sigma}_k$ to be invertible. If any class has fewer samples than features, its covariance matrix is singular and QDA fails. This is more restrictive than LDA, which only needs the pooled covariance to be non-singular (requiring $n - K > p$).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npfrom scipy.linalg import inv, det class QuadraticDiscriminantAnalysis: """ QDA implementation from first principles. Demonstrates class-specific covariance estimation. """ def __init__(self, reg_param=0.0): self.reg_param = reg_param # Regularization for numerical stability self.classes_ = None self.means_ = None # Class means self.priors_ = None # Class priors self.covariances_ = None # Class-specific covariances self.covariances_inv_ = None # Inverses for efficiency self.log_dets_ = None # Log determinants def fit(self, X, y): """Fit QDA model to training data.""" n_samples, n_features = X.shape self.classes_ = np.unique(y) n_classes = len(self.classes_) # Step 1: Estimate class priors class_counts = np.array([np.sum(y == k) for k in self.classes_]) self.priors_ = class_counts / n_samples # Step 2: Estimate class means self.means_ = np.array([X[y == k].mean(axis=0) for k in self.classes_]) # Step 3: Estimate class-specific covariances self.covariances_ = [] self.covariances_inv_ = [] self.log_dets_ = [] for k_idx, k in enumerate(self.classes_): X_k = X[y == k] n_k = len(X_k) # Check for sufficient samples if n_k <= n_features: raise ValueError( f"Class {k} has {n_k} samples but {n_features} features. " f"QDA requires n_k > p for each class. Consider LDA or regularization." ) # Estimate covariance X_k_centered = X_k - self.means_[k_idx] cov_k = (X_k_centered.T @ X_k_centered) / (n_k - 1) # Add regularization for numerical stability if self.reg_param > 0: cov_k = (1 - self.reg_param) * cov_k + \ self.reg_param * np.eye(n_features) self.covariances_.append(cov_k) self.covariances_inv_.append(inv(cov_k)) self.log_dets_.append(np.log(det(cov_k))) return self def decision_function(self, X): """Compute QDA discriminant scores for each class.""" n_samples = X.shape[0] n_classes = len(self.classes_) scores = np.zeros((n_samples, n_classes)) for k in range(n_classes): # Center the data diff = X - self.means_[k] # Compute Mahalanobis distance squared mahal = np.sum(diff @ self.covariances_inv_[k] * diff, axis=1) # QDA discriminant function scores[:, k] = ( -0.5 * self.log_dets_[k] - 0.5 * mahal + np.log(self.priors_[k]) ) return scores def predict(self, X): """Predict class labels.""" scores = self.decision_function(X) return self.classes_[np.argmax(scores, axis=1)] def predict_proba(self, X): """Predict posterior probabilities.""" scores = self.decision_function(X) # Softmax to convert log-posteriors to probabilities exp_scores = np.exp(scores - scores.max(axis=1, keepdims=True)) return exp_scores / exp_scores.sum(axis=1, keepdims=True)The choice between LDA and QDA embodies a fundamental statistical tradeoff: bias versus variance. Understanding this tradeoff is critical for model selection.
LDA's tradeoff:
QDA's tradeoff:
The crossover point:
Generally:
Use LDA when: (1) Sample sizes are small relative to dimensions, (2) Classes appear to have similar spreads, (3) You want interpretability via Fisher's projections. Use QDA when: (1) Ample data per class ($n_k \gg p$), (2) Classes clearly have different covariance structures, (3) Flexibility is more important than parsimony.
Empirical guidelines:
A rough rule of thumb: QDA becomes preferable when each class has at least $5$−$10$ times as many samples as features. With fewer samples, the covariance estimates are too noisy for QDA to benefit from its flexibility.
Cross-validation for selection:
Rather than relying on rules of thumb, cross-validation provides a principled way to choose:
This accounts for both the true covariance structure and the sample size available.
Effect of class imbalance:
With imbalanced classes, QDA's disadvantage is amplified: the minority class has very few samples for covariance estimation, making $\hat{\Sigma}_{\text{minority}}$ highly unstable. LDA's pooling helps stabilize estimation in this setting.
QDA has higher computational costs than LDA, both in training and prediction:
Training complexity:
For large $K$, QDA training is $K$ times slower in the inversion step.
Prediction complexity:
QDA prediction is $O(p)$ times slower per sample—significant for high-dimensional data.
Storage:
In practice, we store Cholesky factors rather than explicit inverses. The Cholesky decomposition $\Sigma_k = L_k L_k^T$ allows efficient computation of both the Mahalanobis distance (via forward/back substitution) and the log-determinant ($2\sum_i \log L_{k,ii}$).
| Operation | LDA | QDA |
|---|---|---|
| Training (covariance) | $O(np^2)$ | $O(np^2)$ |
| Training (inversion) | $O(p^3)$ | $O(Kp^3)$ |
| Prediction (per sample) | $O(Kp)$ | $O(Kp^2)$ |
| Memory (covariance storage) | $O(p^2)$ | $O(Kp^2)$ |
Before committing to LDA or QDA, several diagnostics can guide the choice:
1. Compare class covariance matrices:
Compute $\hat{\Sigma}_k$ for each class and compare:
2. Box's M-test:
Formally tests $H_0: \Sigma_1 = \Sigma_2 = \cdots = \Sigma_K$. Rejection suggests QDA. However:
Use as one input, not a definitive answer.
123456789101112131415161718192021222324252627282930313233343536373839404142
import numpy as npfrom scipy import stats def compare_covariances(X, y, classes=None): """ Compare class covariance structures to guide LDA vs QDA choice. """ if classes is None: classes = np.unique(y) results = {} for k in classes: X_k = X[y == k] n_k = len(X_k) cov_k = np.cov(X_k, rowvar=False) eigenvalues = np.linalg.eigvalsh(cov_k) results[k] = { 'n_samples': n_k, 'determinant': np.linalg.det(cov_k), 'trace': np.trace(cov_k), 'condition_number': eigenvalues.max() / eigenvalues.min(), 'eigenvalue_range': (eigenvalues.min(), eigenvalues.max()), } # Summary comparison dets = [results[k]['determinant'] for k in classes] conds = [results[k]['condition_number'] for k in classes] det_ratio = max(dets) / min(dets) if min(dets) > 0 else float('inf') cond_ratio = max(conds) / min(conds) recommendation = "Consider QDA" if det_ratio > 10 or cond_ratio > 5 else "LDA likely sufficient" return { 'per_class': results, 'det_ratio': det_ratio, 'cond_ratio': cond_ratio, 'recommendation': recommendation }3. Cross-validation comparison:
The most reliable method: fit both LDA and QDA, compare cross-validated performance.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.model_selection import cross_val_score
lda = LinearDiscriminantAnalysis()
qda = QuadraticDiscriminantAnalysis()
lda_scores = cross_val_score(lda, X, y, cv=5, scoring='accuracy')
qda_scores = cross_val_score(qda, X, y, cv=5, scoring='accuracy')
print(f"LDA: {lda_scores.mean():.3f} ± {lda_scores.std():.3f}")
print(f"QDA: {qda_scores.mean():.3f} ± {qda_scores.std():.3f}")
If QDA significantly outperforms LDA, covariance heterogeneity is likely impacting results. If they're similar, prefer LDA for simplicity.
4. Examine boundary visualizations:
For low-dimensional problems (or after PCA reduction), visualize the fitted boundaries. If the LDA linear boundary seems to misalign with class separations, QDA may help.
What's next:
We've seen the extremes: LDA pools all covariances, QDA separates them entirely. But what if we want something in between? The next page explores decision boundaries—their geometric properties, how to interpret and visualize them, and the implications for classification at different points in feature space.
You now understand QDA's generative model with class-specific covariances, why this leads to quadratic decision boundaries, the geometric interpretation of these boundaries, and the bias-variance tradeoff between LDA and QDA. Next, we'll examine decision boundaries in depth.