Linear And Quadratic Discriminant Analysis - Learning Module

Loading content...

0/278

QDA with Class-Specific Covariance

Beyond Linear Boundaries: Quadratic Discriminant Analysis

Linear Discriminant Analysis makes a strong assumption: all classes share the same covariance matrix. This assumption yields elegant linear decision boundaries but may be fundamentally wrong when classes truly differ in their internal structure—when one class is compact and another is diffuse, or when features are correlated differently across classes.

Quadratic Discriminant Analysis (QDA) relaxes this constraint, allowing each class to have its own covariance matrix $\Sigma_k$. The result is a richer model capable of capturing class-specific correlation structures and quadratic (curved) decision boundaries that can better separate classes with heterogeneous variances.

However, this flexibility comes at a cost: QDA requires estimating many more parameters, making it susceptible to overfitting when sample sizes are small relative to dimensionality. Understanding the LDA-QDA tradeoff is essential for choosing the right method in practice.

What You Will Learn

By the end of this page, you will understand: how QDA generative model differs from LDA, why class-specific covariances lead to quadratic boundaries, the geometry of QDA decision surfaces, parameter estimation for QDA, the bias-variance tradeoff between LDA and QDA, and when to prefer each method.

The QDA Generative Model

Like LDA, QDA is a generative classifier that models the joint distribution $P(X, Y)$ by specifying class priors and class-conditional densities. The key difference is in the covariance assumptions.

The QDA model:

For each class $k \in {1, 2, \ldots, K}$:

$$X | Y = k \sim \mathcal{N}(\mu_k, \Sigma_k)$$

where:

$\mu_k$ is the mean vector for class $k$
$\Sigma_k$ is the covariance matrix specific to class $k$ (not shared)
Class priors are $\pi_k = P(Y = k)$

The probability density for class $k$ is:

$$P(X = x | Y = k) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}\exp\left(-\frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k)\right)$$

The One Assumption Relaxed

QDA relaxes only one of LDA's assumptions: the equal covariance constraint. It still assumes Gaussian class-conditional distributions—each class is a multivariate Gaussian, just with its own shape and orientation. The 'Q' in QDA refers to the quadratic form of the resulting discriminant functions, not a quadratic assumption about distributions.

Parameter count comparison:

The number of parameters required by each method reveals the complexity tradeoff:

For $K$ classes and $p$ features:

Model	Means	Covariances	Priors	Total
LDA	$Kp$	$\frac{p(p+1)}{2}$	$K-1$	$Kp + \frac{p(p+1)}{2} + K - 1$
QDA	$Kp$	$K \cdot \frac{p(p+1)}{2}$	$K-1$	$Kp + K\frac{p(p+1)}{2} + K - 1$

Example: With $K = 3$ classes and $p = 10$ features:

LDA: $30 + 55 + 2 = 87$ parameters
QDA: $30 + 165 + 2 = 197$ parameters

QDA requires about $K$ times as many covariance parameters. For high-dimensional problems, this difference becomes substantial.

LDA vs QDA: Model Specifications
Aspect	LDA	QDA
Class-conditional distribution	$\mathcal{N}(\mu_k, \Sigma)$	$\mathcal{N}(\mu_k, \Sigma_k)$
Covariance structure	Shared across all classes	Different for each class
Number of covariance parameters	$\frac{p(p+1)}{2}$	$K \cdot \frac{p(p+1)}{2}$
Decision boundary form	Linear (hyperplanes)	Quadratic (conics)
Flexibility	Low	High
Variance of estimates	Lower (pooling)	Higher (no pooling)

Deriving Quadratic Decision Boundaries

Let's rigorously derive the form of QDA decision boundaries, showing exactly where the quadratic terms arise.

The classification objective:

We classify $x$ to the class maximizing the posterior:

$$\hat{y} = \arg\max_k P(Y = k | X = x)$$

Using Bayes' rule and taking logarithms:

$$\hat{y} = \arg\max_k \left[\log P(X = x | Y = k) + \log \pi_k\right]$$

Substituting the Gaussian density with class-specific covariance:

$$\log P(X = x | Y = k) = -\frac{p}{2}\log(2\pi) - \frac{1}{2}\log|\Sigma_k| - \frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k)$$

The critical difference from LDA:

In LDA, the terms $-\frac{p}{2}\log(2\pi)$ and $-\frac{1}{2}\log|\Sigma|$ are constant across classes and can be dropped. In QDA, $\log|\Sigma_k|$ depends on $k$ and must be retained.

The QDA discriminant function:

$$\delta_k(x) = -\frac{1}{2}\log|\Sigma_k| - \frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k) + \log\pi_k$$

Expanding the quadratic form:

$$(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k) = x^T\Sigma_k^{-1}x - 2\mu_k^T\Sigma_k^{-1}x + \mu_k^T\Sigma_k^{-1}\mu_k$$

So:

$$\delta_k(x) = -\frac{1}{2}x^T\Sigma_k^{-1}x + \mu_k^T\Sigma_k^{-1}x - \frac{1}{2}\mu_k^T\Sigma_k^{-1}\mu_k - \frac{1}{2}\log|\Sigma_k| + \log\pi_k$$

This can be written in the form:

$$\delta_k(x) = x^T A_k x + b_k^T x + c_k$$

where:

$A_k = -\frac{1}{2}\Sigma_k^{-1}$ (a $p \times p$ matrix—the quadratic term)
$b_k = \Sigma_k^{-1}\mu_k$ (a $p$-dimensional vector—the linear term)
$c_k = -\frac{1}{2}\mu_k^T\Sigma_k^{-1}\mu_k - \frac{1}{2}\log|\Sigma_k| + \log\pi_k$ (a scalar—the constant term)

This is a quadratic function of $x$—hence Quadratic Discriminant Analysis.

Why Quadratic Terms Don't Cancel in QDA

In LDA, the quadratic terms $x^T\Sigma^{-1}x$ cancel across classes because $\Sigma$ is shared. In QDA, each class has $x^T\Sigma_k^{-1}x$—since $\Sigma_k^{-1}$ differs by class, these terms don't cancel and contribute to a quadratic boundary.

The decision boundary between classes $k$ and $l$:

The boundary is where $\delta_k(x) = \delta_l(x)$:

$$x^T(A_k - A_l)x + (b_k - b_l)^Tx + (c_k - c_l) = 0$$

This is the equation of a quadric surface (also called a conic section in 2D). Depending on the eigenvalues of $(A_k - A_l)$, this surface can be:

Ellipse/Ellipsoid: One class entirely 'inside' another
Hyperbola/Hyperboloid: Classes on opposite sides of hyperbolic branches
Parabola/Paraboloid: Asymmetric separation
Lines/Planes: Degenerate cases when covariances are nearly equal

The specific shape depends on the relationship between $\Sigma_k$ and $\Sigma_l$.

Geometry of QDA Decision Boundaries

Understanding the geometry of QDA boundaries provides intuition for when QDA is beneficial and how it differs from LDA.

Elliptical class contours:

Each class's Gaussian distribution has elliptical contours of equal probability density. In QDA, these ellipses can have different:

Shape (axis lengths / eigenvalue ratios)
Orientation (eigenvector directions)
Size (overall variance)

The decision boundary between two classes occurs where the probability densities (weighted by priors) are equal—where two elliptical 'hills' have the same height.

Examples of boundary shapes:

Ellipse: One class has much larger variance than the other. The smaller-variance class is 'inside' a closed curve, the larger-variance class 'outside.'
Hyperbola: Classes have similar overall variance but different orientations. The boundary separates them with open curves.
Two lines (degenerate hyperbola): When covariances are nearly equal in some directions but different in others.
Near-linear: When covariance differences are small, QDA boundaries are close to LDA's linear boundaries.

Convexity of Decision Regions

Unlike LDA (where all class regions are convex), QDA class regions can be non-convex and even disconnected. A class can have multiple separate 'islands' in feature space—impossible with LDA's linear boundaries.

Visualizing the difference:

Consider a two-class problem in 2D:

Class 1: Mean at $(0, 0)$, covariance $\begin{pmatrix} 1 & 0 \ 0 & 1 \end{pmatrix}$ (circular)
Class 2: Mean at $(3, 0)$, covariance $\begin{pmatrix} 4 & 0 \ 0 & 0.25 \end{pmatrix}$ (elongated ellipse)

LDA would fit a straight line boundary. But the true optimal boundary curves: near the x-axis, Class 2's elongated ellipse dominates; away from the x-axis, Class 1's compact circle dominates. QDA captures this with a hyperbolic boundary.

The multi-class case:

With $K$ classes, there are $\binom{K}{2}$ pairwise boundaries, each potentially quadratic. The overall decision regions are intersections of quadratic constraints, yielding complex shapes. Unlike LDA's convex polyhedra, QDA regions can have curved edges and non-convex shapes.

QDA Boundary Types Based on Covariance Structure
Covariance Relationship	Boundary Type	Geometric Interpretation
$\Sigma_1 = \lambda \Sigma_2$ (proportional)	Ellipse/Circle	One class surrounded by another
$\Sigma_1, \Sigma_2$ have same eigenvectors, different eigenvalues	Axis-aligned hyperbola	Classes separated along principal axes
$\Sigma_1, \Sigma_2$ have different eigenvectors	Rotated hyperbola	Oblique separation
$\Sigma_1 \approx \Sigma_2$	Near-linear	Close to LDA boundary
One class has near-zero variance in some direction	Degenerate (lines)	Class creates a 'wall'

Parameter Estimation for QDA

QDA parameter estimation follows the maximum likelihood principle, estimating separate covariance matrices for each class.

Step 1: Estimate class priors

$$\hat{\pi}_k = \frac{n_k}{n}$$

Step 2: Estimate class means

$$\hat{\mu}k = \frac{1}{n_k}\sum{i: y_i = k}x_i$$

Step 3: Estimate class-specific covariances

Unlike LDA, we do not pool. Each class gets its own estimate:

$$\hat{\Sigma}k = \frac{1}{n_k - 1}\sum{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$$

Step 4: Compute discriminant functions

For a new observation $x$:

$$\hat{\delta}_k(x) = -\frac{1}{2}\log|\hat{\Sigma}_k| - \frac{1}{2}(x - \hat{\mu}_k)^T\hat{\Sigma}_k^{-1}(x - \hat{\mu}_k) + \log\hat{\pi}_k$$

Step 5: Classify

$$\hat{y} = \arg\max_k \hat{\delta}_k(x)$$

The Singularity Problem

For QDA, each class requires $n_k > p$ samples for $\hat{\Sigma}_k$ to be invertible. If any class has fewer samples than features, its covariance matrix is singular and QDA fails. This is more restrictive than LDA, which only needs the pooled covariance to be non-singular (requiring $n - K > p$).

qda_from_scratch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from scipy.linalg import inv, det
 
class QuadraticDiscriminantAnalysis:
    """
    QDA implementation from first principles.
    Demonstrates class-specific covariance estimation.
    """
    
    def __init__(self, reg_param=0.0):
        self.reg_param = reg_param  # Regularization for numerical stability
        self.classes_ = None
        self.means_ = None           # Class means
        self.priors_ = None          # Class priors
        self.covariances_ = None     # Class-specific covariances
        self.covariances_inv_ = None # Inverses for efficiency
        self.log_dets_ = None        # Log determinants
    
    def fit(self, X, y):
        """Fit QDA model to training data."""
        n_samples, n_features = X.shape
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        
        # Step 1: Estimate class priors
        class_counts = np.array([np.sum(y == k) for k in self.classes_])
        self.priors_ = class_counts / n_samples
        
        # Step 2: Estimate class means
        self.means_ = np.array([X[y == k].mean(axis=0) 
                                 for k in self.classes_])
        
        # Step 3: Estimate class-specific covariances
        self.covariances_ = []
        self.covariances_inv_ = []
        self.log_dets_ = []
        
        for k_idx, k in enumerate(self.classes_):
            X_k = X[y == k]
            n_k = len(X_k)
            
            # Check for sufficient samples
            if n_k <= n_features:
                raise ValueError(
                    f"Class {k} has {n_k} samples but {n_features} features. "
                    f"QDA requires n_k > p for each class. Consider LDA or regularization."
                )
            
            # Estimate covariance
            X_k_centered = X_k - self.means_[k_idx]
            cov_k = (X_k_centered.T @ X_k_centered) / (n_k - 1)
            
            # Add regularization for numerical stability
            if self.reg_param > 0:
                cov_k = (1 - self.reg_param) * cov_k + \
                        self.reg_param * np.eye(n_features)
            
            self.covariances_.append(cov_k)
            self.covariances_inv_.append(inv(cov_k))
            self.log_dets_.append(np.log(det(cov_k)))
        
        return self
    
    def decision_function(self, X):
        """Compute QDA discriminant scores for each class."""
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        scores = np.zeros((n_samples, n_classes))
        
        for k in range(n_classes):
            # Center the data
            diff = X - self.means_[k]
            
            # Compute Mahalanobis distance squared
            mahal = np.sum(diff @ self.covariances_inv_[k] * diff, axis=1)
            
            # QDA discriminant function
            scores[:, k] = (
                -0.5 * self.log_dets_[k]
                - 0.5 * mahal
                + np.log(self.priors_[k])
            )
        
        return scores
    
    def predict(self, X):
        """Predict class labels."""
        scores = self.decision_function(X)
        return self.classes_[np.argmax(scores, axis=1)]
    
    def predict_proba(self, X):
        """Predict posterior probabilities."""
        scores = self.decision_function(X)
        # Softmax to convert log-posteriors to probabilities
        exp_scores = np.exp(scores - scores.max(axis=1, keepdims=True))
        return exp_scores / exp_scores.sum(axis=1, keepdims=True)

LDA vs QDA: The Bias-Variance Tradeoff

The choice between LDA and QDA embodies a fundamental statistical tradeoff: bias versus variance. Understanding this tradeoff is critical for model selection.

LDA's tradeoff:

Bias: If covariances truly differ, LDA's boundary is systematically wrong → high bias
Variance: Pooling data stabilizes covariance estimates → low variance

QDA's tradeoff:

Bias: Flexible enough to capture true differences → low bias (if model is correct)
Variance: Each class estimates its own covariance → high variance, especially with small $n_k$

The crossover point:

Generally:

With limited data: LDA's lower variance wins, even if covariances differ
With abundant data: QDA's lower bias wins, as variance diminishes with $n$
With high dimensionality: LDA's parameter efficiency is crucial

When to Prefer Each Method

Use LDA when: (1) Sample sizes are small relative to dimensions, (2) Classes appear to have similar spreads, (3) You want interpretability via Fisher's projections. Use QDA when: (1) Ample data per class ($n_k \gg p$), (2) Classes clearly have different covariance structures, (3) Flexibility is more important than parsimony.

Favor LDA When...

•Small sample sizes ($n_k < 5p$)
•High-dimensional features
•Covariances appear similar
•Interpretability is important
•Dimensionality reduction needed
•Computational efficiency matters

Favor QDA When...

•Large sample sizes ($n_k \gg p$)
•Low to moderate dimensionality
•Covariances clearly differ
•Quadratic boundaries seem natural
•Classification accuracy is paramount
•Willing to accept more complex model

Empirical guidelines:

A rough rule of thumb: QDA becomes preferable when each class has at least $5$−$10$ times as many samples as features. With fewer samples, the covariance estimates are too noisy for QDA to benefit from its flexibility.

Cross-validation for selection:

Rather than relying on rules of thumb, cross-validation provides a principled way to choose:

Fit both LDA and QDA using training folds
Evaluate on validation folds
Select the method with better average performance

This accounts for both the true covariance structure and the sample size available.

Effect of class imbalance:

With imbalanced classes, QDA's disadvantage is amplified: the minority class has very few samples for covariance estimation, making $\hat{\Sigma}_{\text{minority}}$ highly unstable. LDA's pooling helps stabilize estimation in this setting.

Computational Considerations

QDA has higher computational costs than LDA, both in training and prediction:

Training complexity:

LDA: Compute one pooled covariance ($O(np^2)$) and invert it ($O(p^3)$)
QDA: Compute $K$ covariances ($O(np^2)$ total) and invert each ($O(Kp^3)$)

For large $K$, QDA training is $K$ times slower in the inversion step.

Prediction complexity:

LDA: Compute $K$ linear discriminants ($O(Kp)$ per sample)
QDA: Compute $K$ quadratic discriminants ($O(Kp^2)$ per sample)

QDA prediction is $O(p)$ times slower per sample—significant for high-dimensional data.

Storage:

LDA: Store one $p \times p$ covariance inverse, $K$ mean vectors
QDA: Store $K$ covariance inverses, $K$ mean vectors, $K$ log-determinants

Efficient Implementation Tricks

In practice, we store Cholesky factors rather than explicit inverses. The Cholesky decomposition $\Sigma_k = L_k L_k^T$ allows efficient computation of both the Mahalanobis distance (via forward/back substitution) and the log-determinant ($2\sum_i \log L_{k,ii}$).

Computational Complexity: LDA vs QDA
Operation	LDA	QDA
Training (covariance)	$O(np^2)$	$O(np^2)$
Training (inversion)	$O(p^3)$	$O(Kp^3)$
Prediction (per sample)	$O(Kp)$	$O(Kp^2)$
Memory (covariance storage)	$O(p^2)$	$O(Kp^2)$

Practical Diagnostics: Choosing Between LDA and QDA

Before committing to LDA or QDA, several diagnostics can guide the choice:

1. Compare class covariance matrices:

Compute $\hat{\Sigma}_k$ for each class and compare:

Eigenvalue ratios: Are the condition numbers (ratio of largest to smallest eigenvalue) similar across classes?
Determinants: $|\hat{\Sigma}_k|$ measures 'spread'—large differences suggest QDA
Visual inspection: In 2D/3D, plot class ellipses (contours of $\hat{\Sigma}_k$)

2. Box's M-test:

Formally tests $H_0: \Sigma_1 = \Sigma_2 = \cdots = \Sigma_K$. Rejection suggests QDA. However:

Very sensitive to non-normality
With large $n$, rejects even trivial differences
With small $n$, low power to detect real differences

Use as one input, not a definitive answer.

covariance_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
from scipy import stats
 
def compare_covariances(X, y, classes=None):
    """
    Compare class covariance structures to guide LDA vs QDA choice.
    """
    if classes is None:
        classes = np.unique(y)
    
    results = {}
    
    for k in classes:
        X_k = X[y == k]
        n_k = len(X_k)
        cov_k = np.cov(X_k, rowvar=False)
        
        eigenvalues = np.linalg.eigvalsh(cov_k)
        
        results[k] = {
            'n_samples': n_k,
            'determinant': np.linalg.det(cov_k),
            'trace': np.trace(cov_k),
            'condition_number': eigenvalues.max() / eigenvalues.min(),
            'eigenvalue_range': (eigenvalues.min(), eigenvalues.max()),
        }
    
    # Summary comparison
    dets = [results[k]['determinant'] for k in classes]
    conds = [results[k]['condition_number'] for k in classes]
    
    det_ratio = max(dets) / min(dets) if min(dets) > 0 else float('inf')
    cond_ratio = max(conds) / min(conds)
    
    recommendation = "Consider QDA" if det_ratio > 10 or cond_ratio > 5 else "LDA likely sufficient"
    
    return {
        'per_class': results,
        'det_ratio': det_ratio,
        'cond_ratio': cond_ratio,
        'recommendation': recommendation
    }

3. Cross-validation comparison:

The most reliable method: fit both LDA and QDA, compare cross-validated performance.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.model_selection import cross_val_score

lda = LinearDiscriminantAnalysis()
qda = QuadraticDiscriminantAnalysis()

lda_scores = cross_val_score(lda, X, y, cv=5, scoring='accuracy')
qda_scores = cross_val_score(qda, X, y, cv=5, scoring='accuracy')

print(f"LDA: {lda_scores.mean():.3f} ± {lda_scores.std():.3f}")
print(f"QDA: {qda_scores.mean():.3f} ± {qda_scores.std():.3f}")

If QDA significantly outperforms LDA, covariance heterogeneity is likely impacting results. If they're similar, prefer LDA for simplicity.

4. Examine boundary visualizations:

For low-dimensional problems (or after PCA reduction), visualize the fitted boundaries. If the LDA linear boundary seems to misalign with class separations, QDA may help.

Summary: QDA and Class-Specific Covariance

Key Takeaways

•QDA relaxes LDA's shared covariance assumption, allowing each class its own $\Sigma_k$.
•Quadratic boundaries arise because class-specific $\Sigma_k^{-1}$ terms don't cancel in the log-posterior.
•Boundary shapes can be ellipses, hyperbolas, or other conic sections depending on covariance relationships.
•QDA requires more parameters ($K$ times as many covariance parameters), leading to higher variance.
•The bias-variance tradeoff favors LDA with small samples, QDA with large samples.
•QDA has stricter data requirements: each class needs $n_k > p$ for invertible covariance.
•Cross-validation is the most reliable method for choosing between LDA and QDA.
•Computational costs are higher for QDA: $O(Kp^3)$ training, $O(Kp^2)$ prediction.

What's next:

We've seen the extremes: LDA pools all covariances, QDA separates them entirely. But what if we want something in between? The next page explores decision boundaries—their geometric properties, how to interpret and visualize them, and the implications for classification at different points in feature space.

Page Complete

You now understand QDA's generative model with class-specific covariances, why this leads to quadratic decision boundaries, the geometric interpretation of these boundaries, and the bias-variance tradeoff between LDA and QDA. Next, we'll examine decision boundaries in depth.

QDA with Class-Specific Covariance

Beyond Linear Boundaries: Quadratic Discriminant Analysis

What You Will Learn

The QDA Generative Model

Like LDA, QDA is a generative classifier that models the joint distribution $P(X, Y)$ by specifying class priors and class-conditional densities. The key difference is in the covariance assumptions.

The QDA model:

For each class $k \in {1, 2, \ldots, K}$:

$$X | Y = k \sim \mathcal{N}(\mu_k, \Sigma_k)$$

where:

$\mu_k$ is the mean vector for class $k$
$\Sigma_k$ is the covariance matrix specific to class $k$ (not shared)
Class priors are $\pi_k = P(Y = k)$

The probability density for class $k$ is:

$$P(X = x | Y = k) = \frac{1}{(2\pi)^{p/2}|\Sigma_k|^{1/2}}\exp\left(-\frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k)\right)$$

The One Assumption Relaxed

Parameter count comparison:

The number of parameters required by each method reveals the complexity tradeoff:

For $K$ classes and $p$ features:

Model	Means	Covariances	Priors	Total
LDA	$Kp$	$\frac{p(p+1)}{2}$	$K-1$	$Kp + \frac{p(p+1)}{2} + K - 1$
QDA	$Kp$	$K \cdot \frac{p(p+1)}{2}$	$K-1$	$Kp + K\frac{p(p+1)}{2} + K - 1$

Example: With $K = 3$ classes and $p = 10$ features:

LDA: $30 + 55 + 2 = 87$ parameters
QDA: $30 + 165 + 2 = 197$ parameters

QDA requires about $K$ times as many covariance parameters. For high-dimensional problems, this difference becomes substantial.

LDA vs QDA: Model Specifications
Aspect	LDA	QDA
Class-conditional distribution	$\mathcal{N}(\mu_k, \Sigma)$	$\mathcal{N}(\mu_k, \Sigma_k)$
Covariance structure	Shared across all classes	Different for each class
Number of covariance parameters	$\frac{p(p+1)}{2}$	$K \cdot \frac{p(p+1)}{2}$
Decision boundary form	Linear (hyperplanes)	Quadratic (conics)
Flexibility	Low	High
Variance of estimates	Lower (pooling)	Higher (no pooling)

Deriving Quadratic Decision Boundaries

Let's rigorously derive the form of QDA decision boundaries, showing exactly where the quadratic terms arise.

The classification objective:

We classify $x$ to the class maximizing the posterior:

$$\hat{y} = \arg\max_k P(Y = k | X = x)$$

Using Bayes' rule and taking logarithms:

$$\hat{y} = \arg\max_k \left[\log P(X = x | Y = k) + \log \pi_k\right]$$

Substituting the Gaussian density with class-specific covariance:

$$\log P(X = x | Y = k) = -\frac{p}{2}\log(2\pi) - \frac{1}{2}\log|\Sigma_k| - \frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k)$$

The critical difference from LDA:

In LDA, the terms $-\frac{p}{2}\log(2\pi)$ and $-\frac{1}{2}\log|\Sigma|$ are constant across classes and can be dropped. In QDA, $\log|\Sigma_k|$ depends on $k$ and must be retained.

The QDA discriminant function:

$$\delta_k(x) = -\frac{1}{2}\log|\Sigma_k| - \frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k) + \log\pi_k$$

Expanding the quadratic form:

$$(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k) = x^T\Sigma_k^{-1}x - 2\mu_k^T\Sigma_k^{-1}x + \mu_k^T\Sigma_k^{-1}\mu_k$$

So:

$$\delta_k(x) = -\frac{1}{2}x^T\Sigma_k^{-1}x + \mu_k^T\Sigma_k^{-1}x - \frac{1}{2}\mu_k^T\Sigma_k^{-1}\mu_k - \frac{1}{2}\log|\Sigma_k| + \log\pi_k$$

This can be written in the form:

$$\delta_k(x) = x^T A_k x + b_k^T x + c_k$$

where:

$A_k = -\frac{1}{2}\Sigma_k^{-1}$ (a $p \times p$ matrix—the quadratic term)
$b_k = \Sigma_k^{-1}\mu_k$ (a $p$-dimensional vector—the linear term)
$c_k = -\frac{1}{2}\mu_k^T\Sigma_k^{-1}\mu_k - \frac{1}{2}\log|\Sigma_k| + \log\pi_k$ (a scalar—the constant term)

This is a quadratic function of $x$—hence Quadratic Discriminant Analysis.

Why Quadratic Terms Don't Cancel in QDA

The decision boundary between classes $k$ and $l$:

The boundary is where $\delta_k(x) = \delta_l(x)$:

$$x^T(A_k - A_l)x + (b_k - b_l)^Tx + (c_k - c_l) = 0$$

This is the equation of a quadric surface (also called a conic section in 2D). Depending on the eigenvalues of $(A_k - A_l)$, this surface can be:

Ellipse/Ellipsoid: One class entirely 'inside' another
Hyperbola/Hyperboloid: Classes on opposite sides of hyperbolic branches
Parabola/Paraboloid: Asymmetric separation
Lines/Planes: Degenerate cases when covariances are nearly equal

The specific shape depends on the relationship between $\Sigma_k$ and $\Sigma_l$.

Geometry of QDA Decision Boundaries

Understanding the geometry of QDA boundaries provides intuition for when QDA is beneficial and how it differs from LDA.

Elliptical class contours:

Each class's Gaussian distribution has elliptical contours of equal probability density. In QDA, these ellipses can have different:

Shape (axis lengths / eigenvalue ratios)
Orientation (eigenvector directions)
Size (overall variance)

The decision boundary between two classes occurs where the probability densities (weighted by priors) are equal—where two elliptical 'hills' have the same height.

Examples of boundary shapes:

Ellipse: One class has much larger variance than the other. The smaller-variance class is 'inside' a closed curve, the larger-variance class 'outside.'
Hyperbola: Classes have similar overall variance but different orientations. The boundary separates them with open curves.
Two lines (degenerate hyperbola): When covariances are nearly equal in some directions but different in others.
Near-linear: When covariance differences are small, QDA boundaries are close to LDA's linear boundaries.

Convexity of Decision Regions

Visualizing the difference:

Consider a two-class problem in 2D:

Class 1: Mean at $(0, 0)$, covariance $\begin{pmatrix} 1 & 0 \ 0 & 1 \end{pmatrix}$ (circular)
Class 2: Mean at $(3, 0)$, covariance $\begin{pmatrix} 4 & 0 \ 0 & 0.25 \end{pmatrix}$ (elongated ellipse)

The multi-class case:

QDA Boundary Types Based on Covariance Structure
Covariance Relationship	Boundary Type	Geometric Interpretation
$\Sigma_1 = \lambda \Sigma_2$ (proportional)	Ellipse/Circle	One class surrounded by another
$\Sigma_1, \Sigma_2$ have same eigenvectors, different eigenvalues	Axis-aligned hyperbola	Classes separated along principal axes
$\Sigma_1, \Sigma_2$ have different eigenvectors	Rotated hyperbola	Oblique separation
$\Sigma_1 \approx \Sigma_2$	Near-linear	Close to LDA boundary
One class has near-zero variance in some direction	Degenerate (lines)	Class creates a 'wall'

Parameter Estimation for QDA

QDA parameter estimation follows the maximum likelihood principle, estimating separate covariance matrices for each class.

Step 1: Estimate class priors

$$\hat{\pi}_k = \frac{n_k}{n}$$

Step 2: Estimate class means

$$\hat{\mu}k = \frac{1}{n_k}\sum{i: y_i = k}x_i$$

Step 3: Estimate class-specific covariances

Unlike LDA, we do not pool. Each class gets its own estimate:

$$\hat{\Sigma}k = \frac{1}{n_k - 1}\sum{i: y_i = k}(x_i - \hat{\mu}_k)(x_i - \hat{\mu}_k)^T$$

Step 4: Compute discriminant functions

For a new observation $x$:

$$\hat{\delta}_k(x) = -\frac{1}{2}\log|\hat{\Sigma}_k| - \frac{1}{2}(x - \hat{\mu}_k)^T\hat{\Sigma}_k^{-1}(x - \hat{\mu}_k) + \log\hat{\pi}_k$$

Step 5: Classify

$$\hat{y} = \arg\max_k \hat{\delta}_k(x)$$

The Singularity Problem

qda_from_scratch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from scipy.linalg import inv, det
 
class QuadraticDiscriminantAnalysis:
    """
    QDA implementation from first principles.
    Demonstrates class-specific covariance estimation.
    """
    
    def __init__(self, reg_param=0.0):
        self.reg_param = reg_param  # Regularization for numerical stability
        self.classes_ = None
        self.means_ = None           # Class means
        self.priors_ = None          # Class priors
        self.covariances_ = None     # Class-specific covariances
        self.covariances_inv_ = None # Inverses for efficiency
        self.log_dets_ = None        # Log determinants
    
    def fit(self, X, y):
        """Fit QDA model to training data."""
        n_samples, n_features = X.shape
        self.classes_ = np.unique(y)
        n_classes = len(self.classes_)
        
        # Step 1: Estimate class priors
        class_counts = np.array([np.sum(y == k) for k in self.classes_])
        self.priors_ = class_counts / n_samples
        
        # Step 2: Estimate class means
        self.means_ = np.array([X[y == k].mean(axis=0) 
                                 for k in self.classes_])
        
        # Step 3: Estimate class-specific covariances
        self.covariances_ = []
        self.covariances_inv_ = []
        self.log_dets_ = []
        
        for k_idx, k in enumerate(self.classes_):
            X_k = X[y == k]
            n_k = len(X_k)
            
            # Check for sufficient samples
            if n_k <= n_features:
                raise ValueError(
                    f"Class {k} has {n_k} samples but {n_features} features. "
                    f"QDA requires n_k > p for each class. Consider LDA or regularization."
                )
            
            # Estimate covariance
            X_k_centered = X_k - self.means_[k_idx]
            cov_k = (X_k_centered.T @ X_k_centered) / (n_k - 1)
            
            # Add regularization for numerical stability
            if self.reg_param > 0:
                cov_k = (1 - self.reg_param) * cov_k + \
                        self.reg_param * np.eye(n_features)
            
            self.covariances_.append(cov_k)
            self.covariances_inv_.append(inv(cov_k))
            self.log_dets_.append(np.log(det(cov_k)))
        
        return self
    
    def decision_function(self, X):
        """Compute QDA discriminant scores for each class."""
        n_samples = X.shape[0]
        n_classes = len(self.classes_)
        scores = np.zeros((n_samples, n_classes))
        
        for k in range(n_classes):
            # Center the data
            diff = X - self.means_[k]
            
            # Compute Mahalanobis distance squared
            mahal = np.sum(diff @ self.covariances_inv_[k] * diff, axis=1)
            
            # QDA discriminant function
            scores[:, k] = (
                -0.5 * self.log_dets_[k]
                - 0.5 * mahal
                + np.log(self.priors_[k])
            )
        
        return scores
    
    def predict(self, X):
        """Predict class labels."""
        scores = self.decision_function(X)
        return self.classes_[np.argmax(scores, axis=1)]
    
    def predict_proba(self, X):
        """Predict posterior probabilities."""
        scores = self.decision_function(X)
        # Softmax to convert log-posteriors to probabilities
        exp_scores = np.exp(scores - scores.max(axis=1, keepdims=True))
        return exp_scores / exp_scores.sum(axis=1, keepdims=True)

LDA vs QDA: The Bias-Variance Tradeoff

The choice between LDA and QDA embodies a fundamental statistical tradeoff: bias versus variance. Understanding this tradeoff is critical for model selection.

LDA's tradeoff:

Bias: If covariances truly differ, LDA's boundary is systematically wrong → high bias
Variance: Pooling data stabilizes covariance estimates → low variance

QDA's tradeoff:

Bias: Flexible enough to capture true differences → low bias (if model is correct)
Variance: Each class estimates its own covariance → high variance, especially with small $n_k$

The crossover point:

Generally:

With limited data: LDA's lower variance wins, even if covariances differ
With abundant data: QDA's lower bias wins, as variance diminishes with $n$
With high dimensionality: LDA's parameter efficiency is crucial

When to Prefer Each Method

Favor LDA When...

•Small sample sizes ($n_k < 5p$)
•High-dimensional features
•Covariances appear similar
•Interpretability is important
•Dimensionality reduction needed
•Computational efficiency matters

Favor QDA When...

•Large sample sizes ($n_k \gg p$)
•Low to moderate dimensionality
•Covariances clearly differ
•Quadratic boundaries seem natural
•Classification accuracy is paramount
•Willing to accept more complex model

Empirical guidelines:

Cross-validation for selection:

Rather than relying on rules of thumb, cross-validation provides a principled way to choose:

Fit both LDA and QDA using training folds
Evaluate on validation folds
Select the method with better average performance

This accounts for both the true covariance structure and the sample size available.

Effect of class imbalance:

Computational Considerations

QDA has higher computational costs than LDA, both in training and prediction:

Training complexity:

LDA: Compute one pooled covariance ($O(np^2)$) and invert it ($O(p^3)$)
QDA: Compute $K$ covariances ($O(np^2)$ total) and invert each ($O(Kp^3)$)

For large $K$, QDA training is $K$ times slower in the inversion step.

Prediction complexity:

LDA: Compute $K$ linear discriminants ($O(Kp)$ per sample)
QDA: Compute $K$ quadratic discriminants ($O(Kp^2)$ per sample)

QDA prediction is $O(p)$ times slower per sample—significant for high-dimensional data.

Storage:

LDA: Store one $p \times p$ covariance inverse, $K$ mean vectors
QDA: Store $K$ covariance inverses, $K$ mean vectors, $K$ log-determinants

Efficient Implementation Tricks

Computational Complexity: LDA vs QDA
Operation	LDA	QDA
Training (covariance)	$O(np^2)$	$O(np^2)$
Training (inversion)	$O(p^3)$	$O(Kp^3)$
Prediction (per sample)	$O(Kp)$	$O(Kp^2)$
Memory (covariance storage)	$O(p^2)$	$O(Kp^2)$

Practical Diagnostics: Choosing Between LDA and QDA

Before committing to LDA or QDA, several diagnostics can guide the choice:

1. Compare class covariance matrices:

Compute $\hat{\Sigma}_k$ for each class and compare:

Eigenvalue ratios: Are the condition numbers (ratio of largest to smallest eigenvalue) similar across classes?
Determinants: $|\hat{\Sigma}_k|$ measures 'spread'—large differences suggest QDA
Visual inspection: In 2D/3D, plot class ellipses (contours of $\hat{\Sigma}_k$)

2. Box's M-test:

Formally tests $H_0: \Sigma_1 = \Sigma_2 = \cdots = \Sigma_K$. Rejection suggests QDA. However:

Very sensitive to non-normality
With large $n$, rejects even trivial differences
With small $n$, low power to detect real differences

Use as one input, not a definitive answer.

covariance_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
from scipy import stats
 
def compare_covariances(X, y, classes=None):
    """
    Compare class covariance structures to guide LDA vs QDA choice.
    """
    if classes is None:
        classes = np.unique(y)
    
    results = {}
    
    for k in classes:
        X_k = X[y == k]
        n_k = len(X_k)
        cov_k = np.cov(X_k, rowvar=False)
        
        eigenvalues = np.linalg.eigvalsh(cov_k)
        
        results[k] = {
            'n_samples': n_k,
            'determinant': np.linalg.det(cov_k),
            'trace': np.trace(cov_k),
            'condition_number': eigenvalues.max() / eigenvalues.min(),
            'eigenvalue_range': (eigenvalues.min(), eigenvalues.max()),
        }
    
    # Summary comparison
    dets = [results[k]['determinant'] for k in classes]
    conds = [results[k]['condition_number'] for k in classes]
    
    det_ratio = max(dets) / min(dets) if min(dets) > 0 else float('inf')
    cond_ratio = max(conds) / min(conds)
    
    recommendation = "Consider QDA" if det_ratio > 10 or cond_ratio > 5 else "LDA likely sufficient"
    
    return {
        'per_class': results,
        'det_ratio': det_ratio,
        'cond_ratio': cond_ratio,
        'recommendation': recommendation
    }

3. Cross-validation comparison:

The most reliable method: fit both LDA and QDA, compare cross-validated performance.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.model_selection import cross_val_score

lda = LinearDiscriminantAnalysis()
qda = QuadraticDiscriminantAnalysis()

lda_scores = cross_val_score(lda, X, y, cv=5, scoring='accuracy')
qda_scores = cross_val_score(qda, X, y, cv=5, scoring='accuracy')

print(f"LDA: {lda_scores.mean():.3f} ± {lda_scores.std():.3f}")
print(f"QDA: {qda_scores.mean():.3f} ± {qda_scores.std():.3f}")

If QDA significantly outperforms LDA, covariance heterogeneity is likely impacting results. If they're similar, prefer LDA for simplicity.

4. Examine boundary visualizations:

For low-dimensional problems (or after PCA reduction), visualize the fitted boundaries. If the LDA linear boundary seems to misalign with class separations, QDA may help.

Summary: QDA and Class-Specific Covariance

Key Takeaways

•QDA relaxes LDA's shared covariance assumption, allowing each class its own $\Sigma_k$.
•Quadratic boundaries arise because class-specific $\Sigma_k^{-1}$ terms don't cancel in the log-posterior.
•Boundary shapes can be ellipses, hyperbolas, or other conic sections depending on covariance relationships.
•QDA requires more parameters ($K$ times as many covariance parameters), leading to higher variance.
•The bias-variance tradeoff favors LDA with small samples, QDA with large samples.
•QDA has stricter data requirements: each class needs $n_k > p$ for invertible covariance.
•Cross-validation is the most reliable method for choosing between LDA and QDA.
•Computational costs are higher for QDA: $O(Kp^3)$ training, $O(Kp^2)$ prediction.

What's next:

Page Complete