Mathematical FoundationsJoint and Marginal Distributions

Joint and Marginal Distributions

LevelIntermediate

Duration90 mins

TopicJoint and Marginal Distributions

4 / 5

Covariance and Correlation

Measuring How Variables Move Together

When variables are not independent, knowing one tells us something about the other. But how much does it tell us? And in what direction do they move together?

Covariance and correlation answer these questions with single numbers that summarize the linear relationship between two random variables. While they capture only linear associations (missing nonlinear dependencies), they are computationally simple, theoretically tractable, and foundational to many ML techniques.

From PCA to linear regression, from portfolio theory to feature selection—covariance and correlation appear everywhere. Understanding them deeply is essential for any machine learning practitioner.

What You Will Learn

By the end of this page, you will master covariance and correlation as measures of linear dependence, understand their properties and limitations, work with covariance matrices for multiple variables, and see their central role in machine learning algorithms.

Covariance: Definition and Properties

Definition (Covariance):

For random variables $X$ and $Y$, the covariance is:

$$\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$$

where $\mu_X = \mathbb{E}[X]$ and $\mu_Y = \mathbb{E}[Y]$.

Interpretation:

Positive covariance: When $X$ is above its mean, $Y$ tends to be above its mean (and vice versa)
Negative covariance: When $X$ is above its mean, $Y$ tends to be below its mean
Zero covariance: No linear association (but possibly nonlinear dependence)

Properties:

Symmetry: $\text{Cov}(X, Y) = \text{Cov}(Y, X)$
Variance is self-covariance: $\text{Cov}(X, X) = \text{Var}(X)$
Bilinearity:
- $\text{Cov}(aX + b, Y) = a \cdot \text{Cov}(X, Y)$
- $\text{Cov}(X + Z, Y) = \text{Cov}(X, Y) + \text{Cov}(Z, Y)$
Variance of sum: $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$$
Independence implies zero covariance: If $X \perp Y$, then $\text{Cov}(X, Y) = 0$. (Converse is false!)

Zero Covariance ≠ Independence

If X and Y are independent, their covariance is zero. But zero covariance does NOT imply independence! Classic example: X ~ N(0,1) and Y = X². Then Cov(X, Y) = E[X³] = 0, but Y is completely determined by X.

covariance_basics.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
 
def covariance_examples():
    """Demonstrate covariance computation and properties."""
    np.random.seed(42)
    n = 10000
    
    # Example 1: Positively correlated
    x1 = np.random.normal(0, 1, n)
    y1 = 0.7 * x1 + 0.3 * np.random.normal(0, 1, n)
    cov1 = np.cov(x1, y1)[0, 1]
    print(f"Positive relationship: Cov(X, Y) = {cov1:.4f}")
    
    # Example 2: Negatively correlated
    y2 = -0.5 * x1 + 0.5 * np.random.normal(0, 1, n)
    cov2 = np.cov(x1, y2)[0, 1]
    print(f"Negative relationship: Cov(X, Y) = {cov2:.4f}")
    
    # Example 3: Independent (zero covariance)
    y3 = np.random.normal(0, 1, n)
    cov3 = np.cov(x1, y3)[0, 1]
    print(f"Independent: Cov(X, Y) = {cov3:.4f}")
    
    # Example 4: Zero covariance but dependent (Y = X²)
    y4 = x1 ** 2
    cov4 = np.cov(x1, y4)[0, 1]
    print(f"Y = X² (dependent but uncorrelated): Cov(X, Y) = {cov4:.4f}")
    
    # Verify variance of sum formula
    print("\nVariance of Sum Formula:")
    var_x = np.var(x1, ddof=1)
    var_y = np.var(y1, ddof=1)
    cov_xy = np.cov(x1, y1)[0, 1]
    var_sum = np.var(x1 + y1, ddof=1)
    computed = var_x + var_y + 2 * cov_xy
    print(f"Var(X+Y) = {var_sum:.4f}")
    print(f"Var(X) + Var(Y) + 2Cov(X,Y) = {computed:.4f}")
 
covariance_examples()

Correlation: Normalized Covariance

The Problem with Covariance:

Covariance depends on the scales of $X$ and $Y$. If we measure height in centimeters vs. meters, we get different covariances. This makes interpretation difficult.

Definition (Pearson Correlation Coefficient):

$$\rho_{X,Y} = \text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{\mathbb{E}[(X - \mu_X)(Y - \mu_Y)]}{\sqrt{\text{Var}(X)}\sqrt{\text{Var}(Y)}}$$

Properties:

Bounded: $-1 \leq \rho \leq 1$ always
Scale invariant: $\text{Corr}(aX + b, cY + d) = \text{sign}(ac) \cdot \text{Corr}(X, Y)$ for $a, c \neq 0$
Extreme values:
- $\rho = 1$: Perfect positive linear relationship ($Y = aX + b$ with $a > 0$)
- $\rho = -1$: Perfect negative linear relationship ($Y = aX + b$ with $a < 0$)
- $\rho = 0$: No linear relationship (but possibly nonlinear)
For bivariate Gaussian: $\rho$ completely characterizes the dependence structure

Interpreting Correlation Values
Correlation Range	Interpretation	Example
0.9 to 1.0	Very strong positive	Height vs. arm span
0.7 to 0.9	Strong positive	Study hours vs. grades
0.4 to 0.7	Moderate positive	Income vs. spending
0.1 to 0.4	Weak positive	Age vs. income (adult)
-0.1 to 0.1	No linear relationship	Shoe size vs. IQ
-0.4 to -0.1	Weak negative	Altitude vs. temperature
-0.7 to -0.4	Moderate negative	Speed vs. travel time
-1.0 to -0.7	Strong negative	Stock price vs. short seller profit

Correlation Measures Linear Association Only

A correlation of 0 means no LINEAR relationship—variables can still be perfectly dependent nonlinearly. When X ~ Uniform(-1,1) and Y = X², they're perfectly dependent but have ρ = 0. Always visualize your data!

correlation_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
import matplotlib.pyplot as plt
 
def visualize_correlations():
    """Show what different correlation values look like."""
    np.random.seed(42)
    n = 500
    
    correlations = [0.0, 0.3, 0.6, 0.9, -0.6, 0.0]
    titles = ['ρ = 0 (Independent)', 'ρ = 0.3 (Weak)', 'ρ = 0.6 (Moderate)',
              'ρ = 0.9 (Strong)', 'ρ = -0.6 (Negative)', 'ρ = 0 (Y = X²)']
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    for idx, (rho, title) in enumerate(zip(correlations, titles)):
        ax = axes.flatten()[idx]
        
        if idx < 5:
            # Generate bivariate normal with correlation rho
            x = np.random.normal(0, 1, n)
            y = rho * x + np.sqrt(1 - rho**2) * np.random.normal(0, 1, n)
        else:
            # Special case: Y = X² (zero correlation but dependent)
            x = np.random.uniform(-2, 2, n)
            y = x**2 + 0.2 * np.random.normal(0, 1, n)
        
        ax.scatter(x, y, alpha=0.5, s=20)
        ax.set_xlabel('X')
        ax.set_ylabel('Y')
        
        # Compute actual correlation
        actual_rho = np.corrcoef(x, y)[0, 1]
        ax.set_title(f'{title}\nActual ρ = {actual_rho:.2f}')
        ax.set_aspect('equal', 'box')
    
    plt.tight_layout()
    plt.savefig('correlation_examples.png', dpi=150)
    plt.show()
 
visualize_correlations()

The Covariance Matrix

For a random vector $\mathbf{X} = (X_1, X_2, \ldots, X_d)^T$, pairwise covariances are organized into the covariance matrix.

Definition:

$$\boldsymbol{\Sigma} = \text{Cov}(\mathbf{X}) = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T]$$

where $\boldsymbol{\mu} = \mathbb{E}[\mathbf{X}]$. Entry $(i, j)$ is:

$$\Sigma_{ij} = \text{Cov}(X_i, X_j)$$

Properties:

Symmetric: $\boldsymbol{\Sigma} = \boldsymbol{\Sigma}^T$
Positive semi-definite: $\mathbf{v}^T \boldsymbol{\Sigma} \mathbf{v} \geq 0$ for all $\mathbf{v}$
Diagonal entries: $\Sigma_{ii} = \text{Var}(X_i)$
Linear transformation: If $\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}$, then: $$\text{Cov}(\mathbf{Y}) = \mathbf{A} \boldsymbol{\Sigma} \mathbf{A}^T$$

Correlation Matrix:

The normalized version where each entry is a correlation:

$$R_{ij} = \frac{\Sigma_{ij}}{\sqrt{\Sigma_{ii}}\sqrt{\Sigma_{jj}}} = \text{Corr}(X_i, X_j)$$

Diagonal entries are all 1, and off-diagonal entries are between -1 and 1.

covariance_matrix.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
 
def covariance_matrix_demo():
    """Demonstrate covariance matrix properties."""
    np.random.seed(42)
    
    # Generate 4-dimensional data with known covariance structure
    n = 10000
    
    # True covariance matrix (positive definite)
    true_cov = np.array([
        [1.0, 0.8, 0.3, 0.0],
        [0.8, 1.0, 0.4, 0.1],
        [0.3, 0.4, 1.0, 0.6],
        [0.0, 0.1, 0.6, 1.0]
    ])
    
    # Generate data
    mean = np.zeros(4)
    data = np.random.multivariate_normal(mean, true_cov, n)
    
    # Compute sample covariance matrix
    sample_cov = np.cov(data.T)
    
    print("True Covariance Matrix:")
    print(np.round(true_cov, 3))
    print("\nSample Covariance Matrix:")
    print(np.round(sample_cov, 3))
    
    # Verify positive semi-definiteness via eigenvalues
    eigenvalues = np.linalg.eigvalsh(sample_cov)
    print(f"\nEigenvalues: {np.round(eigenvalues, 4)}")
    print(f"All eigenvalues >= 0? {all(eigenvalues >= -1e-10)}")
    
    # Compute correlation matrix
    std_devs = np.sqrt(np.diag(sample_cov))
    corr_matrix = sample_cov / np.outer(std_devs, std_devs)
    print("\nCorrelation Matrix:")
    print(np.round(corr_matrix, 3))
    
    # Using numpy's corrcoef
    print("\nVerify with np.corrcoef:")
    print(np.round(np.corrcoef(data.T), 3))
 
covariance_matrix_demo()

Geometric Interpretation

The covariance matrix has a beautiful geometric interpretation through its eigendecomposition.

Eigendecomposition of Covariance:

$$\boldsymbol{\Sigma} = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^T$$

where:

$\mathbf{V}$: Orthogonal matrix of eigenvectors (principal directions)
$\boldsymbol{\Lambda}$: Diagonal matrix of eigenvalues (variances along principal directions)

Geometric Meaning:

Eigenvectors = directions of maximum/minimum variance (principal axes)
Eigenvalues = variance along each principal axis
Contours of constant probability for Gaussian are ellipsoids with:
- Axes aligned with eigenvectors
- Axis lengths proportional to $\sqrt{\lambda_i}$

Connection to PCA:

Principal Component Analysis finds exactly these eigenvectors. The first principal component is the direction of maximum variance, the second is orthogonal with next-highest variance, etc.

Mahalanobis Distance:

Distance that accounts for covariance structure:

$$d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$

Points with equal Mahalanobis distance form ellipsoids aligned with the covariance structure.

Correlation = Ellipse Shape

For 2D Gaussian: ρ = 0 gives circular contours, ρ → ±1 gives increasingly elongated ellipses. The tilt angle depends on the relative variances. This is why scatter plots of correlated data look like tilted ellipses.

Applications in Machine Learning

Covariance and correlation are fundamental to many ML techniques:

1. Principal Component Analysis (PCA)

PCA finds the eigendecomposition of the covariance (or correlation) matrix. Principal components are eigenvectors; eigenvalues tell us how much variance each captures.

2. Linear Regression

The OLS coefficient in simple linear regression: $$\beta = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} = \rho \frac{\sigma_Y}{\sigma_X}$$

3. Feature Selection

Highly correlated features are redundant. Correlation matrices help identify multicollinearity.

4. Gaussian Mixture Models

Each component has its own covariance matrix, determining cluster shape.

5. Portfolio Theory

Covariance between asset returns determines portfolio risk. Diversification works because assets aren't perfectly correlated.

ML Techniques Using Covariance

•Whitening/Decorrelation: Transform data to have identity covariance
•Linear Discriminant Analysis: Uses within-class and between-class covariance
•Kalman Filters: Track covariance of state estimates over time
•Gaussian Processes: Kernel function defines covariance between function values
•Factor Analysis: Models covariance as low-rank plus diagonal

Estimation and Sample Covariance

In practice, we estimate covariance from data.

Sample Covariance:

For samples $(x_1, y_1), \ldots, (x_n, y_n)$:

$$\widehat{\text{Cov}}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})$$

The $n-1$ denominator (Bessel's correction) makes this an unbiased estimator.

High-Dimensional Challenges:

For $d$ variables, the covariance matrix has $d(d+1)/2$ unique entries. When $n < d$, the sample covariance matrix is rank-deficient (not invertible).

Solutions:

Shrinkage estimators: Regularize toward simpler structure (e.g., Ledoit-Wolf)
Assuming sparsity: Many off-diagonal entries are zero
Factor models: Low-rank approximation
Diagonal assumption: Ignore covariances (Naive Bayes)

Robust Estimation:

Sample covariance is sensitive to outliers. Robust alternatives:

Minimum Covariance Determinant (MCD)
Median Absolute Deviation (MAD) for variance
Spearman/Kendall correlation for non-normal data

Summary: Covariance and Correlation

Key Takeaways

•Covariance measures how variables move together: Positive = same direction, negative = opposite
•Correlation normalizes covariance to [-1, 1]: Scale-invariant measure of linear association
•Zero correlation ≠ independence: Only linear relationships are captured
•Covariance matrix organizes all pairwise relationships: Symmetric, positive semi-definite
•Eigendecomposition reveals principal axes: Foundation for PCA and geometric understanding
•ML relies heavily on covariance: From regression to clustering to dimensionality reduction

What's next:

Correlation tells us about the strength of linear association, but correlation zero doesn't mean independence. The next page covers independence rigorously—the strongest statement we can make about variables not affecting each other.

Page Complete

You now understand covariance and correlation as measures of linear dependence, their properties, the covariance matrix, and their central role in machine learning algorithms.

4 / 5

Loading learning content...

Mathematical FoundationsJoint and Marginal Distributions

Joint and Marginal Distributions

LevelIntermediate

Duration90 mins

TopicJoint and Marginal Distributions

4 / 5

Covariance and Correlation

Measuring How Variables Move Together

When variables are not independent, knowing one tells us something about the other. But how much does it tell us? And in what direction do they move together?

From PCA to linear regression, from portfolio theory to feature selection—covariance and correlation appear everywhere. Understanding them deeply is essential for any machine learning practitioner.

What You Will Learn

Covariance: Definition and Properties

Definition (Covariance):

For random variables $X$ and $Y$, the covariance is:

$$\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$$

where $\mu_X = \mathbb{E}[X]$ and $\mu_Y = \mathbb{E}[Y]$.

Interpretation:

Positive covariance: When $X$ is above its mean, $Y$ tends to be above its mean (and vice versa)
Negative covariance: When $X$ is above its mean, $Y$ tends to be below its mean
Zero covariance: No linear association (but possibly nonlinear dependence)

Properties:

Symmetry: $\text{Cov}(X, Y) = \text{Cov}(Y, X)$
Variance is self-covariance: $\text{Cov}(X, X) = \text{Var}(X)$
Bilinearity:
- $\text{Cov}(aX + b, Y) = a \cdot \text{Cov}(X, Y)$
- $\text{Cov}(X + Z, Y) = \text{Cov}(X, Y) + \text{Cov}(Z, Y)$
Variance of sum: $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$$
Independence implies zero covariance: If $X \perp Y$, then $\text{Cov}(X, Y) = 0$. (Converse is false!)

Zero Covariance ≠ Independence

covariance_basics.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
 
def covariance_examples():
    """Demonstrate covariance computation and properties."""
    np.random.seed(42)
    n = 10000
    
    # Example 1: Positively correlated
    x1 = np.random.normal(0, 1, n)
    y1 = 0.7 * x1 + 0.3 * np.random.normal(0, 1, n)
    cov1 = np.cov(x1, y1)[0, 1]
    print(f"Positive relationship: Cov(X, Y) = {cov1:.4f}")
    
    # Example 2: Negatively correlated
    y2 = -0.5 * x1 + 0.5 * np.random.normal(0, 1, n)
    cov2 = np.cov(x1, y2)[0, 1]
    print(f"Negative relationship: Cov(X, Y) = {cov2:.4f}")
    
    # Example 3: Independent (zero covariance)
    y3 = np.random.normal(0, 1, n)
    cov3 = np.cov(x1, y3)[0, 1]
    print(f"Independent: Cov(X, Y) = {cov3:.4f}")
    
    # Example 4: Zero covariance but dependent (Y = X²)
    y4 = x1 ** 2
    cov4 = np.cov(x1, y4)[0, 1]
    print(f"Y = X² (dependent but uncorrelated): Cov(X, Y) = {cov4:.4f}")
    
    # Verify variance of sum formula
    print("\nVariance of Sum Formula:")
    var_x = np.var(x1, ddof=1)
    var_y = np.var(y1, ddof=1)
    cov_xy = np.cov(x1, y1)[0, 1]
    var_sum = np.var(x1 + y1, ddof=1)
    computed = var_x + var_y + 2 * cov_xy
    print(f"Var(X+Y) = {var_sum:.4f}")
    print(f"Var(X) + Var(Y) + 2Cov(X,Y) = {computed:.4f}")
 
covariance_examples()

Correlation: Normalized Covariance

The Problem with Covariance:

Covariance depends on the scales of $X$ and $Y$. If we measure height in centimeters vs. meters, we get different covariances. This makes interpretation difficult.

Definition (Pearson Correlation Coefficient):

$$\rho_{X,Y} = \text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{\mathbb{E}[(X - \mu_X)(Y - \mu_Y)]}{\sqrt{\text{Var}(X)}\sqrt{\text{Var}(Y)}}$$

Properties:

Bounded: $-1 \leq \rho \leq 1$ always
Scale invariant: $\text{Corr}(aX + b, cY + d) = \text{sign}(ac) \cdot \text{Corr}(X, Y)$ for $a, c \neq 0$
Extreme values:
- $\rho = 1$: Perfect positive linear relationship ($Y = aX + b$ with $a > 0$)
- $\rho = -1$: Perfect negative linear relationship ($Y = aX + b$ with $a < 0$)
- $\rho = 0$: No linear relationship (but possibly nonlinear)
For bivariate Gaussian: $\rho$ completely characterizes the dependence structure

Interpreting Correlation Values
Correlation Range	Interpretation	Example
0.9 to 1.0	Very strong positive	Height vs. arm span
0.7 to 0.9	Strong positive	Study hours vs. grades
0.4 to 0.7	Moderate positive	Income vs. spending
0.1 to 0.4	Weak positive	Age vs. income (adult)
-0.1 to 0.1	No linear relationship	Shoe size vs. IQ
-0.4 to -0.1	Weak negative	Altitude vs. temperature
-0.7 to -0.4	Moderate negative	Speed vs. travel time
-1.0 to -0.7	Strong negative	Stock price vs. short seller profit

Correlation Measures Linear Association Only

correlation_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import numpy as np
import matplotlib.pyplot as plt
 
def visualize_correlations():
    """Show what different correlation values look like."""
    np.random.seed(42)
    n = 500
    
    correlations = [0.0, 0.3, 0.6, 0.9, -0.6, 0.0]
    titles = ['ρ = 0 (Independent)', 'ρ = 0.3 (Weak)', 'ρ = 0.6 (Moderate)',
              'ρ = 0.9 (Strong)', 'ρ = -0.6 (Negative)', 'ρ = 0 (Y = X²)']
    
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    for idx, (rho, title) in enumerate(zip(correlations, titles)):
        ax = axes.flatten()[idx]
        
        if idx < 5:
            # Generate bivariate normal with correlation rho
            x = np.random.normal(0, 1, n)
            y = rho * x + np.sqrt(1 - rho**2) * np.random.normal(0, 1, n)
        else:
            # Special case: Y = X² (zero correlation but dependent)
            x = np.random.uniform(-2, 2, n)
            y = x**2 + 0.2 * np.random.normal(0, 1, n)
        
        ax.scatter(x, y, alpha=0.5, s=20)
        ax.set_xlabel('X')
        ax.set_ylabel('Y')
        
        # Compute actual correlation
        actual_rho = np.corrcoef(x, y)[0, 1]
        ax.set_title(f'{title}\nActual ρ = {actual_rho:.2f}')
        ax.set_aspect('equal', 'box')
    
    plt.tight_layout()
    plt.savefig('correlation_examples.png', dpi=150)
    plt.show()
 
visualize_correlations()

The Covariance Matrix

For a random vector $\mathbf{X} = (X_1, X_2, \ldots, X_d)^T$, pairwise covariances are organized into the covariance matrix.

Definition:

$$\boldsymbol{\Sigma} = \text{Cov}(\mathbf{X}) = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T]$$

where $\boldsymbol{\mu} = \mathbb{E}[\mathbf{X}]$. Entry $(i, j)$ is:

$$\Sigma_{ij} = \text{Cov}(X_i, X_j)$$

Properties:

Symmetric: $\boldsymbol{\Sigma} = \boldsymbol{\Sigma}^T$
Positive semi-definite: $\mathbf{v}^T \boldsymbol{\Sigma} \mathbf{v} \geq 0$ for all $\mathbf{v}$
Diagonal entries: $\Sigma_{ii} = \text{Var}(X_i)$
Linear transformation: If $\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}$, then: $$\text{Cov}(\mathbf{Y}) = \mathbf{A} \boldsymbol{\Sigma} \mathbf{A}^T$$

Correlation Matrix:

The normalized version where each entry is a correlation:

$$R_{ij} = \frac{\Sigma_{ij}}{\sqrt{\Sigma_{ii}}\sqrt{\Sigma_{jj}}} = \text{Corr}(X_i, X_j)$$

Diagonal entries are all 1, and off-diagonal entries are between -1 and 1.

covariance_matrix.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
 
def covariance_matrix_demo():
    """Demonstrate covariance matrix properties."""
    np.random.seed(42)
    
    # Generate 4-dimensional data with known covariance structure
    n = 10000
    
    # True covariance matrix (positive definite)
    true_cov = np.array([
        [1.0, 0.8, 0.3, 0.0],
        [0.8, 1.0, 0.4, 0.1],
        [0.3, 0.4, 1.0, 0.6],
        [0.0, 0.1, 0.6, 1.0]
    ])
    
    # Generate data
    mean = np.zeros(4)
    data = np.random.multivariate_normal(mean, true_cov, n)
    
    # Compute sample covariance matrix
    sample_cov = np.cov(data.T)
    
    print("True Covariance Matrix:")
    print(np.round(true_cov, 3))
    print("\nSample Covariance Matrix:")
    print(np.round(sample_cov, 3))
    
    # Verify positive semi-definiteness via eigenvalues
    eigenvalues = np.linalg.eigvalsh(sample_cov)
    print(f"\nEigenvalues: {np.round(eigenvalues, 4)}")
    print(f"All eigenvalues >= 0? {all(eigenvalues >= -1e-10)}")
    
    # Compute correlation matrix
    std_devs = np.sqrt(np.diag(sample_cov))
    corr_matrix = sample_cov / np.outer(std_devs, std_devs)
    print("\nCorrelation Matrix:")
    print(np.round(corr_matrix, 3))
    
    # Using numpy's corrcoef
    print("\nVerify with np.corrcoef:")
    print(np.round(np.corrcoef(data.T), 3))
 
covariance_matrix_demo()

Geometric Interpretation

The covariance matrix has a beautiful geometric interpretation through its eigendecomposition.

Eigendecomposition of Covariance:

$$\boldsymbol{\Sigma} = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^T$$

where:

$\mathbf{V}$: Orthogonal matrix of eigenvectors (principal directions)
$\boldsymbol{\Lambda}$: Diagonal matrix of eigenvalues (variances along principal directions)

Geometric Meaning:

Eigenvectors = directions of maximum/minimum variance (principal axes)
Eigenvalues = variance along each principal axis
Contours of constant probability for Gaussian are ellipsoids with:
- Axes aligned with eigenvectors
- Axis lengths proportional to $\sqrt{\lambda_i}$

Connection to PCA:

Principal Component Analysis finds exactly these eigenvectors. The first principal component is the direction of maximum variance, the second is orthogonal with next-highest variance, etc.

Mahalanobis Distance:

Distance that accounts for covariance structure:

$$d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$

Points with equal Mahalanobis distance form ellipsoids aligned with the covariance structure.

Correlation = Ellipse Shape

Applications in Machine Learning

Covariance and correlation are fundamental to many ML techniques:

1. Principal Component Analysis (PCA)

PCA finds the eigendecomposition of the covariance (or correlation) matrix. Principal components are eigenvectors; eigenvalues tell us how much variance each captures.

2. Linear Regression

The OLS coefficient in simple linear regression: $$\beta = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} = \rho \frac{\sigma_Y}{\sigma_X}$$

3. Feature Selection

Highly correlated features are redundant. Correlation matrices help identify multicollinearity.

4. Gaussian Mixture Models

Each component has its own covariance matrix, determining cluster shape.

5. Portfolio Theory

Covariance between asset returns determines portfolio risk. Diversification works because assets aren't perfectly correlated.

ML Techniques Using Covariance

•Whitening/Decorrelation: Transform data to have identity covariance
•Linear Discriminant Analysis: Uses within-class and between-class covariance
•Kalman Filters: Track covariance of state estimates over time
•Gaussian Processes: Kernel function defines covariance between function values
•Factor Analysis: Models covariance as low-rank plus diagonal

Estimation and Sample Covariance

In practice, we estimate covariance from data.

Sample Covariance:

For samples $(x_1, y_1), \ldots, (x_n, y_n)$:

$$\widehat{\text{Cov}}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})$$

The $n-1$ denominator (Bessel's correction) makes this an unbiased estimator.

High-Dimensional Challenges:

For $d$ variables, the covariance matrix has $d(d+1)/2$ unique entries. When $n < d$, the sample covariance matrix is rank-deficient (not invertible).

Solutions:

Shrinkage estimators: Regularize toward simpler structure (e.g., Ledoit-Wolf)
Assuming sparsity: Many off-diagonal entries are zero
Factor models: Low-rank approximation
Diagonal assumption: Ignore covariances (Naive Bayes)

Robust Estimation:

Sample covariance is sensitive to outliers. Robust alternatives:

Minimum Covariance Determinant (MCD)
Median Absolute Deviation (MAD) for variance
Spearman/Kendall correlation for non-normal data

Summary: Covariance and Correlation

Key Takeaways

•Covariance measures how variables move together: Positive = same direction, negative = opposite
•Correlation normalizes covariance to [-1, 1]: Scale-invariant measure of linear association
•Zero correlation ≠ independence: Only linear relationships are captured
•Covariance matrix organizes all pairwise relationships: Symmetric, positive semi-definite
•Eigendecomposition reveals principal axes: Foundation for PCA and geometric understanding
•ML relies heavily on covariance: From regression to clustering to dimensionality reduction

What's next:

Page Complete

You now understand covariance and correlation as measures of linear dependence, their properties, the covariance matrix, and their central role in machine learning algorithms.

4 / 5