Loading learning content...
When variables are not independent, knowing one tells us something about the other. But how much does it tell us? And in what direction do they move together?
Covariance and correlation answer these questions with single numbers that summarize the linear relationship between two random variables. While they capture only linear associations (missing nonlinear dependencies), they are computationally simple, theoretically tractable, and foundational to many ML techniques.
From PCA to linear regression, from portfolio theory to feature selection—covariance and correlation appear everywhere. Understanding them deeply is essential for any machine learning practitioner.
By the end of this page, you will master covariance and correlation as measures of linear dependence, understand their properties and limitations, work with covariance matrices for multiple variables, and see their central role in machine learning algorithms.
Definition (Covariance):
For random variables $X$ and $Y$, the covariance is:
$$\text{Cov}(X, Y) = \mathbb{E}[(X - \mu_X)(Y - \mu_Y)] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]$$
where $\mu_X = \mathbb{E}[X]$ and $\mu_Y = \mathbb{E}[Y]$.
Interpretation:
Properties:
Symmetry: $\text{Cov}(X, Y) = \text{Cov}(Y, X)$
Variance is self-covariance: $\text{Cov}(X, X) = \text{Var}(X)$
Bilinearity:
Variance of sum: $$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$$
Independence implies zero covariance: If $X \perp Y$, then $\text{Cov}(X, Y) = 0$. (Converse is false!)
If X and Y are independent, their covariance is zero. But zero covariance does NOT imply independence! Classic example: X ~ N(0,1) and Y = X². Then Cov(X, Y) = E[X³] = 0, but Y is completely determined by X.
123456789101112131415161718192021222324252627282930313233343536373839
import numpy as np def covariance_examples(): """Demonstrate covariance computation and properties.""" np.random.seed(42) n = 10000 # Example 1: Positively correlated x1 = np.random.normal(0, 1, n) y1 = 0.7 * x1 + 0.3 * np.random.normal(0, 1, n) cov1 = np.cov(x1, y1)[0, 1] print(f"Positive relationship: Cov(X, Y) = {cov1:.4f}") # Example 2: Negatively correlated y2 = -0.5 * x1 + 0.5 * np.random.normal(0, 1, n) cov2 = np.cov(x1, y2)[0, 1] print(f"Negative relationship: Cov(X, Y) = {cov2:.4f}") # Example 3: Independent (zero covariance) y3 = np.random.normal(0, 1, n) cov3 = np.cov(x1, y3)[0, 1] print(f"Independent: Cov(X, Y) = {cov3:.4f}") # Example 4: Zero covariance but dependent (Y = X²) y4 = x1 ** 2 cov4 = np.cov(x1, y4)[0, 1] print(f"Y = X² (dependent but uncorrelated): Cov(X, Y) = {cov4:.4f}") # Verify variance of sum formula print("\nVariance of Sum Formula:") var_x = np.var(x1, ddof=1) var_y = np.var(y1, ddof=1) cov_xy = np.cov(x1, y1)[0, 1] var_sum = np.var(x1 + y1, ddof=1) computed = var_x + var_y + 2 * cov_xy print(f"Var(X+Y) = {var_sum:.4f}") print(f"Var(X) + Var(Y) + 2Cov(X,Y) = {computed:.4f}") covariance_examples()The Problem with Covariance:
Covariance depends on the scales of $X$ and $Y$. If we measure height in centimeters vs. meters, we get different covariances. This makes interpretation difficult.
Definition (Pearson Correlation Coefficient):
$$\rho_{X,Y} = \text{Corr}(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} = \frac{\mathbb{E}[(X - \mu_X)(Y - \mu_Y)]}{\sqrt{\text{Var}(X)}\sqrt{\text{Var}(Y)}}$$
Properties:
Bounded: $-1 \leq \rho \leq 1$ always
Scale invariant: $\text{Corr}(aX + b, cY + d) = \text{sign}(ac) \cdot \text{Corr}(X, Y)$ for $a, c \neq 0$
Extreme values:
For bivariate Gaussian: $\rho$ completely characterizes the dependence structure
| Correlation Range | Interpretation | Example |
|---|---|---|
| 0.9 to 1.0 | Very strong positive | Height vs. arm span |
| 0.7 to 0.9 | Strong positive | Study hours vs. grades |
| 0.4 to 0.7 | Moderate positive | Income vs. spending |
| 0.1 to 0.4 | Weak positive | Age vs. income (adult) |
| -0.1 to 0.1 | No linear relationship | Shoe size vs. IQ |
| -0.4 to -0.1 | Weak negative | Altitude vs. temperature |
| -0.7 to -0.4 | Moderate negative | Speed vs. travel time |
| -1.0 to -0.7 | Strong negative | Stock price vs. short seller profit |
A correlation of 0 means no LINEAR relationship—variables can still be perfectly dependent nonlinearly. When X ~ Uniform(-1,1) and Y = X², they're perfectly dependent but have ρ = 0. Always visualize your data!
12345678910111213141516171819202122232425262728293031323334353637383940
import numpy as npimport matplotlib.pyplot as plt def visualize_correlations(): """Show what different correlation values look like.""" np.random.seed(42) n = 500 correlations = [0.0, 0.3, 0.6, 0.9, -0.6, 0.0] titles = ['ρ = 0 (Independent)', 'ρ = 0.3 (Weak)', 'ρ = 0.6 (Moderate)', 'ρ = 0.9 (Strong)', 'ρ = -0.6 (Negative)', 'ρ = 0 (Y = X²)'] fig, axes = plt.subplots(2, 3, figsize=(15, 10)) for idx, (rho, title) in enumerate(zip(correlations, titles)): ax = axes.flatten()[idx] if idx < 5: # Generate bivariate normal with correlation rho x = np.random.normal(0, 1, n) y = rho * x + np.sqrt(1 - rho**2) * np.random.normal(0, 1, n) else: # Special case: Y = X² (zero correlation but dependent) x = np.random.uniform(-2, 2, n) y = x**2 + 0.2 * np.random.normal(0, 1, n) ax.scatter(x, y, alpha=0.5, s=20) ax.set_xlabel('X') ax.set_ylabel('Y') # Compute actual correlation actual_rho = np.corrcoef(x, y)[0, 1] ax.set_title(f'{title}\nActual ρ = {actual_rho:.2f}') ax.set_aspect('equal', 'box') plt.tight_layout() plt.savefig('correlation_examples.png', dpi=150) plt.show() visualize_correlations()For a random vector $\mathbf{X} = (X_1, X_2, \ldots, X_d)^T$, pairwise covariances are organized into the covariance matrix.
Definition:
$$\boldsymbol{\Sigma} = \text{Cov}(\mathbf{X}) = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu})(\mathbf{X} - \boldsymbol{\mu})^T]$$
where $\boldsymbol{\mu} = \mathbb{E}[\mathbf{X}]$. Entry $(i, j)$ is:
$$\Sigma_{ij} = \text{Cov}(X_i, X_j)$$
Properties:
Symmetric: $\boldsymbol{\Sigma} = \boldsymbol{\Sigma}^T$
Positive semi-definite: $\mathbf{v}^T \boldsymbol{\Sigma} \mathbf{v} \geq 0$ for all $\mathbf{v}$
Diagonal entries: $\Sigma_{ii} = \text{Var}(X_i)$
Linear transformation: If $\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}$, then: $$\text{Cov}(\mathbf{Y}) = \mathbf{A} \boldsymbol{\Sigma} \mathbf{A}^T$$
Correlation Matrix:
The normalized version where each entry is a correlation:
$$R_{ij} = \frac{\Sigma_{ij}}{\sqrt{\Sigma_{ii}}\sqrt{\Sigma_{jj}}} = \text{Corr}(X_i, X_j)$$
Diagonal entries are all 1, and off-diagonal entries are between -1 and 1.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import numpy as np def covariance_matrix_demo(): """Demonstrate covariance matrix properties.""" np.random.seed(42) # Generate 4-dimensional data with known covariance structure n = 10000 # True covariance matrix (positive definite) true_cov = np.array([ [1.0, 0.8, 0.3, 0.0], [0.8, 1.0, 0.4, 0.1], [0.3, 0.4, 1.0, 0.6], [0.0, 0.1, 0.6, 1.0] ]) # Generate data mean = np.zeros(4) data = np.random.multivariate_normal(mean, true_cov, n) # Compute sample covariance matrix sample_cov = np.cov(data.T) print("True Covariance Matrix:") print(np.round(true_cov, 3)) print("\nSample Covariance Matrix:") print(np.round(sample_cov, 3)) # Verify positive semi-definiteness via eigenvalues eigenvalues = np.linalg.eigvalsh(sample_cov) print(f"\nEigenvalues: {np.round(eigenvalues, 4)}") print(f"All eigenvalues >= 0? {all(eigenvalues >= -1e-10)}") # Compute correlation matrix std_devs = np.sqrt(np.diag(sample_cov)) corr_matrix = sample_cov / np.outer(std_devs, std_devs) print("\nCorrelation Matrix:") print(np.round(corr_matrix, 3)) # Using numpy's corrcoef print("\nVerify with np.corrcoef:") print(np.round(np.corrcoef(data.T), 3)) covariance_matrix_demo()The covariance matrix has a beautiful geometric interpretation through its eigendecomposition.
Eigendecomposition of Covariance:
$$\boldsymbol{\Sigma} = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^T$$
where:
Geometric Meaning:
Connection to PCA:
Principal Component Analysis finds exactly these eigenvectors. The first principal component is the direction of maximum variance, the second is orthogonal with next-highest variance, etc.
Mahalanobis Distance:
Distance that accounts for covariance structure:
$$d_M(\mathbf{x}, \boldsymbol{\mu}) = \sqrt{(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})}$$
Points with equal Mahalanobis distance form ellipsoids aligned with the covariance structure.
For 2D Gaussian: ρ = 0 gives circular contours, ρ → ±1 gives increasingly elongated ellipses. The tilt angle depends on the relative variances. This is why scatter plots of correlated data look like tilted ellipses.
Covariance and correlation are fundamental to many ML techniques:
1. Principal Component Analysis (PCA)
PCA finds the eigendecomposition of the covariance (or correlation) matrix. Principal components are eigenvectors; eigenvalues tell us how much variance each captures.
2. Linear Regression
The OLS coefficient in simple linear regression: $$\beta = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} = \rho \frac{\sigma_Y}{\sigma_X}$$
3. Feature Selection
Highly correlated features are redundant. Correlation matrices help identify multicollinearity.
4. Gaussian Mixture Models
Each component has its own covariance matrix, determining cluster shape.
5. Portfolio Theory
Covariance between asset returns determines portfolio risk. Diversification works because assets aren't perfectly correlated.
In practice, we estimate covariance from data.
Sample Covariance:
For samples $(x_1, y_1), \ldots, (x_n, y_n)$:
$$\widehat{\text{Cov}}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})$$
The $n-1$ denominator (Bessel's correction) makes this an unbiased estimator.
High-Dimensional Challenges:
For $d$ variables, the covariance matrix has $d(d+1)/2$ unique entries. When $n < d$, the sample covariance matrix is rank-deficient (not invertible).
Solutions:
Robust Estimation:
Sample covariance is sensitive to outliers. Robust alternatives:
What's next:
Correlation tells us about the strength of linear association, but correlation zero doesn't mean independence. The next page covers independence rigorously—the strongest statement we can make about variables not affecting each other.
You now understand covariance and correlation as measures of linear dependence, their properties, the covariance matrix, and their central role in machine learning algorithms.