Machine LearningMultiple Linear Regression

Multiple Linear Regression

LevelIntermediate

Duration90 mins

TopicMultiple Linear Regression

1 / 5

Matrix Formulation

From Scalars to Matrices

In simple linear regression, we modeled the relationship between a single predictor variable and a response. The mathematics was elegant: a slope, an intercept, and clear geometric intuition. But real-world problems rarely involve a single predictor.

Consider predicting house prices. You don't just have square footage—you have the number of bedrooms, bathrooms, age of the house, lot size, proximity to schools, crime rates, and dozens of other variables. Each adds predictive power, but also adds mathematical complexity.

The key insight: When we have multiple predictors, we transition from scalar algebra to matrix algebra. This isn't merely a notational convenience—it's a fundamental shift that unlocks computational efficiency, theoretical elegance, and connections to geometry that would be impossible to express otherwise.

What You Will Learn

By the end of this page, you will understand how multiple linear regression is expressed in matrix form, why this formulation is essential for both theory and computation, and how to translate between the summation notation you may have seen and the compact matrix equations that power modern implementations.

The Multiple Regression Model

Let's establish the model formally. In multiple linear regression, we assume that a response variable $y$ is a linear combination of $p$ predictor variables plus random noise:

$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i$$

where:

$y_i$ is the response for observation $i$
$x_{i1}, x_{i2}, \ldots, x_{ip}$ are the $p$ predictor values for observation $i$
$\beta_0$ is the intercept
$\beta_1, \ldots, \beta_p$ are the regression coefficients
$\epsilon_i$ is the random error term

For $n$ observations, we have $n$ such equations, one for each data point. Writing them all out individually becomes unwieldy—this is where matrix notation becomes indispensable.

Terminological Precision

Predictors are also called features, independent variables, covariates, or regressors. Response is also called the dependent variable, target, or outcome. In machine learning, we typically use features/target; in statistics, predictors/response is more common. The mathematics is identical.

The summation form:

We can write the model more compactly using summation notation:

$$y_i = \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} + \epsilon_i \quad \text{for } i = 1, 2, \ldots, n$$

This captures all $n$ equations but still requires index manipulation. To derive estimators and prove properties, we need something even more compact. Enter matrix formulation.

Constructing the Design Matrix

The design matrix (also called the model matrix or data matrix) is the central object that organizes all predictor information. We construct it by stacking observations as rows and features as columns.

Definition: Given $n$ observations and $p$ predictor variables, the design matrix $\mathbf{X}$ is an $n \times (p+1)$ matrix:

$$\mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \ 1 & x_{21} & x_{22} & \cdots & x_{2p} \ \vdots & \vdots & \vdots & \ddots & \vdots \ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}$$

Critical observation: The first column is all 1s. This is the intercept column, and it's essential for the matrix formulation to work correctly.

The Intercept Column

The column of 1s allows the intercept β₀ to be treated uniformly with other coefficients. When we multiply the design matrix by the coefficient vector, the 1s ensure β₀ is added to every prediction. Without this column, we'd need to handle the intercept separately, complicating all derivations.

Reading the design matrix:

Row i contains all feature values for observation $i$ (preceded by a 1 for the intercept)
Column j (for $j \geq 1$) contains all values of predictor $j$ across observations
Column 0 is the intercept column of ones

The design matrix encodes the entire training dataset's feature information.

Design Matrix Dimensions
Component	Symbol	Dimensions	Description
Design matrix	$\mathbf{X}$	$n \times (p+1)$	Predictor values with intercept column
Response vector	$\mathbf{y}$	$n \times 1$	Target values for all observations
Coefficient vector	$\boldsymbol{\beta}$	$(p+1) \times 1$	Intercept plus all regression coefficients
Error vector	$\boldsymbol{\epsilon}$	$n \times 1$	Random errors for all observations

The Matrix Equation

Now we can express the entire multiple regression model as a single matrix equation:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$

where:

$\mathbf{y} = \begin{bmatrix} y_1 \ y_2 \ \vdots \ y_n \end{bmatrix}$ is the $n \times 1$ response vector
$\boldsymbol{\beta} = \begin{bmatrix} \beta_0 \ \beta_1 \ \vdots \ \beta_p \end{bmatrix}$ is the $(p+1) \times 1$ coefficient vector
$\boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \ \epsilon_2 \ \vdots \ \epsilon_n \end{bmatrix}$ is the $n \times 1$ error vector

This single equation replaces n equations. Every derivation, proof, and algorithm flows from this compact representation.

Dimensional Verification

Always verify dimensions: $\mathbf{X}$ is $n \times (p+1)$, $\boldsymbol{\beta}$ is $(p+1) \times 1$, so $\mathbf{X}\boldsymbol{\beta}$ is $n \times 1$. Adding $\boldsymbol{\epsilon}$ (also $n \times 1$) gives $\mathbf{y}$, which is $n \times 1$. ✓

Unpacking the multiplication:

To see why this works, consider the product $\mathbf{X}\boldsymbol{\beta}$. The $i$-th element is:

$$(\mathbf{X}\boldsymbol{\beta})i = 1 \cdot \beta_0 + x{i1} \cdot \beta_1 + x_{i2} \cdot \beta_2 + \cdots + x_{ip} \cdot \beta_p$$

which is exactly $\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}$—the linear predictor for observation $i$. The matrix form is not an approximation or simplification; it's an exact encoding.

Predictions and Fitted Values

Once we have estimated coefficients $\hat{\boldsymbol{\beta}}$ (we'll derive these in the next page), the fitted values or predictions are:

$$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$$

The residual vector captures the difference between observed and predicted values:

$$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$$

Important distinction:

$\boldsymbol{\epsilon}$ represents the true (unobservable) errors in the population model
$\mathbf{e}$ represents the residuals, which are the observable estimates of those errors

Key Vectors in Regression
Vector	Symbol	Formula	Interpretation
Response	$\mathbf{y}$	Observed data	What we measured
Fitted values	$\hat{\mathbf{y}}$	$\mathbf{X}\hat{\boldsymbol{\beta}}$	Model predictions
True errors	$\boldsymbol{\epsilon}$	$\mathbf{y} - \mathbf{X}\boldsymbol{\beta}$	Unobservable random noise
Residuals	$\mathbf{e}$	$\mathbf{y} - \hat{\mathbf{y}}$	Observable approximation to errors

Predicting new data:

For a new observation with features $\mathbf{x}_{\text{new}} = [1, x_1, x_2, \ldots, x_p]^\top$, the predicted response is:

$$\hat{y}{\text{new}} = \mathbf{x}{\text{new}}^\top \hat{\boldsymbol{\beta}}$$

The same coefficient vector is used; only the feature values change. This is the core of the linear model's utility—parameters learned from training data apply directly to new predictions.

The Least Squares Objective in Matrix Form

The goal of Ordinary Least Squares (OLS) is to find the coefficient vector $\boldsymbol{\beta}$ that minimizes the sum of squared residuals. In scalar form, we minimize:

$$\text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}i)^2 = \sum{i=1}^{n} (y_i - \beta_0 - \beta_1 x_{i1} - \cdots - \beta_p x_{ip})^2$$

In matrix form, this becomes remarkably elegant:

$$\text{SSE}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$$

Norm Notation

The notation $|\mathbf{v}|^2 = \mathbf{v}^\top\mathbf{v}$ denotes the squared Euclidean (L2) norm. For $\mathbf{v} = [v_1, v_2, \ldots, v_n]^\top$, we have $|\mathbf{v}|^2 = v_1^2 + v_2^2 + \cdots + v_n^2$. The sum of squared residuals is literally the squared length of the residual vector.

Expanding the objective:

Let's expand the matrix expression to connect with optimization:

$$\begin{aligned} \text{SSE}(\boldsymbol{\beta}) &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) &= \mathbf{y}^\top\mathbf{y} - \mathbf{y}^\top\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} &= \mathbf{y}^\top\mathbf{y} - 2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \end{aligned}$$

Note that $\mathbf{y}^\top\mathbf{X}\boldsymbol{\beta}$ and $\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}$ are both scalars and are transposes of each other, hence equal. This quadratic form in $\boldsymbol{\beta}$ is the starting point for deriving the normal equations.

Why Matrix Form Matters

The matrix formulation isn't just elegant notation—it provides fundamental advantages across theory, computation, and intuition.

Advantages of Matrix Formulation

•Compact representation — A single equation captures arbitrarily many observations and predictors. Proofs don't grow with data size.
•Closed-form solutions — Matrix calculus yields explicit solutions (the normal equations) rather than iterative approximations.
•Computational efficiency — Modern linear algebra libraries (LAPACK, BLAS, cuBLAS) are highly optimized for matrix operations. Vectorized code runs orders of magnitude faster than loops.
•Geometric interpretation — Matrices represent linear transformations. Regression becomes projection onto a subspace, unlocking deep geometric insights.
•Extensibility — Regularization (Ridge, Lasso), generalized least squares, and mixed models all extend naturally from the matrix framework.
•Statistical analysis — Variance-covariance of estimates, hypothesis tests, and confidence intervals all derive cleanly from matrix expressions.

Implementation Reality

In production code, you almost never implement matrix regression from scratch. Libraries like scikit-learn, R's lm(), and statsmodels handle numerical stability, missing data, and edge cases. But understanding the matrix formulation is essential for debugging, interpreting results, and extending to custom models.

The Gram Matrix: X⊤X

A matrix that appears repeatedly in regression theory is the Gram matrix $\mathbf{X}^\top\mathbf{X}$. Understanding its structure is crucial.

Definition: The Gram matrix is a $(p+1) \times (p+1)$ symmetric matrix:

$$\mathbf{X}^\top\mathbf{X} = \begin{bmatrix} \mathbf{x}_0^\top\mathbf{x}_0 & \mathbf{x}_0^\top\mathbf{x}_1 & \cdots & \mathbf{x}_0^\top\mathbf{x}_p \ \mathbf{x}_1^\top\mathbf{x}_0 & \mathbf{x}_1^\top\mathbf{x}_1 & \cdots & \mathbf{x}_1^\top\mathbf{x}_p \ \vdots & \vdots & \ddots & \vdots \ \mathbf{x}_p^\top\mathbf{x}_0 & \mathbf{x}_p^\top\mathbf{x}_1 & \cdots & \mathbf{x}_p^\top\mathbf{x}_p \end{bmatrix}$$

where $\mathbf{x}_j$ denotes column $j$ of the design matrix (treating columns as vectors).

Properties of the Gram matrix:

Symmetric: $(\mathbf{X}^\top\mathbf{X})^\top = \mathbf{X}^\top\mathbf{X}$
Positive semi-definite: For any vector $\mathbf{v}$, $\mathbf{v}^\top(\mathbf{X}^\top\mathbf{X})\mathbf{v} = |\mathbf{X}\mathbf{v}|^2 \geq 0$
Captures inner products: Element $(j, k)$ is the inner product $\langle \mathbf{x}j, \mathbf{x}k \rangle = \sum{i=1}^{n} x{ij} x_{ik}$
Diagonal entries: $(\mathbf{X}^\top\mathbf{X})_{jj} = |\mathbf{x}_j|^2$ = sum of squares of column $j$

When X⊤X is Not Invertible

The Gram matrix is invertible if and only if the columns of X are linearly independent. This fails when: (1) there are more features than observations (p > n), (2) features are perfectly collinear, or (3) one feature is a constant multiple of another. These cases require regularization or dimensionality reduction.

Centering and Standardization

Before fitting regression models, features are often centered (mean-subtracted) or standardized (mean-subtracted and scaled by standard deviation). These transformations affect the matrix formulation in important ways.

Centering: $$\tilde{x}{ij} = x{ij} - \bar{x}j \quad \text{where } \bar{x}j = \frac{1}{n}\sum{i=1}^{n} x{ij}$$

Standardization (Z-score normalization): $$z_{ij} = \frac{x_{ij} - \bar{x}j}{s_j} \quad \text{where } s_j = \sqrt{\frac{1}{n-1}\sum{i=1}^{n}(x_{ij} - \bar{x}_j)^2}$$

Benefits of Centering

•Intercept becomes the mean response
•Coefficients measure deviation from mean
•Reduces numerical instability
•Simplifies interpretation in polynomial regression

Benefits of Standardization

•Coefficients become comparable
•Essential for regularization
•Prevents features on different scales from dominating
•Improves gradient descent convergence

Intercept Handling

When standardizing features, do NOT standardize the intercept column (the column of 1s). Standardization should apply only to actual predictors. Most libraries handle this automatically, but when implementing from scratch, this is a common source of bugs.

Implementing the Matrix Formulation

Let's see how the matrix formulation translates to code. We'll build the design matrix and express the least squares objective.

matrix_formulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
 
def create_design_matrix(X, add_intercept=True):
    """
    Create the design matrix from feature data.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Feature matrix without intercept
    add_intercept : bool
        Whether to add a column of 1s for the intercept
    
    Returns:
    --------
    X_design : ndarray of shape (n_samples, n_features + 1)
        Design matrix with intercept column
    """
    n_samples = X.shape[0]
    
    if add_intercept:
        # Add column of 1s as the first column
        ones = np.ones((n_samples, 1))
        X_design = np.hstack([ones, X])
    else:
        X_design = X.copy()
    
    return X_design
 
def compute_sse(y, X, beta):
    """
    Compute Sum of Squared Errors in matrix form.
    
    SSE = (y - Xβ)ᵀ(y - Xβ) = ||y - Xβ||²
    
    Parameters:
    -----------
    y : ndarray of shape (n_samples,)
        Response vector
    X : ndarray of shape (n_samples, n_features)
        Design matrix (with intercept column)
    beta : ndarray of shape (n_features,)
        Coefficient vector
    
    Returns:
    --------
    sse : float
        Sum of squared errors
    """
    residuals = y - X @ beta  # Matrix-vector product
    sse = residuals.T @ residuals  # Squared norm
    return float(sse)
 
def predict(X_new, beta, add_intercept=True):
    """
    Predict response values for new data.
    
    ŷ = Xβ
    
    Parameters:
    -----------
    X_new : ndarray of shape (n_samples, n_features)
        New feature data
    beta : ndarray of shape (n_features + 1,)
        Fitted coefficients (including intercept)
    add_intercept : bool
        Whether to add intercept column
    
    Returns:
    --------
    predictions : ndarray of shape (n_samples,)
        Predicted values
    """
    X_design = create_design_matrix(X_new, add_intercept=add_intercept)
    return X_design @ beta
 
# Example usage
np.random.seed(42)
 
# Generate sample data: 100 observations, 3 features
n_samples, n_features = 100, 3
X_raw = np.random.randn(n_samples, n_features)
 
# True coefficients (including intercept)
beta_true = np.array([5.0, 2.0, -1.5, 0.8])  # [β₀, β₁, β₂, β₃]
 
# Create design matrix and generate response
X = create_design_matrix(X_raw)
y = X @ beta_true + np.random.randn(n_samples) * 0.5  # Add noise
 
print("Design matrix shape:", X.shape)
print("Response vector shape:", y.shape)
print("True SSE (with true β):", compute_sse(y, X, beta_true))
 
# Verify dimensions
print(f"
Dimensions verification:")
print(f"  X: {X.shape[0]} × {X.shape[1]} (n × (p+1))")
print(f"  y: {y.shape[0]} × 1")
print(f"  β: {beta_true.shape[0]} × 1")

NumPy's @ Operator

The @ operator performs matrix multiplication in NumPy (Python 3.5+). It's equivalent to np.dot() or np.matmul() but more readable. Always use @ for matrix products in modern Python code.

Summary: Matrix Formulation

We've established the matrix foundation for multiple linear regression. Let's consolidate the key concepts:

Key Takeaways

•The regression model $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ captures all observations in a single equation.
•The design matrix $\mathbf{X}$ is $n \times (p+1)$ with a column of 1s for the intercept and p columns for predictors.
•The least squares objective $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$ is a quadratic function of $\boldsymbol{\beta}$.
•The Gram matrix $\mathbf{X}^\top\mathbf{X}$ is central to deriving the normal equations and analyzing estimator properties.
•Centering and standardization affect interpretation and numerical stability but don't change the fundamental formulation.
•Matrix form enables compact proofs, efficient computation, and geometric insights impossible with scalar notation.

What's next:

With the matrix formulation in place, we're ready to derive the normal equations—the closed-form solution for the optimal coefficients $\hat{\boldsymbol{\beta}}$. The next page develops this derivation using matrix calculus and establishes the famous result $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$.

Page Complete

You now understand the matrix formulation of multiple linear regression—the foundation for everything that follows. The design matrix, coefficient vector, and matrix objective function are your essential tools for deriving solutions and proving properties.

1 / 5

Loading learning content...

Machine LearningMultiple Linear Regression

Multiple Linear Regression

LevelIntermediate

Duration90 mins

TopicMultiple Linear Regression

1 / 5

Matrix Formulation

From Scalars to Matrices

What You Will Learn

The Multiple Regression Model

Let's establish the model formally. In multiple linear regression, we assume that a response variable $y$ is a linear combination of $p$ predictor variables plus random noise:

$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i$$

where:

$y_i$ is the response for observation $i$
$x_{i1}, x_{i2}, \ldots, x_{ip}$ are the $p$ predictor values for observation $i$
$\beta_0$ is the intercept
$\beta_1, \ldots, \beta_p$ are the regression coefficients
$\epsilon_i$ is the random error term

For $n$ observations, we have $n$ such equations, one for each data point. Writing them all out individually becomes unwieldy—this is where matrix notation becomes indispensable.

Terminological Precision

The summation form:

We can write the model more compactly using summation notation:

$$y_i = \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} + \epsilon_i \quad \text{for } i = 1, 2, \ldots, n$$

This captures all $n$ equations but still requires index manipulation. To derive estimators and prove properties, we need something even more compact. Enter matrix formulation.

Constructing the Design Matrix

Definition: Given $n$ observations and $p$ predictor variables, the design matrix $\mathbf{X}$ is an $n \times (p+1)$ matrix:

Critical observation: The first column is all 1s. This is the intercept column, and it's essential for the matrix formulation to work correctly.

The Intercept Column

Reading the design matrix:

Row i contains all feature values for observation $i$ (preceded by a 1 for the intercept)
Column j (for $j \geq 1$) contains all values of predictor $j$ across observations
Column 0 is the intercept column of ones

The design matrix encodes the entire training dataset's feature information.

Design Matrix Dimensions
Component	Symbol	Dimensions	Description
Design matrix	$\mathbf{X}$	$n \times (p+1)$	Predictor values with intercept column
Response vector	$\mathbf{y}$	$n \times 1$	Target values for all observations
Coefficient vector	$\boldsymbol{\beta}$	$(p+1) \times 1$	Intercept plus all regression coefficients
Error vector	$\boldsymbol{\epsilon}$	$n \times 1$	Random errors for all observations

The Matrix Equation

Now we can express the entire multiple regression model as a single matrix equation:

$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$

where:

$\mathbf{y} = \begin{bmatrix} y_1 \ y_2 \ \vdots \ y_n \end{bmatrix}$ is the $n \times 1$ response vector
$\boldsymbol{\beta} = \begin{bmatrix} \beta_0 \ \beta_1 \ \vdots \ \beta_p \end{bmatrix}$ is the $(p+1) \times 1$ coefficient vector
$\boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \ \epsilon_2 \ \vdots \ \epsilon_n \end{bmatrix}$ is the $n \times 1$ error vector

This single equation replaces n equations. Every derivation, proof, and algorithm flows from this compact representation.

Dimensional Verification

Unpacking the multiplication:

To see why this works, consider the product $\mathbf{X}\boldsymbol{\beta}$. The $i$-th element is:

$$(\mathbf{X}\boldsymbol{\beta})i = 1 \cdot \beta_0 + x{i1} \cdot \beta_1 + x_{i2} \cdot \beta_2 + \cdots + x_{ip} \cdot \beta_p$$

which is exactly $\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}$—the linear predictor for observation $i$. The matrix form is not an approximation or simplification; it's an exact encoding.

Predictions and Fitted Values

Once we have estimated coefficients $\hat{\boldsymbol{\beta}}$ (we'll derive these in the next page), the fitted values or predictions are:

$$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$$

The residual vector captures the difference between observed and predicted values:

$$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$$

Important distinction:

$\boldsymbol{\epsilon}$ represents the true (unobservable) errors in the population model
$\mathbf{e}$ represents the residuals, which are the observable estimates of those errors

Key Vectors in Regression
Vector	Symbol	Formula	Interpretation
Response	$\mathbf{y}$	Observed data	What we measured
Fitted values	$\hat{\mathbf{y}}$	$\mathbf{X}\hat{\boldsymbol{\beta}}$	Model predictions
True errors	$\boldsymbol{\epsilon}$	$\mathbf{y} - \mathbf{X}\boldsymbol{\beta}$	Unobservable random noise
Residuals	$\mathbf{e}$	$\mathbf{y} - \hat{\mathbf{y}}$	Observable approximation to errors

Predicting new data:

For a new observation with features $\mathbf{x}_{\text{new}} = [1, x_1, x_2, \ldots, x_p]^\top$, the predicted response is:

$$\hat{y}{\text{new}} = \mathbf{x}{\text{new}}^\top \hat{\boldsymbol{\beta}}$$

The same coefficient vector is used; only the feature values change. This is the core of the linear model's utility—parameters learned from training data apply directly to new predictions.

The Least Squares Objective in Matrix Form

The goal of Ordinary Least Squares (OLS) is to find the coefficient vector $\boldsymbol{\beta}$ that minimizes the sum of squared residuals. In scalar form, we minimize:

$$\text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}i)^2 = \sum{i=1}^{n} (y_i - \beta_0 - \beta_1 x_{i1} - \cdots - \beta_p x_{ip})^2$$

In matrix form, this becomes remarkably elegant:

$$\text{SSE}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$$

Norm Notation

Expanding the objective:

Let's expand the matrix expression to connect with optimization:

Why Matrix Form Matters

The matrix formulation isn't just elegant notation—it provides fundamental advantages across theory, computation, and intuition.

Advantages of Matrix Formulation

•Compact representation — A single equation captures arbitrarily many observations and predictors. Proofs don't grow with data size.
•Closed-form solutions — Matrix calculus yields explicit solutions (the normal equations) rather than iterative approximations.
•Computational efficiency — Modern linear algebra libraries (LAPACK, BLAS, cuBLAS) are highly optimized for matrix operations. Vectorized code runs orders of magnitude faster than loops.
•Geometric interpretation — Matrices represent linear transformations. Regression becomes projection onto a subspace, unlocking deep geometric insights.
•Extensibility — Regularization (Ridge, Lasso), generalized least squares, and mixed models all extend naturally from the matrix framework.
•Statistical analysis — Variance-covariance of estimates, hypothesis tests, and confidence intervals all derive cleanly from matrix expressions.

Implementation Reality

The Gram Matrix: X⊤X

A matrix that appears repeatedly in regression theory is the Gram matrix $\mathbf{X}^\top\mathbf{X}$. Understanding its structure is crucial.

Definition: The Gram matrix is a $(p+1) \times (p+1)$ symmetric matrix:

where $\mathbf{x}_j$ denotes column $j$ of the design matrix (treating columns as vectors).

Properties of the Gram matrix:

Symmetric: $(\mathbf{X}^\top\mathbf{X})^\top = \mathbf{X}^\top\mathbf{X}$
Positive semi-definite: For any vector $\mathbf{v}$, $\mathbf{v}^\top(\mathbf{X}^\top\mathbf{X})\mathbf{v} = |\mathbf{X}\mathbf{v}|^2 \geq 0$
Captures inner products: Element $(j, k)$ is the inner product $\langle \mathbf{x}j, \mathbf{x}k \rangle = \sum{i=1}^{n} x{ij} x_{ik}$
Diagonal entries: $(\mathbf{X}^\top\mathbf{X})_{jj} = |\mathbf{x}_j|^2$ = sum of squares of column $j$

When X⊤X is Not Invertible

Centering and Standardization

Centering: $$\tilde{x}{ij} = x{ij} - \bar{x}j \quad \text{where } \bar{x}j = \frac{1}{n}\sum{i=1}^{n} x{ij}$$

Standardization (Z-score normalization): $$z_{ij} = \frac{x_{ij} - \bar{x}j}{s_j} \quad \text{where } s_j = \sqrt{\frac{1}{n-1}\sum{i=1}^{n}(x_{ij} - \bar{x}_j)^2}$$

Benefits of Centering

•Intercept becomes the mean response
•Coefficients measure deviation from mean
•Reduces numerical instability
•Simplifies interpretation in polynomial regression

Benefits of Standardization

•Coefficients become comparable
•Essential for regularization
•Prevents features on different scales from dominating
•Improves gradient descent convergence

Intercept Handling

Implementing the Matrix Formulation

Let's see how the matrix formulation translates to code. We'll build the design matrix and express the least squares objective.

matrix_formulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
 
def create_design_matrix(X, add_intercept=True):
    """
    Create the design matrix from feature data.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Feature matrix without intercept
    add_intercept : bool
        Whether to add a column of 1s for the intercept
    
    Returns:
    --------
    X_design : ndarray of shape (n_samples, n_features + 1)
        Design matrix with intercept column
    """
    n_samples = X.shape[0]
    
    if add_intercept:
        # Add column of 1s as the first column
        ones = np.ones((n_samples, 1))
        X_design = np.hstack([ones, X])
    else:
        X_design = X.copy()
    
    return X_design
 
def compute_sse(y, X, beta):
    """
    Compute Sum of Squared Errors in matrix form.
    
    SSE = (y - Xβ)ᵀ(y - Xβ) = ||y - Xβ||²
    
    Parameters:
    -----------
    y : ndarray of shape (n_samples,)
        Response vector
    X : ndarray of shape (n_samples, n_features)
        Design matrix (with intercept column)
    beta : ndarray of shape (n_features,)
        Coefficient vector
    
    Returns:
    --------
    sse : float
        Sum of squared errors
    """
    residuals = y - X @ beta  # Matrix-vector product
    sse = residuals.T @ residuals  # Squared norm
    return float(sse)
 
def predict(X_new, beta, add_intercept=True):
    """
    Predict response values for new data.
    
    ŷ = Xβ
    
    Parameters:
    -----------
    X_new : ndarray of shape (n_samples, n_features)
        New feature data
    beta : ndarray of shape (n_features + 1,)
        Fitted coefficients (including intercept)
    add_intercept : bool
        Whether to add intercept column
    
    Returns:
    --------
    predictions : ndarray of shape (n_samples,)
        Predicted values
    """
    X_design = create_design_matrix(X_new, add_intercept=add_intercept)
    return X_design @ beta
 
# Example usage
np.random.seed(42)
 
# Generate sample data: 100 observations, 3 features
n_samples, n_features = 100, 3
X_raw = np.random.randn(n_samples, n_features)
 
# True coefficients (including intercept)
beta_true = np.array([5.0, 2.0, -1.5, 0.8])  # [β₀, β₁, β₂, β₃]
 
# Create design matrix and generate response
X = create_design_matrix(X_raw)
y = X @ beta_true + np.random.randn(n_samples) * 0.5  # Add noise
 
print("Design matrix shape:", X.shape)
print("Response vector shape:", y.shape)
print("True SSE (with true β):", compute_sse(y, X, beta_true))
 
# Verify dimensions
print(f"
Dimensions verification:")
print(f"  X: {X.shape[0]} × {X.shape[1]} (n × (p+1))")
print(f"  y: {y.shape[0]} × 1")
print(f"  β: {beta_true.shape[0]} × 1")

NumPy's @ Operator

The @ operator performs matrix multiplication in NumPy (Python 3.5+). It's equivalent to np.dot() or np.matmul() but more readable. Always use @ for matrix products in modern Python code.

Summary: Matrix Formulation

We've established the matrix foundation for multiple linear regression. Let's consolidate the key concepts:

Key Takeaways

•The regression model $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$ captures all observations in a single equation.
•The design matrix $\mathbf{X}$ is $n \times (p+1)$ with a column of 1s for the intercept and p columns for predictors.
•The least squares objective $|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$ is a quadratic function of $\boldsymbol{\beta}$.
•The Gram matrix $\mathbf{X}^\top\mathbf{X}$ is central to deriving the normal equations and analyzing estimator properties.
•Centering and standardization affect interpretation and numerical stability but don't change the fundamental formulation.
•Matrix form enables compact proofs, efficient computation, and geometric insights impossible with scalar notation.

What's next:

Page Complete

1 / 5