Loading learning content...
In simple linear regression, we modeled the relationship between a single predictor variable and a response. The mathematics was elegant: a slope, an intercept, and clear geometric intuition. But real-world problems rarely involve a single predictor.
Consider predicting house prices. You don't just have square footage—you have the number of bedrooms, bathrooms, age of the house, lot size, proximity to schools, crime rates, and dozens of other variables. Each adds predictive power, but also adds mathematical complexity.
The key insight: When we have multiple predictors, we transition from scalar algebra to matrix algebra. This isn't merely a notational convenience—it's a fundamental shift that unlocks computational efficiency, theoretical elegance, and connections to geometry that would be impossible to express otherwise.
By the end of this page, you will understand how multiple linear regression is expressed in matrix form, why this formulation is essential for both theory and computation, and how to translate between the summation notation you may have seen and the compact matrix equations that power modern implementations.
Let's establish the model formally. In multiple linear regression, we assume that a response variable $y$ is a linear combination of $p$ predictor variables plus random noise:
$$y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i$$
where:
For $n$ observations, we have $n$ such equations, one for each data point. Writing them all out individually becomes unwieldy—this is where matrix notation becomes indispensable.
Predictors are also called features, independent variables, covariates, or regressors. Response is also called the dependent variable, target, or outcome. In machine learning, we typically use features/target; in statistics, predictors/response is more common. The mathematics is identical.
The summation form:
We can write the model more compactly using summation notation:
$$y_i = \beta_0 + \sum_{j=1}^{p} \beta_j x_{ij} + \epsilon_i \quad \text{for } i = 1, 2, \ldots, n$$
This captures all $n$ equations but still requires index manipulation. To derive estimators and prove properties, we need something even more compact. Enter matrix formulation.
The design matrix (also called the model matrix or data matrix) is the central object that organizes all predictor information. We construct it by stacking observations as rows and features as columns.
Definition: Given $n$ observations and $p$ predictor variables, the design matrix $\mathbf{X}$ is an $n \times (p+1)$ matrix:
$$\mathbf{X} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \ 1 & x_{21} & x_{22} & \cdots & x_{2p} \ \vdots & \vdots & \vdots & \ddots & \vdots \ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}$$
Critical observation: The first column is all 1s. This is the intercept column, and it's essential for the matrix formulation to work correctly.
The column of 1s allows the intercept β₀ to be treated uniformly with other coefficients. When we multiply the design matrix by the coefficient vector, the 1s ensure β₀ is added to every prediction. Without this column, we'd need to handle the intercept separately, complicating all derivations.
Reading the design matrix:
The design matrix encodes the entire training dataset's feature information.
| Component | Symbol | Dimensions | Description |
|---|---|---|---|
| Design matrix | $\mathbf{X}$ | $n \times (p+1)$ | Predictor values with intercept column |
| Response vector | $\mathbf{y}$ | $n \times 1$ | Target values for all observations |
| Coefficient vector | $\boldsymbol{\beta}$ | $(p+1) \times 1$ | Intercept plus all regression coefficients |
| Error vector | $\boldsymbol{\epsilon}$ | $n \times 1$ | Random errors for all observations |
Now we can express the entire multiple regression model as a single matrix equation:
$$\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}$$
where:
$\mathbf{y} = \begin{bmatrix} y_1 \ y_2 \ \vdots \ y_n \end{bmatrix}$ is the $n \times 1$ response vector
$\boldsymbol{\beta} = \begin{bmatrix} \beta_0 \ \beta_1 \ \vdots \ \beta_p \end{bmatrix}$ is the $(p+1) \times 1$ coefficient vector
$\boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \ \epsilon_2 \ \vdots \ \epsilon_n \end{bmatrix}$ is the $n \times 1$ error vector
This single equation replaces n equations. Every derivation, proof, and algorithm flows from this compact representation.
Always verify dimensions: $\mathbf{X}$ is $n \times (p+1)$, $\boldsymbol{\beta}$ is $(p+1) \times 1$, so $\mathbf{X}\boldsymbol{\beta}$ is $n \times 1$. Adding $\boldsymbol{\epsilon}$ (also $n \times 1$) gives $\mathbf{y}$, which is $n \times 1$. ✓
Unpacking the multiplication:
To see why this works, consider the product $\mathbf{X}\boldsymbol{\beta}$. The $i$-th element is:
$$(\mathbf{X}\boldsymbol{\beta})i = 1 \cdot \beta_0 + x{i1} \cdot \beta_1 + x_{i2} \cdot \beta_2 + \cdots + x_{ip} \cdot \beta_p$$
which is exactly $\beta_0 + \sum_{j=1}^{p} \beta_j x_{ij}$—the linear predictor for observation $i$. The matrix form is not an approximation or simplification; it's an exact encoding.
Once we have estimated coefficients $\hat{\boldsymbol{\beta}}$ (we'll derive these in the next page), the fitted values or predictions are:
$$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$$
The residual vector captures the difference between observed and predicted values:
$$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$$
Important distinction:
| Vector | Symbol | Formula | Interpretation |
|---|---|---|---|
| Response | $\mathbf{y}$ | Observed data | What we measured |
| Fitted values | $\hat{\mathbf{y}}$ | $\mathbf{X}\hat{\boldsymbol{\beta}}$ | Model predictions |
| True errors | $\boldsymbol{\epsilon}$ | $\mathbf{y} - \mathbf{X}\boldsymbol{\beta}$ | Unobservable random noise |
| Residuals | $\mathbf{e}$ | $\mathbf{y} - \hat{\mathbf{y}}$ | Observable approximation to errors |
Predicting new data:
For a new observation with features $\mathbf{x}_{\text{new}} = [1, x_1, x_2, \ldots, x_p]^\top$, the predicted response is:
$$\hat{y}{\text{new}} = \mathbf{x}{\text{new}}^\top \hat{\boldsymbol{\beta}}$$
The same coefficient vector is used; only the feature values change. This is the core of the linear model's utility—parameters learned from training data apply directly to new predictions.
The goal of Ordinary Least Squares (OLS) is to find the coefficient vector $\boldsymbol{\beta}$ that minimizes the sum of squared residuals. In scalar form, we minimize:
$$\text{SSE} = \sum_{i=1}^{n} (y_i - \hat{y}i)^2 = \sum{i=1}^{n} (y_i - \beta_0 - \beta_1 x_{i1} - \cdots - \beta_p x_{ip})^2$$
In matrix form, this becomes remarkably elegant:
$$\text{SSE}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = |\mathbf{y} - \mathbf{X}\boldsymbol{\beta}|^2$$
The notation $|\mathbf{v}|^2 = \mathbf{v}^\top\mathbf{v}$ denotes the squared Euclidean (L2) norm. For $\mathbf{v} = [v_1, v_2, \ldots, v_n]^\top$, we have $|\mathbf{v}|^2 = v_1^2 + v_2^2 + \cdots + v_n^2$. The sum of squared residuals is literally the squared length of the residual vector.
Expanding the objective:
Let's expand the matrix expression to connect with optimization:
$$\begin{aligned} \text{SSE}(\boldsymbol{\beta}) &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \ &= \mathbf{y}^\top\mathbf{y} - \mathbf{y}^\top\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \ &= \mathbf{y}^\top\mathbf{y} - 2\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y} + \boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{X}\boldsymbol{\beta} \end{aligned}$$
Note that $\mathbf{y}^\top\mathbf{X}\boldsymbol{\beta}$ and $\boldsymbol{\beta}^\top\mathbf{X}^\top\mathbf{y}$ are both scalars and are transposes of each other, hence equal. This quadratic form in $\boldsymbol{\beta}$ is the starting point for deriving the normal equations.
The matrix formulation isn't just elegant notation—it provides fundamental advantages across theory, computation, and intuition.
In production code, you almost never implement matrix regression from scratch. Libraries like scikit-learn, R's lm(), and statsmodels handle numerical stability, missing data, and edge cases. But understanding the matrix formulation is essential for debugging, interpreting results, and extending to custom models.
A matrix that appears repeatedly in regression theory is the Gram matrix $\mathbf{X}^\top\mathbf{X}$. Understanding its structure is crucial.
Definition: The Gram matrix is a $(p+1) \times (p+1)$ symmetric matrix:
$$\mathbf{X}^\top\mathbf{X} = \begin{bmatrix} \mathbf{x}_0^\top\mathbf{x}_0 & \mathbf{x}_0^\top\mathbf{x}_1 & \cdots & \mathbf{x}_0^\top\mathbf{x}_p \ \mathbf{x}_1^\top\mathbf{x}_0 & \mathbf{x}_1^\top\mathbf{x}_1 & \cdots & \mathbf{x}_1^\top\mathbf{x}_p \ \vdots & \vdots & \ddots & \vdots \ \mathbf{x}_p^\top\mathbf{x}_0 & \mathbf{x}_p^\top\mathbf{x}_1 & \cdots & \mathbf{x}_p^\top\mathbf{x}_p \end{bmatrix}$$
where $\mathbf{x}_j$ denotes column $j$ of the design matrix (treating columns as vectors).
Properties of the Gram matrix:
Symmetric: $(\mathbf{X}^\top\mathbf{X})^\top = \mathbf{X}^\top\mathbf{X}$
Positive semi-definite: For any vector $\mathbf{v}$, $\mathbf{v}^\top(\mathbf{X}^\top\mathbf{X})\mathbf{v} = |\mathbf{X}\mathbf{v}|^2 \geq 0$
Captures inner products: Element $(j, k)$ is the inner product $\langle \mathbf{x}j, \mathbf{x}k \rangle = \sum{i=1}^{n} x{ij} x_{ik}$
Diagonal entries: $(\mathbf{X}^\top\mathbf{X})_{jj} = |\mathbf{x}_j|^2$ = sum of squares of column $j$
The Gram matrix is invertible if and only if the columns of X are linearly independent. This fails when: (1) there are more features than observations (p > n), (2) features are perfectly collinear, or (3) one feature is a constant multiple of another. These cases require regularization or dimensionality reduction.
Before fitting regression models, features are often centered (mean-subtracted) or standardized (mean-subtracted and scaled by standard deviation). These transformations affect the matrix formulation in important ways.
Centering: $$\tilde{x}{ij} = x{ij} - \bar{x}j \quad \text{where } \bar{x}j = \frac{1}{n}\sum{i=1}^{n} x{ij}$$
Standardization (Z-score normalization): $$z_{ij} = \frac{x_{ij} - \bar{x}j}{s_j} \quad \text{where } s_j = \sqrt{\frac{1}{n-1}\sum{i=1}^{n}(x_{ij} - \bar{x}_j)^2}$$
When standardizing features, do NOT standardize the intercept column (the column of 1s). Standardization should apply only to actual predictors. Most libraries handle this automatically, but when implementing from scratch, this is a common source of bugs.
Let's see how the matrix formulation translates to code. We'll build the design matrix and express the least squares objective.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
import numpy as np def create_design_matrix(X, add_intercept=True): """ Create the design matrix from feature data. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Feature matrix without intercept add_intercept : bool Whether to add a column of 1s for the intercept Returns: -------- X_design : ndarray of shape (n_samples, n_features + 1) Design matrix with intercept column """ n_samples = X.shape[0] if add_intercept: # Add column of 1s as the first column ones = np.ones((n_samples, 1)) X_design = np.hstack([ones, X]) else: X_design = X.copy() return X_design def compute_sse(y, X, beta): """ Compute Sum of Squared Errors in matrix form. SSE = (y - Xβ)ᵀ(y - Xβ) = ||y - Xβ||² Parameters: ----------- y : ndarray of shape (n_samples,) Response vector X : ndarray of shape (n_samples, n_features) Design matrix (with intercept column) beta : ndarray of shape (n_features,) Coefficient vector Returns: -------- sse : float Sum of squared errors """ residuals = y - X @ beta # Matrix-vector product sse = residuals.T @ residuals # Squared norm return float(sse) def predict(X_new, beta, add_intercept=True): """ Predict response values for new data. ŷ = Xβ Parameters: ----------- X_new : ndarray of shape (n_samples, n_features) New feature data beta : ndarray of shape (n_features + 1,) Fitted coefficients (including intercept) add_intercept : bool Whether to add intercept column Returns: -------- predictions : ndarray of shape (n_samples,) Predicted values """ X_design = create_design_matrix(X_new, add_intercept=add_intercept) return X_design @ beta # Example usagenp.random.seed(42) # Generate sample data: 100 observations, 3 featuresn_samples, n_features = 100, 3X_raw = np.random.randn(n_samples, n_features) # True coefficients (including intercept)beta_true = np.array([5.0, 2.0, -1.5, 0.8]) # [β₀, β₁, β₂, β₃] # Create design matrix and generate responseX = create_design_matrix(X_raw)y = X @ beta_true + np.random.randn(n_samples) * 0.5 # Add noise print("Design matrix shape:", X.shape)print("Response vector shape:", y.shape)print("True SSE (with true β):", compute_sse(y, X, beta_true)) # Verify dimensionsprint(f"\nDimensions verification:")print(f" X: {X.shape[0]} × {X.shape[1]} (n × (p+1))")print(f" y: {y.shape[0]} × 1")print(f" β: {beta_true.shape[0]} × 1")The @ operator performs matrix multiplication in NumPy (Python 3.5+). It's equivalent to np.dot() or np.matmul() but more readable. Always use @ for matrix products in modern Python code.
We've established the matrix foundation for multiple linear regression. Let's consolidate the key concepts:
What's next:
With the matrix formulation in place, we're ready to derive the normal equations—the closed-form solution for the optimal coefficients $\hat{\boldsymbol{\beta}}$. The next page develops this derivation using matrix calculus and establishes the famous result $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$.
You now understand the matrix formulation of multiple linear regression—the foundation for everything that follows. The design matrix, coefficient vector, and matrix objective function are your essential tools for deriving solutions and proving properties.