Loading learning content...
Linear regression stands as one of the most fundamental and enduring techniques in all of machine learning and statistics. Despite its apparent simplicity, it forms the conceptual bedrock upon which much of modern predictive modeling is built. Understanding linear regression thoroughly—not just how to use it, but why it works and when it fails—is essential for any serious practitioner of machine learning.
This page introduces the simple linear regression model, the case where we have a single input variable (feature) and seek to predict a continuous output variable (target). While "simple" in name, this model encapsulates the core principles that extend to multiple regression, regularized methods, and even neural networks.
By the end of this page, you will understand the precise mathematical formulation of simple linear regression, the distinction between deterministic and stochastic components, the role of the error term, and how this model connects to the broader framework of supervised learning. You'll develop the conceptual foundation necessary for deriving the least squares solution in subsequent pages.
At its heart, regression addresses a fundamental question in data analysis: Given observations of two related variables, how can we predict one from the other?
Consider concrete examples:
In each case, we observe pairs of values $(x_i, y_i)$ where $x$ is the independent variable (predictor, feature, explanatory variable) and $y$ is the dependent variable (response, target, outcome). Our goal is to discover a relationship that allows prediction of $y$ for new, unseen values of $x$.
Different fields use different terminology. Statisticians often say 'independent/dependent variables' or 'explanatory/response.' Machine learning practitioners prefer 'features/targets' or 'inputs/outputs.' Econometricians say 'regressors/regressand.' These all refer to the same mathematical objects. We'll use these terms interchangeably, favoring the most natural choice for each context.
Why Linear?
The simplest possible assumption about the relationship between $x$ and $y$ is that it's linear—that $y$ changes proportionally with $x$. This means we're looking for a straight line that best describes the relationship.
This assumption might seem restrictive, but it turns out to be remarkably powerful for several reasons:
| Component | Symbol | Description | Example |
|---|---|---|---|
| Input space | $\mathcal{X}$ | Set of possible input values | $\mathcal{X} = \mathbb{R}$ (house sizes) |
| Output space | $\mathcal{Y}$ | Set of possible output values | $\mathcal{Y} = \mathbb{R}$ (house prices) |
| Training data | ${(x_i, y_i)}_{i=1}^n$ | Observed input-output pairs | 100 house sales records |
| Hypothesis class | $\mathcal{H}$ | Set of candidate functions | All linear functions $f(x) = \beta_0 + \beta_1 x$ |
| Loss function | $L(y, \hat{y})$ | Measures prediction error | Squared error $(y - \hat{y})^2$ |
| Learning algorithm | — | Procedure to find best hypothesis | Least squares (OLS) |
The simple linear regression model posits that the true relationship between $x$ and $y$ takes the form:
$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$$
This equation is deceptively simple, but each component carries deep meaning. Let's dissect it carefully.
The error term εᵢ is not just statistical noise to be ignored—it's a fundamental part of the model that acknowledges reality: no two houses with the same square footage sell for exactly the same price. The error term captures measurement error, omitted variables, inherent randomness, and all other factors affecting y that aren't included in our model.
Deterministic vs. Stochastic Components
The model equation can be rewritten to highlight a crucial conceptual distinction:
$$y_i = \underbrace{\beta_0 + \beta_1 x_i}{\text{systematic component}} + \underbrace{\varepsilon_i}{\text{random component}}$$
The systematic component $\beta_0 + \beta_1 x_i$ represents the deterministic part—the expected or average value of $y$ given $x$. If we knew $\beta_0$ and $\beta_1$ exactly, this would be our best prediction.
The random component $\varepsilon_i$ represents the stochastic part—the unpredictable deviation of any individual observation from the expected value. This is inherently uncertain and differs for each data point.
This decomposition is fundamental: we can model and estimate the systematic component, but we can never eliminate the random component—we can only characterize its statistical properties.
A critical distinction in statistical modeling—one that sometimes confuses newcomers—is between population parameters and sample estimates. Understanding this distinction is essential for interpreting regression results correctly.
Population Parameters (True but Unknown)
The quantities $\beta_0$, $\beta_1$, and the distribution of $\varepsilon$ describe the true underlying relationship in the entire population. If we could somehow observe every possible $(x, y)$ pair in the universe (every house ever sold, every student ever tested), we would know these parameters exactly.
But we can't. We have only a finite sample of $n$ observations: ${(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)}$.
Sample Estimates (Computed from Data)
From our sample, we compute estimates of the population parameters, denoted with hats:
$$\hat{\beta}_0, \quad \hat{\beta}_1$$
These estimates are our best guesses based on available data. They will differ from the true parameters, and they would differ if we collected a different sample. This variability is the subject of statistical inference.
| Concept | Population (True) | Sample (Estimated) | Relationship |
|---|---|---|---|
| Intercept | $\beta_0$ | $\hat{\beta}_0$ | $\hat{\beta}_0$ estimates $\beta_0$ |
| Slope | $\beta_1$ | $\hat{\beta}_1$ | $\hat{\beta}_1$ estimates $\beta_1$ |
| Predicted value | $E[y|x] = \beta_0 + \beta_1 x$ | $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$ | $\hat{y}$ estimates $E[y|x]$ |
| Error term | $\varepsilon_i = y_i - (\beta_0 + \beta_1 x_i)$ | $e_i = y_i - \hat{y}_i$ | $e_i$ (residual) estimates $\varepsilon_i$ |
| Error variance | $\sigma^2 = \text{Var}(\varepsilon)$ | $s^2 = \frac{1}{n-2}\sum e_i^2$ | $s^2$ estimates $\sigma^2$ |
The terms 'residual' and 'error' are often used interchangeably in casual discussion, but they have distinct technical meanings. The error εᵢ is the true deviation from the population regression line (unknown). The residual eᵢ is the observed deviation from the fitted line (computed from data). Residuals approximate errors but are not identical to them.
The Fitted Model
Once we estimate $\hat{\beta}_0$ and $\hat{\beta}_1$ from our sample, we obtain the fitted regression line:
$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$$
This is our predictive model. For any value $x$, we can compute a predicted value $\hat{y}$. The residuals are the differences between observed and predicted values:
$$e_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)$$
Residuals are observable (we can compute them), while errors are not (they involve unknown true parameters). Much of regression diagnostics involves analyzing residuals as proxies for the unobservable errors.
A powerful way to understand the regression model is through the lens of conditional expectation. This perspective connects regression to fundamental concepts in probability theory and provides deeper insight into what the model actually represents.
What Does the Regression Function Represent?
The systematic component $\beta_0 + \beta_1 x$ equals the conditional expectation of $y$ given $x$:
$$E[y \mid x] = \beta_0 + \beta_1 x$$
This tells us: for any fixed value of $x$, the expected (average) value of $y$ is $\beta_0 + \beta_1 x$. The regression line traces out these conditional means as $x$ varies.
Think of it this way: if we had thousands of houses all with exactly 2,000 square feet, their prices would vary (some higher, some lower). The conditional expectation E[price | sqft = 2000] is the average price among those 2,000 sqft houses. The regression function tells us how this average changes as we consider different house sizes.
Deriving the Error Properties
The conditional expectation view implies key properties of the error term. Starting from:
$$y = \beta_0 + \beta_1 x + \varepsilon$$
Take the conditional expectation given $x$:
$$E[y \mid x] = E[\beta_0 + \beta_1 x + \varepsilon \mid x]$$ $$\beta_0 + \beta_1 x = \beta_0 + \beta_1 x + E[\varepsilon \mid x]$$
This requires:
$$E[\varepsilon \mid x] = 0$$
This is the zero conditional mean assumption: the expected value of the error, given $x$, is zero. This is one of the most important assumptions in regression analysis.
What Does Zero Conditional Mean Imply?
The condition $E[\varepsilon \mid x] = 0$ means:
While simple linear regression involves only one predictor, it's instructive to see how the model can be written in matrix form—the notation that scales naturally to multiple regression.
Scalar Form (What We've Seen)
For each observation $i = 1, \ldots, n$: $$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$$
System of Equations
Writing out all $n$ observations:
$$\begin{aligned} y_1 &= \beta_0 + \beta_1 x_1 + \varepsilon_1 \ y_2 &= \beta_0 + \beta_1 x_2 + \varepsilon_2 \ &\vdots \ y_n &= \beta_0 + \beta_1 x_n + \varepsilon_n \end{aligned}$$
Matrix Form
Define:
$$\mathbf{y} = \begin{pmatrix} y_1 \ y_2 \ \vdots \ y_n \end{pmatrix}, \quad \mathbf{X} = \begin{pmatrix} 1 & x_1 \ 1 & x_2 \ \vdots & \vdots \ 1 & x_n \end{pmatrix}, \quad \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \ \beta_1 \end{pmatrix}, \quad \boldsymbol{\varepsilon} = \begin{pmatrix} \varepsilon_1 \ \varepsilon_2 \ \vdots \ \varepsilon_n \end{pmatrix}$$
The $\mathbf{X}$ matrix is called the design matrix. Notice the column of 1's—this corresponds to the intercept term $\beta_0$.
Now the entire system becomes:
$$\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}$$
This compact notation is remarkably powerful. It expresses all $n$ equations in a single matrix equation, and as we'll see, it leads to elegant solutions and insights.
Matrix notation isn't just compact—it reveals structure. The coefficient vector β exists in a 2-dimensional parameter space. The predictions Xβ live in a subspace of n-dimensional observation space. This geometric view will be essential when we discuss least squares as projection.
| Object | Notation | Dimension | Interpretation |
|---|---|---|---|
| Response vector | $\mathbf{y}$ | $n \times 1$ | All observed outcomes |
| Design matrix | $\mathbf{X}$ | $n \times 2$ | Predictor values with intercept column |
| Parameter vector | $\boldsymbol{\beta}$ | $2 \times 1$ | Intercept and slope |
| Error vector | $\boldsymbol{\varepsilon}$ | $n \times 1$ | All unobserved errors |
| Fitted values | $\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$ | $n \times 1$ | All predictions |
| Residual vector | $\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$ | $n \times 1$ | All residuals |
A useful mental model is to think of the regression equation as capturing how data is generated. This data generating process (DGP) perspective helps clarify assumptions and their consequences.
Thought Experiment: Simulating Data
Imagine you're Nature (or a simulator), and you want to generate regression data:
The observer (statistician) sees only $(x_i, y_i)$ pairs. They don't know $\beta_0$, $\beta_1$, $\sigma^2$, or the realized $\varepsilon_i$ values. Their task is to recover these from the observed data.
1234567891011121314151617181920212223242526
import numpy as npimport matplotlib.pyplot as plt # True parameters (unknown to the observer)beta_0_true = 5 # Interceptbeta_1_true = 2 # Slopesigma_true = 1 # Error standard deviation # Generate predictor valuesnp.random.seed(42)n = 100x = np.random.uniform(0, 10, n) # The data generating processdeterministic = beta_0_true + beta_1_true * x # E[y|x]errors = np.random.normal(0, sigma_true, n) # Random componenty = deterministic + errors # Observed response # What the observer sees: (x, y) pairs# What they want to recover: beta_0, beta_1, sigma print(f"True parameters: β₀ = {beta_0_true}, β₁ = {beta_1_true}, σ = {sigma_true}")print(f"Sample size: n = {n}")print(f"First 5 observations:")for i in range(5): print(f" x = {x[i]:.2f}, E[y|x] = {deterministic[i]:.2f}, ε = {errors[i]:.2f}, y = {y[i]:.2f}")In some formulations, the xᵢ values are treated as fixed (chosen by the experimenter), while in others they're treated as random (drawn from some distribution). For inference about β₀ and β₁ conditional on X, the distinction often doesn't matter. For unconditional inference or prediction, it can. We'll primarily use the fixed-X framework.
Linear regression predates modern machine learning by over two centuries—Legendre published the first account of least squares in 1805. Yet it remains central to contemporary ML. Understanding this connection helps contextualize what we're learning.
Linear Regression as Supervised Learning
In the ML framework:
Ordinary least squares is precisely ERM with the linear hypothesis class and squared loss.
Why Both Views Matter
The statistical and ML perspectives are complementary, not contradictory:
A complete understanding of linear regression draws from both traditions. This module emphasizes the statistical foundations; subsequent modules on regularization and model selection emphasize the ML perspective.
Modern practice often integrates both views. We use statistical theory to understand estimator properties, then use ML methods (cross-validation, regularization) to build better predictive models. The linear regression framework is the common ground where these traditions meet.
We've established the mathematical foundation for simple linear regression. Let's consolidate the key concepts:
What's Next
With the model formulated, the natural question is: how do we estimate $\beta_0$ and $\beta_1$ from data? The next page develops the least squares principle—the most fundamental estimation method in regression. We'll derive the OLS estimators by minimizing the sum of squared residuals and explore why this criterion makes sense both geometrically and statistically.
You now understand the formal structure of the simple linear regression model. This foundation is essential for everything that follows—from least squares derivation to geometric interpretation to statistical inference. Next, we'll develop the method for finding the best-fitting line through our data.