Machine LearningSimple Linear Regression

Simple Linear Regression

LevelIntermediate

Duration90 mins

TopicSimple Linear Regression

1 / 5

Model Formulation

The Foundation of Predictive Modeling

Linear regression stands as one of the most fundamental and enduring techniques in all of machine learning and statistics. Despite its apparent simplicity, it forms the conceptual bedrock upon which much of modern predictive modeling is built. Understanding linear regression thoroughly—not just how to use it, but why it works and when it fails—is essential for any serious practitioner of machine learning.

This page introduces the simple linear regression model, the case where we have a single input variable (feature) and seek to predict a continuous output variable (target). While "simple" in name, this model encapsulates the core principles that extend to multiple regression, regularized methods, and even neural networks.

What You Will Learn

By the end of this page, you will understand the precise mathematical formulation of simple linear regression, the distinction between deterministic and stochastic components, the role of the error term, and how this model connects to the broader framework of supervised learning. You'll develop the conceptual foundation necessary for deriving the least squares solution in subsequent pages.

The Prediction Problem

At its heart, regression addresses a fundamental question in data analysis: Given observations of two related variables, how can we predict one from the other?

Consider concrete examples:

Given a house's square footage, predict its sale price
Given hours studied, predict exam score
Given advertising spend, predict sales revenue
Given temperature, predict energy consumption

In each case, we observe pairs of values $(x_i, y_i)$ where $x$ is the independent variable (predictor, feature, explanatory variable) and $y$ is the dependent variable (response, target, outcome). Our goal is to discover a relationship that allows prediction of $y$ for new, unseen values of $x$.

Terminology Matters

Different fields use different terminology. Statisticians often say 'independent/dependent variables' or 'explanatory/response.' Machine learning practitioners prefer 'features/targets' or 'inputs/outputs.' Econometricians say 'regressors/regressand.' These all refer to the same mathematical objects. We'll use these terms interchangeably, favoring the most natural choice for each context.

Why Linear?

The simplest possible assumption about the relationship between $x$ and $y$ is that it's linear—that $y$ changes proportionally with $x$. This means we're looking for a straight line that best describes the relationship.

This assumption might seem restrictive, but it turns out to be remarkably powerful for several reasons:

Interpretability: Linear relationships are easy to understand and explain
Mathematical Tractability: Linear models have closed-form solutions and well-understood statistical properties
Flexibility via Transformation: Many nonlinear relationships become linear after appropriate transformations
Local Approximation: Any smooth function can be approximated linearly in a small neighborhood (Taylor expansion)
Foundation for Extension: Linear methods generalize naturally to more complex models

The Supervised Learning Framework Applied to Regression
Component	Symbol	Description	Example
Input space	$\mathcal{X}$	Set of possible input values	$\mathcal{X} = \mathbb{R}$ (house sizes)
Output space	$\mathcal{Y}$	Set of possible output values	$\mathcal{Y} = \mathbb{R}$ (house prices)
Training data	${(x_i, y_i)}_{i=1}^n$	Observed input-output pairs	100 house sales records
Hypothesis class	$\mathcal{H}$	Set of candidate functions	All linear functions $f(x) = \beta_0 + \beta_1 x$
Loss function	$L(y, \hat{y})$	Measures prediction error	Squared error $(y - \hat{y})^2$
Learning algorithm	—	Procedure to find best hypothesis	Least squares (OLS)

The Simple Linear Regression Model

The simple linear regression model posits that the true relationship between $x$ and $y$ takes the form:

$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$$

This equation is deceptively simple, but each component carries deep meaning. Let's dissect it carefully.

Model Components Explained

•$y_i$ — The observed response value for the $i$-th data point. This is what we actually measure or record. It's the quantity we ultimately want to predict for new observations.
•$x_i$ — The observed predictor value for the $i$-th data point. This is our input, the variable we use to make predictions. In simple linear regression, we have exactly one predictor.
•$\beta_0$ — The intercept (or constant term). This represents the expected value of $y$ when $x = 0$. Geometrically, it's where the regression line crosses the $y$-axis.
•$\beta_1$ — The slope parameter. This represents the expected change in $y$ for a one-unit increase in $x$. It captures the direction and strength of the linear relationship.
•$\varepsilon_i$ — The error term (or disturbance, residual in the population sense). This captures everything about $y_i$ that cannot be explained by the linear relationship with $x_i$.

The Crucial Role of the Error Term

The error term εᵢ is not just statistical noise to be ignored—it's a fundamental part of the model that acknowledges reality: no two houses with the same square footage sell for exactly the same price. The error term captures measurement error, omitted variables, inherent randomness, and all other factors affecting y that aren't included in our model.

Deterministic vs. Stochastic Components

The model equation can be rewritten to highlight a crucial conceptual distinction:

$$y_i = \underbrace{\beta_0 + \beta_1 x_i}{\text{systematic component}} + \underbrace{\varepsilon_i}{\text{random component}}$$

The systematic component $\beta_0 + \beta_1 x_i$ represents the deterministic part—the expected or average value of $y$ given $x$. If we knew $\beta_0$ and $\beta_1$ exactly, this would be our best prediction.

The random component $\varepsilon_i$ represents the stochastic part—the unpredictable deviation of any individual observation from the expected value. This is inherently uncertain and differs for each data point.

This decomposition is fundamental: we can model and estimate the systematic component, but we can never eliminate the random component—we can only characterize its statistical properties.

Population vs. Sample Framework

A critical distinction in statistical modeling—one that sometimes confuses newcomers—is between population parameters and sample estimates. Understanding this distinction is essential for interpreting regression results correctly.

Population Parameters (True but Unknown)

The quantities $\beta_0$, $\beta_1$, and the distribution of $\varepsilon$ describe the true underlying relationship in the entire population. If we could somehow observe every possible $(x, y)$ pair in the universe (every house ever sold, every student ever tested), we would know these parameters exactly.

But we can't. We have only a finite sample of $n$ observations: ${(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)}$.

Sample Estimates (Computed from Data)

From our sample, we compute estimates of the population parameters, denoted with hats:

$$\hat{\beta}_0, \quad \hat{\beta}_1$$

These estimates are our best guesses based on available data. They will differ from the true parameters, and they would differ if we collected a different sample. This variability is the subject of statistical inference.

Population Parameters vs. Sample Estimates
Concept	Population (True)	Sample (Estimated)	Relationship
Intercept	$\beta_0$	$\hat{\beta}_0$	$\hat{\beta}_0$ estimates $\beta_0$
Slope	$\beta_1$	$\hat{\beta}_1$	$\hat{\beta}_1$ estimates $\beta_1$
Predicted value	$E[y\|x] = \beta_0 + \beta_1 x$	$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$	$\hat{y}$ estimates $E[y\|x]$
Error term	$\varepsilon_i = y_i - (\beta_0 + \beta_1 x_i)$	$e_i = y_i - \hat{y}_i$	$e_i$ (residual) estimates $\varepsilon_i$
Error variance	$\sigma^2 = \text{Var}(\varepsilon)$	$s^2 = \frac{1}{n-2}\sum e_i^2$	$s^2$ estimates $\sigma^2$

Residuals vs. Errors

The terms 'residual' and 'error' are often used interchangeably in casual discussion, but they have distinct technical meanings. The error εᵢ is the true deviation from the population regression line (unknown). The residual eᵢ is the observed deviation from the fitted line (computed from data). Residuals approximate errors but are not identical to them.

The Fitted Model

Once we estimate $\hat{\beta}_0$ and $\hat{\beta}_1$ from our sample, we obtain the fitted regression line:

$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$$

This is our predictive model. For any value $x$, we can compute a predicted value $\hat{y}$. The residuals are the differences between observed and predicted values:

$$e_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)$$

Residuals are observable (we can compute them), while errors are not (they involve unknown true parameters). Much of regression diagnostics involves analyzing residuals as proxies for the unobservable errors.

The Conditional Expectation View

A powerful way to understand the regression model is through the lens of conditional expectation. This perspective connects regression to fundamental concepts in probability theory and provides deeper insight into what the model actually represents.

What Does the Regression Function Represent?

The systematic component $\beta_0 + \beta_1 x$ equals the conditional expectation of $y$ given $x$:

$$E[y \mid x] = \beta_0 + \beta_1 x$$

This tells us: for any fixed value of $x$, the expected (average) value of $y$ is $\beta_0 + \beta_1 x$. The regression line traces out these conditional means as $x$ varies.

Conditional Expectation in Plain Language

Think of it this way: if we had thousands of houses all with exactly 2,000 square feet, their prices would vary (some higher, some lower). The conditional expectation E[price | sqft = 2000] is the average price among those 2,000 sqft houses. The regression function tells us how this average changes as we consider different house sizes.

Deriving the Error Properties

The conditional expectation view implies key properties of the error term. Starting from:

$$y = \beta_0 + \beta_1 x + \varepsilon$$

Take the conditional expectation given $x$:

$$E[y \mid x] = E[\beta_0 + \beta_1 x + \varepsilon \mid x]$$ $$\beta_0 + \beta_1 x = \beta_0 + \beta_1 x + E[\varepsilon \mid x]$$

This requires:

$$E[\varepsilon \mid x] = 0$$

This is the zero conditional mean assumption: the expected value of the error, given $x$, is zero. This is one of the most important assumptions in regression analysis.

What Does Zero Conditional Mean Imply?

The condition $E[\varepsilon \mid x] = 0$ means:

On average, predictions are neither systematically too high nor too low for any value of $x$
The error is uncorrelated with the predictor: $\text{Cov}(x, \varepsilon) = 0$
No information about $\varepsilon$ can be extracted from knowing $x$
The relationship between mean $y$ and $x$ is correctly specified as linear

When E[ε|x] = 0 Holds

•The linear specification is correct
•All relevant variables are included
•Measurement errors in $x$ average to zero
•Sampling is random with respect to $\varepsilon$

When E[ε|x] ≠ 0 (Violations)

•Omitted variable correlated with $x$
•True relationship is nonlinear
•Measurement error in $x$
•Reverse causality (y affects x)

Matrix Notation Preview

While simple linear regression involves only one predictor, it's instructive to see how the model can be written in matrix form—the notation that scales naturally to multiple regression.

Scalar Form (What We've Seen)

For each observation $i = 1, \ldots, n$: $$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$$

System of Equations

Writing out all $n$ observations:

$$\begin{aligned} y_1 &= \beta_0 + \beta_1 x_1 + \varepsilon_1 \ y_2 &= \beta_0 + \beta_1 x_2 + \varepsilon_2 \ &\vdots \ y_n &= \beta_0 + \beta_1 x_n + \varepsilon_n \end{aligned}$$

Matrix Form

Define:

$$\mathbf{y} = \begin{pmatrix} y_1 \ y_2 \ \vdots \ y_n \end{pmatrix}, \quad \mathbf{X} = \begin{pmatrix} 1 & x_1 \ 1 & x_2 \ \vdots & \vdots \ 1 & x_n \end{pmatrix}, \quad \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \ \beta_1 \end{pmatrix}, \quad \boldsymbol{\varepsilon} = \begin{pmatrix} \varepsilon_1 \ \varepsilon_2 \ \vdots \ \varepsilon_n \end{pmatrix}$$

The $\mathbf{X}$ matrix is called the design matrix. Notice the column of 1's—this corresponds to the intercept term $\beta_0$.

Now the entire system becomes:

$$\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}$$

This compact notation is remarkably powerful. It expresses all $n$ equations in a single matrix equation, and as we'll see, it leads to elegant solutions and insights.

Why Matrix Notation Matters

Matrix notation isn't just compact—it reveals structure. The coefficient vector β exists in a 2-dimensional parameter space. The predictions Xβ live in a subspace of n-dimensional observation space. This geometric view will be essential when we discuss least squares as projection.

Dimensions in the Linear Regression Model
Object	Notation	Dimension	Interpretation
Response vector	$\mathbf{y}$	$n \times 1$	All observed outcomes
Design matrix	$\mathbf{X}$	$n \times 2$	Predictor values with intercept column
Parameter vector	$\boldsymbol{\beta}$	$2 \times 1$	Intercept and slope
Error vector	$\boldsymbol{\varepsilon}$	$n \times 1$	All unobserved errors
Fitted values	$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$	$n \times 1$	All predictions
Residual vector	$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$	$n \times 1$	All residuals

The Data Generation Perspective

A useful mental model is to think of the regression equation as capturing how data is generated. This data generating process (DGP) perspective helps clarify assumptions and their consequences.

Thought Experiment: Simulating Data

Imagine you're Nature (or a simulator), and you want to generate regression data:

Choose true parameters: Fix $\beta_0 = 5$, $\beta_1 = 2$, $\sigma^2 = 1$
For each observation $i$:
- Observe or choose $x_i$ (e.g., $x_i = 3$)
- Compute deterministic component: $5 + 2(3) = 11$
- Draw random error: $\varepsilon_i \sim N(0, 1)$ (say $\varepsilon_i = 0.4$)
- Generate observed response: $y_i = 11 + 0.4 = 11.4$
Repeat for $n$ observations

The observer (statistician) sees only $(x_i, y_i)$ pairs. They don't know $\beta_0$, $\beta_1$, $\sigma^2$, or the realized $\varepsilon_i$ values. Their task is to recover these from the observed data.

data_generation_simulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
import matplotlib.pyplot as plt
 
# True parameters (unknown to the observer)
beta_0_true = 5      # Intercept
beta_1_true = 2      # Slope
sigma_true = 1       # Error standard deviation
 
# Generate predictor values
np.random.seed(42)
n = 100
x = np.random.uniform(0, 10, n)
 
# The data generating process
deterministic = beta_0_true + beta_1_true * x  # E[y|x]
errors = np.random.normal(0, sigma_true, n)     # Random component
y = deterministic + errors                       # Observed response
 
# What the observer sees: (x, y) pairs
# What they want to recover: beta_0, beta_1, sigma
 
print(f"True parameters: β₀ = {beta_0_true}, β₁ = {beta_1_true}, σ = {sigma_true}")
print(f"Sample size: n = {n}")
print(f"First 5 observations:")
for i in range(5):
    print(f"  x = {x[i]:.2f}, E[y|x] = {deterministic[i]:.2f}, ε = {errors[i]:.2f}, y = {y[i]:.2f}")

Fixed vs. Random Predictors

In some formulations, the xᵢ values are treated as fixed (chosen by the experimenter), while in others they're treated as random (drawn from some distribution). For inference about β₀ and β₁ conditional on X, the distinction often doesn't matter. For unconditional inference or prediction, it can. We'll primarily use the fixed-X framework.

Connection to Machine Learning

Linear regression predates modern machine learning by over two centuries—Legendre published the first account of least squares in 1805. Yet it remains central to contemporary ML. Understanding this connection helps contextualize what we're learning.

Linear Regression as Supervised Learning

In the ML framework:

Hypothesis class $\mathcal{H}$: All functions of the form $h(x) = \beta_0 + \beta_1 x$
Loss function: Squared error $L(y, h(x)) = (y - h(x))^2$
Risk: Expected loss $R(h) = E[(y - h(x))^2]$
Empirical risk: Average loss on training data $\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n (y_i - h(x_i))^2$
Learning: Find $h \in \mathcal{H}$ that minimizes empirical risk (empirical risk minimization)

Ordinary least squares is precisely ERM with the linear hypothesis class and squared loss.

Statistical View

•Focus on inference about parameters
•Emphasize assumptions and their validation
•Care about uncertainty (confidence intervals)
•Interpret results in terms of causation (carefully)
•Optimize for unbiasedness and efficiency

Machine Learning View

•Focus on prediction accuracy
•Emphasize generalization to new data
•Care about train/test performance gap
•Interpret results in terms of prediction
•Optimize for out-of-sample error

Why Both Views Matter

The statistical and ML perspectives are complementary, not contradictory:

Understanding OLS statistical properties (unbiasedness, variance) helps diagnose when predictions will be reliable
ML's focus on generalization guards against overfitting, even in simple regression
Statistical hypothesis tests tell us which variables actually matter
ML validation methods tell us how well we'll predict on new data

A complete understanding of linear regression draws from both traditions. This module emphasizes the statistical foundations; subsequent modules on regularization and model selection emphasize the ML perspective.

The Best of Both Worlds

Modern practice often integrates both views. We use statistical theory to understand estimator properties, then use ML methods (cross-validation, regularization) to build better predictive models. The linear regression framework is the common ground where these traditions meet.

Summary: Model Formulation

We've established the mathematical foundation for simple linear regression. Let's consolidate the key concepts:

Key Takeaways

•The Model: $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$ decomposes observations into systematic (linear) and random (error) components
•Parameters: $\beta_0$ (intercept) and $\beta_1$ (slope) are unknown population quantities we aim to estimate
•The Error Term: $\varepsilon_i$ captures all factors affecting $y$ not explained by the linear relationship with $x$
•Population vs. Sample: True parameters ($\beta$) are estimated by sample quantities ($\hat{\beta}$); residuals ($e$) approximate errors ($\varepsilon$)
•Conditional Expectation: $E[y|x] = \beta_0 + \beta_1 x$; the regression line traces out conditional means
•Matrix Form: $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ provides compact notation that scales to multiple regression
•Dual Perspective: Linear regression sits at the intersection of classical statistics and modern machine learning

What's Next

With the model formulated, the natural question is: how do we estimate $\beta_0$ and $\beta_1$ from data? The next page develops the least squares principle—the most fundamental estimation method in regression. We'll derive the OLS estimators by minimizing the sum of squared residuals and explore why this criterion makes sense both geometrically and statistically.

Page Complete

You now understand the formal structure of the simple linear regression model. This foundation is essential for everything that follows—from least squares derivation to geometric interpretation to statistical inference. Next, we'll develop the method for finding the best-fitting line through our data.

1 / 5

Loading learning content...

Machine LearningSimple Linear Regression

Simple Linear Regression

LevelIntermediate

Duration90 mins

TopicSimple Linear Regression

1 / 5

Model Formulation

The Foundation of Predictive Modeling

What You Will Learn

The Prediction Problem

At its heart, regression addresses a fundamental question in data analysis: Given observations of two related variables, how can we predict one from the other?

Consider concrete examples:

Given a house's square footage, predict its sale price
Given hours studied, predict exam score
Given advertising spend, predict sales revenue
Given temperature, predict energy consumption

Terminology Matters

Why Linear?

This assumption might seem restrictive, but it turns out to be remarkably powerful for several reasons:

Interpretability: Linear relationships are easy to understand and explain
Mathematical Tractability: Linear models have closed-form solutions and well-understood statistical properties
Flexibility via Transformation: Many nonlinear relationships become linear after appropriate transformations
Local Approximation: Any smooth function can be approximated linearly in a small neighborhood (Taylor expansion)
Foundation for Extension: Linear methods generalize naturally to more complex models

The Supervised Learning Framework Applied to Regression
Component	Symbol	Description	Example
Input space	$\mathcal{X}$	Set of possible input values	$\mathcal{X} = \mathbb{R}$ (house sizes)
Output space	$\mathcal{Y}$	Set of possible output values	$\mathcal{Y} = \mathbb{R}$ (house prices)
Training data	${(x_i, y_i)}_{i=1}^n$	Observed input-output pairs	100 house sales records
Hypothesis class	$\mathcal{H}$	Set of candidate functions	All linear functions $f(x) = \beta_0 + \beta_1 x$
Loss function	$L(y, \hat{y})$	Measures prediction error	Squared error $(y - \hat{y})^2$
Learning algorithm	—	Procedure to find best hypothesis	Least squares (OLS)

The Simple Linear Regression Model

The simple linear regression model posits that the true relationship between $x$ and $y$ takes the form:

$$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$$

This equation is deceptively simple, but each component carries deep meaning. Let's dissect it carefully.

Model Components Explained

•$y_i$ — The observed response value for the $i$-th data point. This is what we actually measure or record. It's the quantity we ultimately want to predict for new observations.
•$x_i$ — The observed predictor value for the $i$-th data point. This is our input, the variable we use to make predictions. In simple linear regression, we have exactly one predictor.
•$\beta_0$ — The intercept (or constant term). This represents the expected value of $y$ when $x = 0$. Geometrically, it's where the regression line crosses the $y$-axis.
•$\beta_1$ — The slope parameter. This represents the expected change in $y$ for a one-unit increase in $x$. It captures the direction and strength of the linear relationship.
•$\varepsilon_i$ — The error term (or disturbance, residual in the population sense). This captures everything about $y_i$ that cannot be explained by the linear relationship with $x_i$.

The Crucial Role of the Error Term

Deterministic vs. Stochastic Components

The model equation can be rewritten to highlight a crucial conceptual distinction:

$$y_i = \underbrace{\beta_0 + \beta_1 x_i}{\text{systematic component}} + \underbrace{\varepsilon_i}{\text{random component}}$$

This decomposition is fundamental: we can model and estimate the systematic component, but we can never eliminate the random component—we can only characterize its statistical properties.

Population vs. Sample Framework

Population Parameters (True but Unknown)

But we can't. We have only a finite sample of $n$ observations: ${(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)}$.

Sample Estimates (Computed from Data)

From our sample, we compute estimates of the population parameters, denoted with hats:

$$\hat{\beta}_0, \quad \hat{\beta}_1$$

Population Parameters vs. Sample Estimates
Concept	Population (True)	Sample (Estimated)	Relationship
Intercept	$\beta_0$	$\hat{\beta}_0$	$\hat{\beta}_0$ estimates $\beta_0$
Slope	$\beta_1$	$\hat{\beta}_1$	$\hat{\beta}_1$ estimates $\beta_1$
Predicted value	$E[y\|x] = \beta_0 + \beta_1 x$	$\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x$	$\hat{y}$ estimates $E[y\|x]$
Error term	$\varepsilon_i = y_i - (\beta_0 + \beta_1 x_i)$	$e_i = y_i - \hat{y}_i$	$e_i$ (residual) estimates $\varepsilon_i$
Error variance	$\sigma^2 = \text{Var}(\varepsilon)$	$s^2 = \frac{1}{n-2}\sum e_i^2$	$s^2$ estimates $\sigma^2$

Residuals vs. Errors

The Fitted Model

Once we estimate $\hat{\beta}_0$ and $\hat{\beta}_1$ from our sample, we obtain the fitted regression line:

$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$$

This is our predictive model. For any value $x$, we can compute a predicted value $\hat{y}$. The residuals are the differences between observed and predicted values:

$$e_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)$$

The Conditional Expectation View

What Does the Regression Function Represent?

The systematic component $\beta_0 + \beta_1 x$ equals the conditional expectation of $y$ given $x$:

$$E[y \mid x] = \beta_0 + \beta_1 x$$

This tells us: for any fixed value of $x$, the expected (average) value of $y$ is $\beta_0 + \beta_1 x$. The regression line traces out these conditional means as $x$ varies.

Conditional Expectation in Plain Language

Deriving the Error Properties

The conditional expectation view implies key properties of the error term. Starting from:

$$y = \beta_0 + \beta_1 x + \varepsilon$$

Take the conditional expectation given $x$:

$$E[y \mid x] = E[\beta_0 + \beta_1 x + \varepsilon \mid x]$$ $$\beta_0 + \beta_1 x = \beta_0 + \beta_1 x + E[\varepsilon \mid x]$$

This requires:

$$E[\varepsilon \mid x] = 0$$

This is the zero conditional mean assumption: the expected value of the error, given $x$, is zero. This is one of the most important assumptions in regression analysis.

What Does Zero Conditional Mean Imply?

The condition $E[\varepsilon \mid x] = 0$ means:

On average, predictions are neither systematically too high nor too low for any value of $x$
The error is uncorrelated with the predictor: $\text{Cov}(x, \varepsilon) = 0$
No information about $\varepsilon$ can be extracted from knowing $x$
The relationship between mean $y$ and $x$ is correctly specified as linear

When E[ε|x] = 0 Holds

•The linear specification is correct
•All relevant variables are included
•Measurement errors in $x$ average to zero
•Sampling is random with respect to $\varepsilon$

When E[ε|x] ≠ 0 (Violations)

•Omitted variable correlated with $x$
•True relationship is nonlinear
•Measurement error in $x$
•Reverse causality (y affects x)

Matrix Notation Preview

While simple linear regression involves only one predictor, it's instructive to see how the model can be written in matrix form—the notation that scales naturally to multiple regression.

Scalar Form (What We've Seen)

For each observation $i = 1, \ldots, n$: $$y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$$

System of Equations

Writing out all $n$ observations:

$$\begin{aligned} y_1 &= \beta_0 + \beta_1 x_1 + \varepsilon_1 \ y_2 &= \beta_0 + \beta_1 x_2 + \varepsilon_2 \ &\vdots \ y_n &= \beta_0 + \beta_1 x_n + \varepsilon_n \end{aligned}$$

Matrix Form

Define:

The $\mathbf{X}$ matrix is called the design matrix. Notice the column of 1's—this corresponds to the intercept term $\beta_0$.

Now the entire system becomes:

$$\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}$$

This compact notation is remarkably powerful. It expresses all $n$ equations in a single matrix equation, and as we'll see, it leads to elegant solutions and insights.

Why Matrix Notation Matters

Dimensions in the Linear Regression Model
Object	Notation	Dimension	Interpretation
Response vector	$\mathbf{y}$	$n \times 1$	All observed outcomes
Design matrix	$\mathbf{X}$	$n \times 2$	Predictor values with intercept column
Parameter vector	$\boldsymbol{\beta}$	$2 \times 1$	Intercept and slope
Error vector	$\boldsymbol{\varepsilon}$	$n \times 1$	All unobserved errors
Fitted values	$\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}$	$n \times 1$	All predictions
Residual vector	$\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}$	$n \times 1$	All residuals

The Data Generation Perspective

A useful mental model is to think of the regression equation as capturing how data is generated. This data generating process (DGP) perspective helps clarify assumptions and their consequences.

Thought Experiment: Simulating Data

Imagine you're Nature (or a simulator), and you want to generate regression data:

Choose true parameters: Fix $\beta_0 = 5$, $\beta_1 = 2$, $\sigma^2 = 1$
For each observation $i$:
- Observe or choose $x_i$ (e.g., $x_i = 3$)
- Compute deterministic component: $5 + 2(3) = 11$
- Draw random error: $\varepsilon_i \sim N(0, 1)$ (say $\varepsilon_i = 0.4$)
- Generate observed response: $y_i = 11 + 0.4 = 11.4$
Repeat for $n$ observations

data_generation_simulation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
import matplotlib.pyplot as plt
 
# True parameters (unknown to the observer)
beta_0_true = 5      # Intercept
beta_1_true = 2      # Slope
sigma_true = 1       # Error standard deviation
 
# Generate predictor values
np.random.seed(42)
n = 100
x = np.random.uniform(0, 10, n)
 
# The data generating process
deterministic = beta_0_true + beta_1_true * x  # E[y|x]
errors = np.random.normal(0, sigma_true, n)     # Random component
y = deterministic + errors                       # Observed response
 
# What the observer sees: (x, y) pairs
# What they want to recover: beta_0, beta_1, sigma
 
print(f"True parameters: β₀ = {beta_0_true}, β₁ = {beta_1_true}, σ = {sigma_true}")
print(f"Sample size: n = {n}")
print(f"First 5 observations:")
for i in range(5):
    print(f"  x = {x[i]:.2f}, E[y|x] = {deterministic[i]:.2f}, ε = {errors[i]:.2f}, y = {y[i]:.2f}")

Fixed vs. Random Predictors

Connection to Machine Learning

Linear Regression as Supervised Learning

In the ML framework:

Hypothesis class $\mathcal{H}$: All functions of the form $h(x) = \beta_0 + \beta_1 x$
Loss function: Squared error $L(y, h(x)) = (y - h(x))^2$
Risk: Expected loss $R(h) = E[(y - h(x))^2]$
Empirical risk: Average loss on training data $\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n (y_i - h(x_i))^2$
Learning: Find $h \in \mathcal{H}$ that minimizes empirical risk (empirical risk minimization)

Ordinary least squares is precisely ERM with the linear hypothesis class and squared loss.

Statistical View

•Focus on inference about parameters
•Emphasize assumptions and their validation
•Care about uncertainty (confidence intervals)
•Interpret results in terms of causation (carefully)
•Optimize for unbiasedness and efficiency

Machine Learning View

•Focus on prediction accuracy
•Emphasize generalization to new data
•Care about train/test performance gap
•Interpret results in terms of prediction
•Optimize for out-of-sample error

Why Both Views Matter

The statistical and ML perspectives are complementary, not contradictory:

Understanding OLS statistical properties (unbiasedness, variance) helps diagnose when predictions will be reliable
ML's focus on generalization guards against overfitting, even in simple regression
Statistical hypothesis tests tell us which variables actually matter
ML validation methods tell us how well we'll predict on new data

The Best of Both Worlds

Summary: Model Formulation

We've established the mathematical foundation for simple linear regression. Let's consolidate the key concepts:

Key Takeaways

•The Model: $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$ decomposes observations into systematic (linear) and random (error) components
•Parameters: $\beta_0$ (intercept) and $\beta_1$ (slope) are unknown population quantities we aim to estimate
•The Error Term: $\varepsilon_i$ captures all factors affecting $y$ not explained by the linear relationship with $x$
•Population vs. Sample: True parameters ($\beta$) are estimated by sample quantities ($\hat{\beta}$); residuals ($e$) approximate errors ($\varepsilon$)
•Conditional Expectation: $E[y|x] = \beta_0 + \beta_1 x$; the regression line traces out conditional means
•Matrix Form: $\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ provides compact notation that scales to multiple regression
•Dual Perspective: Linear regression sits at the intersection of classical statistics and modern machine learning

What's Next

Page Complete

1 / 5