Simple Linear Regression - Learning Module

Loading content...

0/245

Least Squares Derivation

Finding the Best-Fitting Line

Given our model $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$ and observed data ${(x_i, y_i)}_{i=1}^n$, we face a fundamental question: How do we find the "best" values for $\beta_0$ and $\beta_1$?

The answer depends on defining what "best" means. The method of least squares—first published by Legendre in 1805 and independently developed by Gauss around the same time—provides an elegant and principled answer that has stood the test of two centuries.

This page derives the least squares solution from scratch, revealing not just the formulas but the reasoning behind them.

What You Will Learn

By the end of this page, you will be able to derive the OLS estimators β̂₀ and β̂₁ using calculus, understand why squared errors are used, manipulate the optimization problem algebraically, and connect the derivation to numerical computing considerations.

The Criterion of Least Squares

Our goal is to find values $\hat{\beta}_0$ and $\hat{\beta}_1$ that make our predicted values $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$ as close as possible to the observed values $y_i$.

The residual for observation $i$ is the prediction error:

$$e_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)$$

We want all residuals to be small. But how do we aggregate $n$ individual residuals into a single measure of overall fit?

Possible Aggregate Measures

•Sum of residuals: $\sum_{i=1}^n e_i$. Problem: Positive and negative residuals cancel, so this can be zero even with large errors.
•Sum of absolute residuals: $\sum_{i=1}^n |e_i|$. Viable (leads to LAD/median regression), but not differentiable everywhere.
•Sum of squared residuals: $\sum_{i=1}^n e_i^2$. Squaring makes all terms positive and is differentiable. This is the least squares criterion.
•Maximum absolute residual: $\max_i |e_i|$. Called minimax or Chebyshev regression. Sensitive to outliers.

The least squares method minimizes the residual sum of squares (RSS), also called the sum of squared errors (SSE):

$$\text{RSS}(\beta_0, \beta_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n [y_i - (\beta_0 + \beta_1 x_i)]^2$$

This is a function of two unknowns ($\beta_0$ and $\beta_1$). Our task is to find the values that minimize it:

$$\hat{\beta}0, \hat{\beta}1 = \arg\min{\beta_0, \beta_1} \sum{i=1}^n [y_i - \beta_0 - \beta_1 x_i]^2$$

Why Squared Errors?

Beyond mathematical convenience (differentiability), squared errors have deeper justifications: (1) Under Gaussian error assumptions, least squares equals maximum likelihood estimation. (2) Among all linear unbiased estimators, OLS has minimum variance (Gauss-Markov theorem). (3) Squared loss corresponds to minimizing mean squared error in prediction. (4) RSS relates to variance decomposition and R-squared.

Setting Up the Optimization

We have an unconstrained optimization problem in two variables. Define the objective function:

$$S(\beta_0, \beta_1) = \sum_{i=1}^n [y_i - \beta_0 - \beta_1 x_i]^2$$

This is a convex quadratic function of $(\beta_0, \beta_1)$. Because it's quadratic with positive leading coefficients (think: bowl-shaped surface opening upward), it has a unique global minimum that we can find by setting the first-order derivatives to zero.

Visualizing the Objective

Imagine a 3D surface where the horizontal plane has coordinates $(\beta_0, \beta_1)$ and the vertical axis represents $S(\beta_0, \beta_1)$. The surface is a paraboloid—a bowl shape. The minimum is at the bottom of the bowl.

First-Order Necessary Conditions

At the minimum, the gradient must be zero:

$$\nabla S = \begin{pmatrix} \frac{\partial S}{\partial \beta_0} \ \frac{\partial S}{\partial \beta_1} \end{pmatrix} = \begin{pmatrix} 0 \ 0 \end{pmatrix}$$

Let's compute each partial derivative.

Partial Derivative with Respect to $\beta_0$

Using the chain rule:

$$\frac{\partial S}{\partial \beta_0} = \sum_{i=1}^n 2[y_i - \beta_0 - \beta_1 x_i] \cdot (-1) = -2\sum_{i=1}^n [y_i - \beta_0 - \beta_1 x_i]$$

Setting equal to zero and dividing by $-2$:

$$\sum_{i=1}^n [y_i - \beta_0 - \beta_1 x_i] = 0$$

Expanding the sum:

$$\sum_{i=1}^n y_i - n\beta_0 - \beta_1 \sum_{i=1}^n x_i = 0$$

Dividing by $n$ and using the notation $\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i$ (sample mean of $y$) and $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ (sample mean of $x$):

$$\bar{y} - \beta_0 - \beta_1 \bar{x} = 0$$

This gives us the first normal equation:

$$\boxed{\beta_0 = \bar{y} - \beta_1 \bar{x}}$$

The Regression Line Passes Through the Centroid

This result has a beautiful geometric interpretation: the fitted regression line always passes through the point (x̄, ȳ)—the centroid (center of mass) of the data. Substituting x = x̄ into the fitted line ŷ = β̂₀ + β̂₁x gives ŷ = ȳ.

Deriving the Slope Estimator

Partial Derivative with Respect to $\beta_1$

$$\frac{\partial S}{\partial \beta_1} = \sum_{i=1}^n 2[y_i - \beta_0 - \beta_1 x_i] \cdot (-x_i) = -2\sum_{i=1}^n x_i[y_i - \beta_0 - \beta_1 x_i]$$

Setting equal to zero and dividing by $-2$:

$$\sum_{i=1}^n x_i[y_i - \beta_0 - \beta_1 x_i] = 0$$

Expanding:

$$\sum_{i=1}^n x_i y_i - \beta_0 \sum_{i=1}^n x_i - \beta_1 \sum_{i=1}^n x_i^2 = 0$$

This is the second normal equation.

Solving for $\beta_1$

Substitute $\beta_0 = \bar{y} - \beta_1 \bar{x}$ into the second normal equation:

$$\sum_{i=1}^n x_i y_i - (\bar{y} - \beta_1 \bar{x}) \sum_{i=1}^n x_i - \beta_1 \sum_{i=1}^n x_i^2 = 0$$

Note that $\sum_{i=1}^n x_i = n\bar{x}$, so:

$$\sum_{i=1}^n x_i y_i - \bar{y}(n\bar{x}) + \beta_1 \bar{x}(n\bar{x}) - \beta_1 \sum_{i=1}^n x_i^2 = 0$$

Rearranging:

$$\sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} = \beta_1 \left( \sum_{i=1}^n x_i^2 - n\bar{x}^2 \right)$$

Solving for $\beta_1$:

$$\boxed{\beta_1 = \frac{\sum_{i=1}^n x_i y_i - n\bar{x}\bar{y}}{\sum_{i=1}^n x_i^2 - n\bar{x}^2}}$$

Recognizing Covariance and Variance

The numerator and denominator have familiar statistical interpretations. The numerator is (n times) the sample covariance between x and y. The denominator is (n times) the sample variance of x. This connection will become clearer shortly.

The Final Formulas

Let's express the OLS estimators in their most common forms. Using centered notation, define:

$$S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})$$ $$S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2$$ $$S_{yy} = \sum_{i=1}^n (y_i - \bar{y})^2$$

It can be shown (by expanding) that:

$$\sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} = S_{xy}$$ $$\sum_{i=1}^n x_i^2 - n\bar{x}^2 = S_{xx}$$

The OLS Estimators

The slope estimator:

$$\boxed{\hat{\beta}1 = \frac{S{xy}}{S_{xx}} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}}$$

The intercept estimator:

$$\boxed{\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}}$$

These are the ordinary least squares (OLS) estimators for simple linear regression.

Equivalent Expressions for the OLS Slope
Form	Expression	When Useful
Deviation form	$\frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2}$	Conceptual clarity, numerical stability
Raw score form	$\frac{\sum_i x_i y_i - n \bar{x} \bar{y}}{\sum_i x_i^2 - n \bar{x}^2}$	Streaming/incremental computation
Covariance form	$\frac{\text{Cov}(x, y)}{\text{Var}(x)} = \frac{s_{xy}}{s_x^2}$	Statistical interpretation
Correlation form	$r_{xy} \cdot \frac{s_y}{s_x}$	Standardized variables, R-squared connection

Slope as Ratio of Covariance to Variance

The expression β̂₁ = Cov(x,y)/Var(x) reveals the slope as measuring how much y varies with x (covariance) relative to how much x varies by itself (variance). If x has high variance but low covariance with y, the slope is small. If Cov(x,y) is large relative to Var(x), the slope is steep.

Understanding the Normal Equations

The first-order conditions we derived are called the normal equations. Written together:

$$\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}1 x_i) = 0$$ $$\sum{i=1}^n x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0$$

These can be rewritten using residuals $e_i = y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i$:

$$\sum_{i=1}^n e_i = 0$$ $$\sum_{i=1}^n x_i e_i = 0$$

These equations express fundamental properties of OLS residuals.

Properties of OLS Residuals

•Sum to zero: $\sum_i e_i = 0$. The residuals average to zero (have zero mean).
•Orthogonal to predictor: $\sum_i x_i e_i = 0$. The residuals are uncorrelated with the predictor.
•Orthogonal to fitted values: $\sum_i \hat{y}_i e_i = 0$. (Follows from the above since $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$).
•Mean of predictions equals mean of observations: $\bar{\hat{y}} = \bar{y}$. (Follows from residuals summing to zero).

Matrix Form of Normal Equations

In matrix notation with $\mathbf{X} = [\mathbf{1} | \mathbf{x}]$ (design matrix with intercept column), the normal equations become:

$$\mathbf{X}^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}$$

This says the residual vector $\mathbf{e} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ is orthogonal to every column of $\mathbf{X}$. Specifically:

Orthogonal to $\mathbf{1}$ (column of ones): $\sum_i e_i = 0$
Orthogonal to $\mathbf{x}$: $\sum_i x_i e_i = 0$

Rearranging the normal equations:

$$\mathbf{X}^T \mathbf{y} = \mathbf{X}^T \mathbf{X} \hat{\boldsymbol{\beta}}$$

If $\mathbf{X}^T\mathbf{X}$ is invertible:

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

This is the famous closed-form solution for OLS in matrix form.

Why 'Normal' Equations?

The term 'normal' here means perpendicular/orthogonal, not usual/typical. The normal equations express the condition that residuals are normal (perpendicular) to the column space of X. Don't confuse this with the normality assumption about error distributions—that's a separate concept.

Second-Order Conditions

We found a critical point by setting the gradient to zero. How do we know it's a minimum and not a maximum or saddle point?

The Hessian Matrix

We need to check the second-order conditions using the Hessian matrix of second partial derivatives:

$$H = \begin{pmatrix} \frac{\partial^2 S}{\partial \beta_0^2} & \frac{\partial^2 S}{\partial \beta_0 \partial \beta_1} \ \frac{\partial^2 S}{\partial \beta_1 \partial \beta_0} & \frac{\partial^2 S}{\partial \beta_1^2} \end{pmatrix}$$

Computing each term from $S = \sum_i [y_i - \beta_0 - \beta_1 x_i]^2$:

$$\frac{\partial^2 S}{\partial \beta_0^2} = 2n$$

$$\frac{\partial^2 S}{\partial \beta_1^2} = 2\sum_{i=1}^n x_i^2$$

$$\frac{\partial^2 S}{\partial \beta_0 \partial \beta_1} = 2\sum_{i=1}^n x_i = 2n\bar{x}$$

So the Hessian is:

$$H = 2\begin{pmatrix} n & n\bar{x} \ n\bar{x} & \sum_i x_i^2 \end{pmatrix}$$

Positive Definiteness Check

For a minimum, $H$ must be positive definite. We check:

First principal minor: $2n > 0$ ✓ (assuming $n > 0$)
Second principal minor (determinant):

$$\det(H/2) = n \sum_i x_i^2 - n^2 \bar{x}^2 = n \left( \sum_i x_i^2 - n\bar{x}^2 \right) = n \cdot S_{xx}$$

This is positive as long as $S_{xx} > 0$, which holds whenever not all $x_i$ are identical.

Conclusion: The critical point is a global minimum (since the quadratic is convex). OLS gives the unique solution that minimizes RSS.

When Does OLS Fail?

The only pathological case is when Sₓₓ = 0, meaning all xᵢ values are identical. In this case, the data provides no variation in x, so we cannot estimate a slope—any line through (x̄, ȳ) fits equally well. The design matrix X becomes rank-deficient, and (XᵀX)⁻¹ doesn't exist.

Computational Implementation

Let's implement the OLS formulas and verify our understanding. The implementation illustrates both the conceptual formula and practical considerations.

ols_derivation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
 
def simple_ols(x, y):
    """
    Compute OLS estimates for simple linear regression.
    
    This implementation follows the formulas we derived:
    - β₁ = Sxy / Sxx
    - β₀ = ȳ - β₁x̄
    
    Parameters:
        x: ndarray of shape (n,) - predictor variable
        y: ndarray of shape (n,) - response variable
    
    Returns:
        beta_0: float - intercept estimate
        beta_1: float - slope estimate
    """
    n = len(x)
    
    # Sample means
    x_bar = np.mean(x)
    y_bar = np.mean(y)
    
    # Centered sums (deviation form - more numerically stable)
    Sxy = np.sum((x - x_bar) * (y - y_bar))
    Sxx = np.sum((x - x_bar) ** 2)
    
    # Check for degenerate case
    if Sxx == 0:
        raise ValueError("All x values are identical; slope is undefined")
    
    # OLS estimates
    beta_1 = Sxy / Sxx
    beta_0 = y_bar - beta_1 * x_bar
    
    return beta_0, beta_1
 
# Example: Generate data and fit
np.random.seed(42)
n = 100
 
# True parameters
beta_0_true = 5.0
beta_1_true = 2.0
sigma = 1.0
 
# Generate data
x = np.random.uniform(0, 10, n)
epsilon = np.random.normal(0, sigma, n)
y = beta_0_true + beta_1_true * x + epsilon
 
# Fit OLS
beta_0_hat, beta_1_hat = simple_ols(x, y)
 
print("True parameters:")
print(f"  β₀ = {beta_0_true:.4f}")
print(f"  β₁ = {beta_1_true:.4f}")
print()
print("OLS estimates:")
print(f"  β̂₀ = {beta_0_hat:.4f}")
print(f"  β̂₁ = {beta_1_hat:.4f}")
 
# Verify residual properties
y_hat = beta_0_hat + beta_1_hat * x
residuals = y - y_hat
 
print()
print("Residual properties (should be ~0):")
print(f"  Sum of residuals: {np.sum(residuals):.2e}")
print(f"  Sum of x*residuals: {np.sum(x * residuals):.2e}")
print(f"  Mean of ŷ: {np.mean(y_hat):.4f}")
print(f"  Mean of y: {np.mean(y):.4f}")

Numerical Stability

Notice we used the centered (deviation) form for computing Sxy and Sxx rather than the raw score form. This is more numerically stable when x or y values are large, avoiding potential catastrophic cancellation in floating-point arithmetic. Production implementations use even more sophisticated methods (QR decomposition) for numerical reasons.

Connection to Summary Statistics

The OLS formulas have elegant connections to familiar summary statistics. Let's make these explicit.

Sample Variance and Covariance

Recall the sample variance and covariance:

$$s_x^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 = \frac{S_{xx}}{n-1}$$

$$s_{xy} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) = \frac{S_{xy}}{n-1}$$

Therefore:

$$\hat{\beta}1 = \frac{S{xy}}{S_{xx}} = \frac{(n-1)s_{xy}}{(n-1)s_x^2} = \frac{s_{xy}}{s_x^2} = \frac{\text{Cov}(x,y)}{\text{Var}(x)}$$

Correlation Coefficient

The sample correlation coefficient is:

$$r_{xy} = \frac{s_{xy}}{s_x s_y} = \frac{S_{xy}}{\sqrt{S_{xx}} \sqrt{S_{yy}}}$$

This gives us another expression for the slope:

$$\hat{\beta}1 = \frac{s{xy}}{s_x^2} = \frac{r_{xy} s_x s_y}{s_x^2} = r_{xy} \frac{s_y}{s_x}$$

Interpretation: The slope equals the correlation times the ratio of standard deviations. If $x$ and $y$ are perfectly correlated ($r = 1$), and have equal standard deviations, then $\hat{\beta}_1 = 1$.

Summary Statistic Connections
Relationship	Formula	Interpretation
Slope from covariance	$\hat{\beta}1 = s{xy} / s_x^2$	Covariance per unit variance of x
Slope from correlation	$\hat{\beta}1 = r{xy} \cdot (s_y/s_x)$	Correlation scaled by volatility ratio
Sign of slope	$\text{sign}(\hat{\beta}1) = \text{sign}(r{xy})$	Slope direction matches correlation sign
Standardized slope	$\hat{\beta}1 \cdot (s_x/s_y) = r{xy}$	Slope on standardized data = correlation

The Standardized Perspective

If we standardize both x and y (subtract mean, divide by standard deviation), the OLS slope equals the correlation coefficient. This is why correlation is sometimes called 'standardized covariance' and why r² (squared correlation) plays a central role in regression diagnostics.

Summary: Least Squares Derivation

We've derived the ordinary least squares estimators from first principles. Let's consolidate the key results:

Key Takeaways

•Objective: Minimize RSS = $\sum_i (y_i - \beta_0 - \beta_1 x_i)^2$ to find the best-fitting line
•Slope formula: $\hat{\beta}1 = S{xy}/S_{xx} = \text{Cov}(x,y)/\text{Var}(x)$
•Intercept formula: $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$
•Residual properties: Sum to zero and are orthogonal to predictors
•Normal equations: Express orthogonality conditions in matrix form
•Uniqueness: The solution is the unique global minimum (convex problem)
•Statistical connections: Slope relates to correlation via $\hat{\beta}_1 = r \cdot (s_y/s_x)$

What's Next

We've established how to compute the OLS estimators. The next page provides the geometric interpretation—viewing regression as projection in high-dimensional space. This geometric view isn't just elegant; it provides deep insight into why OLS works, how to interpret R², and what happens when we add predictors in multiple regression.

Page Complete

You can now derive the OLS estimators from scratch using calculus, understand the normal equations, and verify residual properties. This derivation forms the mathematical backbone of linear regression. Next, we'll visualize what least squares actually does geometrically.

Least Squares Derivation

Finding the Best-Fitting Line

This page derives the least squares solution from scratch, revealing not just the formulas but the reasoning behind them.

What You Will Learn

The Criterion of Least Squares

Our goal is to find values $\hat{\beta}_0$ and $\hat{\beta}_1$ that make our predicted values $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$ as close as possible to the observed values $y_i$.

The residual for observation $i$ is the prediction error:

$$e_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)$$

We want all residuals to be small. But how do we aggregate $n$ individual residuals into a single measure of overall fit?

Possible Aggregate Measures

•Sum of residuals: $\sum_{i=1}^n e_i$. Problem: Positive and negative residuals cancel, so this can be zero even with large errors.
•Sum of absolute residuals: $\sum_{i=1}^n |e_i|$. Viable (leads to LAD/median regression), but not differentiable everywhere.
•Sum of squared residuals: $\sum_{i=1}^n e_i^2$. Squaring makes all terms positive and is differentiable. This is the least squares criterion.
•Maximum absolute residual: $\max_i |e_i|$. Called minimax or Chebyshev regression. Sensitive to outliers.

The least squares method minimizes the residual sum of squares (RSS), also called the sum of squared errors (SSE):

$$\text{RSS}(\beta_0, \beta_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n [y_i - (\beta_0 + \beta_1 x_i)]^2$$

This is a function of two unknowns ($\beta_0$ and $\beta_1$). Our task is to find the values that minimize it:

$$\hat{\beta}0, \hat{\beta}1 = \arg\min{\beta_0, \beta_1} \sum{i=1}^n [y_i - \beta_0 - \beta_1 x_i]^2$$

Why Squared Errors?

Setting Up the Optimization

We have an unconstrained optimization problem in two variables. Define the objective function:

$$S(\beta_0, \beta_1) = \sum_{i=1}^n [y_i - \beta_0 - \beta_1 x_i]^2$$

Visualizing the Objective

First-Order Necessary Conditions

At the minimum, the gradient must be zero:

$$\nabla S = \begin{pmatrix} \frac{\partial S}{\partial \beta_0} \ \frac{\partial S}{\partial \beta_1} \end{pmatrix} = \begin{pmatrix} 0 \ 0 \end{pmatrix}$$

Let's compute each partial derivative.

Partial Derivative with Respect to $\beta_0$

Using the chain rule:

$$\frac{\partial S}{\partial \beta_0} = \sum_{i=1}^n 2[y_i - \beta_0 - \beta_1 x_i] \cdot (-1) = -2\sum_{i=1}^n [y_i - \beta_0 - \beta_1 x_i]$$

Setting equal to zero and dividing by $-2$:

$$\sum_{i=1}^n [y_i - \beta_0 - \beta_1 x_i] = 0$$

Expanding the sum:

$$\sum_{i=1}^n y_i - n\beta_0 - \beta_1 \sum_{i=1}^n x_i = 0$$

Dividing by $n$ and using the notation $\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i$ (sample mean of $y$) and $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ (sample mean of $x$):

$$\bar{y} - \beta_0 - \beta_1 \bar{x} = 0$$

This gives us the first normal equation:

$$\boxed{\beta_0 = \bar{y} - \beta_1 \bar{x}}$$

The Regression Line Passes Through the Centroid

Deriving the Slope Estimator

Partial Derivative with Respect to $\beta_1$

$$\frac{\partial S}{\partial \beta_1} = \sum_{i=1}^n 2[y_i - \beta_0 - \beta_1 x_i] \cdot (-x_i) = -2\sum_{i=1}^n x_i[y_i - \beta_0 - \beta_1 x_i]$$

Setting equal to zero and dividing by $-2$:

$$\sum_{i=1}^n x_i[y_i - \beta_0 - \beta_1 x_i] = 0$$

Expanding:

$$\sum_{i=1}^n x_i y_i - \beta_0 \sum_{i=1}^n x_i - \beta_1 \sum_{i=1}^n x_i^2 = 0$$

This is the second normal equation.

Solving for $\beta_1$

Substitute $\beta_0 = \bar{y} - \beta_1 \bar{x}$ into the second normal equation:

$$\sum_{i=1}^n x_i y_i - (\bar{y} - \beta_1 \bar{x}) \sum_{i=1}^n x_i - \beta_1 \sum_{i=1}^n x_i^2 = 0$$

Note that $\sum_{i=1}^n x_i = n\bar{x}$, so:

$$\sum_{i=1}^n x_i y_i - \bar{y}(n\bar{x}) + \beta_1 \bar{x}(n\bar{x}) - \beta_1 \sum_{i=1}^n x_i^2 = 0$$

Rearranging:

$$\sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} = \beta_1 \left( \sum_{i=1}^n x_i^2 - n\bar{x}^2 \right)$$

Solving for $\beta_1$:

$$\boxed{\beta_1 = \frac{\sum_{i=1}^n x_i y_i - n\bar{x}\bar{y}}{\sum_{i=1}^n x_i^2 - n\bar{x}^2}}$$

Recognizing Covariance and Variance

The Final Formulas

Let's express the OLS estimators in their most common forms. Using centered notation, define:

$$S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})$$ $$S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2$$ $$S_{yy} = \sum_{i=1}^n (y_i - \bar{y})^2$$

It can be shown (by expanding) that:

$$\sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} = S_{xy}$$ $$\sum_{i=1}^n x_i^2 - n\bar{x}^2 = S_{xx}$$

The OLS Estimators

The slope estimator:

$$\boxed{\hat{\beta}1 = \frac{S{xy}}{S_{xx}} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}}$$

The intercept estimator:

$$\boxed{\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}}$$

These are the ordinary least squares (OLS) estimators for simple linear regression.

Equivalent Expressions for the OLS Slope
Form	Expression	When Useful
Deviation form	$\frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2}$	Conceptual clarity, numerical stability
Raw score form	$\frac{\sum_i x_i y_i - n \bar{x} \bar{y}}{\sum_i x_i^2 - n \bar{x}^2}$	Streaming/incremental computation
Covariance form	$\frac{\text{Cov}(x, y)}{\text{Var}(x)} = \frac{s_{xy}}{s_x^2}$	Statistical interpretation
Correlation form	$r_{xy} \cdot \frac{s_y}{s_x}$	Standardized variables, R-squared connection

Slope as Ratio of Covariance to Variance

Understanding the Normal Equations

The first-order conditions we derived are called the normal equations. Written together:

$$\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}1 x_i) = 0$$ $$\sum{i=1}^n x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0$$

These can be rewritten using residuals $e_i = y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i$:

$$\sum_{i=1}^n e_i = 0$$ $$\sum_{i=1}^n x_i e_i = 0$$

These equations express fundamental properties of OLS residuals.

Properties of OLS Residuals

•Sum to zero: $\sum_i e_i = 0$. The residuals average to zero (have zero mean).
•Orthogonal to predictor: $\sum_i x_i e_i = 0$. The residuals are uncorrelated with the predictor.
•Orthogonal to fitted values: $\sum_i \hat{y}_i e_i = 0$. (Follows from the above since $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$).
•Mean of predictions equals mean of observations: $\bar{\hat{y}} = \bar{y}$. (Follows from residuals summing to zero).

Matrix Form of Normal Equations

In matrix notation with $\mathbf{X} = [\mathbf{1} | \mathbf{x}]$ (design matrix with intercept column), the normal equations become:

$$\mathbf{X}^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}$$

This says the residual vector $\mathbf{e} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ is orthogonal to every column of $\mathbf{X}$. Specifically:

Orthogonal to $\mathbf{1}$ (column of ones): $\sum_i e_i = 0$
Orthogonal to $\mathbf{x}$: $\sum_i x_i e_i = 0$

Rearranging the normal equations:

$$\mathbf{X}^T \mathbf{y} = \mathbf{X}^T \mathbf{X} \hat{\boldsymbol{\beta}}$$

If $\mathbf{X}^T\mathbf{X}$ is invertible:

$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$

This is the famous closed-form solution for OLS in matrix form.

Why 'Normal' Equations?

Second-Order Conditions

We found a critical point by setting the gradient to zero. How do we know it's a minimum and not a maximum or saddle point?

The Hessian Matrix

We need to check the second-order conditions using the Hessian matrix of second partial derivatives:

Computing each term from $S = \sum_i [y_i - \beta_0 - \beta_1 x_i]^2$:

$$\frac{\partial^2 S}{\partial \beta_0^2} = 2n$$

$$\frac{\partial^2 S}{\partial \beta_1^2} = 2\sum_{i=1}^n x_i^2$$

$$\frac{\partial^2 S}{\partial \beta_0 \partial \beta_1} = 2\sum_{i=1}^n x_i = 2n\bar{x}$$

So the Hessian is:

$$H = 2\begin{pmatrix} n & n\bar{x} \ n\bar{x} & \sum_i x_i^2 \end{pmatrix}$$

Positive Definiteness Check

For a minimum, $H$ must be positive definite. We check:

First principal minor: $2n > 0$ ✓ (assuming $n > 0$)
Second principal minor (determinant):

$$\det(H/2) = n \sum_i x_i^2 - n^2 \bar{x}^2 = n \left( \sum_i x_i^2 - n\bar{x}^2 \right) = n \cdot S_{xx}$$

This is positive as long as $S_{xx} > 0$, which holds whenever not all $x_i$ are identical.

Conclusion: The critical point is a global minimum (since the quadratic is convex). OLS gives the unique solution that minimizes RSS.

When Does OLS Fail?

Computational Implementation

Let's implement the OLS formulas and verify our understanding. The implementation illustrates both the conceptual formula and practical considerations.

ols_derivation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import numpy as np
 
def simple_ols(x, y):
    """
    Compute OLS estimates for simple linear regression.
    
    This implementation follows the formulas we derived:
    - β₁ = Sxy / Sxx
    - β₀ = ȳ - β₁x̄
    
    Parameters:
        x: ndarray of shape (n,) - predictor variable
        y: ndarray of shape (n,) - response variable
    
    Returns:
        beta_0: float - intercept estimate
        beta_1: float - slope estimate
    """
    n = len(x)
    
    # Sample means
    x_bar = np.mean(x)
    y_bar = np.mean(y)
    
    # Centered sums (deviation form - more numerically stable)
    Sxy = np.sum((x - x_bar) * (y - y_bar))
    Sxx = np.sum((x - x_bar) ** 2)
    
    # Check for degenerate case
    if Sxx == 0:
        raise ValueError("All x values are identical; slope is undefined")
    
    # OLS estimates
    beta_1 = Sxy / Sxx
    beta_0 = y_bar - beta_1 * x_bar
    
    return beta_0, beta_1
 
# Example: Generate data and fit
np.random.seed(42)
n = 100
 
# True parameters
beta_0_true = 5.0
beta_1_true = 2.0
sigma = 1.0
 
# Generate data
x = np.random.uniform(0, 10, n)
epsilon = np.random.normal(0, sigma, n)
y = beta_0_true + beta_1_true * x + epsilon
 
# Fit OLS
beta_0_hat, beta_1_hat = simple_ols(x, y)
 
print("True parameters:")
print(f"  β₀ = {beta_0_true:.4f}")
print(f"  β₁ = {beta_1_true:.4f}")
print()
print("OLS estimates:")
print(f"  β̂₀ = {beta_0_hat:.4f}")
print(f"  β̂₁ = {beta_1_hat:.4f}")
 
# Verify residual properties
y_hat = beta_0_hat + beta_1_hat * x
residuals = y - y_hat
 
print()
print("Residual properties (should be ~0):")
print(f"  Sum of residuals: {np.sum(residuals):.2e}")
print(f"  Sum of x*residuals: {np.sum(x * residuals):.2e}")
print(f"  Mean of ŷ: {np.mean(y_hat):.4f}")
print(f"  Mean of y: {np.mean(y):.4f}")

Numerical Stability

Connection to Summary Statistics

The OLS formulas have elegant connections to familiar summary statistics. Let's make these explicit.

Sample Variance and Covariance

Recall the sample variance and covariance:

$$s_x^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 = \frac{S_{xx}}{n-1}$$

$$s_{xy} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) = \frac{S_{xy}}{n-1}$$

Therefore:

$$\hat{\beta}1 = \frac{S{xy}}{S_{xx}} = \frac{(n-1)s_{xy}}{(n-1)s_x^2} = \frac{s_{xy}}{s_x^2} = \frac{\text{Cov}(x,y)}{\text{Var}(x)}$$

Correlation Coefficient

The sample correlation coefficient is:

$$r_{xy} = \frac{s_{xy}}{s_x s_y} = \frac{S_{xy}}{\sqrt{S_{xx}} \sqrt{S_{yy}}}$$

This gives us another expression for the slope:

$$\hat{\beta}1 = \frac{s{xy}}{s_x^2} = \frac{r_{xy} s_x s_y}{s_x^2} = r_{xy} \frac{s_y}{s_x}$$

Summary Statistic Connections
Relationship	Formula	Interpretation
Slope from covariance	$\hat{\beta}1 = s{xy} / s_x^2$	Covariance per unit variance of x
Slope from correlation	$\hat{\beta}1 = r{xy} \cdot (s_y/s_x)$	Correlation scaled by volatility ratio
Sign of slope	$\text{sign}(\hat{\beta}1) = \text{sign}(r{xy})$	Slope direction matches correlation sign
Standardized slope	$\hat{\beta}1 \cdot (s_x/s_y) = r{xy}$	Slope on standardized data = correlation

The Standardized Perspective

Summary: Least Squares Derivation

We've derived the ordinary least squares estimators from first principles. Let's consolidate the key results:

Key Takeaways

•Objective: Minimize RSS = $\sum_i (y_i - \beta_0 - \beta_1 x_i)^2$ to find the best-fitting line
•Slope formula: $\hat{\beta}1 = S{xy}/S_{xx} = \text{Cov}(x,y)/\text{Var}(x)$
•Intercept formula: $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$
•Residual properties: Sum to zero and are orthogonal to predictors
•Normal equations: Express orthogonality conditions in matrix form
•Uniqueness: The solution is the unique global minimum (convex problem)
•Statistical connections: Slope relates to correlation via $\hat{\beta}_1 = r \cdot (s_y/s_x)$

What's Next

Page Complete