Loading content...
Given our model $y_i = \beta_0 + \beta_1 x_i + \varepsilon_i$ and observed data ${(x_i, y_i)}_{i=1}^n$, we face a fundamental question: How do we find the "best" values for $\beta_0$ and $\beta_1$?
The answer depends on defining what "best" means. The method of least squares—first published by Legendre in 1805 and independently developed by Gauss around the same time—provides an elegant and principled answer that has stood the test of two centuries.
This page derives the least squares solution from scratch, revealing not just the formulas but the reasoning behind them.
By the end of this page, you will be able to derive the OLS estimators β̂₀ and β̂₁ using calculus, understand why squared errors are used, manipulate the optimization problem algebraically, and connect the derivation to numerical computing considerations.
Our goal is to find values $\hat{\beta}_0$ and $\hat{\beta}_1$ that make our predicted values $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$ as close as possible to the observed values $y_i$.
The residual for observation $i$ is the prediction error:
$$e_i = y_i - \hat{y}_i = y_i - (\hat{\beta}_0 + \hat{\beta}_1 x_i)$$
We want all residuals to be small. But how do we aggregate $n$ individual residuals into a single measure of overall fit?
The least squares method minimizes the residual sum of squares (RSS), also called the sum of squared errors (SSE):
$$\text{RSS}(\beta_0, \beta_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n [y_i - (\beta_0 + \beta_1 x_i)]^2$$
This is a function of two unknowns ($\beta_0$ and $\beta_1$). Our task is to find the values that minimize it:
$$\hat{\beta}0, \hat{\beta}1 = \arg\min{\beta_0, \beta_1} \sum{i=1}^n [y_i - \beta_0 - \beta_1 x_i]^2$$
Beyond mathematical convenience (differentiability), squared errors have deeper justifications: (1) Under Gaussian error assumptions, least squares equals maximum likelihood estimation. (2) Among all linear unbiased estimators, OLS has minimum variance (Gauss-Markov theorem). (3) Squared loss corresponds to minimizing mean squared error in prediction. (4) RSS relates to variance decomposition and R-squared.
We have an unconstrained optimization problem in two variables. Define the objective function:
$$S(\beta_0, \beta_1) = \sum_{i=1}^n [y_i - \beta_0 - \beta_1 x_i]^2$$
This is a convex quadratic function of $(\beta_0, \beta_1)$. Because it's quadratic with positive leading coefficients (think: bowl-shaped surface opening upward), it has a unique global minimum that we can find by setting the first-order derivatives to zero.
Visualizing the Objective
Imagine a 3D surface where the horizontal plane has coordinates $(\beta_0, \beta_1)$ and the vertical axis represents $S(\beta_0, \beta_1)$. The surface is a paraboloid—a bowl shape. The minimum is at the bottom of the bowl.
First-Order Necessary Conditions
At the minimum, the gradient must be zero:
$$\nabla S = \begin{pmatrix} \frac{\partial S}{\partial \beta_0} \ \frac{\partial S}{\partial \beta_1} \end{pmatrix} = \begin{pmatrix} 0 \ 0 \end{pmatrix}$$
Let's compute each partial derivative.
Partial Derivative with Respect to $\beta_0$
Using the chain rule:
$$\frac{\partial S}{\partial \beta_0} = \sum_{i=1}^n 2[y_i - \beta_0 - \beta_1 x_i] \cdot (-1) = -2\sum_{i=1}^n [y_i - \beta_0 - \beta_1 x_i]$$
Setting equal to zero and dividing by $-2$:
$$\sum_{i=1}^n [y_i - \beta_0 - \beta_1 x_i] = 0$$
Expanding the sum:
$$\sum_{i=1}^n y_i - n\beta_0 - \beta_1 \sum_{i=1}^n x_i = 0$$
Dividing by $n$ and using the notation $\bar{y} = \frac{1}{n}\sum_{i=1}^n y_i$ (sample mean of $y$) and $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ (sample mean of $x$):
$$\bar{y} - \beta_0 - \beta_1 \bar{x} = 0$$
This gives us the first normal equation:
$$\boxed{\beta_0 = \bar{y} - \beta_1 \bar{x}}$$
This result has a beautiful geometric interpretation: the fitted regression line always passes through the point (x̄, ȳ)—the centroid (center of mass) of the data. Substituting x = x̄ into the fitted line ŷ = β̂₀ + β̂₁x gives ŷ = ȳ.
Partial Derivative with Respect to $\beta_1$
$$\frac{\partial S}{\partial \beta_1} = \sum_{i=1}^n 2[y_i - \beta_0 - \beta_1 x_i] \cdot (-x_i) = -2\sum_{i=1}^n x_i[y_i - \beta_0 - \beta_1 x_i]$$
Setting equal to zero and dividing by $-2$:
$$\sum_{i=1}^n x_i[y_i - \beta_0 - \beta_1 x_i] = 0$$
Expanding:
$$\sum_{i=1}^n x_i y_i - \beta_0 \sum_{i=1}^n x_i - \beta_1 \sum_{i=1}^n x_i^2 = 0$$
This is the second normal equation.
Solving for $\beta_1$
Substitute $\beta_0 = \bar{y} - \beta_1 \bar{x}$ into the second normal equation:
$$\sum_{i=1}^n x_i y_i - (\bar{y} - \beta_1 \bar{x}) \sum_{i=1}^n x_i - \beta_1 \sum_{i=1}^n x_i^2 = 0$$
Note that $\sum_{i=1}^n x_i = n\bar{x}$, so:
$$\sum_{i=1}^n x_i y_i - \bar{y}(n\bar{x}) + \beta_1 \bar{x}(n\bar{x}) - \beta_1 \sum_{i=1}^n x_i^2 = 0$$
Rearranging:
$$\sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} = \beta_1 \left( \sum_{i=1}^n x_i^2 - n\bar{x}^2 \right)$$
Solving for $\beta_1$:
$$\boxed{\beta_1 = \frac{\sum_{i=1}^n x_i y_i - n\bar{x}\bar{y}}{\sum_{i=1}^n x_i^2 - n\bar{x}^2}}$$
The numerator and denominator have familiar statistical interpretations. The numerator is (n times) the sample covariance between x and y. The denominator is (n times) the sample variance of x. This connection will become clearer shortly.
Let's express the OLS estimators in their most common forms. Using centered notation, define:
$$S_{xy} = \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})$$ $$S_{xx} = \sum_{i=1}^n (x_i - \bar{x})^2$$ $$S_{yy} = \sum_{i=1}^n (y_i - \bar{y})^2$$
It can be shown (by expanding) that:
$$\sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} = S_{xy}$$ $$\sum_{i=1}^n x_i^2 - n\bar{x}^2 = S_{xx}$$
The OLS Estimators
The slope estimator:
$$\boxed{\hat{\beta}1 = \frac{S{xy}}{S_{xx}} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}}$$
The intercept estimator:
$$\boxed{\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}}$$
These are the ordinary least squares (OLS) estimators for simple linear regression.
| Form | Expression | When Useful |
|---|---|---|
| Deviation form | $\frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2}$ | Conceptual clarity, numerical stability |
| Raw score form | $\frac{\sum_i x_i y_i - n \bar{x} \bar{y}}{\sum_i x_i^2 - n \bar{x}^2}$ | Streaming/incremental computation |
| Covariance form | $\frac{\text{Cov}(x, y)}{\text{Var}(x)} = \frac{s_{xy}}{s_x^2}$ | Statistical interpretation |
| Correlation form | $r_{xy} \cdot \frac{s_y}{s_x}$ | Standardized variables, R-squared connection |
The expression β̂₁ = Cov(x,y)/Var(x) reveals the slope as measuring how much y varies with x (covariance) relative to how much x varies by itself (variance). If x has high variance but low covariance with y, the slope is small. If Cov(x,y) is large relative to Var(x), the slope is steep.
The first-order conditions we derived are called the normal equations. Written together:
$$\sum_{i=1}^n (y_i - \hat{\beta}_0 - \hat{\beta}1 x_i) = 0$$ $$\sum{i=1}^n x_i(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i) = 0$$
These can be rewritten using residuals $e_i = y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i$:
$$\sum_{i=1}^n e_i = 0$$ $$\sum_{i=1}^n x_i e_i = 0$$
These equations express fundamental properties of OLS residuals.
Matrix Form of Normal Equations
In matrix notation with $\mathbf{X} = [\mathbf{1} | \mathbf{x}]$ (design matrix with intercept column), the normal equations become:
$$\mathbf{X}^T (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}$$
This says the residual vector $\mathbf{e} = \mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}$ is orthogonal to every column of $\mathbf{X}$. Specifically:
Rearranging the normal equations:
$$\mathbf{X}^T \mathbf{y} = \mathbf{X}^T \mathbf{X} \hat{\boldsymbol{\beta}}$$
If $\mathbf{X}^T\mathbf{X}$ is invertible:
$$\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$$
This is the famous closed-form solution for OLS in matrix form.
The term 'normal' here means perpendicular/orthogonal, not usual/typical. The normal equations express the condition that residuals are normal (perpendicular) to the column space of X. Don't confuse this with the normality assumption about error distributions—that's a separate concept.
We found a critical point by setting the gradient to zero. How do we know it's a minimum and not a maximum or saddle point?
The Hessian Matrix
We need to check the second-order conditions using the Hessian matrix of second partial derivatives:
$$H = \begin{pmatrix} \frac{\partial^2 S}{\partial \beta_0^2} & \frac{\partial^2 S}{\partial \beta_0 \partial \beta_1} \ \frac{\partial^2 S}{\partial \beta_1 \partial \beta_0} & \frac{\partial^2 S}{\partial \beta_1^2} \end{pmatrix}$$
Computing each term from $S = \sum_i [y_i - \beta_0 - \beta_1 x_i]^2$:
$$\frac{\partial^2 S}{\partial \beta_0^2} = 2n$$
$$\frac{\partial^2 S}{\partial \beta_1^2} = 2\sum_{i=1}^n x_i^2$$
$$\frac{\partial^2 S}{\partial \beta_0 \partial \beta_1} = 2\sum_{i=1}^n x_i = 2n\bar{x}$$
So the Hessian is:
$$H = 2\begin{pmatrix} n & n\bar{x} \ n\bar{x} & \sum_i x_i^2 \end{pmatrix}$$
Positive Definiteness Check
For a minimum, $H$ must be positive definite. We check:
First principal minor: $2n > 0$ ✓ (assuming $n > 0$)
Second principal minor (determinant):
$$\det(H/2) = n \sum_i x_i^2 - n^2 \bar{x}^2 = n \left( \sum_i x_i^2 - n\bar{x}^2 \right) = n \cdot S_{xx}$$
This is positive as long as $S_{xx} > 0$, which holds whenever not all $x_i$ are identical.
Conclusion: The critical point is a global minimum (since the quadratic is convex). OLS gives the unique solution that minimizes RSS.
The only pathological case is when Sₓₓ = 0, meaning all xᵢ values are identical. In this case, the data provides no variation in x, so we cannot estimate a slope—any line through (x̄, ȳ) fits equally well. The design matrix X becomes rank-deficient, and (XᵀX)⁻¹ doesn't exist.
Let's implement the OLS formulas and verify our understanding. The implementation illustrates both the conceptual formula and practical considerations.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
import numpy as np def simple_ols(x, y): """ Compute OLS estimates for simple linear regression. This implementation follows the formulas we derived: - β₁ = Sxy / Sxx - β₀ = ȳ - β₁x̄ Parameters: x: ndarray of shape (n,) - predictor variable y: ndarray of shape (n,) - response variable Returns: beta_0: float - intercept estimate beta_1: float - slope estimate """ n = len(x) # Sample means x_bar = np.mean(x) y_bar = np.mean(y) # Centered sums (deviation form - more numerically stable) Sxy = np.sum((x - x_bar) * (y - y_bar)) Sxx = np.sum((x - x_bar) ** 2) # Check for degenerate case if Sxx == 0: raise ValueError("All x values are identical; slope is undefined") # OLS estimates beta_1 = Sxy / Sxx beta_0 = y_bar - beta_1 * x_bar return beta_0, beta_1 # Example: Generate data and fitnp.random.seed(42)n = 100 # True parametersbeta_0_true = 5.0beta_1_true = 2.0sigma = 1.0 # Generate datax = np.random.uniform(0, 10, n)epsilon = np.random.normal(0, sigma, n)y = beta_0_true + beta_1_true * x + epsilon # Fit OLSbeta_0_hat, beta_1_hat = simple_ols(x, y) print("True parameters:")print(f" β₀ = {beta_0_true:.4f}")print(f" β₁ = {beta_1_true:.4f}")print()print("OLS estimates:")print(f" β̂₀ = {beta_0_hat:.4f}")print(f" β̂₁ = {beta_1_hat:.4f}") # Verify residual propertiesy_hat = beta_0_hat + beta_1_hat * xresiduals = y - y_hat print()print("Residual properties (should be ~0):")print(f" Sum of residuals: {np.sum(residuals):.2e}")print(f" Sum of x*residuals: {np.sum(x * residuals):.2e}")print(f" Mean of ŷ: {np.mean(y_hat):.4f}")print(f" Mean of y: {np.mean(y):.4f}")Notice we used the centered (deviation) form for computing Sxy and Sxx rather than the raw score form. This is more numerically stable when x or y values are large, avoiding potential catastrophic cancellation in floating-point arithmetic. Production implementations use even more sophisticated methods (QR decomposition) for numerical reasons.
The OLS formulas have elegant connections to familiar summary statistics. Let's make these explicit.
Sample Variance and Covariance
Recall the sample variance and covariance:
$$s_x^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 = \frac{S_{xx}}{n-1}$$
$$s_{xy} = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) = \frac{S_{xy}}{n-1}$$
Therefore:
$$\hat{\beta}1 = \frac{S{xy}}{S_{xx}} = \frac{(n-1)s_{xy}}{(n-1)s_x^2} = \frac{s_{xy}}{s_x^2} = \frac{\text{Cov}(x,y)}{\text{Var}(x)}$$
Correlation Coefficient
The sample correlation coefficient is:
$$r_{xy} = \frac{s_{xy}}{s_x s_y} = \frac{S_{xy}}{\sqrt{S_{xx}} \sqrt{S_{yy}}}$$
This gives us another expression for the slope:
$$\hat{\beta}1 = \frac{s{xy}}{s_x^2} = \frac{r_{xy} s_x s_y}{s_x^2} = r_{xy} \frac{s_y}{s_x}$$
Interpretation: The slope equals the correlation times the ratio of standard deviations. If $x$ and $y$ are perfectly correlated ($r = 1$), and have equal standard deviations, then $\hat{\beta}_1 = 1$.
| Relationship | Formula | Interpretation |
|---|---|---|
| Slope from covariance | $\hat{\beta}1 = s{xy} / s_x^2$ | Covariance per unit variance of x |
| Slope from correlation | $\hat{\beta}1 = r{xy} \cdot (s_y/s_x)$ | Correlation scaled by volatility ratio |
| Sign of slope | $\text{sign}(\hat{\beta}1) = \text{sign}(r{xy})$ | Slope direction matches correlation sign |
| Standardized slope | $\hat{\beta}1 \cdot (s_x/s_y) = r{xy}$ | Slope on standardized data = correlation |
If we standardize both x and y (subtract mean, divide by standard deviation), the OLS slope equals the correlation coefficient. This is why correlation is sometimes called 'standardized covariance' and why r² (squared correlation) plays a central role in regression diagnostics.
We've derived the ordinary least squares estimators from first principles. Let's consolidate the key results:
What's Next
We've established how to compute the OLS estimators. The next page provides the geometric interpretation—viewing regression as projection in high-dimensional space. This geometric view isn't just elegant; it provides deep insight into why OLS works, how to interpret R², and what happens when we add predictors in multiple regression.
You can now derive the OLS estimators from scratch using calculus, understand the normal equations, and verify residual properties. This derivation forms the mathematical backbone of linear regression. Next, we'll visualize what least squares actually does geometrically.