Loading learning content...
Confidence intervals quantify uncertainty, but many practical questions require decisions:
Hypothesis testing provides a formal framework for answering such questions. Given data, we decide whether the evidence is strong enough to reject a specific claim (the null hypothesis) in favor of an alternative.
In regression, the most common tests assess whether coefficients are zero—determining whether predictors have 'statistically significant' effects on the response.
By the end of this page, you will master t-tests for individual coefficients, F-tests for joint hypotheses about multiple coefficients, the overall F-test for regression significance, understand the relationship between tests and confidence intervals, correctly interpret p-values and statistical significance, and avoid common logical fallacies in hypothesis testing.
The Basic Setup:
Types of Errors:
| H₀ True | H₀ False | |
|---|---|---|
| Reject H₀ | Type I Error (α) | Correct (Power = 1-β) |
| Fail to Reject H₀ | Correct | Type II Error (β) |
Hypothesis testing is designed to protect against false positives (Type I errors). We only reject H₀ when evidence is strong. 'Failing to reject H₀' is NOT the same as 'accepting H₀' or 'proving H₀ is true'—it means the evidence wasn't strong enough to rule out H₀.
In Regression Context:
The most common null hypotheses are:
Testing H₀: βⱼ = β₀ⱼ (usually β₀ⱼ = 0):
Under normality of errors, the t-statistic:
$$t_j = \frac{\hat{\beta}j - \beta{0j}}{\widehat{\text{SE}}(\hat{\beta}j)} \sim t{n-p} \quad \text{under } H_0$$
For the standard test of 'no effect' ($H_0: \beta_j = 0$):
$$t_j = \frac{\hat{\beta}_j}{\widehat{\text{SE}}(\hat{\beta}_j)}$$
Two-Sided Test (H₁: βⱼ ≠ 0):
$$\text{p-value} = 2 \cdot \Pr(t_{n-p} > |t_j|) = 2 \cdot [1 - F_{t_{n-p}}(|t_j|)]$$
Reject H₀ if p-value < α, equivalently if $|t_j| > t_{\alpha/2, n-p}$.
One-Sided Tests:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npfrom numpy.linalg import invimport scipy.stats as stats def ols_with_t_tests(X: np.ndarray, y: np.ndarray, null_values: np.ndarray = None): """ Perform OLS regression with t-tests for each coefficient. Parameters: ----------- X : design matrix (n x p) y : response (n,) null_values : hypothesized values under H₀ (default: all zeros) Returns: -------- Dictionary with estimates, SEs, t-stats, p-values """ n, p = X.shape df = n - p if null_values is None: null_values = np.zeros(p) # OLS estimates XtX_inv = inv(X.T @ X) beta_hat = XtX_inv @ X.T @ y # Residual variance residuals = y - X @ beta_hat s_sq = np.sum(residuals**2) / df # Standard errors se = np.sqrt(s_sq * np.diag(XtX_inv)) # t-statistics (testing β = null_values) t_stats = (beta_hat - null_values) / se # p-values (two-sided) p_values_two_sided = 2 * (1 - stats.t.cdf(np.abs(t_stats), df)) # p-values (one-sided, β > 0) p_values_greater = 1 - stats.t.cdf(t_stats, df) # p-values (one-sided, β < 0) p_values_less = stats.t.cdf(t_stats, df) return { 'beta_hat': beta_hat, 'se': se, 't_stats': t_stats, 'p_two_sided': p_values_two_sided, 'p_greater': p_values_greater, 'p_less': p_values_less, 'df': df, 's': np.sqrt(s_sq) } # Examplenp.random.seed(42)n = 100 # Design matrix with intercept and three predictorsX = np.column_stack([ np.ones(n), np.random.randn(n), # x1: effect np.random.randn(n), # x2: no effect np.random.randn(n) # x3: small effect]) # True coefficients: β1 has effect, β2 has none, β3 has small effectbeta_true = np.array([1.0, 2.0, 0.0, 0.3])y = X @ beta_true + np.random.randn(n) results = ols_with_t_tests(X, y) print("OLS Regression with t-Tests (H₀: βⱼ = 0)")print("=" * 70)print(f"{'Coef':12s} {'Estimate':>10s} {'SE':>10s} {'t':>10s} {'p-value':>12s} {'Signif':>8s}")print("-" * 70) names = ['Intercept', 'x1', 'x2', 'x3']for j in range(4): signif = "***" if results['p_two_sided'][j] < 0.001 else "**" if results['p_two_sided'][j] < 0.01 else "*" if results['p_two_sided'][j] < 0.05 else "." if results['p_two_sided'][j] < 0.1 else "" print(f"{names[j]:12s} {results['beta_hat'][j]:10.4f} {results['se'][j]:10.4f} " f"{results['t_stats'][j]:10.3f} {results['p_two_sided'][j]:12.4f} {signif:>8s}") print("-" * 70)print(f"Residual SE: {results['s']:.4f} on {results['df']} df")print()print("True values:", beta_true)print("Signif. codes: *** <0.001, ** <0.01, * <0.05, . <0.1")Statistical software typically reports t-statistics and p-values for the test H₀: βⱼ = 0. The 'significance stars' (*, **, ***) indicate different p-value thresholds. Remember: statistical significance ≠ practical importance. A very small but precisely estimated effect can be 'significant' but irrelevant.
When testing hypotheses about multiple coefficients simultaneously, individual t-tests are insufficient. We need the F-test.
The General Linear Hypothesis:
$$H_0: \mathbf{R}\boldsymbol{\beta} = \mathbf{r}$$
Where:
Common Examples:
| Hypothesis | R | r |
|---|---|---|
| $H_0: \beta_1 = 0$ | $[0, 1, 0, \ldots, 0]$ | $[0]$ |
| $H_0: \beta_1 = \beta_2$ | $[0, 1, -1, 0, \ldots]$ | $[0]$ |
| $H_0: \beta_1 = \beta_2 = 0$ | $\begin{pmatrix} 0 & 1 & 0 & \cdots \ 0 & 0 & 1 & \cdots \end{pmatrix}$ | $\begin{pmatrix} 0 \ 0 \end{pmatrix}$ |
| $H_0: \beta_1 + \beta_2 = 1$ | $[0, 1, 1, 0, \ldots]$ | $[1]$ |
The F-Statistic:
$$F = \frac{(\mathbf{R}\hat{\boldsymbol{\beta}} - \mathbf{r})^\top [\mathbf{R}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{R}^\top]^{-1} (\mathbf{R}\hat{\boldsymbol{\beta}} - \mathbf{r})}{q \cdot s^2} \sim F_{q, n-p} \quad \text{under } H_0$$
Alternatively, using the restricted vs. unrestricted model approach:
$$F = \frac{(\text{RSS}_R - \text{RSS}_U) / q}{\text{RSS}_U / (n-p)} = \frac{(\text{RSS}_R - \text{RSS}_U) / q}{s^2}$$
Where:
Intuition:
The F-statistic compares how much worse the restricted model fits (RSS increase) relative to the residual variance. If the restrictions are valid (H₀ true), the RSS shouldn't increase much, so F is small. If restrictions are false, RSS increases substantially, F is large.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import numpy as npfrom numpy.linalg import invimport scipy.stats as stats def f_test_general(X: np.ndarray, y: np.ndarray, R: np.ndarray, r: np.ndarray): """ Perform F-test for general linear hypothesis H₀: Rβ = r. Parameters: ----------- X : design matrix (n x p) y : response (n,) R : restriction matrix (q x p) r : restriction values (q,) Returns: -------- Dictionary with F-statistic, p-value, df """ n, p = X.shape q = R.shape[0] df1, df2 = q, n - p # OLS estimates XtX_inv = inv(X.T @ X) beta_hat = XtX_inv @ X.T @ y # Residual variance residuals = y - X @ beta_hat s_sq = np.sum(residuals**2) / df2 # Departure from null departure = R @ beta_hat - r # Variance of Rβ̂ under null var_Rbeta = R @ XtX_inv @ R.T # F-statistic F_stat = (departure @ inv(var_Rbeta) @ departure) / (q * s_sq) # p-value p_value = 1 - stats.f.cdf(F_stat, df1, df2) return { 'F_stat': F_stat, 'p_value': p_value, 'df1': df1, 'df2': df2, 'departure': departure, 'beta_hat': beta_hat } def f_test_nested(X_full: np.ndarray, X_reduced: np.ndarray, y: np.ndarray): """ F-test comparing nested models via RSS. """ n = len(y) p_full = X_full.shape[1] p_reduced = X_reduced.shape[1] q = p_full - p_reduced # Fit full model beta_full = inv(X_full.T @ X_full) @ X_full.T @ y RSS_full = np.sum((y - X_full @ beta_full)**2) # Fit reduced model beta_reduced = inv(X_reduced.T @ X_reduced) @ X_reduced.T @ y RSS_reduced = np.sum((y - X_reduced @ beta_reduced)**2) # F-statistic df1, df2 = q, n - p_full F_stat = ((RSS_reduced - RSS_full) / q) / (RSS_full / df2) p_value = 1 - stats.f.cdf(F_stat, df1, df2) return { 'F_stat': F_stat, 'p_value': p_value, 'df1': df1, 'df2': df2, 'RSS_full': RSS_full, 'RSS_reduced': RSS_reduced } # Examplenp.random.seed(42)n = 100X = np.column_stack([ np.ones(n), np.random.randn(n), np.random.randn(n), np.random.randn(n)])beta_true = np.array([1.0, 2.0, 0.0, 0.0]) # Only β₁ is non-zeroy = X @ beta_true + np.random.randn(n) # Test 1: H₀: β₂ = β₃ = 0 (jointly)print("Test 1: H₀: β₂ = β₃ = 0")R = np.array([[0, 0, 1, 0], [0, 0, 0, 1]])r = np.array([0, 0])result1 = f_test_general(X, y, R, r)print(f" F = {result1['F_stat']:.3f}, df = ({result1['df1']}, {result1['df2']}), p = {result1['p_value']:.4f}")print(f" Decision: {'Reject H₀' if result1['p_value'] < 0.05 else 'Fail to reject H₀'}") # Test 2: H₀: β₁ = β₂print("\nTest 2: H₀: β₁ = β₂")R = np.array([[0, 1, -1, 0]])r = np.array([0])result2 = f_test_general(X, y, R, r)print(f" F = {result2['F_stat']:.3f}, df = ({result2['df1']}, {result2['df2']}), p = {result2['p_value']:.4f}")print(f" Decision: {'Reject H₀' if result2['p_value'] < 0.05 else 'Fail to reject H₀'}") # Test 3: Overall F-test using nested modelsprint("\nTest 3: Overall F-test (all slopes = 0)")X_reduced = X[:, 0:1] # Intercept onlyresult3 = f_test_nested(X, X_reduced, y)print(f" F = {result3['F_stat']:.3f}, df = ({result3['df1']}, {result3['df2']}), p = {result3['p_value']:.4f}")print(f" RSS(full) = {result3['RSS_full']:.2f}, RSS(reduced) = {result3['RSS_reduced']:.2f}")The overall F-test asks: Do any of the predictors explain variance in Y?
Hypotheses:
$$H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0$$ $$H_1: \text{At least one } \beta_j \neq 0$$
(Note: The intercept $\beta_0$ is not tested—we allow a non-zero mean.)
The F-Statistic:
$$F = \frac{\text{MSR}}{\text{MSE}} = \frac{\text{RSS}0 - \text{RSS}}{(p-1) \cdot s^2} = \frac{(\text{TSS} - \text{RSS})/(p-1)}{\text{RSS}/(n-p)} \sim F{p-1, n-p}$$
Where:
Connection to R²:
The F-statistic can be expressed in terms of $R^2$:
$$F = \frac{R^2 / (p-1)}{(1-R^2) / (n-p)}$$
This shows that F measures whether the fraction of variance explained ($R^2$) is 'large enough' given the number of predictors and sample size.
The ANOVA Table:
| Source | df | SS | MS | F |
|---|---|---|---|---|
| Regression | $p-1$ | ESS | MSR = ESS/(p-1) | MSR/MSE |
| Residual | $n-p$ | RSS | MSE = RSS/(n-p) | |
| Total | $n-1$ | TSS |
It's possible for the overall F-test to reject H₀ while no individual t-test is significant (rarely, due to multicollinearity), or for individual t-tests to be significant while F is not (if effects cancel out). The F-test assesses joint significance; t-tests assess marginal significance conditional on other variables.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
import numpy as npfrom numpy.linalg import invimport scipy.stats as stats def anova_table(X: np.ndarray, y: np.ndarray): """ Compute complete ANOVA table for regression. """ n, p = X.shape # OLS beta_hat = inv(X.T @ X) @ X.T @ y y_hat = X @ beta_hat residuals = y - y_hat # Sums of squares y_bar = np.mean(y) TSS = np.sum((y - y_bar)**2) RSS = np.sum(residuals**2) ESS = TSS - RSS # Degrees of freedom df_regression = p - 1 df_residual = n - p df_total = n - 1 # Mean squares MSR = ESS / df_regression MSE = RSS / df_residual # F-statistic F_stat = MSR / MSE p_value = 1 - stats.f.cdf(F_stat, df_regression, df_residual) # R-squared R_sq = ESS / TSS R_sq_adj = 1 - (RSS / df_residual) / (TSS / df_total) return { 'TSS': TSS, 'ESS': ESS, 'RSS': RSS, 'df_reg': df_regression, 'df_res': df_residual, 'df_total': df_total, 'MSR': MSR, 'MSE': MSE, 'F_stat': F_stat, 'p_value': p_value, 'R_sq': R_sq, 'R_sq_adj': R_sq_adj } # Examplenp.random.seed(42)n = 100X = np.column_stack([ np.ones(n), np.random.randn(n), np.random.randn(n)])beta_true = np.array([1.0, 1.5, -0.8])y = X @ beta_true + np.random.randn(n) result = anova_table(X, y) print("ANOVA Table")print("=" * 65)print(f"{'Source':<15s} {'df':>8s} {'SS':>12s} {'MS':>12s} {'F':>10s} {'p-value':>10s}")print("-" * 65)print(f"{'Regression':<15s} {result['df_reg']:>8d} {result['ESS']:>12.2f} {result['MSR']:>12.4f} " f"{result['F_stat']:>10.3f} {result['p_value']:>10.4f}")print(f"{'Residual':<15s} {result['df_res']:>8d} {result['RSS']:>12.2f} {result['MSE']:>12.4f}")print(f"{'Total':<15s} {result['df_total']:>8d} {result['TSS']:>12.2f}")print("-" * 65)print(f"R² = {result['R_sq']:.4f}, Adjusted R² = {result['R_sq_adj']:.4f}")print()print(f"Overall F-test: F({result['df_reg']}, {result['df_res']}) = {result['F_stat']:.3f}, " f"p = {result['p_value']:.4f}")decision = "Reject H₀" if result['p_value'] < 0.05 else "Fail to reject H₀"print(f"Decision: {decision} - regression is {'significant' if result['p_value'] < 0.05 else 'not significant'}")There is a profound connection between hypothesis tests and confidence intervals: they are dual procedures that convey the same information.
The Duality Relationship:
For testing $H_0: \beta_j = \beta_{0j}$ at significance level $\alpha$:
$$\text{Reject } H_0 \quad \iff \quad \beta_{0j} \notin \text{CI}_{1-\alpha}(\beta_j)$$
Equivalently:
$$\text{p-value} < \alpha \quad \iff \quad \beta_{0j} \notin [\hat{\beta}j - t{\alpha/2}\cdot\text{SE}, \hat{\beta}j + t{\alpha/2}\cdot\text{SE}]$$
Implications:
A p-value tells you only whether 0 is rejected at some threshold. A CI tells you the entire range of plausible values. If CI = [0.5, 15.0], you know not just that β ≠ 0, but that β is probably between 0.5 and 15.0. The CI tells a richer story.
Example of Duality:
Suppose $\hat{\beta}_1 = 2.5$ with SE = 1.0, and $n - p = 30$.
Now test $H_0: \beta_1 = 1$:
For Joint Tests:
The duality extends:
The p-value is perhaps the most misunderstood quantity in statistics. Let's be precise.
Definition:
The p-value is the probability of observing a test statistic as extreme or more extreme than the one computed, assuming H₀ is true.
$$\text{p-value} = \Pr(|T| \geq |t_{obs}| ,|, H_0 \text{ true})$$
What the p-Value IS:
✅ A measure of compatibility between the data and H₀ ✅ The probability of the data (or more extreme) given H₀ ✅ A continuous measure—smaller = less compatible with H₀
What the p-Value is NOT:
❌ Probability that H₀ is true: P(H₀|data) — This is backwards! The p-value is P(data|H₀). ❌ Probability that H₁ is true: P(H₁|data) — Same error. ❌ Probability the result is due to chance — Vague and misleading. ❌ Effect size — A small p-value doesn't mean a large effect. ❌ Replication probability — A p = 0.04 doesn't mean 96% chance of replicating.
The Correct Interpretation:
A p-value of 0.03 means:
"If the null hypothesis were true (β = 0), there would be a 3% probability of observing a test statistic at least as extreme as the one we computed."
This is NOT the same as:
"There is a 3% probability that the null hypothesis is true." ❌
Why This Matters:
Suppose you test 100 hypotheses, all of which are actually true (all null hypotheses are correct). Using α = 0.05, you expect to reject about 5 of them—false positives.
If p = 0.04 for a particular test, it tells you nothing about whether that null hypothesis is true. It only tells you the probability of seeing such extreme data if it were true.
The Base Rate Fallacy:
If you test a hypothesis that is very likely true a priori (high prior probability of H₀), a p = 0.04 might still leave H₀ more probable than not. Bayesian thinking is needed for P(H₀|data).
A fundamental distinction that researchers often blur:
Statistical Significance:
the observed effect is unlikely under H₀. The p-value is below α. We have evidence that the effect is non-zero.
Practical Significance (Effect Size):
The observed effect is large enough to matter in the real world. This depends on context, costs, and benefits.
The Key Insight:
Statistical significance depends on sample size. With enough data, even trivially small effects become statistically significant.
Example:
Suppose a drug reduces blood pressure by 0.5 mmHg on average:
| Sample Size | SE | t-statistic | p-value | Significant? |
|---|---|---|---|---|
| n = 100 | 2.0 | 0.25 | 0.80 | No |
| n = 10,000 | 0.2 | 2.5 | 0.01 | Yes |
| n = 1,000,000 | 0.02 | 25 | < 0.0001 | Extremely |
The effect size (0.5 mmHg) is the same—likely clinically irrelevant. But with enough data, we can detect it with certainty.
With n = 1,000,000, a coefficient of 0.0001 might be highly 'significant' (p < 0.001) yet completely meaningless practically. Always ask: Is the effect SIZE large enough to matter? Statistical significance tells you it's non-zero, not that it's important.
Reporting Best Practices:
Effect Size Measures:
For regression, common effect size measures include:
Hypothesis testing is rife with logical errors. Here are the most common pitfalls:
This page has covered the complete theory and practice of hypothesis testing for regression coefficients. Here are the key insights:
Module Complete:
With this page, we've completed Module 3 on the Statistical Properties of OLS. You now understand:
These tools form the foundation for rigorous inference in linear regression and extend to many more advanced methods.
Congratulations! You've mastered the statistical properties of OLS—from the theoretical guarantees of the Gauss-Markov theorem to the practical tools of confidence intervals and hypothesis tests. These concepts are fundamental to understanding any regression analysis and form the basis for more advanced topics in econometrics and machine learning.