Machine LearningLinear Regression

Statistical Properties of OLS

LevelIntermediate

Duration90 mins

TopicLinear Regression

5 / 5

Hypothesis Testing for Regression Coefficients

Making Decisions Under Uncertainty

Confidence intervals quantify uncertainty, but many practical questions require decisions:

Does this variable actually affect the outcome, or is the apparent effect just noise?
Are these two treatment effects truly different?
Does this set of variables collectively explain significant variation?

Hypothesis testing provides a formal framework for answering such questions. Given data, we decide whether the evidence is strong enough to reject a specific claim (the null hypothesis) in favor of an alternative.

In regression, the most common tests assess whether coefficients are zero—determining whether predictors have 'statistically significant' effects on the response.

What You Will Learn

By the end of this page, you will master t-tests for individual coefficients, F-tests for joint hypotheses about multiple coefficients, the overall F-test for regression significance, understand the relationship between tests and confidence intervals, correctly interpret p-values and statistical significance, and avoid common logical fallacies in hypothesis testing.

The Framework of Hypothesis Testing

The Basic Setup:

Null Hypothesis (H₀): A specific claim about parameters, usually representing 'no effect' or 'no difference'
Alternative Hypothesis (H₁ or Hₐ): The claim we're trying to find evidence for
Test Statistic: A function of the data that measures departure from H₀
p-value: Probability of observing a test statistic as extreme or more extreme than what we got, assuming H₀ is true
Decision Rule: Reject H₀ if p-value < α (significance level)

Types of Errors:

	H₀ True	H₀ False
Reject H₀	Type I Error (α)	Correct (Power = 1-β)
Fail to Reject H₀	Correct	Type II Error (β)

α (significance level): Probability of rejecting H₀ when it's true (typically 0.05)
β: Probability of failing to reject H₀ when it's false
Power (1-β): Probability of correctly rejecting a false H₀

Asymmetric Burden of Proof

Hypothesis testing is designed to protect against false positives (Type I errors). We only reject H₀ when evidence is strong. 'Failing to reject H₀' is NOT the same as 'accepting H₀' or 'proving H₀ is true'—it means the evidence wasn't strong enough to rule out H₀.

In Regression Context:

The most common null hypotheses are:

Individual coefficient: $H_0: \beta_j = 0$ (predictor $j$ has no effect)
Subset of coefficients: $H_0: \beta_{j_1} = \beta_{j_2} = \cdots = \beta_{j_q} = 0$ (group of predictors jointly have no effect)
Linear restrictions: $H_0: \mathbf{R}\boldsymbol{\beta} = \mathbf{r}$ (coefficients satisfy specific relationships)
Overall regression: $H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0$ (all predictors, excluding intercept, have no effect)

t-Tests for Individual Coefficients

Testing H₀: βⱼ = β₀ⱼ (usually β₀ⱼ = 0):

Under normality of errors, the t-statistic:

$$t_j = \frac{\hat{\beta}j - \beta{0j}}{\widehat{\text{SE}}(\hat{\beta}j)} \sim t{n-p} \quad \text{under } H_0$$

For the standard test of 'no effect' ($H_0: \beta_j = 0$):

$$t_j = \frac{\hat{\beta}_j}{\widehat{\text{SE}}(\hat{\beta}_j)}$$

Two-Sided Test (H₁: βⱼ ≠ 0):

$$\text{p-value} = 2 \cdot \Pr(t_{n-p} > |t_j|) = 2 \cdot [1 - F_{t_{n-p}}(|t_j|)]$$

Reject H₀ if p-value < α, equivalently if $|t_j| > t_{\alpha/2, n-p}$.

One-Sided Tests:

$H_1: \beta_j > 0$: p-value = $\Pr(t_{n-p} > t_j) = 1 - F_{t_{n-p}}(t_j)$
$H_1: \beta_j < 0$: p-value = $\Pr(t_{n-p} < t_j) = F_{t_{n-p}}(t_j)$

t_tests.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def ols_with_t_tests(X: np.ndarray, y: np.ndarray, 
                      null_values: np.ndarray = None):
    """
    Perform OLS regression with t-tests for each coefficient.
    
    Parameters:
    -----------
    X : design matrix (n x p)
    y : response (n,)
    null_values : hypothesized values under H₀ (default: all zeros)
    
    Returns:
    --------
    Dictionary with estimates, SEs, t-stats, p-values
    """
    n, p = X.shape
    df = n - p
    
    if null_values is None:
        null_values = np.zeros(p)
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residual variance
    residuals = y - X @ beta_hat
    s_sq = np.sum(residuals**2) / df
    
    # Standard errors
    se = np.sqrt(s_sq * np.diag(XtX_inv))
    
    # t-statistics (testing β = null_values)
    t_stats = (beta_hat - null_values) / se
    
    # p-values (two-sided)
    p_values_two_sided = 2 * (1 - stats.t.cdf(np.abs(t_stats), df))
    
    # p-values (one-sided, β > 0)
    p_values_greater = 1 - stats.t.cdf(t_stats, df)
    
    # p-values (one-sided, β < 0)
    p_values_less = stats.t.cdf(t_stats, df)
    
    return {
        'beta_hat': beta_hat,
        'se': se,
        't_stats': t_stats,
        'p_two_sided': p_values_two_sided,
        'p_greater': p_values_greater,
        'p_less': p_values_less,
        'df': df,
        's': np.sqrt(s_sq)
    }
 
# Example
np.random.seed(42)
n = 100
 
# Design matrix with intercept and three predictors
X = np.column_stack([
    np.ones(n),
    np.random.randn(n),     # x1: effect
    np.random.randn(n),     # x2: no effect
    np.random.randn(n)      # x3: small effect
])
 
# True coefficients: β1 has effect, β2 has none, β3 has small effect
beta_true = np.array([1.0, 2.0, 0.0, 0.3])
y = X @ beta_true + np.random.randn(n)
 
results = ols_with_t_tests(X, y)
 
print("OLS Regression with t-Tests (H₀: βⱼ = 0)")
print("=" * 70)
print(f"{'Coef':12s} {'Estimate':>10s} {'SE':>10s} {'t':>10s} {'p-value':>12s} {'Signif':>8s}")
print("-" * 70)
 
names = ['Intercept', 'x1', 'x2', 'x3']
for j in range(4):
    signif = "***" if results['p_two_sided'][j] < 0.001 else              "**" if results['p_two_sided'][j] < 0.01 else              "*" if results['p_two_sided'][j] < 0.05 else              "." if results['p_two_sided'][j] < 0.1 else ""
    print(f"{names[j]:12s} {results['beta_hat'][j]:10.4f} {results['se'][j]:10.4f} "
          f"{results['t_stats'][j]:10.3f} {results['p_two_sided'][j]:12.4f} {signif:>8s}")
 
print("-" * 70)
print(f"Residual SE: {results['s']:.4f} on {results['df']} df")
print()
print("True values:", beta_true)
print("Signif. codes: *** <0.001, ** <0.01, * <0.05, . <0.1")

Reading Software Output

Statistical software typically reports t-statistics and p-values for the test H₀: βⱼ = 0. The 'significance stars' (*, **, ***) indicate different p-value thresholds. Remember: statistical significance ≠ practical importance. A very small but precisely estimated effect can be 'significant' but irrelevant.

F-Tests for Joint Hypotheses

When testing hypotheses about multiple coefficients simultaneously, individual t-tests are insufficient. We need the F-test.

The General Linear Hypothesis:

$$H_0: \mathbf{R}\boldsymbol{\beta} = \mathbf{r}$$

Where:

$\mathbf{R}$ is a $q \times p$ matrix of known constants (full row rank $q$)
$\mathbf{r}$ is a $q \times 1$ vector of known constants
$q$ is the number of restrictions (constraints)

Common Examples:

Hypothesis	R	r
$H_0: \beta_1 = 0$	$[0, 1, 0, \ldots, 0]$	$[0]$
$H_0: \beta_1 = \beta_2$	$[0, 1, -1, 0, \ldots]$	$[0]$
$H_0: \beta_1 = \beta_2 = 0$	$\begin{pmatrix} 0 & 1 & 0 & \cdots \ 0 & 0 & 1 & \cdots \end{pmatrix}$	$\begin{pmatrix} 0 \ 0 \end{pmatrix}$
$H_0: \beta_1 + \beta_2 = 1$	$[0, 1, 1, 0, \ldots]$	$[1]$

The F-Statistic:

$$F = \frac{(\mathbf{R}\hat{\boldsymbol{\beta}} - \mathbf{r})^\top [\mathbf{R}(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{R}^\top]^{-1} (\mathbf{R}\hat{\boldsymbol{\beta}} - \mathbf{r})}{q \cdot s^2} \sim F_{q, n-p} \quad \text{under } H_0$$

Alternatively, using the restricted vs. unrestricted model approach:

$$F = \frac{(\text{RSS}_R - \text{RSS}_U) / q}{\text{RSS}_U / (n-p)} = \frac{(\text{RSS}_R - \text{RSS}_U) / q}{s^2}$$

Where:

$\text{RSS}_U$: Residual sum of squares from the unrestricted (full) model
$\text{RSS}_R$: Residual sum of squares from the restricted (null) model
$q$: Number of restrictions imposed

Intuition:

The F-statistic compares how much worse the restricted model fits (RSS increase) relative to the residual variance. If the restrictions are valid (H₀ true), the RSS shouldn't increase much, so F is small. If restrictions are false, RSS increases substantially, F is large.

f_tests.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def f_test_general(X: np.ndarray, y: np.ndarray, 
                    R: np.ndarray, r: np.ndarray):
    """
    Perform F-test for general linear hypothesis H₀: Rβ = r.
    
    Parameters:
    -----------
    X : design matrix (n x p)
    y : response (n,)
    R : restriction matrix (q x p)
    r : restriction values (q,)
    
    Returns:
    --------
    Dictionary with F-statistic, p-value, df
    """
    n, p = X.shape
    q = R.shape[0]
    df1, df2 = q, n - p
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residual variance
    residuals = y - X @ beta_hat
    s_sq = np.sum(residuals**2) / df2
    
    # Departure from null
    departure = R @ beta_hat - r
    
    # Variance of Rβ̂ under null
    var_Rbeta = R @ XtX_inv @ R.T
    
    # F-statistic
    F_stat = (departure @ inv(var_Rbeta) @ departure) / (q * s_sq)
    
    # p-value
    p_value = 1 - stats.f.cdf(F_stat, df1, df2)
    
    return {
        'F_stat': F_stat,
        'p_value': p_value,
        'df1': df1,
        'df2': df2,
        'departure': departure,
        'beta_hat': beta_hat
    }
 
def f_test_nested(X_full: np.ndarray, X_reduced: np.ndarray, y: np.ndarray):
    """
    F-test comparing nested models via RSS.
    """
    n = len(y)
    p_full = X_full.shape[1]
    p_reduced = X_reduced.shape[1]
    q = p_full - p_reduced
    
    # Fit full model
    beta_full = inv(X_full.T @ X_full) @ X_full.T @ y
    RSS_full = np.sum((y - X_full @ beta_full)**2)
    
    # Fit reduced model
    beta_reduced = inv(X_reduced.T @ X_reduced) @ X_reduced.T @ y
    RSS_reduced = np.sum((y - X_reduced @ beta_reduced)**2)
    
    # F-statistic
    df1, df2 = q, n - p_full
    F_stat = ((RSS_reduced - RSS_full) / q) / (RSS_full / df2)
    
    p_value = 1 - stats.f.cdf(F_stat, df1, df2)
    
    return {
        'F_stat': F_stat,
        'p_value': p_value,
        'df1': df1,
        'df2': df2,
        'RSS_full': RSS_full,
        'RSS_reduced': RSS_reduced
    }
 
# Example
np.random.seed(42)
n = 100
X = np.column_stack([
    np.ones(n),
    np.random.randn(n),
    np.random.randn(n),
    np.random.randn(n)
])
beta_true = np.array([1.0, 2.0, 0.0, 0.0])  # Only β₁ is non-zero
y = X @ beta_true + np.random.randn(n)
 
# Test 1: H₀: β₂ = β₃ = 0 (jointly)
print("Test 1: H₀: β₂ = β₃ = 0")
R = np.array([[0, 0, 1, 0],
              [0, 0, 0, 1]])
r = np.array([0, 0])
result1 = f_test_general(X, y, R, r)
print(f"  F = {result1['F_stat']:.3f}, df = ({result1['df1']}, {result1['df2']}), p = {result1['p_value']:.4f}")
print(f"  Decision: {'Reject H₀' if result1['p_value'] < 0.05 else 'Fail to reject H₀'}")
 
# Test 2: H₀: β₁ = β₂
print("\nTest 2: H₀: β₁ = β₂")
R = np.array([[0, 1, -1, 0]])
r = np.array([0])
result2 = f_test_general(X, y, R, r)
print(f"  F = {result2['F_stat']:.3f}, df = ({result2['df1']}, {result2['df2']}), p = {result2['p_value']:.4f}")
print(f"  Decision: {'Reject H₀' if result2['p_value'] < 0.05 else 'Fail to reject H₀'}")
 
# Test 3: Overall F-test using nested models
print("\nTest 3: Overall F-test (all slopes = 0)")
X_reduced = X[:, 0:1]  # Intercept only
result3 = f_test_nested(X, X_reduced, y)
print(f"  F = {result3['F_stat']:.3f}, df = ({result3['df1']}, {result3['df2']}), p = {result3['p_value']:.4f}")
print(f"  RSS(full) = {result3['RSS_full']:.2f}, RSS(reduced) = {result3['RSS_reduced']:.2f}")

The Overall F-Test for Regression Significance

The overall F-test asks: Do any of the predictors explain variance in Y?

Hypotheses:

$$H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0$$ $$H_1: \text{At least one } \beta_j \neq 0$$

(Note: The intercept $\beta_0$ is not tested—we allow a non-zero mean.)

The F-Statistic:

$$F = \frac{\text{MSR}}{\text{MSE}} = \frac{\text{RSS}0 - \text{RSS}}{(p-1) \cdot s^2} = \frac{(\text{TSS} - \text{RSS})/(p-1)}{\text{RSS}/(n-p)} \sim F{p-1, n-p}$$

Where:

$\text{TSS} = \sum_i (y_i - \bar{y})^2$: Total Sum of Squares
$\text{RSS} = \sum_i (y_i - \hat{y}_i)^2$: Residual Sum of Squares
$\text{ESS} = \text{TSS} - \text{RSS}$: Explained Sum of Squares (Model SS)
$\text{MSR} = \text{ESS}/(p-1)$: Mean Square Regression
$\text{MSE} = \text{RSS}/(n-p) = s^2$: Mean Square Error

Connection to R²:

The F-statistic can be expressed in terms of $R^2$:

$$F = \frac{R^2 / (p-1)}{(1-R^2) / (n-p)}$$

This shows that F measures whether the fraction of variance explained ($R^2$) is 'large enough' given the number of predictors and sample size.

The ANOVA Table:

Source	df	SS	MS	F
Regression	$p-1$	ESS	MSR = ESS/(p-1)	MSR/MSE
Residual	$n-p$	RSS	MSE = RSS/(n-p)
Total	$n-1$	TSS

F-Test and Individual t-Tests Can Disagree

It's possible for the overall F-test to reject H₀ while no individual t-test is significant (rarely, due to multicollinearity), or for individual t-tests to be significant while F is not (if effects cancel out). The F-test assesses joint significance; t-tests assess marginal significance conditional on other variables.

overall_f_test.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def anova_table(X: np.ndarray, y: np.ndarray):
    """
    Compute complete ANOVA table for regression.
    """
    n, p = X.shape
    
    # OLS
    beta_hat = inv(X.T @ X) @ X.T @ y
    y_hat = X @ beta_hat
    residuals = y - y_hat
    
    # Sums of squares
    y_bar = np.mean(y)
    TSS = np.sum((y - y_bar)**2)
    RSS = np.sum(residuals**2)
    ESS = TSS - RSS
    
    # Degrees of freedom
    df_regression = p - 1
    df_residual = n - p
    df_total = n - 1
    
    # Mean squares
    MSR = ESS / df_regression
    MSE = RSS / df_residual
    
    # F-statistic
    F_stat = MSR / MSE
    p_value = 1 - stats.f.cdf(F_stat, df_regression, df_residual)
    
    # R-squared
    R_sq = ESS / TSS
    R_sq_adj = 1 - (RSS / df_residual) / (TSS / df_total)
    
    return {
        'TSS': TSS, 'ESS': ESS, 'RSS': RSS,
        'df_reg': df_regression, 'df_res': df_residual, 'df_total': df_total,
        'MSR': MSR, 'MSE': MSE,
        'F_stat': F_stat, 'p_value': p_value,
        'R_sq': R_sq, 'R_sq_adj': R_sq_adj
    }
 
# Example
np.random.seed(42)
n = 100
X = np.column_stack([
    np.ones(n),
    np.random.randn(n),
    np.random.randn(n)
])
beta_true = np.array([1.0, 1.5, -0.8])
y = X @ beta_true + np.random.randn(n)
 
result = anova_table(X, y)
 
print("ANOVA Table")
print("=" * 65)
print(f"{'Source':<15s} {'df':>8s} {'SS':>12s} {'MS':>12s} {'F':>10s} {'p-value':>10s}")
print("-" * 65)
print(f"{'Regression':<15s} {result['df_reg']:>8d} {result['ESS']:>12.2f} {result['MSR']:>12.4f} "
      f"{result['F_stat']:>10.3f} {result['p_value']:>10.4f}")
print(f"{'Residual':<15s} {result['df_res']:>8d} {result['RSS']:>12.2f} {result['MSE']:>12.4f}")
print(f"{'Total':<15s} {result['df_total']:>8d} {result['TSS']:>12.2f}")
print("-" * 65)
print(f"R² = {result['R_sq']:.4f}, Adjusted R² = {result['R_sq_adj']:.4f}")
print()
print(f"Overall F-test: F({result['df_reg']}, {result['df_res']}) = {result['F_stat']:.3f}, "
      f"p = {result['p_value']:.4f}")
decision = "Reject H₀" if result['p_value'] < 0.05 else "Fail to reject H₀"
print(f"Decision: {decision} - regression is {'significant' if result['p_value'] < 0.05 else 'not significant'}")

Duality Between Tests and Confidence Intervals

There is a profound connection between hypothesis tests and confidence intervals: they are dual procedures that convey the same information.

The Duality Relationship:

For testing $H_0: \beta_j = \beta_{0j}$ at significance level $\alpha$:

$$\text{Reject } H_0 \quad \iff \quad \beta_{0j} \notin \text{CI}_{1-\alpha}(\beta_j)$$

Equivalently:

$$\text{p-value} < \alpha \quad \iff \quad \beta_{0j} \notin [\hat{\beta}j - t{\alpha/2}\cdot\text{SE}, \hat{\beta}j + t{\alpha/2}\cdot\text{SE}]$$

Implications:

A 95% CI contains exactly those values of $\beta_j$ that would NOT be rejected at the 5% level
Testing $H_0: \beta_j = 0$ is equivalent to checking if 0 is in the CI
The CI provides MORE information than the test: it shows all non-rejected values, not just whether one specific value is rejected

CIs Are More Informative Than Tests

A p-value tells you only whether 0 is rejected at some threshold. A CI tells you the entire range of plausible values. If CI = [0.5, 15.0], you know not just that β ≠ 0, but that β is probably between 0.5 and 15.0. The CI tells a richer story.

Example of Duality:

Suppose $\hat{\beta}_1 = 2.5$ with SE = 1.0, and $n - p = 30$.

95% CI: $2.5 \pm 2.042 \times 1.0 = [0.458, 4.542]$
Test $H_0: \beta_1 = 0$: $t = 2.5/1.0 = 2.5$, p-value = 0.018 < 0.05 → Reject
Note: 0 is NOT in the CI, consistent with rejection

Now test $H_0: \beta_1 = 1$:

$t = (2.5 - 1)/1.0 = 1.5$, p-value = 0.144 > 0.05 → Fail to reject
Note: 1 IS in the CI [0.458, 4.542], consistent with non-rejection

For Joint Tests:

The duality extends:

F-test of $H_0: \boldsymbol{\beta}J = \boldsymbol{\beta}{0J}$ at level $\alpha$
Rejects $\iff$ $\boldsymbol{\beta}_{0J}$ is outside the $(1-\alpha)$ joint confidence ellipsoid

Interpreting p-Values Correctly

The p-value is perhaps the most misunderstood quantity in statistics. Let's be precise.

Definition:

The p-value is the probability of observing a test statistic as extreme or more extreme than the one computed, assuming H₀ is true.

$$\text{p-value} = \Pr(|T| \geq |t_{obs}| ,|, H_0 \text{ true})$$

What the p-Value IS:

✅ A measure of compatibility between the data and H₀ ✅ The probability of the data (or more extreme) given H₀ ✅ A continuous measure—smaller = less compatible with H₀

What the p-Value is NOT:

Common p-Value Misinterpretations (WRONG!)

❌ Probability that H₀ is true: P(H₀|data) — This is backwards! The p-value is P(data|H₀). ❌ Probability that H₁ is true: P(H₁|data) — Same error. ❌ Probability the result is due to chance — Vague and misleading. ❌ Effect size — A small p-value doesn't mean a large effect. ❌ Replication probability — A p = 0.04 doesn't mean 96% chance of replicating.

The Correct Interpretation:

A p-value of 0.03 means:

"If the null hypothesis were true (β = 0), there would be a 3% probability of observing a test statistic at least as extreme as the one we computed."

This is NOT the same as:

"There is a 3% probability that the null hypothesis is true." ❌

Why This Matters:

Suppose you test 100 hypotheses, all of which are actually true (all null hypotheses are correct). Using α = 0.05, you expect to reject about 5 of them—false positives.

If p = 0.04 for a particular test, it tells you nothing about whether that null hypothesis is true. It only tells you the probability of seeing such extreme data if it were true.

The Base Rate Fallacy:

If you test a hypothesis that is very likely true a priori (high prior probability of H₀), a p = 0.04 might still leave H₀ more probable than not. Bayesian thinking is needed for P(H₀|data).

Statistical Significance vs. Practical Significance

A fundamental distinction that researchers often blur:

Statistical Significance:

the observed effect is unlikely under H₀. The p-value is below α. We have evidence that the effect is non-zero.

Practical Significance (Effect Size):

The observed effect is large enough to matter in the real world. This depends on context, costs, and benefits.

The Key Insight:

Statistical significance depends on sample size. With enough data, even trivially small effects become statistically significant.

Example:

Suppose a drug reduces blood pressure by 0.5 mmHg on average:

Sample Size	SE	t-statistic	p-value	Significant?
n = 100	2.0	0.25	0.80	No
n = 10,000	0.2	2.5	0.01	Yes
n = 1,000,000	0.02	25	< 0.0001	Extremely

The effect size (0.5 mmHg) is the same—likely clinically irrelevant. But with enough data, we can detect it with certainty.

Large Samples Detect Tiny Effects

With n = 1,000,000, a coefficient of 0.0001 might be highly 'significant' (p < 0.001) yet completely meaningless practically. Always ask: Is the effect SIZE large enough to matter? Statistical significance tells you it's non-zero, not that it's important.

Reporting Best Practices:

Report effect sizes, not just p-values — β = 2.5 (SE = 0.5) tells more than p = 0.001
Report confidence intervals — CI = [1.5, 3.5] shows the plausible range of effects
Consider practical meaning — Is this effect size meaningful in context?
Don't just chase stars — A '' or '**' doesn't mean the finding matters
Be wary of high-powered studies — With n = 100,000, almost everything is 'significant'

Effect Size Measures:

For regression, common effect size measures include:

Standardized coefficients ($\beta^*$): measure effect in standard deviation units
Partial R²: proportion of remaining variance explained by a predictor
Cohen's f²: $R^2_{full} - R^2_{reduced})/(1 - R^2_{full})$

Common Fallacies and Pitfalls in Hypothesis Testing

Hypothesis testing is rife with logical errors. Here are the most common pitfalls:

Critical Pitfalls to Avoid

•Absence of evidence ≠ Evidence of absence — Failing to reject H₀ doesn't prove H₀ is true. It might just mean you lack power. p = 0.06 doesn't mean β = 0.
•p < 0.05 is arbitrary — There's nothing magical about 5%. A p = 0.049 isn't fundamentally different from p = 0.051. Treat it as a continuous measure.
•p-hacking and data dredging — Running many tests and reporting only significant ones inflates false positive rates. Pre-register hypotheses.
•Multiple comparisons problem — Testing 20 hypotheses at α = 0.05 yields ~1 false positive on average even if all H₀ are true.
•Confusing conditional probabilities — P(data|H₀) ≠ P(H₀|data). The p-value is the former, not the latter.
•HARKing (Hypothesizing After Results Known) — Formulating hypotheses after seeing data, then claiming to 'test' them.
•Dichotomizing continuous evidence — Treating p = 0.049 as 'significant' and p = 0.051 as 'not significant' loses information.

Better Practices

Report exact p-values, not just < 0.05. 2) Always report effect sizes and CIs. 3) Pre-register analyses. 4) Distinguish exploratory from confirmatory analysis. 5) Consider Bayesian approaches for inverse probability (P(H₀|data)). 6) Remember that p-values are just one piece of evidence—they don't replace scientific judgment.

Summary: Hypothesis Testing in Regression

This page has covered the complete theory and practice of hypothesis testing for regression coefficients. Here are the key insights:

Key Takeaways

•t-tests assess individual coefficients — The t-statistic t = β̂/SE follows t_{n-p} under H₀: β = 0
•F-tests assess joint hypotheses — For testing multiple restrictions simultaneously, F = (RSS_R - RSS_U)/q / s²
•Overall F-test checks regression significance — Tests whether any predictor matters; equivalent to R² > 0 test
•Tests and CIs are dual — Rejecting H₀: β = β₀ at level α ⟺ β₀ is outside the (1-α) CI
•p-value = P(data|H₀), not P(H₀|data) — A common but critical distinction
•Statistical significance ≠ practical significance — Large n detects tiny unimportant effects; always consider effect size
•Avoid common fallacies — Non-rejection isn't proof of H₀; α = 0.05 is arbitrary; multiple testing inflates errors

Module Complete:

With this page, we've completed Module 3 on the Statistical Properties of OLS. You now understand:

The Gauss-Markov theorem and why OLS is BLUE
The variance-covariance matrix and standard errors
Confidence interval construction and interpretation
Hypothesis testing via t-tests and F-tests

These tools form the foundation for rigorous inference in linear regression and extend to many more advanced methods.

Module Complete

Congratulations! You've mastered the statistical properties of OLS—from the theoretical guarantees of the Gauss-Markov theorem to the practical tools of confidence intervals and hypothesis tests. These concepts are fundamental to understanding any regression analysis and form the basis for more advanced topics in econometrics and machine learning.

5 / 5

Loading learning content...

Machine LearningLinear Regression

Statistical Properties of OLS

LevelIntermediate

Duration90 mins

TopicLinear Regression

5 / 5

Hypothesis Testing for Regression Coefficients

Making Decisions Under Uncertainty

Confidence intervals quantify uncertainty, but many practical questions require decisions:

Does this variable actually affect the outcome, or is the apparent effect just noise?
Are these two treatment effects truly different?
Does this set of variables collectively explain significant variation?

In regression, the most common tests assess whether coefficients are zero—determining whether predictors have 'statistically significant' effects on the response.

What You Will Learn

The Framework of Hypothesis Testing

The Basic Setup:

Null Hypothesis (H₀): A specific claim about parameters, usually representing 'no effect' or 'no difference'
Alternative Hypothesis (H₁ or Hₐ): The claim we're trying to find evidence for
Test Statistic: A function of the data that measures departure from H₀
p-value: Probability of observing a test statistic as extreme or more extreme than what we got, assuming H₀ is true
Decision Rule: Reject H₀ if p-value < α (significance level)

Types of Errors:

	H₀ True	H₀ False
Reject H₀	Type I Error (α)	Correct (Power = 1-β)
Fail to Reject H₀	Correct	Type II Error (β)

α (significance level): Probability of rejecting H₀ when it's true (typically 0.05)
β: Probability of failing to reject H₀ when it's false
Power (1-β): Probability of correctly rejecting a false H₀

Asymmetric Burden of Proof

In Regression Context:

The most common null hypotheses are:

Individual coefficient: $H_0: \beta_j = 0$ (predictor $j$ has no effect)
Subset of coefficients: $H_0: \beta_{j_1} = \beta_{j_2} = \cdots = \beta_{j_q} = 0$ (group of predictors jointly have no effect)
Linear restrictions: $H_0: \mathbf{R}\boldsymbol{\beta} = \mathbf{r}$ (coefficients satisfy specific relationships)
Overall regression: $H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0$ (all predictors, excluding intercept, have no effect)

t-Tests for Individual Coefficients

Testing H₀: βⱼ = β₀ⱼ (usually β₀ⱼ = 0):

Under normality of errors, the t-statistic:

$$t_j = \frac{\hat{\beta}j - \beta{0j}}{\widehat{\text{SE}}(\hat{\beta}j)} \sim t{n-p} \quad \text{under } H_0$$

For the standard test of 'no effect' ($H_0: \beta_j = 0$):

$$t_j = \frac{\hat{\beta}_j}{\widehat{\text{SE}}(\hat{\beta}_j)}$$

Two-Sided Test (H₁: βⱼ ≠ 0):

$$\text{p-value} = 2 \cdot \Pr(t_{n-p} > |t_j|) = 2 \cdot [1 - F_{t_{n-p}}(|t_j|)]$$

Reject H₀ if p-value < α, equivalently if $|t_j| > t_{\alpha/2, n-p}$.

One-Sided Tests:

$H_1: \beta_j > 0$: p-value = $\Pr(t_{n-p} > t_j) = 1 - F_{t_{n-p}}(t_j)$
$H_1: \beta_j < 0$: p-value = $\Pr(t_{n-p} < t_j) = F_{t_{n-p}}(t_j)$

t_tests.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def ols_with_t_tests(X: np.ndarray, y: np.ndarray, 
                      null_values: np.ndarray = None):
    """
    Perform OLS regression with t-tests for each coefficient.
    
    Parameters:
    -----------
    X : design matrix (n x p)
    y : response (n,)
    null_values : hypothesized values under H₀ (default: all zeros)
    
    Returns:
    --------
    Dictionary with estimates, SEs, t-stats, p-values
    """
    n, p = X.shape
    df = n - p
    
    if null_values is None:
        null_values = np.zeros(p)
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residual variance
    residuals = y - X @ beta_hat
    s_sq = np.sum(residuals**2) / df
    
    # Standard errors
    se = np.sqrt(s_sq * np.diag(XtX_inv))
    
    # t-statistics (testing β = null_values)
    t_stats = (beta_hat - null_values) / se
    
    # p-values (two-sided)
    p_values_two_sided = 2 * (1 - stats.t.cdf(np.abs(t_stats), df))
    
    # p-values (one-sided, β > 0)
    p_values_greater = 1 - stats.t.cdf(t_stats, df)
    
    # p-values (one-sided, β < 0)
    p_values_less = stats.t.cdf(t_stats, df)
    
    return {
        'beta_hat': beta_hat,
        'se': se,
        't_stats': t_stats,
        'p_two_sided': p_values_two_sided,
        'p_greater': p_values_greater,
        'p_less': p_values_less,
        'df': df,
        's': np.sqrt(s_sq)
    }
 
# Example
np.random.seed(42)
n = 100
 
# Design matrix with intercept and three predictors
X = np.column_stack([
    np.ones(n),
    np.random.randn(n),     # x1: effect
    np.random.randn(n),     # x2: no effect
    np.random.randn(n)      # x3: small effect
])
 
# True coefficients: β1 has effect, β2 has none, β3 has small effect
beta_true = np.array([1.0, 2.0, 0.0, 0.3])
y = X @ beta_true + np.random.randn(n)
 
results = ols_with_t_tests(X, y)
 
print("OLS Regression with t-Tests (H₀: βⱼ = 0)")
print("=" * 70)
print(f"{'Coef':12s} {'Estimate':>10s} {'SE':>10s} {'t':>10s} {'p-value':>12s} {'Signif':>8s}")
print("-" * 70)
 
names = ['Intercept', 'x1', 'x2', 'x3']
for j in range(4):
    signif = "***" if results['p_two_sided'][j] < 0.001 else              "**" if results['p_two_sided'][j] < 0.01 else              "*" if results['p_two_sided'][j] < 0.05 else              "." if results['p_two_sided'][j] < 0.1 else ""
    print(f"{names[j]:12s} {results['beta_hat'][j]:10.4f} {results['se'][j]:10.4f} "
          f"{results['t_stats'][j]:10.3f} {results['p_two_sided'][j]:12.4f} {signif:>8s}")
 
print("-" * 70)
print(f"Residual SE: {results['s']:.4f} on {results['df']} df")
print()
print("True values:", beta_true)
print("Signif. codes: *** <0.001, ** <0.01, * <0.05, . <0.1")

Reading Software Output

F-Tests for Joint Hypotheses

When testing hypotheses about multiple coefficients simultaneously, individual t-tests are insufficient. We need the F-test.

The General Linear Hypothesis:

$$H_0: \mathbf{R}\boldsymbol{\beta} = \mathbf{r}$$

Where:

$\mathbf{R}$ is a $q \times p$ matrix of known constants (full row rank $q$)
$\mathbf{r}$ is a $q \times 1$ vector of known constants
$q$ is the number of restrictions (constraints)

Common Examples:

Hypothesis	R	r
$H_0: \beta_1 = 0$	$[0, 1, 0, \ldots, 0]$	$[0]$
$H_0: \beta_1 = \beta_2$	$[0, 1, -1, 0, \ldots]$	$[0]$
$H_0: \beta_1 = \beta_2 = 0$	$\begin{pmatrix} 0 & 1 & 0 & \cdots \ 0 & 0 & 1 & \cdots \end{pmatrix}$	$\begin{pmatrix} 0 \ 0 \end{pmatrix}$
$H_0: \beta_1 + \beta_2 = 1$	$[0, 1, 1, 0, \ldots]$	$[1]$

The F-Statistic:

Alternatively, using the restricted vs. unrestricted model approach:

$$F = \frac{(\text{RSS}_R - \text{RSS}_U) / q}{\text{RSS}_U / (n-p)} = \frac{(\text{RSS}_R - \text{RSS}_U) / q}{s^2}$$

Where:

$\text{RSS}_U$: Residual sum of squares from the unrestricted (full) model
$\text{RSS}_R$: Residual sum of squares from the restricted (null) model
$q$: Number of restrictions imposed

Intuition:

f_tests.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def f_test_general(X: np.ndarray, y: np.ndarray, 
                    R: np.ndarray, r: np.ndarray):
    """
    Perform F-test for general linear hypothesis H₀: Rβ = r.
    
    Parameters:
    -----------
    X : design matrix (n x p)
    y : response (n,)
    R : restriction matrix (q x p)
    r : restriction values (q,)
    
    Returns:
    --------
    Dictionary with F-statistic, p-value, df
    """
    n, p = X.shape
    q = R.shape[0]
    df1, df2 = q, n - p
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residual variance
    residuals = y - X @ beta_hat
    s_sq = np.sum(residuals**2) / df2
    
    # Departure from null
    departure = R @ beta_hat - r
    
    # Variance of Rβ̂ under null
    var_Rbeta = R @ XtX_inv @ R.T
    
    # F-statistic
    F_stat = (departure @ inv(var_Rbeta) @ departure) / (q * s_sq)
    
    # p-value
    p_value = 1 - stats.f.cdf(F_stat, df1, df2)
    
    return {
        'F_stat': F_stat,
        'p_value': p_value,
        'df1': df1,
        'df2': df2,
        'departure': departure,
        'beta_hat': beta_hat
    }
 
def f_test_nested(X_full: np.ndarray, X_reduced: np.ndarray, y: np.ndarray):
    """
    F-test comparing nested models via RSS.
    """
    n = len(y)
    p_full = X_full.shape[1]
    p_reduced = X_reduced.shape[1]
    q = p_full - p_reduced
    
    # Fit full model
    beta_full = inv(X_full.T @ X_full) @ X_full.T @ y
    RSS_full = np.sum((y - X_full @ beta_full)**2)
    
    # Fit reduced model
    beta_reduced = inv(X_reduced.T @ X_reduced) @ X_reduced.T @ y
    RSS_reduced = np.sum((y - X_reduced @ beta_reduced)**2)
    
    # F-statistic
    df1, df2 = q, n - p_full
    F_stat = ((RSS_reduced - RSS_full) / q) / (RSS_full / df2)
    
    p_value = 1 - stats.f.cdf(F_stat, df1, df2)
    
    return {
        'F_stat': F_stat,
        'p_value': p_value,
        'df1': df1,
        'df2': df2,
        'RSS_full': RSS_full,
        'RSS_reduced': RSS_reduced
    }
 
# Example
np.random.seed(42)
n = 100
X = np.column_stack([
    np.ones(n),
    np.random.randn(n),
    np.random.randn(n),
    np.random.randn(n)
])
beta_true = np.array([1.0, 2.0, 0.0, 0.0])  # Only β₁ is non-zero
y = X @ beta_true + np.random.randn(n)
 
# Test 1: H₀: β₂ = β₃ = 0 (jointly)
print("Test 1: H₀: β₂ = β₃ = 0")
R = np.array([[0, 0, 1, 0],
              [0, 0, 0, 1]])
r = np.array([0, 0])
result1 = f_test_general(X, y, R, r)
print(f"  F = {result1['F_stat']:.3f}, df = ({result1['df1']}, {result1['df2']}), p = {result1['p_value']:.4f}")
print(f"  Decision: {'Reject H₀' if result1['p_value'] < 0.05 else 'Fail to reject H₀'}")
 
# Test 2: H₀: β₁ = β₂
print("\nTest 2: H₀: β₁ = β₂")
R = np.array([[0, 1, -1, 0]])
r = np.array([0])
result2 = f_test_general(X, y, R, r)
print(f"  F = {result2['F_stat']:.3f}, df = ({result2['df1']}, {result2['df2']}), p = {result2['p_value']:.4f}")
print(f"  Decision: {'Reject H₀' if result2['p_value'] < 0.05 else 'Fail to reject H₀'}")
 
# Test 3: Overall F-test using nested models
print("\nTest 3: Overall F-test (all slopes = 0)")
X_reduced = X[:, 0:1]  # Intercept only
result3 = f_test_nested(X, X_reduced, y)
print(f"  F = {result3['F_stat']:.3f}, df = ({result3['df1']}, {result3['df2']}), p = {result3['p_value']:.4f}")
print(f"  RSS(full) = {result3['RSS_full']:.2f}, RSS(reduced) = {result3['RSS_reduced']:.2f}")

The Overall F-Test for Regression Significance

The overall F-test asks: Do any of the predictors explain variance in Y?

Hypotheses:

$$H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0$$ $$H_1: \text{At least one } \beta_j \neq 0$$

(Note: The intercept $\beta_0$ is not tested—we allow a non-zero mean.)

The F-Statistic:

$$F = \frac{\text{MSR}}{\text{MSE}} = \frac{\text{RSS}0 - \text{RSS}}{(p-1) \cdot s^2} = \frac{(\text{TSS} - \text{RSS})/(p-1)}{\text{RSS}/(n-p)} \sim F{p-1, n-p}$$

Where:

$\text{TSS} = \sum_i (y_i - \bar{y})^2$: Total Sum of Squares
$\text{RSS} = \sum_i (y_i - \hat{y}_i)^2$: Residual Sum of Squares
$\text{ESS} = \text{TSS} - \text{RSS}$: Explained Sum of Squares (Model SS)
$\text{MSR} = \text{ESS}/(p-1)$: Mean Square Regression
$\text{MSE} = \text{RSS}/(n-p) = s^2$: Mean Square Error

Connection to R²:

The F-statistic can be expressed in terms of $R^2$:

$$F = \frac{R^2 / (p-1)}{(1-R^2) / (n-p)}$$

This shows that F measures whether the fraction of variance explained ($R^2$) is 'large enough' given the number of predictors and sample size.

The ANOVA Table:

Source	df	SS	MS	F
Regression	$p-1$	ESS	MSR = ESS/(p-1)	MSR/MSE
Residual	$n-p$	RSS	MSE = RSS/(n-p)
Total	$n-1$	TSS

F-Test and Individual t-Tests Can Disagree

overall_f_test.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def anova_table(X: np.ndarray, y: np.ndarray):
    """
    Compute complete ANOVA table for regression.
    """
    n, p = X.shape
    
    # OLS
    beta_hat = inv(X.T @ X) @ X.T @ y
    y_hat = X @ beta_hat
    residuals = y - y_hat
    
    # Sums of squares
    y_bar = np.mean(y)
    TSS = np.sum((y - y_bar)**2)
    RSS = np.sum(residuals**2)
    ESS = TSS - RSS
    
    # Degrees of freedom
    df_regression = p - 1
    df_residual = n - p
    df_total = n - 1
    
    # Mean squares
    MSR = ESS / df_regression
    MSE = RSS / df_residual
    
    # F-statistic
    F_stat = MSR / MSE
    p_value = 1 - stats.f.cdf(F_stat, df_regression, df_residual)
    
    # R-squared
    R_sq = ESS / TSS
    R_sq_adj = 1 - (RSS / df_residual) / (TSS / df_total)
    
    return {
        'TSS': TSS, 'ESS': ESS, 'RSS': RSS,
        'df_reg': df_regression, 'df_res': df_residual, 'df_total': df_total,
        'MSR': MSR, 'MSE': MSE,
        'F_stat': F_stat, 'p_value': p_value,
        'R_sq': R_sq, 'R_sq_adj': R_sq_adj
    }
 
# Example
np.random.seed(42)
n = 100
X = np.column_stack([
    np.ones(n),
    np.random.randn(n),
    np.random.randn(n)
])
beta_true = np.array([1.0, 1.5, -0.8])
y = X @ beta_true + np.random.randn(n)
 
result = anova_table(X, y)
 
print("ANOVA Table")
print("=" * 65)
print(f"{'Source':<15s} {'df':>8s} {'SS':>12s} {'MS':>12s} {'F':>10s} {'p-value':>10s}")
print("-" * 65)
print(f"{'Regression':<15s} {result['df_reg']:>8d} {result['ESS']:>12.2f} {result['MSR']:>12.4f} "
      f"{result['F_stat']:>10.3f} {result['p_value']:>10.4f}")
print(f"{'Residual':<15s} {result['df_res']:>8d} {result['RSS']:>12.2f} {result['MSE']:>12.4f}")
print(f"{'Total':<15s} {result['df_total']:>8d} {result['TSS']:>12.2f}")
print("-" * 65)
print(f"R² = {result['R_sq']:.4f}, Adjusted R² = {result['R_sq_adj']:.4f}")
print()
print(f"Overall F-test: F({result['df_reg']}, {result['df_res']}) = {result['F_stat']:.3f}, "
      f"p = {result['p_value']:.4f}")
decision = "Reject H₀" if result['p_value'] < 0.05 else "Fail to reject H₀"
print(f"Decision: {decision} - regression is {'significant' if result['p_value'] < 0.05 else 'not significant'}")

Duality Between Tests and Confidence Intervals

There is a profound connection between hypothesis tests and confidence intervals: they are dual procedures that convey the same information.

The Duality Relationship:

For testing $H_0: \beta_j = \beta_{0j}$ at significance level $\alpha$:

$$\text{Reject } H_0 \quad \iff \quad \beta_{0j} \notin \text{CI}_{1-\alpha}(\beta_j)$$

Equivalently:

$$\text{p-value} < \alpha \quad \iff \quad \beta_{0j} \notin [\hat{\beta}j - t{\alpha/2}\cdot\text{SE}, \hat{\beta}j + t{\alpha/2}\cdot\text{SE}]$$

Implications:

A 95% CI contains exactly those values of $\beta_j$ that would NOT be rejected at the 5% level
Testing $H_0: \beta_j = 0$ is equivalent to checking if 0 is in the CI
The CI provides MORE information than the test: it shows all non-rejected values, not just whether one specific value is rejected

CIs Are More Informative Than Tests

Example of Duality:

Suppose $\hat{\beta}_1 = 2.5$ with SE = 1.0, and $n - p = 30$.

95% CI: $2.5 \pm 2.042 \times 1.0 = [0.458, 4.542]$
Test $H_0: \beta_1 = 0$: $t = 2.5/1.0 = 2.5$, p-value = 0.018 < 0.05 → Reject
Note: 0 is NOT in the CI, consistent with rejection

Now test $H_0: \beta_1 = 1$:

$t = (2.5 - 1)/1.0 = 1.5$, p-value = 0.144 > 0.05 → Fail to reject
Note: 1 IS in the CI [0.458, 4.542], consistent with non-rejection

For Joint Tests:

The duality extends:

F-test of $H_0: \boldsymbol{\beta}J = \boldsymbol{\beta}{0J}$ at level $\alpha$
Rejects $\iff$ $\boldsymbol{\beta}_{0J}$ is outside the $(1-\alpha)$ joint confidence ellipsoid

Interpreting p-Values Correctly

The p-value is perhaps the most misunderstood quantity in statistics. Let's be precise.

Definition:

The p-value is the probability of observing a test statistic as extreme or more extreme than the one computed, assuming H₀ is true.

$$\text{p-value} = \Pr(|T| \geq |t_{obs}| ,|, H_0 \text{ true})$$

What the p-Value IS:

✅ A measure of compatibility between the data and H₀ ✅ The probability of the data (or more extreme) given H₀ ✅ A continuous measure—smaller = less compatible with H₀

What the p-Value is NOT:

Common p-Value Misinterpretations (WRONG!)

The Correct Interpretation:

A p-value of 0.03 means:

"If the null hypothesis were true (β = 0), there would be a 3% probability of observing a test statistic at least as extreme as the one we computed."

This is NOT the same as:

"There is a 3% probability that the null hypothesis is true." ❌

Why This Matters:

Suppose you test 100 hypotheses, all of which are actually true (all null hypotheses are correct). Using α = 0.05, you expect to reject about 5 of them—false positives.

If p = 0.04 for a particular test, it tells you nothing about whether that null hypothesis is true. It only tells you the probability of seeing such extreme data if it were true.

The Base Rate Fallacy:

If you test a hypothesis that is very likely true a priori (high prior probability of H₀), a p = 0.04 might still leave H₀ more probable than not. Bayesian thinking is needed for P(H₀|data).

Statistical Significance vs. Practical Significance

A fundamental distinction that researchers often blur:

Statistical Significance:

the observed effect is unlikely under H₀. The p-value is below α. We have evidence that the effect is non-zero.

Practical Significance (Effect Size):

The observed effect is large enough to matter in the real world. This depends on context, costs, and benefits.

The Key Insight:

Statistical significance depends on sample size. With enough data, even trivially small effects become statistically significant.

Example:

Suppose a drug reduces blood pressure by 0.5 mmHg on average:

Sample Size	SE	t-statistic	p-value	Significant?
n = 100	2.0	0.25	0.80	No
n = 10,000	0.2	2.5	0.01	Yes
n = 1,000,000	0.02	25	< 0.0001	Extremely

The effect size (0.5 mmHg) is the same—likely clinically irrelevant. But with enough data, we can detect it with certainty.

Large Samples Detect Tiny Effects

Reporting Best Practices:

Report effect sizes, not just p-values — β = 2.5 (SE = 0.5) tells more than p = 0.001
Report confidence intervals — CI = [1.5, 3.5] shows the plausible range of effects
Consider practical meaning — Is this effect size meaningful in context?
Don't just chase stars — A '' or '**' doesn't mean the finding matters
Be wary of high-powered studies — With n = 100,000, almost everything is 'significant'

Effect Size Measures:

For regression, common effect size measures include:

Standardized coefficients ($\beta^*$): measure effect in standard deviation units
Partial R²: proportion of remaining variance explained by a predictor
Cohen's f²: $R^2_{full} - R^2_{reduced})/(1 - R^2_{full})$

Common Fallacies and Pitfalls in Hypothesis Testing

Hypothesis testing is rife with logical errors. Here are the most common pitfalls:

Critical Pitfalls to Avoid

•Absence of evidence ≠ Evidence of absence — Failing to reject H₀ doesn't prove H₀ is true. It might just mean you lack power. p = 0.06 doesn't mean β = 0.
•p < 0.05 is arbitrary — There's nothing magical about 5%. A p = 0.049 isn't fundamentally different from p = 0.051. Treat it as a continuous measure.
•p-hacking and data dredging — Running many tests and reporting only significant ones inflates false positive rates. Pre-register hypotheses.
•Multiple comparisons problem — Testing 20 hypotheses at α = 0.05 yields ~1 false positive on average even if all H₀ are true.
•Confusing conditional probabilities — P(data|H₀) ≠ P(H₀|data). The p-value is the former, not the latter.
•HARKing (Hypothesizing After Results Known) — Formulating hypotheses after seeing data, then claiming to 'test' them.
•Dichotomizing continuous evidence — Treating p = 0.049 as 'significant' and p = 0.051 as 'not significant' loses information.

Better Practices

Report exact p-values, not just < 0.05. 2) Always report effect sizes and CIs. 3) Pre-register analyses. 4) Distinguish exploratory from confirmatory analysis. 5) Consider Bayesian approaches for inverse probability (P(H₀|data)). 6) Remember that p-values are just one piece of evidence—they don't replace scientific judgment.

Summary: Hypothesis Testing in Regression

This page has covered the complete theory and practice of hypothesis testing for regression coefficients. Here are the key insights:

Key Takeaways

•t-tests assess individual coefficients — The t-statistic t = β̂/SE follows t_{n-p} under H₀: β = 0
•F-tests assess joint hypotheses — For testing multiple restrictions simultaneously, F = (RSS_R - RSS_U)/q / s²
•Overall F-test checks regression significance — Tests whether any predictor matters; equivalent to R² > 0 test
•Tests and CIs are dual — Rejecting H₀: β = β₀ at level α ⟺ β₀ is outside the (1-α) CI
•p-value = P(data|H₀), not P(H₀|data) — A common but critical distinction
•Statistical significance ≠ practical significance — Large n detects tiny unimportant effects; always consider effect size
•Avoid common fallacies — Non-rejection isn't proof of H₀; α = 0.05 is arbitrary; multiple testing inflates errors

Module Complete:

With this page, we've completed Module 3 on the Statistical Properties of OLS. You now understand:

The Gauss-Markov theorem and why OLS is BLUE
The variance-covariance matrix and standard errors
Confidence interval construction and interpretation
Hypothesis testing via t-tests and F-tests

These tools form the foundation for rigorous inference in linear regression and extend to many more advanced methods.

Module Complete

5 / 5