Machine LearningLinear Regression

Statistical Properties of OLS

LevelIntermediate

Duration90 mins

TopicLinear Regression

4 / 5

Confidence Intervals for Regression Coefficients

From Point Estimates to Interval Estimates

A point estimate like $\hat{\beta}_1 = 2.5$ provides our best single guess for the true coefficient. But how much trust should we place in this specific value? Could the true $\beta_1$ plausibly be 1.0? Or 5.0?

Confidence intervals transform point estimates into ranges that quantify our uncertainty. A 95% confidence interval might be $[1.8, 3.2]$, telling us that values in this range are 'consistent' with our data at the 95% confidence level.

This page develops the theory and practice of confidence intervals in regression, addressing both construction and—crucially—correct interpretation.

What You Will Learn

By the end of this page, you will be able to construct confidence intervals for individual coefficients and linear combinations, understand why the t-distribution (not normal) is used, correctly interpret what confidence level means (and what it doesn't), recognize the challenge of simultaneous inference for multiple coefficients, and compute confidence regions for joint parameter vectors.

The Distribution of OLS Estimators

Confidence interval construction requires knowing the sampling distribution of $\hat{\boldsymbol{\beta}}$.

Under Gauss-Markov Assumptions + Normality:

We've established:

$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$ (unbiased)
$\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$

If we additionally assume $\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$, then:

$$\hat{\boldsymbol{\beta}} | \mathbf{X} \sim \mathcal{N}(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1})$$

This follows because $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ is a linear transformation of the normal vector $\mathbf{y}$.

For Each Individual Coefficient:

$$\hat{\beta}j | \mathbf{X} \sim \mathcal{N}(\beta_j, \sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj})$$

Standardizing:

$$\frac{\hat{\beta}j - \beta_j}{\sigma \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}}} \sim \mathcal{N}(0, 1)$$

The Problem:

We don't know $\sigma$! We must use the estimate $s$ instead. This introduces additional randomness, and the resulting statistic no longer follows a standard normal distribution.

The Key Insight

Replacing the known σ with the estimated s changes the distribution. The resulting ratio follows a t-distribution, not a normal distribution. This is because s² is random and based on the same data used to compute β̂.

The t-Distribution in Regression

Derivation of the t-Statistic:

Under normality of errors:

$\hat{\beta}j \sim \mathcal{N}(\beta_j, \sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj})$
$(n-p)s^2/\sigma^2 \sim \chi^2_{n-p}$ (chi-squared with n-p degrees of freedom)
$\hat{\boldsymbol{\beta}}$ and $s^2$ are independent

By the definition of the t-distribution:

$$t = \frac{\mathcal{N}(0,1)}{\sqrt{\chi^2_\nu / \nu}} \sim t_\nu$$

Applying this:

$$t_j = \frac{\hat{\beta}j - \beta_j}{s \cdot \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}}} = \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}j)} \sim t{n-p}$$

The Fundamental t-Statistic

$$\boxed{t_j = \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}j)} \sim t{n-p}}$$ This is the pivotal quantity for inference. It has a known distribution (t with n-p degrees of freedom) regardless of the unknown parameters β and σ².

Properties of the t-Distribution:

Property	Description
Shape	Symmetric, bell-shaped, like normal but heavier tails
Center	Mean = 0 (for ν > 1)
Spread	Variance = ν/(ν-2) for ν > 2, heavier tails than normal
Degrees of freedom (ν = n-p)	Controls tail heaviness
As ν → ∞	Converges to N(0,1)

Why Heavier Tails?

The t-distribution has heavier tails because it accounts for uncertainty in estimating $\sigma$. When $n$ is small, our estimate $s$ of $\sigma$ is imprecise, so extreme values of the t-statistic are more likely than under the normal distribution.

As $n - p$ increases:

$s^2 \to \sigma^2$ in probability
$t_{n-p} \to \mathcal{N}(0,1)$
For n-p > 30, the difference is small; for n-p > 100, it's negligible

Constructing Confidence Intervals

The (1-α)×100% Confidence Interval for βⱼ:

Since $t_j = \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}j)} \sim t{n-p}$, we have:

$$\Pr\left(-t_{\alpha/2, n-p} \leq \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}j)} \leq t{\alpha/2, n-p}\right) = 1 - \alpha$$

Rearranging for $\beta_j$:

$$\Pr\left(\hat{\beta}j - t{\alpha/2, n-p} \cdot \widehat{\text{SE}}(\hat{\beta}_j) \leq \beta_j \leq \hat{\beta}j + t{\alpha/2, n-p} \cdot \widehat{\text{SE}}(\hat{\beta}_j)\right) = 1 - \alpha$$

The $(1-\alpha) \times 100%$ confidence interval is:

$$\boxed{\text{CI}_{1-\alpha}(\beta_j) = \hat{\beta}j \pm t{\alpha/2, n-p} \cdot \widehat{\text{SE}}(\hat{\beta}_j)}$$

confidence_intervals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def compute_confidence_intervals(X: np.ndarray, y: np.ndarray, 
                                  alpha: float = 0.05):
    """
    Compute confidence intervals for all regression coefficients.
    
    Parameters:
    -----------
    X : design matrix (n x p)
    y : response vector (n,)
    alpha : significance level (default 0.05 for 95% CI)
    
    Returns:
    --------
    Dictionary with estimates, SEs, CIs, and t-critical value
    """
    n, p = X.shape
    df = n - p
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residuals and variance estimate
    residuals = y - X @ beta_hat
    s_sq = np.sum(residuals**2) / df
    
    # Standard errors
    se = np.sqrt(s_sq * np.diag(XtX_inv))
    
    # Critical value from t-distribution
    t_crit = stats.t.ppf(1 - alpha/2, df)
    
    # Confidence intervals
    ci_lower = beta_hat - t_crit * se
    ci_upper = beta_hat + t_crit * se
    ci_width = 2 * t_crit * se
    
    return {
        'beta_hat': beta_hat,
        'se': se,
        't_critical': t_crit,
        'df': df,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'ci_width': ci_width,
        'alpha': alpha
    }
 
# Example
np.random.seed(42)
n = 30
X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])
beta_true = np.array([1.0, 2.0, -1.0])
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# Compute CIs at different confidence levels
for alpha, conf in [(0.10, '90%'), (0.05, '95%'), (0.01, '99%')]:
    results = compute_confidence_intervals(X, y, alpha)
    print(f"\n{conf} Confidence Intervals (t-critical = {results['t_critical']:.3f}):")
    print("-" * 60)
    for j, name in enumerate(['Intercept', 'x1', 'x2']):
        print(f"  {name:10s}: {results['beta_hat'][j]:.3f} ± {results['t_critical'] * results['se'][j]:.3f}")
        print(f"             [{results['ci_lower'][j]:.3f}, {results['ci_upper'][j]:.3f}]")
        print(f"             Width: {results['ci_width'][j]:.3f}")

Critical Values:

Confidence Level	α	α/2	t-critical (df=∞)	z-critical
90%	0.10	0.05	1.645	1.645
95%	0.05	0.025	1.960	1.960
99%	0.01	0.005	2.576	2.576

For finite degrees of freedom, t-critical values are larger:

df	t₀.₀₂₅ (for 95% CI)
5	2.571
10	2.228
20	2.086
30	2.042
60	2.000
∞	1.960

Correct Interpretation of Confidence Intervals

The interpretation of confidence intervals is frequently misunderstood—even by experienced practitioners. Let's be precise.

What a 95% Confidence Interval Means:

✅ Correct (Frequentist):

"If we repeated the sampling process many times and computed a 95% CI each time, approximately 95% of those intervals would contain the true parameter value."

The confidence level refers to the procedure, not to any individual interval.

✅ Also Correct:

"Before observing data, there is a 95% probability that the random interval $[\hat{\beta}j - t{\alpha/2}\cdot\text{SE}, \hat{\beta}j + t{\alpha/2}\cdot\text{SE}]$ will contain $\beta_j$."

This is the pre-data probability statement.

Common Misinterpretations (WRONG!)

❌ 'There is a 95% probability that βⱼ is in the interval [1.5, 3.0]' — The parameter is fixed (not random); it either is or isn't in the interval. ❌ 'We are 95% confident that βⱼ is between 1.5 and 3.0' — This implies personal belief/probability, which is Bayesian, not frequentist. ❌ '95% of the data falls in this interval' — The CI is about the parameter, not data.

The Subtle Distinction:

Once we observe data and compute a specific interval like $[1.5, 3.0]$:

The interval is now a fixed set of numbers
The true $\beta_j$ is a fixed (unknown) number
Either $\beta_j \in [1.5, 3.0]$ (probability 1) or $\beta_j \notin [1.5, 3.0]$ (probability 0)
The 95% refers to how often such intervals 'work' across many hypothetical repetitions

Practical Interpretation:

Despite the philosophical nuances, a 95% CI is often pragmatically interpreted as: "Values inside the CI are plausible given the data; values outside are not." This is an informal but useful heuristic.

Connection to Hypothesis Testing:

A 95% CI for $\beta_j$ contains exactly those values $\beta_j^0$ that would not be rejected by a two-sided test at the 5% significance level:

$$\beta_j^0 \in \text{CI}_{95%} \iff \text{p-value for } H_0: \beta_j = \beta_j^0 > 0.05$$

Confidence Intervals for Linear Combinations

Often we're interested in linear combinations of coefficients rather than individual ones.

Examples:

Difference in effects: $\beta_1 - \beta_2$ (Is treatment A better than B?)
Sum of effects: $\beta_1 + \beta_2$ (Total effect of combined treatment)
Predicted mean: $\beta_0 + \beta_1 x_1^* + \beta_2 x_2^*$ (Mean response at specific x)

General Form:

For linear combination $\theta = \mathbf{c}^\top \boldsymbol{\beta}$ where $\mathbf{c} \in \mathbb{R}^p$:

Estimate: $\hat{\theta} = \mathbf{c}^\top \hat{\boldsymbol{\beta}}$

Variance: $\text{Var}(\hat{\theta}) = \mathbf{c}^\top \text{Var}(\hat{\boldsymbol{\beta}}) \mathbf{c} = \sigma^2 \mathbf{c}^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{c}$

Estimated SE: $\widehat{\text{SE}}(\hat{\theta}) = s \sqrt{\mathbf{c}^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{c}}$

Confidence interval:

$$\text{CI}{1-\alpha}(\mathbf{c}^\top\boldsymbol{\beta}) = \mathbf{c}^\top \hat{\boldsymbol{\beta}} \pm t{\alpha/2, n-p} \cdot \widehat{\text{SE}}(\hat{\theta})$$

linear_combination_ci.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def ci_for_linear_combination(X: np.ndarray, y: np.ndarray, 
                               c: np.ndarray, alpha: float = 0.05):
    """
    Compute CI for a linear combination θ = c'β.
    
    Parameters:
    -----------
    X : design matrix (n x p)
    y : response (n,)
    c : coefficient vector (p,) defining the linear combination
    alpha : significance level
    
    Returns:
    --------
    Dictionary with estimate, SE, CI
    """
    n, p = X.shape
    df = n - p
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residual variance
    residuals = y - X @ beta_hat
    s_sq = np.sum(residuals**2) / df
    
    # Linear combination estimate
    theta_hat = c @ beta_hat
    
    # Variance of linear combination
    var_theta = s_sq * (c @ XtX_inv @ c)
    se_theta = np.sqrt(var_theta)
    
    # CI
    t_crit = stats.t.ppf(1 - alpha/2, df)
    ci_lower = theta_hat - t_crit * se_theta
    ci_upper = theta_hat + t_crit * se_theta
    
    return {
        'theta_hat': theta_hat,
        'se': se_theta,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        't_critical': t_crit
    }
 
# Example
np.random.seed(42)
n = 100
X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])
beta_true = np.array([1.0, 2.0, 3.0])
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# CI for individual coefficients
print("CIs for Individual Coefficients:")
for j, name in enumerate(['β₀', 'β₁', 'β₂']):
    c = np.zeros(3)
    c[j] = 1
    result = ci_for_linear_combination(X, y, c)
    print(f"  {name}: {result['theta_hat']:.3f} [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]")
 
# CI for difference β₁ - β₂
print("\nCI for Difference (β₁ - β₂):")
c_diff = np.array([0, 1, -1])
result = ci_for_linear_combination(X, y, c_diff)
print(f"  True: {beta_true[1] - beta_true[2]:.3f}")
print(f"  Estimate: {result['theta_hat']:.3f}")
print(f"  CI: [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]")
 
# CI for sum β₁ + β₂ 
print("\nCI for Sum (β₁ + β₂):")
c_sum = np.array([0, 1, 1])
result = ci_for_linear_combination(X, y, c_sum)
print(f"  True: {beta_true[1] + beta_true[2]:.3f}")
print(f"  Estimate: {result['theta_hat']:.3f}")
print(f"  CI: [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]")
 
# CI for predicted mean at x = (1, 0.5, -0.5)
print("\nCI for Predicted Mean at x = (1, 0.5, -0.5):")
c_pred = np.array([1, 0.5, -0.5])
result = ci_for_linear_combination(X, y, c_pred)
true_mean = beta_true @ c_pred
print(f"  True mean: {true_mean:.3f}")
print(f"  Estimate: {result['theta_hat']:.3f}")
print(f"  CI: [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]")

Joint Confidence Regions for Multiple Coefficients

When we're interested in multiple coefficients simultaneously, individual CIs are insufficient. The problem is the multiple comparisons issue.

The Problem with Individual CIs:

If we construct two independent 95% CIs:

P(CI₁ contains β₁) = 0.95
P(CI₂ contains β₂) = 0.95
P(both contain their true values) ≈ 0.95 × 0.95 = 0.90 (if independent)

With p coefficients, the probability that all individual 95% CIs simultaneously contain their true values can be much less than 95%!

Joint Confidence Regions:

A $(1-\alpha) \times 100%$ joint confidence region satisfies:

$$\Pr\left((\boldsymbol{\beta}_J - \hat{\boldsymbol{\beta}}_J)^\top [s^2(\mathbf{X}_J^\top\mathbf{X}_J)^{-1}]^{-1} (\boldsymbol{\beta}_J - \hat{\boldsymbol{\beta}}J) \leq qF{q, n-p, \alpha}\right) = 1 - \alpha$$

Where $\boldsymbol{\beta}J$ is the $q$-dimensional subset of coefficients of interest, and $F{q, n-p, \alpha}$ is the F-distribution critical value.

Geometric Interpretation

The joint confidence region is an ellipsoid in q-dimensional parameter space. For two coefficients, it's an ellipse. The ellipse's shape reflects the covariance between coefficient estimates: correlated estimates produce tilted ellipses; independent estimates produce axis-aligned ellipses.

For Two Coefficients (β₁, β₂):

The 95% joint confidence region is an ellipse defined by:

$$(\boldsymbol{\beta}{12} - \hat{\boldsymbol{\beta}}{12})^\top [s^2(\mathbf{X}{12}^\top\mathbf{X}{12})^{-1}]^{-1} (\boldsymbol{\beta}{12} - \hat{\boldsymbol{\beta}}{12}) \leq 2F_{2, n-p, 0.05}$$

Comparison: Individual CIs vs Joint Region:

Aspect	Individual CIs	Joint Confidence Region
Shape	Rectangle (crossed intervals)	Ellipse
Coverage	< 95% jointly	Exactly 95%
Size	Smaller (narrower margins)	Larger (accounts for multiplicity)
Use case	Marginal inference on each β	Simultaneous statements about all β

joint_confidence_region.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def joint_confidence_ellipse(X: np.ndarray, y: np.ndarray, 
                              indices: list, alpha: float = 0.05,
                              n_points: int = 100):
    """
    Compute the joint confidence ellipse for two coefficients.
    
    Parameters:
    -----------
    X : design matrix
    y : response
    indices : list of two coefficient indices
    alpha : significance level
    
    Returns:
    --------
    ellipse_points : (n_points, 2) array of points on ellipse boundary
    beta_hat_subset : estimated coefficients
    """
    assert len(indices) == 2, "Only 2D ellipse supported"
    
    n, p = X.shape
    df = n - p
    q = len(indices)
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residual variance
    residuals = y - X @ beta_hat
    s_sq = np.sum(residuals**2) / df
    
    # Subset estimates and covariance
    beta_hat_subset = beta_hat[indices]
    cov_subset = s_sq * XtX_inv[np.ix_(indices, indices)]
    
    # F critical value
    f_crit = stats.f.ppf(1 - alpha, q, df)
    
    # Ellipse: (β - β̂)' Σ^{-1} (β - β̂) = q * F_crit
    # Parameterize ellipse using eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(cov_subset)
    
    # Semi-axes lengths: sqrt(q * F_crit * eigenvalue)
    semi_axes = np.sqrt(q * f_crit * eigenvalues)
    
    # Generate ellipse points
    theta = np.linspace(0, 2*np.pi, n_points)
    unit_circle = np.column_stack([np.cos(theta), np.sin(theta)])
    
    # Transform: scale by semi-axes, rotate by eigenvectors, translate by β̂
    ellipse_points = unit_circle @ np.diag(semi_axes) @ eigenvectors.T + beta_hat_subset
    
    return {
        'ellipse_points': ellipse_points,
        'beta_hat': beta_hat_subset,
        'cov_matrix': cov_subset,
        'f_critical': f_crit,
        'semi_axes': semi_axes
    }
 
# Example
np.random.seed(42)
n = 50
X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])
beta_true = np.array([1.0, 2.0, -1.0])
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# Joint CI for (β₁, β₂)
result = joint_confidence_ellipse(X, y, [1, 2])
 
print("Joint 95% Confidence Region for (β₁, β₂):")
print(f"  Estimates: β̂₁ = {result['beta_hat'][0]:.3f}, β̂₂ = {result['beta_hat'][1]:.3f}")
print(f"  True values: β₁ = {beta_true[1]:.3f}, β₂ = {beta_true[2]:.3f}")
print(f"  F-critical (2, {n-3}): {result['f_critical']:.3f}")
print(f"  Semi-axes: {result['semi_axes'][0]:.3f}, {result['semi_axes'][1]:.3f}")
print(f"  Covariance matrix:")
print(f"    {result['cov_matrix']}")

Bonferroni and Other Multiple Comparison Corrections

When constructing individual CIs for multiple coefficients, we can adjust the confidence level to achieve simultaneous coverage.

Bonferroni Correction:

To achieve simultaneous $(1-\alpha) \times 100%$ coverage for $m$ coefficients:

Construct each individual CI at level $1 - \alpha/m$
By the union bound: P(at least one CI doesn't cover) ≤ Σ P(CI_j doesn't cover) = m × (α/m) = α
Therefore: P(all CIs cover) ≥ 1 - α

Bonferroni-adjusted CI:

$$\text{CI}_{Bonf}(\beta_j) = \hat{\beta}j \pm t{\alpha/(2m), n-p} \cdot \widehat{\text{SE}}(\hat{\beta}_j)$$

Example: For 5 coefficients at simultaneous 95% level:

Each individual CI uses $\alpha/m = 0.05/5 = 0.01$
Use $t_{0.005, n-p}$ instead of $t_{0.025, n-p}$
Individual CIs are wider (99% each) to guarantee 95% simultaneous coverage

Comparison of Multiple Comparison Methods
Method	Individual CI Level	Pro	Con
No correction	1-α	Powerful for each test	Inflated family-wise error
Bonferroni	1-α/m	Simple, conservative	Can be very conservative
Šidák	1-(1-α)^(1/m)	Slightly tighter than Bonferroni	Assumes independence
Scheffé	Uses F-distribution	Controls for all linear combos	Most conservative
Tukey HSD	Studentized range	Optimal for pairwise comparisons	Only for pairwise

When to Correct for Multiplicity

Correct when: Making simultaneous claims about multiple parameters; confirmatory analysis where each conclusion matters. Don't necessarily correct when: Exploratory analysis; parameters have very different substantive meaning; you're reporting all individual p-values and letting readers judge.

Asymptotic vs. Exact Inference

The confidence intervals we've developed are exact under normality: the t-distribution is exactly correct, regardless of sample size.

Without Normality:

If errors are not Gaussian, the t-statistic doesn't follow an exact t-distribution. However, by the Central Limit Theorem:

$$\frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}_j)} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } n \to \infty$$

This asymptotic normality justifies using normal or t-based CIs even for non-Gaussian errors, provided $n$ is large enough.

Regularity Conditions for Asymptotic Validity:

Errors have finite variance
Design matrix $\mathbf{X}^\top\mathbf{X}/n$ converges to a positive definite matrix
No single observation dominates (Lindeberg condition)
Independence or sufficiently weak dependence

Finite Sample vs. Asymptotic:

Approach	Requirement	Coverage
Exact (t-based)	Normal errors	Exactly 1-α for any n
Asymptotic (z-based)	Finite variance, large n	Approximately 1-α, better as n→∞
Bootstrap	Few distributional assumptions	Approximately 1-α, often robust

How Large is 'Large Enough'?

There's no universal answer. Depends on: How non-normal are the errors? (Heavier tails need larger n.) How many parameters? (More parameters need larger n.) What's the design structure? (Balanced designs converge faster.) Rule of thumb: n-p > 30 is often adequate; n-p > 100 is usually safe for mild non-normality.

Bootstrap Confidence Intervals:

An alternative to parametric CIs is the bootstrap:

Resample data with replacement (or resample residuals)
Compute $\hat{\beta}^*$ on each bootstrap sample
Use the distribution of $\hat{\beta}^*$ to construct CIs

Bootstrap CIs are:

Less reliant on distributional assumptions
Automatically adapt to non-normality
Can handle heteroscedasticity (with wild bootstrap)
Computationally intensive but increasingly standard

Summary: Confidence Intervals in Regression

This page has developed the complete theory and practice of confidence intervals for regression coefficients. Let's consolidate the key insights:

Key Takeaways

•Under normality, OLS estimators are normal — β̂ ~ N(β, σ²(X'X)⁻¹), enabling exact inference
•The t-statistic follows a t-distribution — Because we estimate σ with s, the standardized statistic is t_{n-p}, not N(0,1)
•CI formula: β̂ ± t_{α/2, n-p} × SE — The critical value depends on degrees of freedom n-p
•Correct interpretation is crucial — 95% refers to the procedure's long-run coverage, not probability the true β is in any specific interval
•Linear combinations use the same framework — Any c'β can be estimated with CI using SE = s√(c'(X'X)⁻¹c)
•Joint regions are ellipses, not rectangles — Individual CIs have lower simultaneous coverage; use F-based ellipses or Bonferroni adjustment
•Asymptotic theory relaxes normality assumption — For large n, CIs are approximately valid regardless of error distribution

What's Next:

Confidence intervals quantify uncertainty; hypothesis tests make formal decisions. The next page develops hypothesis testing for regression coefficients—t-tests for individual coefficients, F-tests for groups of coefficients, and the duality between tests and confidence intervals.

Page Complete

You now understand how to construct and correctly interpret confidence intervals for regression coefficients. This knowledge enables you to communicate uncertainty in your estimates, compare coefficients, and make informed decisions about statistical significance.

4 / 5

Loading learning content...

Machine LearningLinear Regression

Statistical Properties of OLS

LevelIntermediate

Duration90 mins

TopicLinear Regression

4 / 5

Confidence Intervals for Regression Coefficients

From Point Estimates to Interval Estimates

This page develops the theory and practice of confidence intervals in regression, addressing both construction and—crucially—correct interpretation.

What You Will Learn

The Distribution of OLS Estimators

Confidence interval construction requires knowing the sampling distribution of $\hat{\boldsymbol{\beta}}$.

Under Gauss-Markov Assumptions + Normality:

We've established:

$\mathbb{E}[\hat{\boldsymbol{\beta}} | \mathbf{X}] = \boldsymbol{\beta}$ (unbiased)
$\text{Var}(\hat{\boldsymbol{\beta}} | \mathbf{X}) = \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$

If we additionally assume $\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$, then:

$$\hat{\boldsymbol{\beta}} | \mathbf{X} \sim \mathcal{N}(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1})$$

This follows because $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ is a linear transformation of the normal vector $\mathbf{y}$.

For Each Individual Coefficient:

$$\hat{\beta}j | \mathbf{X} \sim \mathcal{N}(\beta_j, \sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj})$$

Standardizing:

$$\frac{\hat{\beta}j - \beta_j}{\sigma \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}}} \sim \mathcal{N}(0, 1)$$

The Problem:

We don't know $\sigma$! We must use the estimate $s$ instead. This introduces additional randomness, and the resulting statistic no longer follows a standard normal distribution.

The Key Insight

The t-Distribution in Regression

Derivation of the t-Statistic:

Under normality of errors:

$\hat{\beta}j \sim \mathcal{N}(\beta_j, \sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj})$
$(n-p)s^2/\sigma^2 \sim \chi^2_{n-p}$ (chi-squared with n-p degrees of freedom)
$\hat{\boldsymbol{\beta}}$ and $s^2$ are independent

By the definition of the t-distribution:

$$t = \frac{\mathcal{N}(0,1)}{\sqrt{\chi^2_\nu / \nu}} \sim t_\nu$$

Applying this:

$$t_j = \frac{\hat{\beta}j - \beta_j}{s \cdot \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}}} = \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}j)} \sim t{n-p}$$

The Fundamental t-Statistic

Properties of the t-Distribution:

Property	Description
Shape	Symmetric, bell-shaped, like normal but heavier tails
Center	Mean = 0 (for ν > 1)
Spread	Variance = ν/(ν-2) for ν > 2, heavier tails than normal
Degrees of freedom (ν = n-p)	Controls tail heaviness
As ν → ∞	Converges to N(0,1)

Why Heavier Tails?

As $n - p$ increases:

$s^2 \to \sigma^2$ in probability
$t_{n-p} \to \mathcal{N}(0,1)$
For n-p > 30, the difference is small; for n-p > 100, it's negligible

Constructing Confidence Intervals

The (1-α)×100% Confidence Interval for βⱼ:

Since $t_j = \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}j)} \sim t{n-p}$, we have:

$$\Pr\left(-t_{\alpha/2, n-p} \leq \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}j)} \leq t{\alpha/2, n-p}\right) = 1 - \alpha$$

Rearranging for $\beta_j$:

The $(1-\alpha) \times 100%$ confidence interval is:

$$\boxed{\text{CI}_{1-\alpha}(\beta_j) = \hat{\beta}j \pm t{\alpha/2, n-p} \cdot \widehat{\text{SE}}(\hat{\beta}_j)}$$

confidence_intervals.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def compute_confidence_intervals(X: np.ndarray, y: np.ndarray, 
                                  alpha: float = 0.05):
    """
    Compute confidence intervals for all regression coefficients.
    
    Parameters:
    -----------
    X : design matrix (n x p)
    y : response vector (n,)
    alpha : significance level (default 0.05 for 95% CI)
    
    Returns:
    --------
    Dictionary with estimates, SEs, CIs, and t-critical value
    """
    n, p = X.shape
    df = n - p
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residuals and variance estimate
    residuals = y - X @ beta_hat
    s_sq = np.sum(residuals**2) / df
    
    # Standard errors
    se = np.sqrt(s_sq * np.diag(XtX_inv))
    
    # Critical value from t-distribution
    t_crit = stats.t.ppf(1 - alpha/2, df)
    
    # Confidence intervals
    ci_lower = beta_hat - t_crit * se
    ci_upper = beta_hat + t_crit * se
    ci_width = 2 * t_crit * se
    
    return {
        'beta_hat': beta_hat,
        'se': se,
        't_critical': t_crit,
        'df': df,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        'ci_width': ci_width,
        'alpha': alpha
    }
 
# Example
np.random.seed(42)
n = 30
X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])
beta_true = np.array([1.0, 2.0, -1.0])
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# Compute CIs at different confidence levels
for alpha, conf in [(0.10, '90%'), (0.05, '95%'), (0.01, '99%')]:
    results = compute_confidence_intervals(X, y, alpha)
    print(f"\n{conf} Confidence Intervals (t-critical = {results['t_critical']:.3f}):")
    print("-" * 60)
    for j, name in enumerate(['Intercept', 'x1', 'x2']):
        print(f"  {name:10s}: {results['beta_hat'][j]:.3f} ± {results['t_critical'] * results['se'][j]:.3f}")
        print(f"             [{results['ci_lower'][j]:.3f}, {results['ci_upper'][j]:.3f}]")
        print(f"             Width: {results['ci_width'][j]:.3f}")

Critical Values:

Confidence Level	α	α/2	t-critical (df=∞)	z-critical
90%	0.10	0.05	1.645	1.645
95%	0.05	0.025	1.960	1.960
99%	0.01	0.005	2.576	2.576

For finite degrees of freedom, t-critical values are larger:

df	t₀.₀₂₅ (for 95% CI)
5	2.571
10	2.228
20	2.086
30	2.042
60	2.000
∞	1.960

Correct Interpretation of Confidence Intervals

The interpretation of confidence intervals is frequently misunderstood—even by experienced practitioners. Let's be precise.

What a 95% Confidence Interval Means:

✅ Correct (Frequentist):

"If we repeated the sampling process many times and computed a 95% CI each time, approximately 95% of those intervals would contain the true parameter value."

The confidence level refers to the procedure, not to any individual interval.

✅ Also Correct:

"Before observing data, there is a 95% probability that the random interval $[\hat{\beta}j - t{\alpha/2}\cdot\text{SE}, \hat{\beta}j + t{\alpha/2}\cdot\text{SE}]$ will contain $\beta_j$."

This is the pre-data probability statement.

Common Misinterpretations (WRONG!)

The Subtle Distinction:

Once we observe data and compute a specific interval like $[1.5, 3.0]$:

The interval is now a fixed set of numbers
The true $\beta_j$ is a fixed (unknown) number
Either $\beta_j \in [1.5, 3.0]$ (probability 1) or $\beta_j \notin [1.5, 3.0]$ (probability 0)
The 95% refers to how often such intervals 'work' across many hypothetical repetitions

Practical Interpretation:

Connection to Hypothesis Testing:

A 95% CI for $\beta_j$ contains exactly those values $\beta_j^0$ that would not be rejected by a two-sided test at the 5% significance level:

$$\beta_j^0 \in \text{CI}_{95%} \iff \text{p-value for } H_0: \beta_j = \beta_j^0 > 0.05$$

Confidence Intervals for Linear Combinations

Often we're interested in linear combinations of coefficients rather than individual ones.

Examples:

Difference in effects: $\beta_1 - \beta_2$ (Is treatment A better than B?)
Sum of effects: $\beta_1 + \beta_2$ (Total effect of combined treatment)
Predicted mean: $\beta_0 + \beta_1 x_1^* + \beta_2 x_2^*$ (Mean response at specific x)

General Form:

For linear combination $\theta = \mathbf{c}^\top \boldsymbol{\beta}$ where $\mathbf{c} \in \mathbb{R}^p$:

Estimate: $\hat{\theta} = \mathbf{c}^\top \hat{\boldsymbol{\beta}}$

Variance: $\text{Var}(\hat{\theta}) = \mathbf{c}^\top \text{Var}(\hat{\boldsymbol{\beta}}) \mathbf{c} = \sigma^2 \mathbf{c}^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{c}$

Estimated SE: $\widehat{\text{SE}}(\hat{\theta}) = s \sqrt{\mathbf{c}^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{c}}$

Confidence interval:

$$\text{CI}{1-\alpha}(\mathbf{c}^\top\boldsymbol{\beta}) = \mathbf{c}^\top \hat{\boldsymbol{\beta}} \pm t{\alpha/2, n-p} \cdot \widehat{\text{SE}}(\hat{\theta})$$

linear_combination_ci.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def ci_for_linear_combination(X: np.ndarray, y: np.ndarray, 
                               c: np.ndarray, alpha: float = 0.05):
    """
    Compute CI for a linear combination θ = c'β.
    
    Parameters:
    -----------
    X : design matrix (n x p)
    y : response (n,)
    c : coefficient vector (p,) defining the linear combination
    alpha : significance level
    
    Returns:
    --------
    Dictionary with estimate, SE, CI
    """
    n, p = X.shape
    df = n - p
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residual variance
    residuals = y - X @ beta_hat
    s_sq = np.sum(residuals**2) / df
    
    # Linear combination estimate
    theta_hat = c @ beta_hat
    
    # Variance of linear combination
    var_theta = s_sq * (c @ XtX_inv @ c)
    se_theta = np.sqrt(var_theta)
    
    # CI
    t_crit = stats.t.ppf(1 - alpha/2, df)
    ci_lower = theta_hat - t_crit * se_theta
    ci_upper = theta_hat + t_crit * se_theta
    
    return {
        'theta_hat': theta_hat,
        'se': se_theta,
        'ci_lower': ci_lower,
        'ci_upper': ci_upper,
        't_critical': t_crit
    }
 
# Example
np.random.seed(42)
n = 100
X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])
beta_true = np.array([1.0, 2.0, 3.0])
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# CI for individual coefficients
print("CIs for Individual Coefficients:")
for j, name in enumerate(['β₀', 'β₁', 'β₂']):
    c = np.zeros(3)
    c[j] = 1
    result = ci_for_linear_combination(X, y, c)
    print(f"  {name}: {result['theta_hat']:.3f} [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]")
 
# CI for difference β₁ - β₂
print("\nCI for Difference (β₁ - β₂):")
c_diff = np.array([0, 1, -1])
result = ci_for_linear_combination(X, y, c_diff)
print(f"  True: {beta_true[1] - beta_true[2]:.3f}")
print(f"  Estimate: {result['theta_hat']:.3f}")
print(f"  CI: [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]")
 
# CI for sum β₁ + β₂ 
print("\nCI for Sum (β₁ + β₂):")
c_sum = np.array([0, 1, 1])
result = ci_for_linear_combination(X, y, c_sum)
print(f"  True: {beta_true[1] + beta_true[2]:.3f}")
print(f"  Estimate: {result['theta_hat']:.3f}")
print(f"  CI: [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]")
 
# CI for predicted mean at x = (1, 0.5, -0.5)
print("\nCI for Predicted Mean at x = (1, 0.5, -0.5):")
c_pred = np.array([1, 0.5, -0.5])
result = ci_for_linear_combination(X, y, c_pred)
true_mean = beta_true @ c_pred
print(f"  True mean: {true_mean:.3f}")
print(f"  Estimate: {result['theta_hat']:.3f}")
print(f"  CI: [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]")

Joint Confidence Regions for Multiple Coefficients

When we're interested in multiple coefficients simultaneously, individual CIs are insufficient. The problem is the multiple comparisons issue.

The Problem with Individual CIs:

If we construct two independent 95% CIs:

P(CI₁ contains β₁) = 0.95
P(CI₂ contains β₂) = 0.95
P(both contain their true values) ≈ 0.95 × 0.95 = 0.90 (if independent)

With p coefficients, the probability that all individual 95% CIs simultaneously contain their true values can be much less than 95%!

Joint Confidence Regions:

A $(1-\alpha) \times 100%$ joint confidence region satisfies:

Where $\boldsymbol{\beta}J$ is the $q$-dimensional subset of coefficients of interest, and $F{q, n-p, \alpha}$ is the F-distribution critical value.

Geometric Interpretation

For Two Coefficients (β₁, β₂):

The 95% joint confidence region is an ellipse defined by:

Comparison: Individual CIs vs Joint Region:

Aspect	Individual CIs	Joint Confidence Region
Shape	Rectangle (crossed intervals)	Ellipse
Coverage	< 95% jointly	Exactly 95%
Size	Smaller (narrower margins)	Larger (accounts for multiplicity)
Use case	Marginal inference on each β	Simultaneous statements about all β

joint_confidence_region.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
from numpy.linalg import inv
import scipy.stats as stats
 
def joint_confidence_ellipse(X: np.ndarray, y: np.ndarray, 
                              indices: list, alpha: float = 0.05,
                              n_points: int = 100):
    """
    Compute the joint confidence ellipse for two coefficients.
    
    Parameters:
    -----------
    X : design matrix
    y : response
    indices : list of two coefficient indices
    alpha : significance level
    
    Returns:
    --------
    ellipse_points : (n_points, 2) array of points on ellipse boundary
    beta_hat_subset : estimated coefficients
    """
    assert len(indices) == 2, "Only 2D ellipse supported"
    
    n, p = X.shape
    df = n - p
    q = len(indices)
    
    # OLS estimates
    XtX_inv = inv(X.T @ X)
    beta_hat = XtX_inv @ X.T @ y
    
    # Residual variance
    residuals = y - X @ beta_hat
    s_sq = np.sum(residuals**2) / df
    
    # Subset estimates and covariance
    beta_hat_subset = beta_hat[indices]
    cov_subset = s_sq * XtX_inv[np.ix_(indices, indices)]
    
    # F critical value
    f_crit = stats.f.ppf(1 - alpha, q, df)
    
    # Ellipse: (β - β̂)' Σ^{-1} (β - β̂) = q * F_crit
    # Parameterize ellipse using eigendecomposition
    eigenvalues, eigenvectors = np.linalg.eigh(cov_subset)
    
    # Semi-axes lengths: sqrt(q * F_crit * eigenvalue)
    semi_axes = np.sqrt(q * f_crit * eigenvalues)
    
    # Generate ellipse points
    theta = np.linspace(0, 2*np.pi, n_points)
    unit_circle = np.column_stack([np.cos(theta), np.sin(theta)])
    
    # Transform: scale by semi-axes, rotate by eigenvectors, translate by β̂
    ellipse_points = unit_circle @ np.diag(semi_axes) @ eigenvectors.T + beta_hat_subset
    
    return {
        'ellipse_points': ellipse_points,
        'beta_hat': beta_hat_subset,
        'cov_matrix': cov_subset,
        'f_critical': f_crit,
        'semi_axes': semi_axes
    }
 
# Example
np.random.seed(42)
n = 50
X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])
beta_true = np.array([1.0, 2.0, -1.0])
y = X @ beta_true + 0.5 * np.random.randn(n)
 
# Joint CI for (β₁, β₂)
result = joint_confidence_ellipse(X, y, [1, 2])
 
print("Joint 95% Confidence Region for (β₁, β₂):")
print(f"  Estimates: β̂₁ = {result['beta_hat'][0]:.3f}, β̂₂ = {result['beta_hat'][1]:.3f}")
print(f"  True values: β₁ = {beta_true[1]:.3f}, β₂ = {beta_true[2]:.3f}")
print(f"  F-critical (2, {n-3}): {result['f_critical']:.3f}")
print(f"  Semi-axes: {result['semi_axes'][0]:.3f}, {result['semi_axes'][1]:.3f}")
print(f"  Covariance matrix:")
print(f"    {result['cov_matrix']}")

Bonferroni and Other Multiple Comparison Corrections

When constructing individual CIs for multiple coefficients, we can adjust the confidence level to achieve simultaneous coverage.

Bonferroni Correction:

To achieve simultaneous $(1-\alpha) \times 100%$ coverage for $m$ coefficients:

Construct each individual CI at level $1 - \alpha/m$
By the union bound: P(at least one CI doesn't cover) ≤ Σ P(CI_j doesn't cover) = m × (α/m) = α
Therefore: P(all CIs cover) ≥ 1 - α

Bonferroni-adjusted CI:

$$\text{CI}_{Bonf}(\beta_j) = \hat{\beta}j \pm t{\alpha/(2m), n-p} \cdot \widehat{\text{SE}}(\hat{\beta}_j)$$

Example: For 5 coefficients at simultaneous 95% level:

Each individual CI uses $\alpha/m = 0.05/5 = 0.01$
Use $t_{0.005, n-p}$ instead of $t_{0.025, n-p}$
Individual CIs are wider (99% each) to guarantee 95% simultaneous coverage

Comparison of Multiple Comparison Methods
Method	Individual CI Level	Pro	Con
No correction	1-α	Powerful for each test	Inflated family-wise error
Bonferroni	1-α/m	Simple, conservative	Can be very conservative
Šidák	1-(1-α)^(1/m)	Slightly tighter than Bonferroni	Assumes independence
Scheffé	Uses F-distribution	Controls for all linear combos	Most conservative
Tukey HSD	Studentized range	Optimal for pairwise comparisons	Only for pairwise

When to Correct for Multiplicity

Asymptotic vs. Exact Inference

The confidence intervals we've developed are exact under normality: the t-distribution is exactly correct, regardless of sample size.

Without Normality:

If errors are not Gaussian, the t-statistic doesn't follow an exact t-distribution. However, by the Central Limit Theorem:

$$\frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}_j)} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } n \to \infty$$

This asymptotic normality justifies using normal or t-based CIs even for non-Gaussian errors, provided $n$ is large enough.

Regularity Conditions for Asymptotic Validity:

Errors have finite variance
Design matrix $\mathbf{X}^\top\mathbf{X}/n$ converges to a positive definite matrix
No single observation dominates (Lindeberg condition)
Independence or sufficiently weak dependence

Finite Sample vs. Asymptotic:

Approach	Requirement	Coverage
Exact (t-based)	Normal errors	Exactly 1-α for any n
Asymptotic (z-based)	Finite variance, large n	Approximately 1-α, better as n→∞
Bootstrap	Few distributional assumptions	Approximately 1-α, often robust

How Large is 'Large Enough'?

Bootstrap Confidence Intervals:

An alternative to parametric CIs is the bootstrap:

Resample data with replacement (or resample residuals)
Compute $\hat{\beta}^*$ on each bootstrap sample
Use the distribution of $\hat{\beta}^*$ to construct CIs

Bootstrap CIs are:

Less reliant on distributional assumptions
Automatically adapt to non-normality
Can handle heteroscedasticity (with wild bootstrap)
Computationally intensive but increasingly standard

Summary: Confidence Intervals in Regression

This page has developed the complete theory and practice of confidence intervals for regression coefficients. Let's consolidate the key insights:

Key Takeaways

•Under normality, OLS estimators are normal — β̂ ~ N(β, σ²(X'X)⁻¹), enabling exact inference
•The t-statistic follows a t-distribution — Because we estimate σ with s, the standardized statistic is t_{n-p}, not N(0,1)
•CI formula: β̂ ± t_{α/2, n-p} × SE — The critical value depends on degrees of freedom n-p
•Correct interpretation is crucial — 95% refers to the procedure's long-run coverage, not probability the true β is in any specific interval
•Linear combinations use the same framework — Any c'β can be estimated with CI using SE = s√(c'(X'X)⁻¹c)
•Joint regions are ellipses, not rectangles — Individual CIs have lower simultaneous coverage; use F-based ellipses or Bonferroni adjustment
•Asymptotic theory relaxes normality assumption — For large n, CIs are approximately valid regardless of error distribution

What's Next:

Page Complete

4 / 5