Loading learning content...
A point estimate like $\hat{\beta}_1 = 2.5$ provides our best single guess for the true coefficient. But how much trust should we place in this specific value? Could the true $\beta_1$ plausibly be 1.0? Or 5.0?
Confidence intervals transform point estimates into ranges that quantify our uncertainty. A 95% confidence interval might be $[1.8, 3.2]$, telling us that values in this range are 'consistent' with our data at the 95% confidence level.
This page develops the theory and practice of confidence intervals in regression, addressing both construction and—crucially—correct interpretation.
By the end of this page, you will be able to construct confidence intervals for individual coefficients and linear combinations, understand why the t-distribution (not normal) is used, correctly interpret what confidence level means (and what it doesn't), recognize the challenge of simultaneous inference for multiple coefficients, and compute confidence regions for joint parameter vectors.
Confidence interval construction requires knowing the sampling distribution of $\hat{\boldsymbol{\beta}}$.
Under Gauss-Markov Assumptions + Normality:
We've established:
If we additionally assume $\boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2\mathbf{I})$, then:
$$\hat{\boldsymbol{\beta}} | \mathbf{X} \sim \mathcal{N}(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^\top\mathbf{X})^{-1})$$
This follows because $\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ is a linear transformation of the normal vector $\mathbf{y}$.
For Each Individual Coefficient:
$$\hat{\beta}j | \mathbf{X} \sim \mathcal{N}(\beta_j, \sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj})$$
Standardizing:
$$\frac{\hat{\beta}j - \beta_j}{\sigma \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}}} \sim \mathcal{N}(0, 1)$$
The Problem:
We don't know $\sigma$! We must use the estimate $s$ instead. This introduces additional randomness, and the resulting statistic no longer follows a standard normal distribution.
Replacing the known σ with the estimated s changes the distribution. The resulting ratio follows a t-distribution, not a normal distribution. This is because s² is random and based on the same data used to compute β̂.
Derivation of the t-Statistic:
Under normality of errors:
$\hat{\beta}j \sim \mathcal{N}(\beta_j, \sigma^2[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj})$
$(n-p)s^2/\sigma^2 \sim \chi^2_{n-p}$ (chi-squared with n-p degrees of freedom)
$\hat{\boldsymbol{\beta}}$ and $s^2$ are independent
By the definition of the t-distribution:
$$t = \frac{\mathcal{N}(0,1)}{\sqrt{\chi^2_\nu / \nu}} \sim t_\nu$$
Applying this:
$$t_j = \frac{\hat{\beta}j - \beta_j}{s \cdot \sqrt{[(\mathbf{X}^\top\mathbf{X})^{-1}]{jj}}} = \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}j)} \sim t{n-p}$$
$$\boxed{t_j = \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}j)} \sim t{n-p}}$$ This is the pivotal quantity for inference. It has a known distribution (t with n-p degrees of freedom) regardless of the unknown parameters β and σ².
Properties of the t-Distribution:
| Property | Description |
|---|---|
| Shape | Symmetric, bell-shaped, like normal but heavier tails |
| Center | Mean = 0 (for ν > 1) |
| Spread | Variance = ν/(ν-2) for ν > 2, heavier tails than normal |
| Degrees of freedom (ν = n-p) | Controls tail heaviness |
| As ν → ∞ | Converges to N(0,1) |
Why Heavier Tails?
The t-distribution has heavier tails because it accounts for uncertainty in estimating $\sigma$. When $n$ is small, our estimate $s$ of $\sigma$ is imprecise, so extreme values of the t-statistic are more likely than under the normal distribution.
As $n - p$ increases:
The (1-α)×100% Confidence Interval for βⱼ:
Since $t_j = \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}j)} \sim t{n-p}$, we have:
$$\Pr\left(-t_{\alpha/2, n-p} \leq \frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}j)} \leq t{\alpha/2, n-p}\right) = 1 - \alpha$$
Rearranging for $\beta_j$:
$$\Pr\left(\hat{\beta}j - t{\alpha/2, n-p} \cdot \widehat{\text{SE}}(\hat{\beta}_j) \leq \beta_j \leq \hat{\beta}j + t{\alpha/2, n-p} \cdot \widehat{\text{SE}}(\hat{\beta}_j)\right) = 1 - \alpha$$
The $(1-\alpha) \times 100%$ confidence interval is:
$$\boxed{\text{CI}_{1-\alpha}(\beta_j) = \hat{\beta}j \pm t{\alpha/2, n-p} \cdot \widehat{\text{SE}}(\hat{\beta}_j)}$$
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
import numpy as npfrom numpy.linalg import invimport scipy.stats as stats def compute_confidence_intervals(X: np.ndarray, y: np.ndarray, alpha: float = 0.05): """ Compute confidence intervals for all regression coefficients. Parameters: ----------- X : design matrix (n x p) y : response vector (n,) alpha : significance level (default 0.05 for 95% CI) Returns: -------- Dictionary with estimates, SEs, CIs, and t-critical value """ n, p = X.shape df = n - p # OLS estimates XtX_inv = inv(X.T @ X) beta_hat = XtX_inv @ X.T @ y # Residuals and variance estimate residuals = y - X @ beta_hat s_sq = np.sum(residuals**2) / df # Standard errors se = np.sqrt(s_sq * np.diag(XtX_inv)) # Critical value from t-distribution t_crit = stats.t.ppf(1 - alpha/2, df) # Confidence intervals ci_lower = beta_hat - t_crit * se ci_upper = beta_hat + t_crit * se ci_width = 2 * t_crit * se return { 'beta_hat': beta_hat, 'se': se, 't_critical': t_crit, 'df': df, 'ci_lower': ci_lower, 'ci_upper': ci_upper, 'ci_width': ci_width, 'alpha': alpha } # Examplenp.random.seed(42)n = 30X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])beta_true = np.array([1.0, 2.0, -1.0])y = X @ beta_true + 0.5 * np.random.randn(n) # Compute CIs at different confidence levelsfor alpha, conf in [(0.10, '90%'), (0.05, '95%'), (0.01, '99%')]: results = compute_confidence_intervals(X, y, alpha) print(f"\n{conf} Confidence Intervals (t-critical = {results['t_critical']:.3f}):") print("-" * 60) for j, name in enumerate(['Intercept', 'x1', 'x2']): print(f" {name:10s}: {results['beta_hat'][j]:.3f} ± {results['t_critical'] * results['se'][j]:.3f}") print(f" [{results['ci_lower'][j]:.3f}, {results['ci_upper'][j]:.3f}]") print(f" Width: {results['ci_width'][j]:.3f}")Critical Values:
| Confidence Level | α | α/2 | t-critical (df=∞) | z-critical |
|---|---|---|---|---|
| 90% | 0.10 | 0.05 | 1.645 | 1.645 |
| 95% | 0.05 | 0.025 | 1.960 | 1.960 |
| 99% | 0.01 | 0.005 | 2.576 | 2.576 |
For finite degrees of freedom, t-critical values are larger:
| df | t₀.₀₂₅ (for 95% CI) |
|---|---|
| 5 | 2.571 |
| 10 | 2.228 |
| 20 | 2.086 |
| 30 | 2.042 |
| 60 | 2.000 |
| ∞ | 1.960 |
The interpretation of confidence intervals is frequently misunderstood—even by experienced practitioners. Let's be precise.
What a 95% Confidence Interval Means:
✅ Correct (Frequentist):
"If we repeated the sampling process many times and computed a 95% CI each time, approximately 95% of those intervals would contain the true parameter value."
The confidence level refers to the procedure, not to any individual interval.
✅ Also Correct:
"Before observing data, there is a 95% probability that the random interval $[\hat{\beta}j - t{\alpha/2}\cdot\text{SE}, \hat{\beta}j + t{\alpha/2}\cdot\text{SE}]$ will contain $\beta_j$."
This is the pre-data probability statement.
❌ 'There is a 95% probability that βⱼ is in the interval [1.5, 3.0]' — The parameter is fixed (not random); it either is or isn't in the interval. ❌ 'We are 95% confident that βⱼ is between 1.5 and 3.0' — This implies personal belief/probability, which is Bayesian, not frequentist. ❌ '95% of the data falls in this interval' — The CI is about the parameter, not data.
The Subtle Distinction:
Once we observe data and compute a specific interval like $[1.5, 3.0]$:
Practical Interpretation:
Despite the philosophical nuances, a 95% CI is often pragmatically interpreted as: "Values inside the CI are plausible given the data; values outside are not." This is an informal but useful heuristic.
Connection to Hypothesis Testing:
A 95% CI for $\beta_j$ contains exactly those values $\beta_j^0$ that would not be rejected by a two-sided test at the 5% significance level:
$$\beta_j^0 \in \text{CI}_{95%} \iff \text{p-value for } H_0: \beta_j = \beta_j^0 > 0.05$$
Often we're interested in linear combinations of coefficients rather than individual ones.
Examples:
General Form:
For linear combination $\theta = \mathbf{c}^\top \boldsymbol{\beta}$ where $\mathbf{c} \in \mathbb{R}^p$:
Estimate: $\hat{\theta} = \mathbf{c}^\top \hat{\boldsymbol{\beta}}$
Variance: $\text{Var}(\hat{\theta}) = \mathbf{c}^\top \text{Var}(\hat{\boldsymbol{\beta}}) \mathbf{c} = \sigma^2 \mathbf{c}^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{c}$
Estimated SE: $\widehat{\text{SE}}(\hat{\theta}) = s \sqrt{\mathbf{c}^\top (\mathbf{X}^\top\mathbf{X})^{-1} \mathbf{c}}$
Confidence interval:
$$\text{CI}{1-\alpha}(\mathbf{c}^\top\boldsymbol{\beta}) = \mathbf{c}^\top \hat{\boldsymbol{\beta}} \pm t{\alpha/2, n-p} \cdot \widehat{\text{SE}}(\hat{\theta})$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import numpy as npfrom numpy.linalg import invimport scipy.stats as stats def ci_for_linear_combination(X: np.ndarray, y: np.ndarray, c: np.ndarray, alpha: float = 0.05): """ Compute CI for a linear combination θ = c'β. Parameters: ----------- X : design matrix (n x p) y : response (n,) c : coefficient vector (p,) defining the linear combination alpha : significance level Returns: -------- Dictionary with estimate, SE, CI """ n, p = X.shape df = n - p # OLS estimates XtX_inv = inv(X.T @ X) beta_hat = XtX_inv @ X.T @ y # Residual variance residuals = y - X @ beta_hat s_sq = np.sum(residuals**2) / df # Linear combination estimate theta_hat = c @ beta_hat # Variance of linear combination var_theta = s_sq * (c @ XtX_inv @ c) se_theta = np.sqrt(var_theta) # CI t_crit = stats.t.ppf(1 - alpha/2, df) ci_lower = theta_hat - t_crit * se_theta ci_upper = theta_hat + t_crit * se_theta return { 'theta_hat': theta_hat, 'se': se_theta, 'ci_lower': ci_lower, 'ci_upper': ci_upper, 't_critical': t_crit } # Examplenp.random.seed(42)n = 100X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])beta_true = np.array([1.0, 2.0, 3.0])y = X @ beta_true + 0.5 * np.random.randn(n) # CI for individual coefficientsprint("CIs for Individual Coefficients:")for j, name in enumerate(['β₀', 'β₁', 'β₂']): c = np.zeros(3) c[j] = 1 result = ci_for_linear_combination(X, y, c) print(f" {name}: {result['theta_hat']:.3f} [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]") # CI for difference β₁ - β₂print("\nCI for Difference (β₁ - β₂):")c_diff = np.array([0, 1, -1])result = ci_for_linear_combination(X, y, c_diff)print(f" True: {beta_true[1] - beta_true[2]:.3f}")print(f" Estimate: {result['theta_hat']:.3f}")print(f" CI: [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]") # CI for sum β₁ + β₂ print("\nCI for Sum (β₁ + β₂):")c_sum = np.array([0, 1, 1])result = ci_for_linear_combination(X, y, c_sum)print(f" True: {beta_true[1] + beta_true[2]:.3f}")print(f" Estimate: {result['theta_hat']:.3f}")print(f" CI: [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]") # CI for predicted mean at x = (1, 0.5, -0.5)print("\nCI for Predicted Mean at x = (1, 0.5, -0.5):")c_pred = np.array([1, 0.5, -0.5])result = ci_for_linear_combination(X, y, c_pred)true_mean = beta_true @ c_predprint(f" True mean: {true_mean:.3f}")print(f" Estimate: {result['theta_hat']:.3f}")print(f" CI: [{result['ci_lower']:.3f}, {result['ci_upper']:.3f}]")When we're interested in multiple coefficients simultaneously, individual CIs are insufficient. The problem is the multiple comparisons issue.
The Problem with Individual CIs:
If we construct two independent 95% CIs:
With p coefficients, the probability that all individual 95% CIs simultaneously contain their true values can be much less than 95%!
Joint Confidence Regions:
A $(1-\alpha) \times 100%$ joint confidence region satisfies:
$$\Pr\left((\boldsymbol{\beta}_J - \hat{\boldsymbol{\beta}}_J)^\top [s^2(\mathbf{X}_J^\top\mathbf{X}_J)^{-1}]^{-1} (\boldsymbol{\beta}_J - \hat{\boldsymbol{\beta}}J) \leq qF{q, n-p, \alpha}\right) = 1 - \alpha$$
Where $\boldsymbol{\beta}J$ is the $q$-dimensional subset of coefficients of interest, and $F{q, n-p, \alpha}$ is the F-distribution critical value.
The joint confidence region is an ellipsoid in q-dimensional parameter space. For two coefficients, it's an ellipse. The ellipse's shape reflects the covariance between coefficient estimates: correlated estimates produce tilted ellipses; independent estimates produce axis-aligned ellipses.
For Two Coefficients (β₁, β₂):
The 95% joint confidence region is an ellipse defined by:
$$(\boldsymbol{\beta}{12} - \hat{\boldsymbol{\beta}}{12})^\top [s^2(\mathbf{X}{12}^\top\mathbf{X}{12})^{-1}]^{-1} (\boldsymbol{\beta}{12} - \hat{\boldsymbol{\beta}}{12}) \leq 2F_{2, n-p, 0.05}$$
Comparison: Individual CIs vs Joint Region:
| Aspect | Individual CIs | Joint Confidence Region |
|---|---|---|
| Shape | Rectangle (crossed intervals) | Ellipse |
| Coverage | < 95% jointly | Exactly 95% |
| Size | Smaller (narrower margins) | Larger (accounts for multiplicity) |
| Use case | Marginal inference on each β | Simultaneous statements about all β |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import numpy as npfrom numpy.linalg import invimport scipy.stats as stats def joint_confidence_ellipse(X: np.ndarray, y: np.ndarray, indices: list, alpha: float = 0.05, n_points: int = 100): """ Compute the joint confidence ellipse for two coefficients. Parameters: ----------- X : design matrix y : response indices : list of two coefficient indices alpha : significance level Returns: -------- ellipse_points : (n_points, 2) array of points on ellipse boundary beta_hat_subset : estimated coefficients """ assert len(indices) == 2, "Only 2D ellipse supported" n, p = X.shape df = n - p q = len(indices) # OLS estimates XtX_inv = inv(X.T @ X) beta_hat = XtX_inv @ X.T @ y # Residual variance residuals = y - X @ beta_hat s_sq = np.sum(residuals**2) / df # Subset estimates and covariance beta_hat_subset = beta_hat[indices] cov_subset = s_sq * XtX_inv[np.ix_(indices, indices)] # F critical value f_crit = stats.f.ppf(1 - alpha, q, df) # Ellipse: (β - β̂)' Σ^{-1} (β - β̂) = q * F_crit # Parameterize ellipse using eigendecomposition eigenvalues, eigenvectors = np.linalg.eigh(cov_subset) # Semi-axes lengths: sqrt(q * F_crit * eigenvalue) semi_axes = np.sqrt(q * f_crit * eigenvalues) # Generate ellipse points theta = np.linspace(0, 2*np.pi, n_points) unit_circle = np.column_stack([np.cos(theta), np.sin(theta)]) # Transform: scale by semi-axes, rotate by eigenvectors, translate by β̂ ellipse_points = unit_circle @ np.diag(semi_axes) @ eigenvectors.T + beta_hat_subset return { 'ellipse_points': ellipse_points, 'beta_hat': beta_hat_subset, 'cov_matrix': cov_subset, 'f_critical': f_crit, 'semi_axes': semi_axes } # Examplenp.random.seed(42)n = 50X = np.column_stack([np.ones(n), np.random.randn(n), np.random.randn(n)])beta_true = np.array([1.0, 2.0, -1.0])y = X @ beta_true + 0.5 * np.random.randn(n) # Joint CI for (β₁, β₂)result = joint_confidence_ellipse(X, y, [1, 2]) print("Joint 95% Confidence Region for (β₁, β₂):")print(f" Estimates: β̂₁ = {result['beta_hat'][0]:.3f}, β̂₂ = {result['beta_hat'][1]:.3f}")print(f" True values: β₁ = {beta_true[1]:.3f}, β₂ = {beta_true[2]:.3f}")print(f" F-critical (2, {n-3}): {result['f_critical']:.3f}")print(f" Semi-axes: {result['semi_axes'][0]:.3f}, {result['semi_axes'][1]:.3f}")print(f" Covariance matrix:")print(f" {result['cov_matrix']}")When constructing individual CIs for multiple coefficients, we can adjust the confidence level to achieve simultaneous coverage.
Bonferroni Correction:
To achieve simultaneous $(1-\alpha) \times 100%$ coverage for $m$ coefficients:
Bonferroni-adjusted CI:
$$\text{CI}_{Bonf}(\beta_j) = \hat{\beta}j \pm t{\alpha/(2m), n-p} \cdot \widehat{\text{SE}}(\hat{\beta}_j)$$
Example: For 5 coefficients at simultaneous 95% level:
| Method | Individual CI Level | Pro | Con |
|---|---|---|---|
| No correction | 1-α | Powerful for each test | Inflated family-wise error |
| Bonferroni | 1-α/m | Simple, conservative | Can be very conservative |
| Šidák | 1-(1-α)^(1/m) | Slightly tighter than Bonferroni | Assumes independence |
| Scheffé | Uses F-distribution | Controls for all linear combos | Most conservative |
| Tukey HSD | Studentized range | Optimal for pairwise comparisons | Only for pairwise |
Correct when: Making simultaneous claims about multiple parameters; confirmatory analysis where each conclusion matters. Don't necessarily correct when: Exploratory analysis; parameters have very different substantive meaning; you're reporting all individual p-values and letting readers judge.
The confidence intervals we've developed are exact under normality: the t-distribution is exactly correct, regardless of sample size.
Without Normality:
If errors are not Gaussian, the t-statistic doesn't follow an exact t-distribution. However, by the Central Limit Theorem:
$$\frac{\hat{\beta}_j - \beta_j}{\widehat{\text{SE}}(\hat{\beta}_j)} \xrightarrow{d} \mathcal{N}(0, 1) \quad \text{as } n \to \infty$$
This asymptotic normality justifies using normal or t-based CIs even for non-Gaussian errors, provided $n$ is large enough.
Regularity Conditions for Asymptotic Validity:
Finite Sample vs. Asymptotic:
| Approach | Requirement | Coverage |
|---|---|---|
| Exact (t-based) | Normal errors | Exactly 1-α for any n |
| Asymptotic (z-based) | Finite variance, large n | Approximately 1-α, better as n→∞ |
| Bootstrap | Few distributional assumptions | Approximately 1-α, often robust |
There's no universal answer. Depends on: How non-normal are the errors? (Heavier tails need larger n.) How many parameters? (More parameters need larger n.) What's the design structure? (Balanced designs converge faster.) Rule of thumb: n-p > 30 is often adequate; n-p > 100 is usually safe for mild non-normality.
Bootstrap Confidence Intervals:
An alternative to parametric CIs is the bootstrap:
Bootstrap CIs are:
This page has developed the complete theory and practice of confidence intervals for regression coefficients. Let's consolidate the key insights:
What's Next:
Confidence intervals quantify uncertainty; hypothesis tests make formal decisions. The next page develops hypothesis testing for regression coefficients—t-tests for individual coefficients, F-tests for groups of coefficients, and the duality between tests and confidence intervals.
You now understand how to construct and correctly interpret confidence intervals for regression coefficients. This knowledge enables you to communicate uncertainty in your estimates, compare coefficients, and make informed decisions about statistical significance.