Machine LearningComputational Considerations

Computational Considerations in Linear Regression

LevelAdvanced

Duration90 mins

TopicComputational Considerations

5 / 5

Weighted Least Squares

Not All Data Points Are Created Equal

Throughout our treatment of linear regression, we've implicitly assumed that all observations are equally reliable. Every residual (yᵢ - xᵢᵀβ)² contributes equally to the loss function.

But this assumption often fails in practice:

Heteroscedasticity: Variance of errors differs across observations. A measurement from a precise instrument should count more than one from a noisy sensor.
Sample importance: Some examples matter more than others. Recent data may be more relevant than old data.
Aggregated data: Observations representing different group sizes should be weighted accordingly.
Outlier downweighting: Robust regression iteratively downweights suspected outliers.

Weighted Least Squares (WLS) generalizes OLS by allowing each observation to have a different weight in the objective function.

What You Will Learn

By the end of this page, you will: (1) understand when and why to use weighted least squares, (2) derive the WLS estimator and its properties, (3) implement WLS efficiently, and (4) connect WLS to generalized least squares and iteratively reweighted methods.

Motivation: When OLS Assumptions Fail

Recall the OLS assumptions:

Linearity: y = Xβ + ε
Zero mean errors: E[εᵢ] = 0
Homoscedasticity: Var(εᵢ) = σ² (constant variance)
Uncorrelated errors: Cov(εᵢ, εⱼ) = 0 for i ≠ j

When assumption 3 (homoscedasticity) fails, we have heteroscedasticity:

$$\text{Var}(\varepsilon_i) = \sigma_i^2$$

where σᵢ² varies across observations.

Consequences of Ignoring Heteroscedasticity:

OLS estimates remain unbiased: β̂_OLS is still a correct point estimate on average
But OLS is inefficient: It's no longer BLUE (Best Linear Unbiased Estimator)
Standard errors are wrong: Confidence intervals and hypothesis tests are invalid
Prediction intervals are wrong: Uncertainty quantification breaks down

The Core Insight:

If observation i has variance σᵢ², it should receive weight wᵢ = 1/σᵢ². High-variance (unreliable) observations get low weight; low-variance (precise) observations get high weight.

Real-World Examples of Heteroscedasticity
Domain	Pattern	Cause
Financial data	Volatility clustering	Variance increases during market stress
Survey data	Group size effects	Aggregate means from different sample sizes
Sensor data	Calibration variation	Different sensors have different precision
Time series	Level-dependent variance	Larger values have proportionally larger errors
Cross-sectional	Scale effects	Large companies have larger absolute variations

Beyond Heteroscedasticity

WLS handles cases where we know the relative variances. More generally, Generalized Least Squares (GLS) handles both heteroscedasticity AND correlated errors (e.g., time series autocorrelation). WLS is the special case of GLS with diagonal covariance matrix.

The Weighted Least Squares Estimator

The WLS Objective:

Instead of minimizing the sum of squared residuals, WLS minimizes the weighted sum:

$$\mathcal{L}{WLS}(\beta) = \sum{i=1}^{n} w_i (y_i - x_i^\top \beta)^2 = (y - X\beta)^\top W (y - X\beta)$$

where W = diag(w₁, w₂, ..., wₙ) is the diagonal weight matrix.

Matrix Form:

Let W^(1/2) = diag(√w₁, ..., √wₙ). Then we can write:

$$\mathcal{L}_{WLS}(\beta) = |W^{1/2}(y - X\beta)|^2 = |\tilde{y} - \tilde{X}\beta|^2$$

where ỹ = W^(1/2)y and X̃ = W^(1/2)X are weighted versions of the data.

This is just OLS on transformed data!

Derivation of the WLS Solution:

Taking the gradient and setting to zero:

$$\nabla_\beta \mathcal{L}_{WLS} = -2X^\top W(y - X\beta) = 0$$

$$X^\top W X \beta = X^\top W y$$

The WLS estimator:

$$\boxed{\hat{\beta}_{WLS} = (X^\top W X)^{-1} X^\top W y}$$

This is the normal equations with W inserted. Compare with OLS: β̂_OLS = (XᵀX)⁻¹Xᵀy.

Equivalently via transformation:

$$\hat{\beta}{WLS} = (\tilde{X}^\top \tilde{X})^{-1} \tilde{X}^\top \tilde{y} = \hat{\beta}{OLS}(\tilde{X}, \tilde{y})$$

The Transformation Perspective

To solve WLS, you can either: (1) modify your solver to handle weights, or (2) transform data by √wᵢ and use standard OLS. The second approach lets you reuse all your OLS machinery—QR decomposition, regularization, etc.—without modification.

weighted_least_squares.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy.linalg import solve, cholesky
 
def weighted_least_squares(X, y, weights):
    """
    Solve weighted least squares via normal equations.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix
    y : ndarray of shape (n,)
        Target vector
    weights : ndarray of shape (n,)
        Observation weights (must be positive)
        
    Returns:
    --------
    beta : ndarray of shape (p,)
        WLS coefficient estimates
    """
    n, p = X.shape
    
    # Construct weight matrix W (diagonal)
    W = np.diag(weights)
    
    # Normal equations: (X^T W X) beta = X^T W y
    XtW = X.T @ W  # (p, n)
    XtWX = XtW @ X  # (p, p)
    XtWy = XtW @ y  # (p,)
    
    # Solve via Cholesky (since XtWX is symmetric positive definite)
    try:
        L = cholesky(XtWX, lower=True)
        z = solve(L, XtWy, lower=True)
        beta = solve(L.T, z, lower=False)
    except np.linalg.LinAlgError:
        # Fall back to lstsq if singular
        beta = np.linalg.lstsq(XtWX, XtWy, rcond=None)[0]
    
    return beta
 
 
def wls_via_transformation(X, y, weights):
    """
    Solve WLS by transforming to standard OLS.
    
    Transform: X̃ = W^{1/2} X, ỹ = W^{1/2} y
    Then solve: X̃.T @ X̃ @ beta = X̃.T @ ỹ
    
    This approach lets us use standard OLS machinery (QR, SVD, etc.)
    """
    sqrt_w = np.sqrt(weights)
    
    # Transform data
    X_tilde = X * sqrt_w[:, np.newaxis]  # Multiply each row by sqrt(w_i)
    y_tilde = y * sqrt_w
    
    # Standard OLS on transformed data
    beta = np.linalg.lstsq(X_tilde, y_tilde, rcond=None)[0]
    
    return beta
 
 
def demonstrate_wls():
    """
    Demonstrate WLS on heteroscedastic data.
    """
    np.random.seed(42)
    
    n, p = 200, 3
    X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
    beta_true = np.array([2.0, -1.0, 0.5])
    
    # Heteroscedastic errors: variance increases with X[:, 1]
    sigma = 0.5 + 2.0 * np.abs(X[:, 1])  # Standard deviation
    epsilon = sigma * np.random.randn(n)
    y = X @ beta_true + epsilon
    
    # True weights (inverse variance)
    true_weights = 1.0 / sigma**2
    
    # OLS (ignoring heteroscedasticity)
    beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
    
    # WLS with true weights
    beta_wls = weighted_least_squares(X, y, true_weights)
    
    # WLS via transformation (should match)
    beta_wls2 = wls_via_transformation(X, y, true_weights)
    
    print("=== WLS vs OLS on Heteroscedastic Data ===")
    print(f"True β: {beta_true}")
    print(f"OLS β:  {np.round(beta_ols, 4)}")
    print(f"WLS β:  {np.round(beta_wls, 4)}")
    print(f"WLS (transform) β: {np.round(beta_wls2, 4)}")
    print()
    print(f"OLS error: {np.linalg.norm(beta_ols - beta_true):.4f}")
    print(f"WLS error: {np.linalg.norm(beta_wls - beta_true):.4f}")
    
    # Simulate many times to compare efficiency
    n_sim = 1000
    ols_estimates = []
    wls_estimates = []
    
    for _ in range(n_sim):
        epsilon = sigma * np.random.randn(n)
        y_sim = X @ beta_true + epsilon
        
        ols_estimates.append(np.linalg.lstsq(X, y_sim, rcond=None)[0])
        wls_estimates.append(weighted_least_squares(X, y_sim, true_weights))
    
    ols_estimates = np.array(ols_estimates)
    wls_estimates = np.array(wls_estimates)
    
    print(f"\nVariance of estimates (avg over {n_sim} simulations):")
    print(f"  OLS variance: {np.var(ols_estimates, axis=0).sum():.6f}")
    print(f"  WLS variance: {np.var(wls_estimates, axis=0).sum():.6f}")
    print(f"  Efficiency gain: {np.var(ols_estimates, axis=0).sum() / np.var(wls_estimates, axis=0).sum():.2f}x")
 
demonstrate_wls()

Statistical Properties of WLS

When Weights Are Correct (wᵢ = 1/σᵢ²):

If the weights are chosen to be the inverse variances of the errors, WLS has remarkable properties:

1. Unbiasedness: $$E[\hat{\beta}_{WLS}] = \beta$$

WLS remains unbiased (same as OLS).

2. BLUE (Best Linear Unbiased Estimator):

Under heteroscedasticity, WLS is BLUE—it achieves the minimum variance among all linear unbiased estimators. OLS is no longer BLUE under heteroscedasticity.

3. Variance of WLS Estimator: $$\text{Var}(\hat{\beta}_{WLS}) = (X^\top W X)^{-1}$$

when the weights equal inverse variances.

This is the Gauss-Markov theorem extended to the heteroscedastic case.

When Weights Are Misspecified:

In practice, we rarely know the true σᵢ². What happens if our weights are wrong?

1. Still unbiased: WLS remains unbiased for any positive weights (assuming the linear model is correct).

2. No longer BLUE: You lose efficiency compared to using the correct weights.

3. Standard errors need correction: The naive variance formula (XᵀWX)⁻¹ is wrong. Use the sandwich estimator:

$$\text{Var}(\hat{\beta}_{WLS}) = (X^\top W X)^{-1} X^\top W \Sigma W X (X^\top W X)^{-1}$$

where Σ = diag(σ₁², ..., σₙ²) is the true error covariance.

This is often called the "robust" or "heteroscedasticity-consistent" covariance estimator.

The Catch-22 of WLS

To compute optimal weights, you need to know the error variances σᵢ². But to estimate σᵢ², you need the residuals. And to get good residuals, you need a good fit. This circular dependency is why WLS often requires iterative approaches or external variance estimates.

Comparison of Estimator Properties
Scenario	OLS	WLS (correct weights)	WLS (wrong weights)
Homoscedastic data	BLUE	Same as OLS (if W∝I)	Unbiased, inefficient
Heteroscedastic data	Unbiased but not BLUE	BLUE	Unbiased but not BLUE
Correct standard errors?	No (wrong variance)	Yes	No (need sandwich)

Estimating Weights in Practice

Since true variances are unknown, we need practical strategies to estimate or choose weights:

Strategy 1: Squared Residual Regression

Fit OLS to get initial residuals ε̂ᵢ
Model log(ε̂ᵢ²) as a function of predictors (or other variables)
Use predicted variances σ̂ᵢ² as weights wᵢ = 1/σ̂ᵢ²

This is called Feasible GLS (FGLS) when iterated.

Strategy 2: Group-Based Weights

If observations naturally fall into groups with different variances:

Fit OLS within each group
Estimate variance σ̂ₖ² for each group k
Assign wᵢ = 1/σ̂ₖ² for all observations in group k

Example: Panel data with different time periods having different volatility.

Strategy 3: Known Functional Form

Sometimes theory suggests a variance structure:

Proportional variance: Var(εᵢ) ∝ xᵢ → weights wᵢ = 1/xᵢ
Count data (Poisson-like): Var(εᵢ) ∝ E[yᵢ] → weights wᵢ = 1/ŷᵢ
Aggregated data: If yᵢ is a mean of nᵢ observations, Var(εᵢ) ∝ 1/nᵢ → weights wᵢ = nᵢ

estimate_weights.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import numpy as np
 
def feasible_gls(X, y, max_iter=10, tol=1e-6):
    """
    Feasible Generalized Least Squares (FGLS).
    
    Iteratively estimates variance structure and re-fits WLS.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix
    y : ndarray of shape (n,)
        Target vector
    max_iter : int
        Maximum iterations
    tol : float
        Convergence tolerance
        
    Returns:
    --------
    beta : ndarray of shape (p,)
        FGLS coefficient estimates
    weights : ndarray of shape (n,)
        Estimated optimal weights
    """
    n, p = X.shape
    
    # Step 1: Initial OLS fit
    beta = np.linalg.lstsq(X, y, rcond=None)[0]
    
    for iteration in range(max_iter):
        beta_old = beta.copy()
        
        # Step 2: Compute residuals
        residuals = y - X @ beta
        
        # Step 3: Model the variance
        # Simple approach: regress log(resid²) on log(|predictions|)
        # This assumes variance proportional to some power of E[y]
        log_resid_sq = np.log(residuals**2 + 1e-10)  # Avoid log(0)
        y_pred = X @ beta
        
        # Fit a simple model for variance: log(σ²) = α + β₁·log(|ŷ|)
        # This captures multiplicative heteroscedasticity
        design_var = np.column_stack([
            np.ones(n),
            np.log(np.abs(y_pred) + 1)  # Avoid log(0)
        ])
        gamma = np.linalg.lstsq(design_var, log_resid_sq, rcond=None)[0]
        
        # Predicted variances
        log_var_pred = design_var @ gamma
        var_pred = np.exp(log_var_pred)
        
        # Weights = 1/variance
        weights = 1.0 / np.maximum(var_pred, 1e-10)
        
        # Step 4: WLS with estimated weights
        sqrt_w = np.sqrt(weights)
        X_weighted = X * sqrt_w[:, np.newaxis]
        y_weighted = y * sqrt_w
        beta = np.linalg.lstsq(X_weighted, y_weighted, rcond=None)[0]
        
        # Check convergence
        if np.linalg.norm(beta - beta_old) < tol:
            print(f"FGLS converged in {iteration + 1} iterations")
            break
    
    return beta, weights
 
 
def group_based_weights(X, y, groups):
    """
    Estimate weights from group-specific variances.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix
    y : ndarray of shape (n,)
        Target vector
    groups : ndarray of shape (n,)
        Group labels for each observation
        
    Returns:
    --------
    beta : ndarray of shape (p,)
        WLS estimates
    weights : ndarray of shape (n,)
        Group-based weights
    """
    unique_groups = np.unique(groups)
    
    # Initial OLS
    beta = np.linalg.lstsq(X, y, rcond=None)[0]
    residuals = y - X @ beta
    
    # Estimate variance per group
    weights = np.zeros(len(y))
    for g in unique_groups:
        mask = groups == g
        group_resid = residuals[mask]
        group_var = np.var(group_resid, ddof=1)
        weights[mask] = 1.0 / group_var
    
    # WLS with group weights
    sqrt_w = np.sqrt(weights)
    X_weighted = X * sqrt_w[:, np.newaxis]
    y_weighted = y * sqrt_w
    beta = np.linalg.lstsq(X_weighted, y_weighted, rcond=None)[0]
    
    return beta, weights
 
 
def demonstrate_weight_estimation():
    """
    Compare different weight estimation strategies.
    """
    np.random.seed(42)
    
    n, p = 300, 3
    X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
    beta_true = np.array([1.0, 2.0, -1.0])
    
    # Create heteroscedastic data: variance increases with |X @ beta|
    signal = X @ beta_true
    sigma = 0.5 + 0.5 * np.abs(signal)
    epsilon = sigma * np.random.randn(n)
    y = signal + epsilon
    
    # True optimal weights
    true_weights = 1.0 / sigma**2
    
    # Method 1: OLS (ignores heteroscedasticity)
    beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
    
    # Method 2: WLS with true weights (oracle)
    sqrt_w = np.sqrt(true_weights)
    beta_oracle = np.linalg.lstsq(X * sqrt_w[:, np.newaxis], 
                                   y * sqrt_w, rcond=None)[0]
    
    # Method 3: FGLS (estimated weights)
    beta_fgls, weights_fgls = feasible_gls(X, y)
    
    print("=== Weight Estimation Comparison ===")
    print(f"True β: {beta_true}")
    print(f"OLS:    {np.round(beta_ols, 4)} (error: {np.linalg.norm(beta_ols - beta_true):.4f})")
    print(f"Oracle: {np.round(beta_oracle, 4)} (error: {np.linalg.norm(beta_oracle - beta_true):.4f})")
    print(f"FGLS:   {np.round(beta_fgls, 4)} (error: {np.linalg.norm(beta_fgls - beta_true):.4f})")
    
    # Check weight quality
    weight_corr = np.corrcoef(true_weights, weights_fgls)[0, 1]
    print(f"\nCorrelation between true and estimated weights: {weight_corr:.3f}")
 
demonstrate_weight_estimation()

Iteratively Reweighted Least Squares (IRLS)

Iteratively Reweighted Least Squares (IRLS) is a powerful technique that recasts certain optimization problems as a sequence of weighted least squares problems. It's used extensively in:

Robust regression: M-estimators (Huber, Tukey bisquare)
Generalized linear models: Logistic, Poisson regression
L1 regularization approximation: Iteratively approximating Lasso
Low-rank matrix completion: Nuclear norm minimization

The General Framework:

Many loss functions can be written as:

$$\mathcal{L}(\beta) = \sum_{i=1}^{n} \rho(y_i - x_i^\top \beta)$$

for some function ρ. If ρ(r) can be expressed as:

$$\rho(r) = w(r) \cdot r^2$$

then minimizing is equivalent to iteratively solving:

Compute residuals rᵢ = yᵢ - xᵢᵀβ
Update weights: wᵢ = ψ(rᵢ) / rᵢ (where ψ = ρ')
Solve WLS with current weights
Repeat until convergence

Example: Huber Regression

The Huber loss is quadratic for small residuals and linear for large ones:

$$\rho_{Huber}(r) = \begin{cases} \frac{1}{2}r^2 & |r| \leq \delta \ \delta|r| - \frac{1}{2}\delta^2 & |r| > \delta \end{cases}$$

The weight function becomes:

$$w(r) = \begin{cases} 1 & |r| \leq \delta \ \delta / |r| & |r| > \delta \end{cases}$$

Large residuals (outliers) get downweighted, making the estimator robust.

irls_robust.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
import numpy as np
 
def huber_weights(residuals, delta=1.345):
    """
    Compute Huber weights for given residuals.
    
    Parameters:
    -----------
    residuals : ndarray
        Current residuals
    delta : float
        Threshold for switching from quadratic to linear loss
        1.345 gives 95% efficiency at the normal distribution
    """
    abs_r = np.abs(residuals)
    weights = np.ones_like(residuals)
    
    # For large residuals, downweight
    mask = abs_r > delta
    weights[mask] = delta / abs_r[mask]
    
    return weights
 
 
def irls_huber(X, y, delta=1.345, max_iter=50, tol=1e-6):
    """
    Huber's M-estimator via IRLS.
    
    Robust to outliers: downweights observations with large residuals.
    """
    n, p = X.shape
    
    # Initialize with OLS
    beta = np.linalg.lstsq(X, y, rcond=None)[0]
    
    for iteration in range(max_iter):
        beta_old = beta.copy()
        
        # Compute residuals
        residuals = y - X @ beta
        
        # Scale residuals by MAD (robust scale estimate)
        mad = np.median(np.abs(residuals - np.median(residuals)))
        scale = mad / 0.6745  # Consistent at normal distribution
        scaled_residuals = residuals / (scale + 1e-10)
        
        # Compute Huber weights
        weights = huber_weights(scaled_residuals, delta)
        
        # WLS step
        sqrt_w = np.sqrt(weights)
        X_w = X * sqrt_w[:, np.newaxis]
        y_w = y * sqrt_w
        beta = np.linalg.lstsq(X_w, y_w, rcond=None)[0]
        
        # Check convergence
        if np.linalg.norm(beta - beta_old) < tol:
            print(f"IRLS-Huber converged in {iteration + 1} iterations")
            break
    
    return beta, weights
 
 
def bisquare_weights(residuals, c=4.685):
    """
    Tukey's bisquare weights (more aggressive downweighting).
    
    c=4.685 gives 95% efficiency at the normal distribution.
    """
    abs_r = np.abs(residuals)
    weights = np.zeros_like(residuals)
    
    mask = abs_r <= c
    u = residuals[mask] / c
    weights[mask] = (1 - u**2)**2
    
    return weights
 
 
def irls_bisquare(X, y, c=4.685, max_iter=50, tol=1e-6):
    """
    Tukey's bisquare M-estimator via IRLS.
    
    More robust than Huber: completely ignores extreme outliers.
    """
    n, p = X.shape
    
    # Initialize with OLS
    beta = np.linalg.lstsq(X, y, rcond=None)[0]
    
    for iteration in range(max_iter):
        beta_old = beta.copy()
        
        residuals = y - X @ beta
        mad = np.median(np.abs(residuals - np.median(residuals)))
        scale = mad / 0.6745
        scaled_residuals = residuals / (scale + 1e-10)
        
        weights = bisquare_weights(scaled_residuals, c)
        
        # Avoid all-zero weights
        if weights.sum() < 0.1:
            print("Warning: too many outliers, reducing c")
            c *= 1.5
            weights = bisquare_weights(scaled_residuals, c)
        
        sqrt_w = np.sqrt(weights)
        X_w = X * sqrt_w[:, np.newaxis]
        y_w = y * sqrt_w
        beta = np.linalg.lstsq(X_w, y_w, rcond=None)[0]
        
        if np.linalg.norm(beta - beta_old) < tol:
            print(f"IRLS-Bisquare converged in {iteration + 1} iterations")
            break
    
    return beta, weights
 
 
def demonstrate_robust_regression():
    """
    Compare OLS, Huber, and Bisquare on contaminated data.
    """
    np.random.seed(42)
    
    n, p = 100, 3
    X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
    beta_true = np.array([1.0, 2.0, -1.0])
    
    # Clean data
    y_clean = X @ beta_true + 0.5 * np.random.randn(n)
    
    # Add outliers (10% contamination)
    n_outliers = 10
    outlier_idx = np.random.choice(n, n_outliers, replace=False)
    y_contaminated = y_clean.copy()
    y_contaminated[outlier_idx] += np.random.choice([-1, 1], n_outliers) * 20
    
    print("=== Robust Regression Comparison ===")
    print(f"True β: {beta_true}")
    print(f"Data: {n} observations, {n_outliers} outliers (10% contamination)")
    print()
    
    # OLS (sensitive to outliers)
    beta_ols = np.linalg.lstsq(X, y_contaminated, rcond=None)[0]
    print(f"OLS:       {np.round(beta_ols, 3)} (error: {np.linalg.norm(beta_ols - beta_true):.3f})")
    
    # Huber (moderate robustness)
    beta_huber, w_huber = irls_huber(X, y_contaminated)
    print(f"Huber:     {np.round(beta_huber, 3)} (error: {np.linalg.norm(beta_huber - beta_true):.3f})")
    
    # Bisquare (high robustness)
    beta_bisq, w_bisq = irls_bisquare(X, y_contaminated)
    print(f"Bisquare:  {np.round(beta_bisq, 3)} (error: {np.linalg.norm(beta_bisq - beta_true):.3f})")
    
    # Show which points were downweighted
    print(f"\nOutlier indices: {sorted(outlier_idx)}")
    print(f"Points with Huber weight < 0.5: {np.where(w_huber < 0.5)[0]}")
    print(f"Points with Bisquare weight < 0.5: {np.where(w_bisq < 0.5)[0]}")
 
demonstrate_robust_regression()

IRLS in GLMs

Generalized Linear Models (logistic, Poisson regression) are fit using IRLS with weights derived from the variance function. The famous Fisher scoring algorithm is essentially IRLS. We'll cover this in detail in the GLM chapter.

Generalized Least Squares: Beyond Diagonal Weights

Generalized Least Squares (GLS) extends WLS to handle correlated errors, not just heteroscedasticity.

Model: $$y = X\beta + \varepsilon, \quad \text{Var}(\varepsilon) = \sigma^2 \Omega$$

where Ω is a general (not necessarily diagonal) covariance structure.

Examples of correlated errors:

Time series: Consecutive observations are autocorrelated
Spatial data: Nearby locations have correlated errors
Clustered data: Observations within a group are correlated
Repeated measures: Multiple observations per subject are correlated

The GLS Estimator:

If we know Ω, the GLS estimator is:

$$\hat{\beta}_{GLS} = (X^\top \Omega^{-1} X)^{-1} X^\top \Omega^{-1} y$$

Notice this is WLS with W = Ω⁻¹, but now W is not diagonal.

Transformation Approach:

Since Ω is positive definite, we can factor Ω = LL^ᵀ (Cholesky). Then:

Let X̃ = L⁻¹X and ỹ = L⁻¹y
The transformed model ỹ = X̃β + ε̃ has Var(ε̃) = σ²I
Apply OLS to (X̃, ỹ)

This is called decorrelation or prewhitening.

generalized_least_squares.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def gls(X, y, Omega):
    """
    Generalized Least Squares estimator.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix
    y : ndarray of shape (n,)
        Target vector
    Omega : ndarray of shape (n, n)
        Error covariance matrix (positive definite)
        
    Returns:
    --------
    beta : ndarray of shape (p,)
        GLS coefficient estimates
    """
    # Method 1: Direct formula (requires inverting Omega)
    # beta = solve(X.T @ solve(Omega, X), X.T @ solve(Omega, y))
    
    # Method 2: Transform via Cholesky (more stable)
    # Omega = L @ L.T, so Omega^{-1} = L^{-T} @ L^{-1}
    L = cholesky(Omega, lower=True)
    
    # Transform: X_tilde = L^{-1} @ X, y_tilde = L^{-1} @ y
    X_tilde = solve_triangular(L, X, lower=True)
    y_tilde = solve_triangular(L, y, lower=True)
    
    # Now apply OLS to transformed data
    beta = np.linalg.lstsq(X_tilde, y_tilde, rcond=None)[0]
    
    return beta
 
 
def ar1_covariance(n, rho):
    """
    Construct AR(1) correlation matrix.
    
    Corr(ε_i, ε_j) = rho^|i-j|
    """
    idx = np.arange(n)
    return rho ** np.abs(idx[:, np.newaxis] - idx[np.newaxis, :])
 
 
def demonstrate_gls():
    """
    Demonstrate GLS on time series data with autocorrelated errors.
    """
    np.random.seed(42)
    
    n, p = 100, 2
    rho = 0.8  # Strong autocorrelation
    
    # Design matrix (time trend)
    t = np.linspace(0, 1, n)
    X = np.column_stack([np.ones(n), t])
    beta_true = np.array([1.0, 2.0])
    
    # Generate AR(1) errors
    sigma_epsilon = 0.5
    epsilon = np.zeros(n)
    epsilon[0] = np.random.randn()
    for i in range(1, n):
        epsilon[i] = rho * epsilon[i-1] + np.sqrt(1 - rho**2) * np.random.randn()
    epsilon *= sigma_epsilon
    
    y = X @ beta_true + epsilon
    
    # True covariance structure
    Omega = ar1_covariance(n, rho)
    
    # OLS (ignores autocorrelation)
    beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
    
    # GLS (uses true covariance)
    beta_gls = gls(X, y, Omega)
    
    print("=== GLS vs OLS with Autocorrelated Errors ===")
    print(f"True β: {beta_true}")
    print(f"Autocorrelation (ρ): {rho}")
    print()
    print(f"OLS: {np.round(beta_ols, 4)} (error: {np.linalg.norm(beta_ols - beta_true):.4f})")
    print(f"GLS: {np.round(beta_gls, 4)} (error: {np.linalg.norm(beta_gls - beta_true):.4f})")
    
    # Simulate many times to compare efficiency
    n_sim = 500
    ols_estimates = []
    gls_estimates = []
    
    for _ in range(n_sim):
        # Generate new AR(1) errors
        epsilon = np.zeros(n)
        epsilon[0] = np.random.randn()
        for i in range(1, n):
            epsilon[i] = rho * epsilon[i-1] + np.sqrt(1 - rho**2) * np.random.randn()
        epsilon *= sigma_epsilon
        
        y_sim = X @ beta_true + epsilon
        
        ols_estimates.append(np.linalg.lstsq(X, y_sim, rcond=None)[0])
        gls_estimates.append(gls(X, y_sim, Omega))
    
    ols_estimates = np.array(ols_estimates)
    gls_estimates = np.array(gls_estimates)
    
    print(f"\nEstimator variance (over {n_sim} simulations):")
    print(f"  OLS variance: {np.var(ols_estimates, axis=0)}")
    print(f"  GLS variance: {np.var(gls_estimates, axis=0)}")
    print(f"  Efficiency gain: {np.var(ols_estimates, axis=0).sum() / np.var(gls_estimates, axis=0).sum():.2f}x")
 
demonstrate_gls()

Feasible GLS in Practice

In practice, Ω is unknown and must be estimated. Common approaches: (1) For time series, fit an AR/MA model to OLS residuals to estimate ρ. (2) For spatial data, use variogram fitting. (3) For clustered data, use a working correlation structure (GEE approach).

Summary: Weighted Least Squares

Key Takeaways

•WLS handles heteroscedasticity — Weight observations by inverse variance to restore efficiency.
•The WLS estimator is β̂ = (XᵀWX)⁻¹XᵀWy — Or equivalently, OLS on W^(1/2)-transformed data.
•With correct weights, WLS is BLUE — Achieves minimum variance among linear unbiased estimators.
•Weights usually need estimation — FGLS, group-based methods, or theory-driven functional forms.
•IRLS extends to M-estimators and GLMs — Robust regression and logistic regression are special cases.
•GLS handles correlated errors — Generalize from diagonal W to full covariance structure Ω.

Weighted Methods Summary
Method	Error Structure	Weight Matrix	Use Case
OLS	iid, homoscedastic	W = I	Default when assumptions hold
WLS	Independent, heteroscedastic	W = diag(1/σᵢ²)	Known variance structure
FGLS	Independent, heteroscedastic	W = diag(1/σ̂ᵢ²)	Unknown variance structure
GLS	Correlated	W = Ω⁻¹	Time series, spatial, clustered
IRLS	Varies	Iteratively computed	Robust regression, GLMs

Module Complete!

You've now completed the Computational Considerations module, covering:

Numerical Stability: Why naive implementations fail and how to diagnose issues
QR Decomposition: The stable algorithm for solving least squares
Complexity Analysis: Operation counts and algorithm selection
Online Updates: Incremental algorithms for streaming data
Weighted Least Squares: Handling heteroscedasticity and beyond

These computational skills transform theoretical knowledge into practical engineering capability—the difference between knowing linear regression and being able to deploy it reliably at scale.

Module Complete!

Congratulations! You've mastered the computational foundations of linear regression. You now understand not just what linear regression computes, but how to compute it efficiently, stably, and correctly in real-world systems.

5 / 5

Loading learning content...

Machine LearningComputational Considerations

Computational Considerations in Linear Regression

LevelAdvanced

Duration90 mins

TopicComputational Considerations

5 / 5

Weighted Least Squares

Not All Data Points Are Created Equal

Throughout our treatment of linear regression, we've implicitly assumed that all observations are equally reliable. Every residual (yᵢ - xᵢᵀβ)² contributes equally to the loss function.

But this assumption often fails in practice:

Heteroscedasticity: Variance of errors differs across observations. A measurement from a precise instrument should count more than one from a noisy sensor.
Sample importance: Some examples matter more than others. Recent data may be more relevant than old data.
Aggregated data: Observations representing different group sizes should be weighted accordingly.
Outlier downweighting: Robust regression iteratively downweights suspected outliers.

Weighted Least Squares (WLS) generalizes OLS by allowing each observation to have a different weight in the objective function.

What You Will Learn

Motivation: When OLS Assumptions Fail

Recall the OLS assumptions:

Linearity: y = Xβ + ε
Zero mean errors: E[εᵢ] = 0
Homoscedasticity: Var(εᵢ) = σ² (constant variance)
Uncorrelated errors: Cov(εᵢ, εⱼ) = 0 for i ≠ j

When assumption 3 (homoscedasticity) fails, we have heteroscedasticity:

$$\text{Var}(\varepsilon_i) = \sigma_i^2$$

where σᵢ² varies across observations.

Consequences of Ignoring Heteroscedasticity:

OLS estimates remain unbiased: β̂_OLS is still a correct point estimate on average
But OLS is inefficient: It's no longer BLUE (Best Linear Unbiased Estimator)
Standard errors are wrong: Confidence intervals and hypothesis tests are invalid
Prediction intervals are wrong: Uncertainty quantification breaks down

The Core Insight:

If observation i has variance σᵢ², it should receive weight wᵢ = 1/σᵢ². High-variance (unreliable) observations get low weight; low-variance (precise) observations get high weight.

Real-World Examples of Heteroscedasticity
Domain	Pattern	Cause
Financial data	Volatility clustering	Variance increases during market stress
Survey data	Group size effects	Aggregate means from different sample sizes
Sensor data	Calibration variation	Different sensors have different precision
Time series	Level-dependent variance	Larger values have proportionally larger errors
Cross-sectional	Scale effects	Large companies have larger absolute variations

Beyond Heteroscedasticity

The Weighted Least Squares Estimator

The WLS Objective:

Instead of minimizing the sum of squared residuals, WLS minimizes the weighted sum:

$$\mathcal{L}{WLS}(\beta) = \sum{i=1}^{n} w_i (y_i - x_i^\top \beta)^2 = (y - X\beta)^\top W (y - X\beta)$$

where W = diag(w₁, w₂, ..., wₙ) is the diagonal weight matrix.

Matrix Form:

Let W^(1/2) = diag(√w₁, ..., √wₙ). Then we can write:

$$\mathcal{L}_{WLS}(\beta) = |W^{1/2}(y - X\beta)|^2 = |\tilde{y} - \tilde{X}\beta|^2$$

where ỹ = W^(1/2)y and X̃ = W^(1/2)X are weighted versions of the data.

This is just OLS on transformed data!

Derivation of the WLS Solution:

Taking the gradient and setting to zero:

$$\nabla_\beta \mathcal{L}_{WLS} = -2X^\top W(y - X\beta) = 0$$

$$X^\top W X \beta = X^\top W y$$

The WLS estimator:

$$\boxed{\hat{\beta}_{WLS} = (X^\top W X)^{-1} X^\top W y}$$

This is the normal equations with W inserted. Compare with OLS: β̂_OLS = (XᵀX)⁻¹Xᵀy.

Equivalently via transformation:

$$\hat{\beta}{WLS} = (\tilde{X}^\top \tilde{X})^{-1} \tilde{X}^\top \tilde{y} = \hat{\beta}{OLS}(\tilde{X}, \tilde{y})$$

The Transformation Perspective

weighted_least_squares.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
import numpy as np
from scipy.linalg import solve, cholesky
 
def weighted_least_squares(X, y, weights):
    """
    Solve weighted least squares via normal equations.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix
    y : ndarray of shape (n,)
        Target vector
    weights : ndarray of shape (n,)
        Observation weights (must be positive)
        
    Returns:
    --------
    beta : ndarray of shape (p,)
        WLS coefficient estimates
    """
    n, p = X.shape
    
    # Construct weight matrix W (diagonal)
    W = np.diag(weights)
    
    # Normal equations: (X^T W X) beta = X^T W y
    XtW = X.T @ W  # (p, n)
    XtWX = XtW @ X  # (p, p)
    XtWy = XtW @ y  # (p,)
    
    # Solve via Cholesky (since XtWX is symmetric positive definite)
    try:
        L = cholesky(XtWX, lower=True)
        z = solve(L, XtWy, lower=True)
        beta = solve(L.T, z, lower=False)
    except np.linalg.LinAlgError:
        # Fall back to lstsq if singular
        beta = np.linalg.lstsq(XtWX, XtWy, rcond=None)[0]
    
    return beta
 
 
def wls_via_transformation(X, y, weights):
    """
    Solve WLS by transforming to standard OLS.
    
    Transform: X̃ = W^{1/2} X, ỹ = W^{1/2} y
    Then solve: X̃.T @ X̃ @ beta = X̃.T @ ỹ
    
    This approach lets us use standard OLS machinery (QR, SVD, etc.)
    """
    sqrt_w = np.sqrt(weights)
    
    # Transform data
    X_tilde = X * sqrt_w[:, np.newaxis]  # Multiply each row by sqrt(w_i)
    y_tilde = y * sqrt_w
    
    # Standard OLS on transformed data
    beta = np.linalg.lstsq(X_tilde, y_tilde, rcond=None)[0]
    
    return beta
 
 
def demonstrate_wls():
    """
    Demonstrate WLS on heteroscedastic data.
    """
    np.random.seed(42)
    
    n, p = 200, 3
    X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
    beta_true = np.array([2.0, -1.0, 0.5])
    
    # Heteroscedastic errors: variance increases with X[:, 1]
    sigma = 0.5 + 2.0 * np.abs(X[:, 1])  # Standard deviation
    epsilon = sigma * np.random.randn(n)
    y = X @ beta_true + epsilon
    
    # True weights (inverse variance)
    true_weights = 1.0 / sigma**2
    
    # OLS (ignoring heteroscedasticity)
    beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
    
    # WLS with true weights
    beta_wls = weighted_least_squares(X, y, true_weights)
    
    # WLS via transformation (should match)
    beta_wls2 = wls_via_transformation(X, y, true_weights)
    
    print("=== WLS vs OLS on Heteroscedastic Data ===")
    print(f"True β: {beta_true}")
    print(f"OLS β:  {np.round(beta_ols, 4)}")
    print(f"WLS β:  {np.round(beta_wls, 4)}")
    print(f"WLS (transform) β: {np.round(beta_wls2, 4)}")
    print()
    print(f"OLS error: {np.linalg.norm(beta_ols - beta_true):.4f}")
    print(f"WLS error: {np.linalg.norm(beta_wls - beta_true):.4f}")
    
    # Simulate many times to compare efficiency
    n_sim = 1000
    ols_estimates = []
    wls_estimates = []
    
    for _ in range(n_sim):
        epsilon = sigma * np.random.randn(n)
        y_sim = X @ beta_true + epsilon
        
        ols_estimates.append(np.linalg.lstsq(X, y_sim, rcond=None)[0])
        wls_estimates.append(weighted_least_squares(X, y_sim, true_weights))
    
    ols_estimates = np.array(ols_estimates)
    wls_estimates = np.array(wls_estimates)
    
    print(f"\nVariance of estimates (avg over {n_sim} simulations):")
    print(f"  OLS variance: {np.var(ols_estimates, axis=0).sum():.6f}")
    print(f"  WLS variance: {np.var(wls_estimates, axis=0).sum():.6f}")
    print(f"  Efficiency gain: {np.var(ols_estimates, axis=0).sum() / np.var(wls_estimates, axis=0).sum():.2f}x")
 
demonstrate_wls()

Statistical Properties of WLS

When Weights Are Correct (wᵢ = 1/σᵢ²):

If the weights are chosen to be the inverse variances of the errors, WLS has remarkable properties:

1. Unbiasedness: $$E[\hat{\beta}_{WLS}] = \beta$$

WLS remains unbiased (same as OLS).

2. BLUE (Best Linear Unbiased Estimator):

Under heteroscedasticity, WLS is BLUE—it achieves the minimum variance among all linear unbiased estimators. OLS is no longer BLUE under heteroscedasticity.

3. Variance of WLS Estimator: $$\text{Var}(\hat{\beta}_{WLS}) = (X^\top W X)^{-1}$$

when the weights equal inverse variances.

This is the Gauss-Markov theorem extended to the heteroscedastic case.

When Weights Are Misspecified:

In practice, we rarely know the true σᵢ². What happens if our weights are wrong?

1. Still unbiased: WLS remains unbiased for any positive weights (assuming the linear model is correct).

2. No longer BLUE: You lose efficiency compared to using the correct weights.

3. Standard errors need correction: The naive variance formula (XᵀWX)⁻¹ is wrong. Use the sandwich estimator:

$$\text{Var}(\hat{\beta}_{WLS}) = (X^\top W X)^{-1} X^\top W \Sigma W X (X^\top W X)^{-1}$$

where Σ = diag(σ₁², ..., σₙ²) is the true error covariance.

This is often called the "robust" or "heteroscedasticity-consistent" covariance estimator.

The Catch-22 of WLS

Comparison of Estimator Properties
Scenario	OLS	WLS (correct weights)	WLS (wrong weights)
Homoscedastic data	BLUE	Same as OLS (if W∝I)	Unbiased, inefficient
Heteroscedastic data	Unbiased but not BLUE	BLUE	Unbiased but not BLUE
Correct standard errors?	No (wrong variance)	Yes	No (need sandwich)

Estimating Weights in Practice

Since true variances are unknown, we need practical strategies to estimate or choose weights:

Strategy 1: Squared Residual Regression

Fit OLS to get initial residuals ε̂ᵢ
Model log(ε̂ᵢ²) as a function of predictors (or other variables)
Use predicted variances σ̂ᵢ² as weights wᵢ = 1/σ̂ᵢ²

This is called Feasible GLS (FGLS) when iterated.

Strategy 2: Group-Based Weights

If observations naturally fall into groups with different variances:

Fit OLS within each group
Estimate variance σ̂ₖ² for each group k
Assign wᵢ = 1/σ̂ₖ² for all observations in group k

Example: Panel data with different time periods having different volatility.

Strategy 3: Known Functional Form

Sometimes theory suggests a variance structure:

Proportional variance: Var(εᵢ) ∝ xᵢ → weights wᵢ = 1/xᵢ
Count data (Poisson-like): Var(εᵢ) ∝ E[yᵢ] → weights wᵢ = 1/ŷᵢ
Aggregated data: If yᵢ is a mean of nᵢ observations, Var(εᵢ) ∝ 1/nᵢ → weights wᵢ = nᵢ

estimate_weights.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
import numpy as np
 
def feasible_gls(X, y, max_iter=10, tol=1e-6):
    """
    Feasible Generalized Least Squares (FGLS).
    
    Iteratively estimates variance structure and re-fits WLS.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix
    y : ndarray of shape (n,)
        Target vector
    max_iter : int
        Maximum iterations
    tol : float
        Convergence tolerance
        
    Returns:
    --------
    beta : ndarray of shape (p,)
        FGLS coefficient estimates
    weights : ndarray of shape (n,)
        Estimated optimal weights
    """
    n, p = X.shape
    
    # Step 1: Initial OLS fit
    beta = np.linalg.lstsq(X, y, rcond=None)[0]
    
    for iteration in range(max_iter):
        beta_old = beta.copy()
        
        # Step 2: Compute residuals
        residuals = y - X @ beta
        
        # Step 3: Model the variance
        # Simple approach: regress log(resid²) on log(|predictions|)
        # This assumes variance proportional to some power of E[y]
        log_resid_sq = np.log(residuals**2 + 1e-10)  # Avoid log(0)
        y_pred = X @ beta
        
        # Fit a simple model for variance: log(σ²) = α + β₁·log(|ŷ|)
        # This captures multiplicative heteroscedasticity
        design_var = np.column_stack([
            np.ones(n),
            np.log(np.abs(y_pred) + 1)  # Avoid log(0)
        ])
        gamma = np.linalg.lstsq(design_var, log_resid_sq, rcond=None)[0]
        
        # Predicted variances
        log_var_pred = design_var @ gamma
        var_pred = np.exp(log_var_pred)
        
        # Weights = 1/variance
        weights = 1.0 / np.maximum(var_pred, 1e-10)
        
        # Step 4: WLS with estimated weights
        sqrt_w = np.sqrt(weights)
        X_weighted = X * sqrt_w[:, np.newaxis]
        y_weighted = y * sqrt_w
        beta = np.linalg.lstsq(X_weighted, y_weighted, rcond=None)[0]
        
        # Check convergence
        if np.linalg.norm(beta - beta_old) < tol:
            print(f"FGLS converged in {iteration + 1} iterations")
            break
    
    return beta, weights
 
 
def group_based_weights(X, y, groups):
    """
    Estimate weights from group-specific variances.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix
    y : ndarray of shape (n,)
        Target vector
    groups : ndarray of shape (n,)
        Group labels for each observation
        
    Returns:
    --------
    beta : ndarray of shape (p,)
        WLS estimates
    weights : ndarray of shape (n,)
        Group-based weights
    """
    unique_groups = np.unique(groups)
    
    # Initial OLS
    beta = np.linalg.lstsq(X, y, rcond=None)[0]
    residuals = y - X @ beta
    
    # Estimate variance per group
    weights = np.zeros(len(y))
    for g in unique_groups:
        mask = groups == g
        group_resid = residuals[mask]
        group_var = np.var(group_resid, ddof=1)
        weights[mask] = 1.0 / group_var
    
    # WLS with group weights
    sqrt_w = np.sqrt(weights)
    X_weighted = X * sqrt_w[:, np.newaxis]
    y_weighted = y * sqrt_w
    beta = np.linalg.lstsq(X_weighted, y_weighted, rcond=None)[0]
    
    return beta, weights
 
 
def demonstrate_weight_estimation():
    """
    Compare different weight estimation strategies.
    """
    np.random.seed(42)
    
    n, p = 300, 3
    X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
    beta_true = np.array([1.0, 2.0, -1.0])
    
    # Create heteroscedastic data: variance increases with |X @ beta|
    signal = X @ beta_true
    sigma = 0.5 + 0.5 * np.abs(signal)
    epsilon = sigma * np.random.randn(n)
    y = signal + epsilon
    
    # True optimal weights
    true_weights = 1.0 / sigma**2
    
    # Method 1: OLS (ignores heteroscedasticity)
    beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
    
    # Method 2: WLS with true weights (oracle)
    sqrt_w = np.sqrt(true_weights)
    beta_oracle = np.linalg.lstsq(X * sqrt_w[:, np.newaxis], 
                                   y * sqrt_w, rcond=None)[0]
    
    # Method 3: FGLS (estimated weights)
    beta_fgls, weights_fgls = feasible_gls(X, y)
    
    print("=== Weight Estimation Comparison ===")
    print(f"True β: {beta_true}")
    print(f"OLS:    {np.round(beta_ols, 4)} (error: {np.linalg.norm(beta_ols - beta_true):.4f})")
    print(f"Oracle: {np.round(beta_oracle, 4)} (error: {np.linalg.norm(beta_oracle - beta_true):.4f})")
    print(f"FGLS:   {np.round(beta_fgls, 4)} (error: {np.linalg.norm(beta_fgls - beta_true):.4f})")
    
    # Check weight quality
    weight_corr = np.corrcoef(true_weights, weights_fgls)[0, 1]
    print(f"\nCorrelation between true and estimated weights: {weight_corr:.3f}")
 
demonstrate_weight_estimation()

Iteratively Reweighted Least Squares (IRLS)

Iteratively Reweighted Least Squares (IRLS) is a powerful technique that recasts certain optimization problems as a sequence of weighted least squares problems. It's used extensively in:

Robust regression: M-estimators (Huber, Tukey bisquare)
Generalized linear models: Logistic, Poisson regression
L1 regularization approximation: Iteratively approximating Lasso
Low-rank matrix completion: Nuclear norm minimization

The General Framework:

Many loss functions can be written as:

$$\mathcal{L}(\beta) = \sum_{i=1}^{n} \rho(y_i - x_i^\top \beta)$$

for some function ρ. If ρ(r) can be expressed as:

$$\rho(r) = w(r) \cdot r^2$$

then minimizing is equivalent to iteratively solving:

Compute residuals rᵢ = yᵢ - xᵢᵀβ
Update weights: wᵢ = ψ(rᵢ) / rᵢ (where ψ = ρ')
Solve WLS with current weights
Repeat until convergence

Example: Huber Regression

The Huber loss is quadratic for small residuals and linear for large ones:

$$\rho_{Huber}(r) = \begin{cases} \frac{1}{2}r^2 & |r| \leq \delta \ \delta|r| - \frac{1}{2}\delta^2 & |r| > \delta \end{cases}$$

The weight function becomes:

$$w(r) = \begin{cases} 1 & |r| \leq \delta \ \delta / |r| & |r| > \delta \end{cases}$$

Large residuals (outliers) get downweighted, making the estimator robust.

irls_robust.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
import numpy as np
 
def huber_weights(residuals, delta=1.345):
    """
    Compute Huber weights for given residuals.
    
    Parameters:
    -----------
    residuals : ndarray
        Current residuals
    delta : float
        Threshold for switching from quadratic to linear loss
        1.345 gives 95% efficiency at the normal distribution
    """
    abs_r = np.abs(residuals)
    weights = np.ones_like(residuals)
    
    # For large residuals, downweight
    mask = abs_r > delta
    weights[mask] = delta / abs_r[mask]
    
    return weights
 
 
def irls_huber(X, y, delta=1.345, max_iter=50, tol=1e-6):
    """
    Huber's M-estimator via IRLS.
    
    Robust to outliers: downweights observations with large residuals.
    """
    n, p = X.shape
    
    # Initialize with OLS
    beta = np.linalg.lstsq(X, y, rcond=None)[0]
    
    for iteration in range(max_iter):
        beta_old = beta.copy()
        
        # Compute residuals
        residuals = y - X @ beta
        
        # Scale residuals by MAD (robust scale estimate)
        mad = np.median(np.abs(residuals - np.median(residuals)))
        scale = mad / 0.6745  # Consistent at normal distribution
        scaled_residuals = residuals / (scale + 1e-10)
        
        # Compute Huber weights
        weights = huber_weights(scaled_residuals, delta)
        
        # WLS step
        sqrt_w = np.sqrt(weights)
        X_w = X * sqrt_w[:, np.newaxis]
        y_w = y * sqrt_w
        beta = np.linalg.lstsq(X_w, y_w, rcond=None)[0]
        
        # Check convergence
        if np.linalg.norm(beta - beta_old) < tol:
            print(f"IRLS-Huber converged in {iteration + 1} iterations")
            break
    
    return beta, weights
 
 
def bisquare_weights(residuals, c=4.685):
    """
    Tukey's bisquare weights (more aggressive downweighting).
    
    c=4.685 gives 95% efficiency at the normal distribution.
    """
    abs_r = np.abs(residuals)
    weights = np.zeros_like(residuals)
    
    mask = abs_r <= c
    u = residuals[mask] / c
    weights[mask] = (1 - u**2)**2
    
    return weights
 
 
def irls_bisquare(X, y, c=4.685, max_iter=50, tol=1e-6):
    """
    Tukey's bisquare M-estimator via IRLS.
    
    More robust than Huber: completely ignores extreme outliers.
    """
    n, p = X.shape
    
    # Initialize with OLS
    beta = np.linalg.lstsq(X, y, rcond=None)[0]
    
    for iteration in range(max_iter):
        beta_old = beta.copy()
        
        residuals = y - X @ beta
        mad = np.median(np.abs(residuals - np.median(residuals)))
        scale = mad / 0.6745
        scaled_residuals = residuals / (scale + 1e-10)
        
        weights = bisquare_weights(scaled_residuals, c)
        
        # Avoid all-zero weights
        if weights.sum() < 0.1:
            print("Warning: too many outliers, reducing c")
            c *= 1.5
            weights = bisquare_weights(scaled_residuals, c)
        
        sqrt_w = np.sqrt(weights)
        X_w = X * sqrt_w[:, np.newaxis]
        y_w = y * sqrt_w
        beta = np.linalg.lstsq(X_w, y_w, rcond=None)[0]
        
        if np.linalg.norm(beta - beta_old) < tol:
            print(f"IRLS-Bisquare converged in {iteration + 1} iterations")
            break
    
    return beta, weights
 
 
def demonstrate_robust_regression():
    """
    Compare OLS, Huber, and Bisquare on contaminated data.
    """
    np.random.seed(42)
    
    n, p = 100, 3
    X = np.column_stack([np.ones(n), np.random.randn(n, p-1)])
    beta_true = np.array([1.0, 2.0, -1.0])
    
    # Clean data
    y_clean = X @ beta_true + 0.5 * np.random.randn(n)
    
    # Add outliers (10% contamination)
    n_outliers = 10
    outlier_idx = np.random.choice(n, n_outliers, replace=False)
    y_contaminated = y_clean.copy()
    y_contaminated[outlier_idx] += np.random.choice([-1, 1], n_outliers) * 20
    
    print("=== Robust Regression Comparison ===")
    print(f"True β: {beta_true}")
    print(f"Data: {n} observations, {n_outliers} outliers (10% contamination)")
    print()
    
    # OLS (sensitive to outliers)
    beta_ols = np.linalg.lstsq(X, y_contaminated, rcond=None)[0]
    print(f"OLS:       {np.round(beta_ols, 3)} (error: {np.linalg.norm(beta_ols - beta_true):.3f})")
    
    # Huber (moderate robustness)
    beta_huber, w_huber = irls_huber(X, y_contaminated)
    print(f"Huber:     {np.round(beta_huber, 3)} (error: {np.linalg.norm(beta_huber - beta_true):.3f})")
    
    # Bisquare (high robustness)
    beta_bisq, w_bisq = irls_bisquare(X, y_contaminated)
    print(f"Bisquare:  {np.round(beta_bisq, 3)} (error: {np.linalg.norm(beta_bisq - beta_true):.3f})")
    
    # Show which points were downweighted
    print(f"\nOutlier indices: {sorted(outlier_idx)}")
    print(f"Points with Huber weight < 0.5: {np.where(w_huber < 0.5)[0]}")
    print(f"Points with Bisquare weight < 0.5: {np.where(w_bisq < 0.5)[0]}")
 
demonstrate_robust_regression()

IRLS in GLMs

Generalized Least Squares: Beyond Diagonal Weights

Generalized Least Squares (GLS) extends WLS to handle correlated errors, not just heteroscedasticity.

Model: $$y = X\beta + \varepsilon, \quad \text{Var}(\varepsilon) = \sigma^2 \Omega$$

where Ω is a general (not necessarily diagonal) covariance structure.

Examples of correlated errors:

Time series: Consecutive observations are autocorrelated
Spatial data: Nearby locations have correlated errors
Clustered data: Observations within a group are correlated
Repeated measures: Multiple observations per subject are correlated

The GLS Estimator:

If we know Ω, the GLS estimator is:

$$\hat{\beta}_{GLS} = (X^\top \Omega^{-1} X)^{-1} X^\top \Omega^{-1} y$$

Notice this is WLS with W = Ω⁻¹, but now W is not diagonal.

Transformation Approach:

Since Ω is positive definite, we can factor Ω = LL^ᵀ (Cholesky). Then:

Let X̃ = L⁻¹X and ỹ = L⁻¹y
The transformed model ỹ = X̃β + ε̃ has Var(ε̃) = σ²I
Apply OLS to (X̃, ỹ)

This is called decorrelation or prewhitening.

generalized_least_squares.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
import numpy as np
from scipy.linalg import cholesky, solve_triangular
 
def gls(X, y, Omega):
    """
    Generalized Least Squares estimator.
    
    Parameters:
    -----------
    X : ndarray of shape (n, p)
        Design matrix
    y : ndarray of shape (n,)
        Target vector
    Omega : ndarray of shape (n, n)
        Error covariance matrix (positive definite)
        
    Returns:
    --------
    beta : ndarray of shape (p,)
        GLS coefficient estimates
    """
    # Method 1: Direct formula (requires inverting Omega)
    # beta = solve(X.T @ solve(Omega, X), X.T @ solve(Omega, y))
    
    # Method 2: Transform via Cholesky (more stable)
    # Omega = L @ L.T, so Omega^{-1} = L^{-T} @ L^{-1}
    L = cholesky(Omega, lower=True)
    
    # Transform: X_tilde = L^{-1} @ X, y_tilde = L^{-1} @ y
    X_tilde = solve_triangular(L, X, lower=True)
    y_tilde = solve_triangular(L, y, lower=True)
    
    # Now apply OLS to transformed data
    beta = np.linalg.lstsq(X_tilde, y_tilde, rcond=None)[0]
    
    return beta
 
 
def ar1_covariance(n, rho):
    """
    Construct AR(1) correlation matrix.
    
    Corr(ε_i, ε_j) = rho^|i-j|
    """
    idx = np.arange(n)
    return rho ** np.abs(idx[:, np.newaxis] - idx[np.newaxis, :])
 
 
def demonstrate_gls():
    """
    Demonstrate GLS on time series data with autocorrelated errors.
    """
    np.random.seed(42)
    
    n, p = 100, 2
    rho = 0.8  # Strong autocorrelation
    
    # Design matrix (time trend)
    t = np.linspace(0, 1, n)
    X = np.column_stack([np.ones(n), t])
    beta_true = np.array([1.0, 2.0])
    
    # Generate AR(1) errors
    sigma_epsilon = 0.5
    epsilon = np.zeros(n)
    epsilon[0] = np.random.randn()
    for i in range(1, n):
        epsilon[i] = rho * epsilon[i-1] + np.sqrt(1 - rho**2) * np.random.randn()
    epsilon *= sigma_epsilon
    
    y = X @ beta_true + epsilon
    
    # True covariance structure
    Omega = ar1_covariance(n, rho)
    
    # OLS (ignores autocorrelation)
    beta_ols = np.linalg.lstsq(X, y, rcond=None)[0]
    
    # GLS (uses true covariance)
    beta_gls = gls(X, y, Omega)
    
    print("=== GLS vs OLS with Autocorrelated Errors ===")
    print(f"True β: {beta_true}")
    print(f"Autocorrelation (ρ): {rho}")
    print()
    print(f"OLS: {np.round(beta_ols, 4)} (error: {np.linalg.norm(beta_ols - beta_true):.4f})")
    print(f"GLS: {np.round(beta_gls, 4)} (error: {np.linalg.norm(beta_gls - beta_true):.4f})")
    
    # Simulate many times to compare efficiency
    n_sim = 500
    ols_estimates = []
    gls_estimates = []
    
    for _ in range(n_sim):
        # Generate new AR(1) errors
        epsilon = np.zeros(n)
        epsilon[0] = np.random.randn()
        for i in range(1, n):
            epsilon[i] = rho * epsilon[i-1] + np.sqrt(1 - rho**2) * np.random.randn()
        epsilon *= sigma_epsilon
        
        y_sim = X @ beta_true + epsilon
        
        ols_estimates.append(np.linalg.lstsq(X, y_sim, rcond=None)[0])
        gls_estimates.append(gls(X, y_sim, Omega))
    
    ols_estimates = np.array(ols_estimates)
    gls_estimates = np.array(gls_estimates)
    
    print(f"\nEstimator variance (over {n_sim} simulations):")
    print(f"  OLS variance: {np.var(ols_estimates, axis=0)}")
    print(f"  GLS variance: {np.var(gls_estimates, axis=0)}")
    print(f"  Efficiency gain: {np.var(ols_estimates, axis=0).sum() / np.var(gls_estimates, axis=0).sum():.2f}x")
 
demonstrate_gls()

Feasible GLS in Practice

Summary: Weighted Least Squares

Key Takeaways

•WLS handles heteroscedasticity — Weight observations by inverse variance to restore efficiency.
•The WLS estimator is β̂ = (XᵀWX)⁻¹XᵀWy — Or equivalently, OLS on W^(1/2)-transformed data.
•With correct weights, WLS is BLUE — Achieves minimum variance among linear unbiased estimators.
•Weights usually need estimation — FGLS, group-based methods, or theory-driven functional forms.
•IRLS extends to M-estimators and GLMs — Robust regression and logistic regression are special cases.
•GLS handles correlated errors — Generalize from diagonal W to full covariance structure Ω.

Weighted Methods Summary
Method	Error Structure	Weight Matrix	Use Case
OLS	iid, homoscedastic	W = I	Default when assumptions hold
WLS	Independent, heteroscedastic	W = diag(1/σᵢ²)	Known variance structure
FGLS	Independent, heteroscedastic	W = diag(1/σ̂ᵢ²)	Unknown variance structure
GLS	Correlated	W = Ω⁻¹	Time series, spatial, clustered
IRLS	Varies	Iteratively computed	Robust regression, GLMs

Module Complete!

You've now completed the Computational Considerations module, covering:

Numerical Stability: Why naive implementations fail and how to diagnose issues
QR Decomposition: The stable algorithm for solving least squares
Complexity Analysis: Operation counts and algorithm selection
Online Updates: Incremental algorithms for streaming data
Weighted Least Squares: Handling heteroscedasticity and beyond

These computational skills transform theoretical knowledge into practical engineering capability—the difference between knowing linear regression and being able to deploy it reliably at scale.

Module Complete!

5 / 5