Computational Considerations - Learning Module

Loading content...

0/278

Online Updates

Learning from Data Streams

Everything we've covered so far assumes batch learning: you have all your data upfront, you fit the model once, and you're done. But real-world systems often don't work this way:

Streaming data: Sensor readings, log events, financial ticks arrive continuously
Memory constraints: Dataset too large to fit in memory simultaneously
Adaptation requirements: Model must evolve as data distribution shifts
Low-latency updates: New data must be incorporated in milliseconds, not hours

For these scenarios, we need online (incremental) algorithms that update the model efficiently as each new data point arrives—without reprocessing all historical data.

What You Will Learn

By the end of this page, you will: (1) understand recursive least squares (RLS) and its derivation, (2) master the Sherman-Morrison-Woodbury formulas for matrix inverse updates, (3) implement efficient O(p²) incremental updates, and (4) handle data removal (downdating) for sliding-window regression.

The Online Learning Problem

The Batch Setting (Review):

Given n observations (X, y) where X is n × p, the OLS solution is:

$$\beta = (X^\top X)^{-1} X^\top y$$

Computing this from scratch costs O(np²) for forming XᵀX and O(p³) for the solve—acceptable once, but prohibitive if repeated for every new data point.

The Online Setting:

Data arrives sequentially: (x₁, y₁), (x₂, y₂), ..., (xₜ, yₜ), ...

After receiving (xₜ, yₜ), we want to update our coefficient estimate βₜ₋₁ → βₜ in:

O(p²) time (not O(np²))
O(p²) memory (not O(np))

The key insight: rather than re-computing (XᵀX)⁻¹ from scratch, we update the existing inverse when a new row is added to X.

Notation:

Let's establish notation for the incremental setting:

At time t, we have observed t samples
Xₜ ∈ ℝᵗˣᵖ: design matrix with rows x₁ᵀ, ..., xₜᵀ
yₜ ∈ ℝᵗ: target vector [y₁, ..., yₜ]ᵀ
Aₜ = Xₜᵀ Xₜ ∈ ℝᵖˣᵖ: Gram matrix (sum of outer products)
bₜ = Xₜᵀ yₜ ∈ ℝᵖ: cross-correlation vector
Pₜ = Aₜ⁻¹ ∈ ℝᵖˣᵖ: inverse Gram matrix
βₜ = Pₜ bₜ ∈ ℝᵖ: coefficient estimate at time t

When a new observation (xₜ₊₁, yₜ₊₁) arrives:

$$A_{t+1} = A_t + x_{t+1} x_{t+1}^\top \quad \text{(rank-1 update)}$$ $$b_{t+1} = b_t + x_{t+1} y_{t+1}$$

The Core Question

Given Pₜ = Aₜ⁻¹ and the rank-1 update Aₜ₊₁ = Aₜ + xxᵀ, can we compute Pₜ₊₁ = Aₜ₊₁⁻¹ efficiently without inverting from scratch? The answer is yes—the Sherman-Morrison formula gives an O(p²) update.

The Sherman-Morrison Formula

The Sherman-Morrison formula is one of the most important results in computational linear algebra. It tells us how to update a matrix inverse when the matrix undergoes a rank-1 perturbation.

Theorem (Sherman-Morrison):

If A is invertible and A + uvᵀ is also invertible, then:

$$\boxed{(A + uv^\top)^{-1} = A^{-1} - \frac{A^{-1}uv^\top A^{-1}}{1 + v^\top A^{-1}u}}$$

where u, v ∈ ℝᵖ are vectors.

Proof:

Let B = A + uvᵀ and suppose B⁻¹ = A⁻¹ + C for some correction matrix C.

We need B · B⁻¹ = I:

$$(A + uv^\top)(A^{-1} + C) = I$$ $$I + AC + uv^\top A^{-1} + uv^\top C = I$$ $$AC + uv^\top A^{-1} + uv^\top C = 0$$

Factor out u: $$AC = -u(v^\top A^{-1} + v^\top C)$$

This suggests C = -αuv̂ᵀ for some scalar α and some vector v̂. Since the correction is rank-1 (coming from a rank-1 update), try C = -αuv^ᵀA⁻¹:

$$A(-\alpha uv^\top A^{-1}) = -u(v^\top A^{-1} + v^\top(-\alpha uv^\top A^{-1}))$$ $$-\alpha uv^\top A^{-1} = -u(v^\top A^{-1} - \alpha v^\top uv^\top A^{-1})$$ $$-\alpha uv^\top A^{-1} = -uv^\top A^{-1}(1 - \alpha v^\top u)$$

Comparing: α = 1 - αvᵀu, solving: α(1 + vᵀu) = 1, so α = 1/(1 + vᵀA⁻¹u).

Substituting back gives the Sherman-Morrison formula. ∎

Computational Cost

Sherman-Morrison requires: (1) one matrix-vector product A⁻¹u: O(p²), (2) one dot product vᵀ(A⁻¹u): O(p), (3) one outer product and matrix subtraction: O(p²). Total: O(p²), regardless of n!

For Least Squares:

When a new sample (x, y) arrives:

$$P_{t+1} = (A_t + xx^\top)^{-1} = P_t - \frac{P_t xx^\top P_t}{1 + x^\top P_t x}$$

Setting u = v = x in Sherman-Morrison (our rank-1 update is symmetric).

The Gain Vector:

Define the gain vector (or Kalman gain in filtering literature):

$$k_t = \frac{P_t x}{1 + x^\top P_t x}$$

Then the update becomes:

$$P_{t+1} = P_t - k_t x^\top P_t = (I - k_t x^\top) P_t$$

Recursive Least Squares (RLS)

Recursive Least Squares combines the Sherman-Morrison inverse update with an elegant formula for updating the coefficient vector directly.

The RLS Algorithm:

Given current estimates Pₜ (inverse covariance) and βₜ (coefficients), when new observation (x, y) arrives:

1. Compute prediction error: $$e = y - x^\top \beta_t \quad \text{(innovation/residual)}$$

2. Compute gain vector: $$k = \frac{P_t x}{1 + x^\top P_t x}$$

3. Update coefficients: $$\beta_{t+1} = \beta_t + k \cdot e$$

4. Update inverse covariance: $$P_{t+1} = P_t - k x^\top P_t$$

Derivation of the Coefficient Update:

We need to show that βₜ₊₁ = βₜ + k·e gives the correct least squares solution.

The batch solution at time t+1: $$\beta_{t+1} = P_{t+1} b_{t+1} = P_{t+1}(b_t + xy)$$

Using Pₜ₊₁ = Pₜ - kxᵀPₜ: $$\beta_{t+1} = (P_t - kx^\top P_t)(b_t + xy)$$ $$= P_t b_t + P_t xy - kx^\top P_t b_t - kx^\top P_t xy$$ $$= \beta_t + P_t xy - k(x^\top \beta_t) - k(x^\top P_t x)y$$

Note that x^ᵀPₜx = (1 + xᵀPₜx) - 1 and k = Pₜx/(1 + xᵀPₜx): $$\beta_{t+1} = \beta_t + P_t xy - k(x^\top \beta_t) - k(x^\top P_t x)y$$

After simplification (using the definition of k): $$\beta_{t+1} = \beta_t + k(y - x^\top \beta_t) = \beta_t + k \cdot e \quad \checkmark$$

recursive_least_squares.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
 
class RecursiveLeastSquares:
    """
    Recursive Least Squares estimator for online linear regression.
    
    Maintains O(p²) state and performs O(p²) updates per sample.
    Mathematically equivalent to batch OLS at each step.
    """
    
    def __init__(self, p, regularization=1e-6):
        """
        Initialize RLS estimator.
        
        Parameters:
        -----------
        p : int
            Number of features
        regularization : float
            Initial regularization (P₀ = (1/regularization) * I)
            Smaller values = faster initial learning, less stable
        """
        self.p = p
        
        # Initialize P = (regularization * I)^(-1) = (1/regularization) * I
        # Large initial P = high initial uncertainty = fast adaptation
        self.P = np.eye(p) / regularization
        
        # Initialize coefficients to zero
        self.beta = np.zeros(p)
        
        # Track number of samples seen
        self.n_samples = 0
        
    def update(self, x, y):
        """
        Update the model with a single new observation.
        
        Parameters:
        -----------
        x : ndarray of shape (p,)
            Feature vector
        y : float
            Target value
            
        Returns:
        --------
        prediction_error : float
            Error before updating (innovation)
        """
        x = np.asarray(x).flatten()
        assert len(x) == self.p, f"Expected {self.p} features, got {len(x)}"
        
        # Step 1: Compute prediction error (before update)
        prediction = x @ self.beta
        error = y - prediction
        
        # Step 2: Compute gain vector via Sherman-Morrison
        Px = self.P @ x  # O(p²)
        denominator = 1 + x @ Px  # O(p)
        k = Px / denominator  # Gain vector
        
        # Step 3: Update coefficients
        self.beta = self.beta + k * error  # O(p)
        
        # Step 4: Update inverse covariance matrix
        # P = P - k * x^T * P = (I - k * x^T) * P
        self.P = self.P - np.outer(k, x @ self.P)  # O(p²)
        
        self.n_samples += 1
        
        return error
    
    def fit_batch(self, X, y):
        """
        Process a batch of samples online (sequentially).
        
        Parameters:
        -----------
        X : ndarray of shape (n, p)
            Design matrix
        y : ndarray of shape (n,)
            Target vector
            
        Returns:
        --------
        errors : list
            Prediction errors for each sample
        """
        errors = []
        for i in range(len(y)):
            error = self.update(X[i], y[i])
            errors.append(error)
        return errors
    
    def predict(self, X):
        """
        Make predictions.
        """
        return X @ self.beta
    
    @property
    def coefficients(self):
        return self.beta.copy()
 
 
def demonstrate_rls():
    """
    Demonstrate RLS and compare with batch OLS.
    """
    np.random.seed(42)
    
    # Generate data
    n, p = 1000, 5
    X = np.random.randn(n, p)
    beta_true = np.array([1, -2, 3, -4, 5])
    y = X @ beta_true + 0.5 * np.random.randn(n)
    
    # RLS online learning
    rls = RecursiveLeastSquares(p, regularization=1e-3)
    errors = rls.fit_batch(X, y)
    
    # Batch OLS for comparison
    beta_batch = np.linalg.lstsq(X, y, rcond=None)[0]
    
    print("=== RLS vs Batch OLS Comparison ===")
    print(f"True coefficients:  {beta_true}")
    print(f"RLS estimate:       {np.round(rls.beta, 4)}")
    print(f"Batch OLS estimate: {np.round(beta_batch, 4)}")
    print(f"RLS error:   {np.linalg.norm(rls.beta - beta_true):.6f}")
    print(f"Batch error: {np.linalg.norm(beta_batch - beta_true):.6f}")
    
    # Show convergence behavior
    print(f"
RLS converged after {n} samples")
    print(f"Initial MSE (first 10):    {np.mean(np.array(errors[:10])**2):.4f}")
    print(f"Midpoint MSE (500-510):    {np.mean(np.array(errors[500:510])**2):.4f}")
    print(f"Final MSE (last 10):       {np.mean(np.array(errors[-10:])**2):.4f}")
 
demonstrate_rls()

Connection to Kalman Filtering

RLS is a special case of the Kalman filter for a static system (constant true coefficients). The matrix P plays the role of the error covariance, and k is the Kalman gain. When coefficients are time-varying, full Kalman filtering adds process noise to track changes.

The Woodbury Formula: Batch Updates

Sherman-Morrison handles rank-1 updates. What if we receive a batch of k new samples at once? Applying Sherman-Morrison k times costs O(kp²). We can do better.

Theorem (Woodbury Matrix Identity):

The Woodbury formula generalizes Sherman-Morrison to rank-k updates:

$$(A + UCV^\top)^{-1} = A^{-1} - A^{-1}U(C^{-1} + V^\top A^{-1}U)^{-1}V^\top A^{-1}$$

where:

A ∈ ℝᵖˣᵖ (original matrix)
U ∈ ℝᵖˣᵏ, V ∈ ℝᵖˣᵏ (low-rank factors)
C ∈ ℝᵏˣᵏ (small matrix)

For Least Squares:

When a batch of k new samples X_new (k × p) and y_new (k,) arrives:

$$A_{new} = A + X_{new}^\top X_{new}$$

This is a rank-k update with U = V = X_newᵀ ∈ ℝᵖˣᵏ and C = Iₖ:

$$P_{new} = P - P X_{new}^\top (I_k + X_{new} P X_{new}^\top)^{-1} X_{new} P$$

Cost:

Compute PXᵀ: O(p²k)
Compute XPXᵀ: O(pk²)
Invert k × k matrix: O(k³)
Final matrix multiply: O(p²k)
Total: O(p²k + k³)

When k << p, this is much better than O(p³) for re-inverting from scratch.

woodbury_batch_update.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import numpy as np
 
def woodbury_update(P, X_new, y_new, beta):
    """
    Batch update using Woodbury formula.
    
    More efficient than repeated Sherman-Morrison when k > 1.
    
    Parameters:
    -----------
    P : ndarray of shape (p, p)
        Current inverse covariance matrix
    X_new : ndarray of shape (k, p)
        New feature batch
    y_new : ndarray of shape (k,)
        New target batch
    beta : ndarray of shape (p,)
        Current coefficient estimate
        
    Returns:
    --------
    P_new : ndarray of shape (p, p)
        Updated inverse covariance
    beta_new : ndarray of shape (p,)
        Updated coefficients
    """
    k, p = X_new.shape
    
    # Woodbury: P_new = P - P @ X^T @ (I + X @ P @ X^T)^{-1} @ X @ P
    
    # Compute intermediate quantities
    PXt = P @ X_new.T  # (p, k)
    XPXt = X_new @ PXt  # (k, k)
    
    # Solve (I + XPX^T) @ M = I for M, equivalent to inverting
    IplusXPXt = np.eye(k) + XPXt
    M = np.linalg.solve(IplusXPXt, np.eye(k))  # (k, k)
    
    # Update P
    P_new = P - PXt @ M @ X_new @ P
    
    # Update beta
    # b_new = b + X_new^T @ y_new
    # beta_new = P_new @ b_new
    # 
    # More efficient: use innovation form
    # beta_new = beta + P_new @ X_new^T @ (y_new - X_new @ beta)
    innovation = y_new - X_new @ beta
    beta_new = beta + P_new @ X_new.T @ innovation
    
    return P_new, beta_new
 
 
def compare_update_methods():
    """
    Compare Sherman-Morrison (one at a time) vs Woodbury (batch).
    """
    np.random.seed(42)
    
    p = 100
    n_existing = 500
    k_new = 50  # Batch size
    
    # Initial data
    X = np.random.randn(n_existing, p)
    beta_true = np.random.randn(p)
    y = X @ beta_true + 0.1 * np.random.randn(n_existing)
    
    # Compute initial solution
    XtX = X.T @ X + 1e-6 * np.eye(p)  # Small regularization
    Xty = X.T @ y
    P = np.linalg.inv(XtX)
    beta = P @ Xty
    
    # New batch
    X_new = np.random.randn(k_new, p)
    y_new = X_new @ beta_true + 0.1 * np.random.randn(k_new)
    
    # Method 1: Sherman-Morrison (sequential)
    import time
    P_sm = P.copy()
    beta_sm = beta.copy()
    
    start = time.time()
    for i in range(k_new):
        x = X_new[i]
        yi = y_new[i]
        Px = P_sm @ x
        k_gain = Px / (1 + x @ Px)
        error = yi - x @ beta_sm
        beta_sm = beta_sm + k_gain * error
        P_sm = P_sm - np.outer(k_gain, x @ P_sm)
    time_sm = time.time() - start
    
    # Method 2: Woodbury (batch)
    start = time.time()
    P_wb, beta_wb = woodbury_update(P, X_new, y_new, beta)
    time_wb = time.time() - start
    
    # Method 3: Recompute from scratch
    X_all = np.vstack([X, X_new])
    y_all = np.concatenate([y, y_new])
    start = time.time()
    beta_full = np.linalg.lstsq(X_all, y_all, rcond=None)[0]
    time_full = time.time() - start
    
    print("=== Update Method Comparison ===")
    print(f"Batch size k = {k_new}, features p = {p}")
    print(f"
Timing:")
    print(f"  Sherman-Morrison (sequential): {time_sm*1000:.2f} ms")
    print(f"  Woodbury (batch):              {time_wb*1000:.2f} ms")
    print(f"  Full recompute:                {time_full*1000:.2f} ms")
    print(f"
Accuracy (||β - β_full||):")
    print(f"  Sherman-Morrison: {np.linalg.norm(beta_sm - beta_full):.2e}")
    print(f"  Woodbury:         {np.linalg.norm(beta_wb - beta_full):.2e}")
 
compare_update_methods()

Downdating: Removing Observations

Sometimes we need to remove observations rather than add them. This is called downdating and arises in:

Sliding window regression: Only consider the last N samples
Outlier removal: Retroactively remove detected outliers
Cross-validation: Leave-one-out efficiently
Data correction: Fix mislabeled training examples

The Downdating Formula:

To remove a sample (x, y), we reverse the Sherman-Morrison update:

$$A_{new} = A - xx^\top$$

$$P_{new} = P + \frac{Pxx^\top P}{1 - x^\top P x}$$

Note the sign changes: minus becomes plus, and the denominator uses 1 - xᵀPx instead of 1 + xᵀPx.

Numerical Danger in Downdating

Downdating is numerically more treacherous than updating. If 1 - xᵀPx ≈ 0, the update explodes. This happens when x contributes most of the information about some direction—removing it makes that direction unidentifiable. Always check the denominator before downdating!

sliding_window_rls.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import numpy as np
from collections import deque
 
class SlidingWindowRLS:
    """
    RLS with a sliding window: maintains the solution over the 
    last window_size observations only.
    
    Uses Sherman-Morrison for both adding new points (updating)
    and removing old points (downdating).
    """
    
    def __init__(self, p, window_size, regularization=1e-6):
        """
        Parameters:
        -----------
        p : int
            Number of features
        window_size : int
            Maximum number of samples to consider
        regularization : float
            Regularization strength
        """
        self.p = p
        self.window_size = window_size
        self.regularization = regularization
        
        # Initialize with regularization
        self.P = np.eye(p) / regularization
        self.beta = np.zeros(p)
        
        # Rolling sums for the normal equations
        self.XtX = regularization * np.eye(p)
        self.Xty = np.zeros(p)
        
        # Store recent observations for downdating
        self.history = deque(maxlen=window_size)
        
    def _update(self, x, y):
        """Add a sample (Sherman-Morrison update)."""
        Px = self.P @ x
        denom = 1 + x @ Px
        k = Px / denom
        
        error = y - x @ self.beta
        self.beta = self.beta + k * error
        self.P = self.P - np.outer(k, x @ self.P)
        
        # Also update running sums (for numerical stability checks)
        self.XtX += np.outer(x, x)
        self.Xty += x * y
        
    def _downdate(self, x, y):
        """Remove a sample (Sherman-Morrison downdate)."""
        Px = self.P @ x
        denom = 1 - x @ Px
        
        # Safety check: avoid division by near-zero
        if abs(denom) < 1e-10:
            raise ValueError("Downdating would make matrix singular! "
                           "Cannot remove this observation.")
        
        k = Px / denom
        
        # Reverse the update
        self.P = self.P + np.outer(k, x @ self.P)
        
        # Update running sums
        self.XtX -= np.outer(x, x)
        self.Xty -= x * y
        
        # Recompute beta from updated XtX, Xty
        # (more stable than incremental update for downdating)
        self.beta = self.P @ self.Xty
        
    def update(self, x, y):
        """
        Add new observation, removing oldest if at capacity.
        """
        x = np.asarray(x).flatten()
        
        # If at capacity, remove oldest observation
        if len(self.history) == self.window_size:
            x_old, y_old = self.history[0]
            self._downdate(x_old, y_old)
        
        # Add new observation
        self._update(x, y)
        self.history.append((x.copy(), y))
        
    def predict(self, X):
        return X @ self.beta
 
 
def demonstrate_sliding_window():
    """
    Demonstrate sliding window RLS on non-stationary data.
    """
    np.random.seed(42)
    
    p = 3
    window_size = 100
    n_total = 500
    
    # Time-varying coefficients (regime changes)
    def true_beta(t):
        if t < 150:
            return np.array([1, 0, 0])
        elif t < 300:
            return np.array([0, 1, 0])
        else:
            return np.array([0, 0, 1])
    
    # Generate data
    X = np.random.randn(n_total, p)
    y = np.array([X[t] @ true_beta(t) for t in range(n_total)])
    y += 0.1 * np.random.randn(n_total)  # Noise
    
    # Standard RLS (all data)
    rls_full = RecursiveLeastSquares(p)
    
    # Sliding window RLS
    rls_window = SlidingWindowRLS(p, window_size)
    
    # Track estimates
    beta_full_history = []
    beta_window_history = []
    
    for t in range(n_total):
        rls_full.update(X[t], y[t])
        rls_window.update(X[t], y[t])
        
        beta_full_history.append(rls_full.beta.copy())
        beta_window_history.append(rls_window.beta.copy())
    
    # Compare tracking at regime boundaries
    print("=== Sliding Window vs Full RLS ===")
    print("(Tracking time-varying coefficients)")
    print()
    
    for t in [100, 200, 350, 450]:
        true_b = true_beta(t)
        full_b = beta_full_history[t]
        window_b = beta_window_history[t]
        
        err_full = np.linalg.norm(full_b - true_b)
        err_window = np.linalg.norm(window_b - true_b)
        
        print(f"t={t}: True={true_b}")
        print(f"  Full RLS: {np.round(full_b, 2)}, error={err_full:.3f}")
        print(f"  Sliding:  {np.round(window_b, 2)}, error={err_window:.3f}")
        print()
    
    print("Observation: Sliding window RLS tracks regime changes;")
    print("Full RLS averages across all regimes (fails to track).")
 
 
# Need RecursiveLeastSquares from earlier
class RecursiveLeastSquares:
    def __init__(self, p, regularization=1e-6):
        self.p = p
        self.P = np.eye(p) / regularization
        self.beta = np.zeros(p)
        
    def update(self, x, y):
        x = np.asarray(x).flatten()
        Px = self.P @ x
        k = Px / (1 + x @ Px)
        error = y - x @ self.beta
        self.beta = self.beta + k * error
        self.P = self.P - np.outer(k, x @ self.P)
        return error
 
demonstrate_sliding_window()

Forgetting Factors: Soft Adaptation

Sliding windows use a hard forgetting scheme: observations older than the window are completely forgotten. An alternative is exponential forgetting, where older observations are gradually down-weighted.

RLS with Forgetting Factor:

Introduce a forgetting factor λ ∈ (0, 1]. At each step:

$$A_{t+1} = \lambda A_t + x_{t+1} x_{t+1}^\top$$

Old observations decay exponentially: a sample from k steps ago has effective weight λᵏ.

Modified Update Equations:

$$k = \frac{P_t x}{\lambda + x^\top P_t x}$$ $$\beta_{t+1} = \beta_t + k(y - x^\top \beta_t)$$ $$P_{t+1} = \frac{1}{\lambda}(P_t - k x^\top P_t)$$

Forgetting Factor Selection Guide
λ Value	Effective Window	Behavior
1.0	Infinite (all data)	Standard RLS, no forgetting
0.99	~100 samples	Slow adaptation
0.95	~20 samples	Moderate adaptation
0.90	~10 samples	Fast adaptation, more noise
0.80	~5 samples	Very fast, high variance

The Effective Window Size:

With forgetting factor λ, the effective window size (sum of weights on past observations) is approximately:

$$W_{eff} = \sum_{k=0}^{\infty} \lambda^k = \frac{1}{1-\lambda}$$

So λ = 0.99 gives an effective window of ~100 samples, λ = 0.95 gives ~20 samples, etc.

Trade-offs:

Large λ (near 1): More stable, better for stationary data, slow to adapt
Small λ: Faster adaptation, higher variance, can track non-stationarity

The optimal λ depends on how quickly the underlying relationship changes.

rls_forgetting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
 
class RLSWithForgetting:
    """
    RLS with exponential forgetting factor.
    
    Older observations are exponentially down-weighted,
    allowing the model to adapt to time-varying relationships.
    """
    
    def __init__(self, p, forgetting_factor=0.99, regularization=1e-6):
        """
        Parameters:
        -----------
        p : int
            Number of features
        forgetting_factor : float
            λ ∈ (0, 1]. 1 = no forgetting, smaller = faster adaptation
        regularization : float
            Initial regularization
        """
        self.p = p
        self.lambda_ = forgetting_factor
        
        self.P = np.eye(p) / regularization
        self.beta = np.zeros(p)
        
    def update(self, x, y):
        """Update with new observation."""
        x = np.asarray(x).flatten()
        
        # Modified gain with forgetting factor
        Px = self.P @ x
        denom = self.lambda_ + x @ Px
        k = Px / denom
        
        # Coefficient update
        error = y - x @ self.beta
        self.beta = self.beta + k * error
        
        # Covariance update with forgetting
        # P = (1/λ) * (P - k @ x^T @ P)
        self.P = (self.P - np.outer(k, x @ self.P)) / self.lambda_
        
        return error
 
 
def compare_forgetting_factors():
    """
    Compare RLS with different forgetting factors on drifting data.
    """
    np.random.seed(42)
    
    p = 2
    n = 500
    
    # Linearly drifting coefficients
    t = np.arange(n)
    beta_true = np.column_stack([
        1 + 0.005 * t,  # Slowly increasing
        2 - 0.003 * t,  # Slowly decreasing
    ])
    
    # Generate data
    X = np.random.randn(n, p)
    y = np.array([X[i] @ beta_true[i] for i in range(n)])
    y += 0.2 * np.random.randn(n)
    
    # Different forgetting factors
    lambdas = [1.0, 0.99, 0.95]
    models = {f"λ={l}": RLSWithForgetting(p, l) for l in lambdas}
    
    # Track estimates and errors
    tracking_errors = {name: [] for name in models}
    
    for i in range(n):
        for name, model in models.items():
            model.update(X[i], y[i])
            error = np.linalg.norm(model.beta - beta_true[i])
            tracking_errors[name].append(error)
    
    # Print results at key timepoints
    print("=== Tracking Drifting Coefficients ===")
    print(f"True β drifts from [1, 2] to [{beta_true[-1,0]:.2f}, {beta_true[-1,1]:.2f}]")
    print()
    
    for t in [100, 250, 400]:
        print(f"t={t}: True β = {np.round(beta_true[t], 2)}")
        for name, model in models.items():
            # Get the beta at that time
            # (we'd need to store history for this; using final for demo)
            avg_error = np.mean(tracking_errors[name][max(0,t-10):t+10])
            print(f"  {name}: avg error = {avg_error:.4f}")
        print()
    
    # Summary
    print("Average tracking error over all time:")
    for name, errors in tracking_errors.items():
        print(f"  {name}: {np.mean(errors):.4f}")
 
compare_forgetting_factors()

Summary: Online Updates

Key Takeaways

•Sherman-Morrison enables O(p²) rank-1 updates — No need to re-invert the full matrix when one sample arrives.
•RLS maintains exact OLS solution incrementally — Mathematically equivalent to batch solving after each sample.
•Woodbury generalizes to rank-k batch updates — More efficient than k sequential Sherman-Morrison applications when k > 1.
•Downdating removes observations — Sign changes in Sherman-Morrison, but numerically more delicate.
•Sliding windows enable local regression — Combine updating and downdating to track non-stationary relationships.
•Forgetting factors provide soft adaptation — Exponential decay of old observations allows continuous tracking with tunable responsiveness.

Online Update Method Summary
Method	Cost per Update	Stationarity Assumption	Use Case
Standard RLS	O(p²)	Constant coefficients	Stable environments
RLS + Forgetting	O(p²)	Slow drift	Gradual concept drift
Sliding Window RLS	O(p²)	Local stationarity	Regime changes
Woodbury Batch	O(p²k + k³)	Constant coefficients	Periodic batch arrivals

What's Next:

The final page of this module covers Weighted Least Squares—regression when observations have different reliabilities or when we want to emphasize certain samples. We'll see how weights modify the normal equations and algorithms, and explore connections to heteroscedasticity correction.

Page Complete

You now understand how to update linear regression models incrementally as data arrives. These techniques are essential for streaming applications, real-time systems, and any scenario where recomputing from scratch is prohibitively expensive.

Online Updates

Learning from Data Streams

Everything we've covered so far assumes batch learning: you have all your data upfront, you fit the model once, and you're done. But real-world systems often don't work this way:

Streaming data: Sensor readings, log events, financial ticks arrive continuously
Memory constraints: Dataset too large to fit in memory simultaneously
Adaptation requirements: Model must evolve as data distribution shifts
Low-latency updates: New data must be incorporated in milliseconds, not hours

For these scenarios, we need online (incremental) algorithms that update the model efficiently as each new data point arrives—without reprocessing all historical data.

What You Will Learn

The Online Learning Problem

The Batch Setting (Review):

Given n observations (X, y) where X is n × p, the OLS solution is:

$$\beta = (X^\top X)^{-1} X^\top y$$

Computing this from scratch costs O(np²) for forming XᵀX and O(p³) for the solve—acceptable once, but prohibitive if repeated for every new data point.

The Online Setting:

Data arrives sequentially: (x₁, y₁), (x₂, y₂), ..., (xₜ, yₜ), ...

After receiving (xₜ, yₜ), we want to update our coefficient estimate βₜ₋₁ → βₜ in:

O(p²) time (not O(np²))
O(p²) memory (not O(np))

The key insight: rather than re-computing (XᵀX)⁻¹ from scratch, we update the existing inverse when a new row is added to X.

Notation:

Let's establish notation for the incremental setting:

At time t, we have observed t samples
Xₜ ∈ ℝᵗˣᵖ: design matrix with rows x₁ᵀ, ..., xₜᵀ
yₜ ∈ ℝᵗ: target vector [y₁, ..., yₜ]ᵀ
Aₜ = Xₜᵀ Xₜ ∈ ℝᵖˣᵖ: Gram matrix (sum of outer products)
bₜ = Xₜᵀ yₜ ∈ ℝᵖ: cross-correlation vector
Pₜ = Aₜ⁻¹ ∈ ℝᵖˣᵖ: inverse Gram matrix
βₜ = Pₜ bₜ ∈ ℝᵖ: coefficient estimate at time t

When a new observation (xₜ₊₁, yₜ₊₁) arrives:

$$A_{t+1} = A_t + x_{t+1} x_{t+1}^\top \quad \text{(rank-1 update)}$$ $$b_{t+1} = b_t + x_{t+1} y_{t+1}$$

The Core Question

The Sherman-Morrison Formula

The Sherman-Morrison formula is one of the most important results in computational linear algebra. It tells us how to update a matrix inverse when the matrix undergoes a rank-1 perturbation.

Theorem (Sherman-Morrison):

If A is invertible and A + uvᵀ is also invertible, then:

$$\boxed{(A + uv^\top)^{-1} = A^{-1} - \frac{A^{-1}uv^\top A^{-1}}{1 + v^\top A^{-1}u}}$$

where u, v ∈ ℝᵖ are vectors.

Proof:

Let B = A + uvᵀ and suppose B⁻¹ = A⁻¹ + C for some correction matrix C.

We need B · B⁻¹ = I:

$$(A + uv^\top)(A^{-1} + C) = I$$ $$I + AC + uv^\top A^{-1} + uv^\top C = I$$ $$AC + uv^\top A^{-1} + uv^\top C = 0$$

Factor out u: $$AC = -u(v^\top A^{-1} + v^\top C)$$

This suggests C = -αuv̂ᵀ for some scalar α and some vector v̂. Since the correction is rank-1 (coming from a rank-1 update), try C = -αuv^ᵀA⁻¹:

Comparing: α = 1 - αvᵀu, solving: α(1 + vᵀu) = 1, so α = 1/(1 + vᵀA⁻¹u).

Substituting back gives the Sherman-Morrison formula. ∎

Computational Cost

For Least Squares:

When a new sample (x, y) arrives:

$$P_{t+1} = (A_t + xx^\top)^{-1} = P_t - \frac{P_t xx^\top P_t}{1 + x^\top P_t x}$$

Setting u = v = x in Sherman-Morrison (our rank-1 update is symmetric).

The Gain Vector:

Define the gain vector (or Kalman gain in filtering literature):

$$k_t = \frac{P_t x}{1 + x^\top P_t x}$$

Then the update becomes:

$$P_{t+1} = P_t - k_t x^\top P_t = (I - k_t x^\top) P_t$$

Recursive Least Squares (RLS)

Recursive Least Squares combines the Sherman-Morrison inverse update with an elegant formula for updating the coefficient vector directly.

The RLS Algorithm:

Given current estimates Pₜ (inverse covariance) and βₜ (coefficients), when new observation (x, y) arrives:

1. Compute prediction error: $$e = y - x^\top \beta_t \quad \text{(innovation/residual)}$$

2. Compute gain vector: $$k = \frac{P_t x}{1 + x^\top P_t x}$$

3. Update coefficients: $$\beta_{t+1} = \beta_t + k \cdot e$$

4. Update inverse covariance: $$P_{t+1} = P_t - k x^\top P_t$$

Derivation of the Coefficient Update:

We need to show that βₜ₊₁ = βₜ + k·e gives the correct least squares solution.

The batch solution at time t+1: $$\beta_{t+1} = P_{t+1} b_{t+1} = P_{t+1}(b_t + xy)$$

Note that x^ᵀPₜx = (1 + xᵀPₜx) - 1 and k = Pₜx/(1 + xᵀPₜx): $$\beta_{t+1} = \beta_t + P_t xy - k(x^\top \beta_t) - k(x^\top P_t x)y$$

After simplification (using the definition of k): $$\beta_{t+1} = \beta_t + k(y - x^\top \beta_t) = \beta_t + k \cdot e \quad \checkmark$$

recursive_least_squares.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
import numpy as np
 
class RecursiveLeastSquares:
    """
    Recursive Least Squares estimator for online linear regression.
    
    Maintains O(p²) state and performs O(p²) updates per sample.
    Mathematically equivalent to batch OLS at each step.
    """
    
    def __init__(self, p, regularization=1e-6):
        """
        Initialize RLS estimator.
        
        Parameters:
        -----------
        p : int
            Number of features
        regularization : float
            Initial regularization (P₀ = (1/regularization) * I)
            Smaller values = faster initial learning, less stable
        """
        self.p = p
        
        # Initialize P = (regularization * I)^(-1) = (1/regularization) * I
        # Large initial P = high initial uncertainty = fast adaptation
        self.P = np.eye(p) / regularization
        
        # Initialize coefficients to zero
        self.beta = np.zeros(p)
        
        # Track number of samples seen
        self.n_samples = 0
        
    def update(self, x, y):
        """
        Update the model with a single new observation.
        
        Parameters:
        -----------
        x : ndarray of shape (p,)
            Feature vector
        y : float
            Target value
            
        Returns:
        --------
        prediction_error : float
            Error before updating (innovation)
        """
        x = np.asarray(x).flatten()
        assert len(x) == self.p, f"Expected {self.p} features, got {len(x)}"
        
        # Step 1: Compute prediction error (before update)
        prediction = x @ self.beta
        error = y - prediction
        
        # Step 2: Compute gain vector via Sherman-Morrison
        Px = self.P @ x  # O(p²)
        denominator = 1 + x @ Px  # O(p)
        k = Px / denominator  # Gain vector
        
        # Step 3: Update coefficients
        self.beta = self.beta + k * error  # O(p)
        
        # Step 4: Update inverse covariance matrix
        # P = P - k * x^T * P = (I - k * x^T) * P
        self.P = self.P - np.outer(k, x @ self.P)  # O(p²)
        
        self.n_samples += 1
        
        return error
    
    def fit_batch(self, X, y):
        """
        Process a batch of samples online (sequentially).
        
        Parameters:
        -----------
        X : ndarray of shape (n, p)
            Design matrix
        y : ndarray of shape (n,)
            Target vector
            
        Returns:
        --------
        errors : list
            Prediction errors for each sample
        """
        errors = []
        for i in range(len(y)):
            error = self.update(X[i], y[i])
            errors.append(error)
        return errors
    
    def predict(self, X):
        """
        Make predictions.
        """
        return X @ self.beta
    
    @property
    def coefficients(self):
        return self.beta.copy()
 
 
def demonstrate_rls():
    """
    Demonstrate RLS and compare with batch OLS.
    """
    np.random.seed(42)
    
    # Generate data
    n, p = 1000, 5
    X = np.random.randn(n, p)
    beta_true = np.array([1, -2, 3, -4, 5])
    y = X @ beta_true + 0.5 * np.random.randn(n)
    
    # RLS online learning
    rls = RecursiveLeastSquares(p, regularization=1e-3)
    errors = rls.fit_batch(X, y)
    
    # Batch OLS for comparison
    beta_batch = np.linalg.lstsq(X, y, rcond=None)[0]
    
    print("=== RLS vs Batch OLS Comparison ===")
    print(f"True coefficients:  {beta_true}")
    print(f"RLS estimate:       {np.round(rls.beta, 4)}")
    print(f"Batch OLS estimate: {np.round(beta_batch, 4)}")
    print(f"RLS error:   {np.linalg.norm(rls.beta - beta_true):.6f}")
    print(f"Batch error: {np.linalg.norm(beta_batch - beta_true):.6f}")
    
    # Show convergence behavior
    print(f"
RLS converged after {n} samples")
    print(f"Initial MSE (first 10):    {np.mean(np.array(errors[:10])**2):.4f}")
    print(f"Midpoint MSE (500-510):    {np.mean(np.array(errors[500:510])**2):.4f}")
    print(f"Final MSE (last 10):       {np.mean(np.array(errors[-10:])**2):.4f}")
 
demonstrate_rls()

Connection to Kalman Filtering

The Woodbury Formula: Batch Updates

Sherman-Morrison handles rank-1 updates. What if we receive a batch of k new samples at once? Applying Sherman-Morrison k times costs O(kp²). We can do better.

Theorem (Woodbury Matrix Identity):

The Woodbury formula generalizes Sherman-Morrison to rank-k updates:

$$(A + UCV^\top)^{-1} = A^{-1} - A^{-1}U(C^{-1} + V^\top A^{-1}U)^{-1}V^\top A^{-1}$$

where:

A ∈ ℝᵖˣᵖ (original matrix)
U ∈ ℝᵖˣᵏ, V ∈ ℝᵖˣᵏ (low-rank factors)
C ∈ ℝᵏˣᵏ (small matrix)

For Least Squares:

When a batch of k new samples X_new (k × p) and y_new (k,) arrives:

$$A_{new} = A + X_{new}^\top X_{new}$$

This is a rank-k update with U = V = X_newᵀ ∈ ℝᵖˣᵏ and C = Iₖ:

$$P_{new} = P - P X_{new}^\top (I_k + X_{new} P X_{new}^\top)^{-1} X_{new} P$$

Cost:

Compute PXᵀ: O(p²k)
Compute XPXᵀ: O(pk²)
Invert k × k matrix: O(k³)
Final matrix multiply: O(p²k)
Total: O(p²k + k³)

When k << p, this is much better than O(p³) for re-inverting from scratch.

woodbury_batch_update.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import numpy as np
 
def woodbury_update(P, X_new, y_new, beta):
    """
    Batch update using Woodbury formula.
    
    More efficient than repeated Sherman-Morrison when k > 1.
    
    Parameters:
    -----------
    P : ndarray of shape (p, p)
        Current inverse covariance matrix
    X_new : ndarray of shape (k, p)
        New feature batch
    y_new : ndarray of shape (k,)
        New target batch
    beta : ndarray of shape (p,)
        Current coefficient estimate
        
    Returns:
    --------
    P_new : ndarray of shape (p, p)
        Updated inverse covariance
    beta_new : ndarray of shape (p,)
        Updated coefficients
    """
    k, p = X_new.shape
    
    # Woodbury: P_new = P - P @ X^T @ (I + X @ P @ X^T)^{-1} @ X @ P
    
    # Compute intermediate quantities
    PXt = P @ X_new.T  # (p, k)
    XPXt = X_new @ PXt  # (k, k)
    
    # Solve (I + XPX^T) @ M = I for M, equivalent to inverting
    IplusXPXt = np.eye(k) + XPXt
    M = np.linalg.solve(IplusXPXt, np.eye(k))  # (k, k)
    
    # Update P
    P_new = P - PXt @ M @ X_new @ P
    
    # Update beta
    # b_new = b + X_new^T @ y_new
    # beta_new = P_new @ b_new
    # 
    # More efficient: use innovation form
    # beta_new = beta + P_new @ X_new^T @ (y_new - X_new @ beta)
    innovation = y_new - X_new @ beta
    beta_new = beta + P_new @ X_new.T @ innovation
    
    return P_new, beta_new
 
 
def compare_update_methods():
    """
    Compare Sherman-Morrison (one at a time) vs Woodbury (batch).
    """
    np.random.seed(42)
    
    p = 100
    n_existing = 500
    k_new = 50  # Batch size
    
    # Initial data
    X = np.random.randn(n_existing, p)
    beta_true = np.random.randn(p)
    y = X @ beta_true + 0.1 * np.random.randn(n_existing)
    
    # Compute initial solution
    XtX = X.T @ X + 1e-6 * np.eye(p)  # Small regularization
    Xty = X.T @ y
    P = np.linalg.inv(XtX)
    beta = P @ Xty
    
    # New batch
    X_new = np.random.randn(k_new, p)
    y_new = X_new @ beta_true + 0.1 * np.random.randn(k_new)
    
    # Method 1: Sherman-Morrison (sequential)
    import time
    P_sm = P.copy()
    beta_sm = beta.copy()
    
    start = time.time()
    for i in range(k_new):
        x = X_new[i]
        yi = y_new[i]
        Px = P_sm @ x
        k_gain = Px / (1 + x @ Px)
        error = yi - x @ beta_sm
        beta_sm = beta_sm + k_gain * error
        P_sm = P_sm - np.outer(k_gain, x @ P_sm)
    time_sm = time.time() - start
    
    # Method 2: Woodbury (batch)
    start = time.time()
    P_wb, beta_wb = woodbury_update(P, X_new, y_new, beta)
    time_wb = time.time() - start
    
    # Method 3: Recompute from scratch
    X_all = np.vstack([X, X_new])
    y_all = np.concatenate([y, y_new])
    start = time.time()
    beta_full = np.linalg.lstsq(X_all, y_all, rcond=None)[0]
    time_full = time.time() - start
    
    print("=== Update Method Comparison ===")
    print(f"Batch size k = {k_new}, features p = {p}")
    print(f"
Timing:")
    print(f"  Sherman-Morrison (sequential): {time_sm*1000:.2f} ms")
    print(f"  Woodbury (batch):              {time_wb*1000:.2f} ms")
    print(f"  Full recompute:                {time_full*1000:.2f} ms")
    print(f"
Accuracy (||β - β_full||):")
    print(f"  Sherman-Morrison: {np.linalg.norm(beta_sm - beta_full):.2e}")
    print(f"  Woodbury:         {np.linalg.norm(beta_wb - beta_full):.2e}")
 
compare_update_methods()

Downdating: Removing Observations

Sometimes we need to remove observations rather than add them. This is called downdating and arises in:

Sliding window regression: Only consider the last N samples
Outlier removal: Retroactively remove detected outliers
Cross-validation: Leave-one-out efficiently
Data correction: Fix mislabeled training examples

The Downdating Formula:

To remove a sample (x, y), we reverse the Sherman-Morrison update:

$$A_{new} = A - xx^\top$$

$$P_{new} = P + \frac{Pxx^\top P}{1 - x^\top P x}$$

Note the sign changes: minus becomes plus, and the denominator uses 1 - xᵀPx instead of 1 + xᵀPx.

Numerical Danger in Downdating

sliding_window_rls.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
import numpy as np
from collections import deque
 
class SlidingWindowRLS:
    """
    RLS with a sliding window: maintains the solution over the 
    last window_size observations only.
    
    Uses Sherman-Morrison for both adding new points (updating)
    and removing old points (downdating).
    """
    
    def __init__(self, p, window_size, regularization=1e-6):
        """
        Parameters:
        -----------
        p : int
            Number of features
        window_size : int
            Maximum number of samples to consider
        regularization : float
            Regularization strength
        """
        self.p = p
        self.window_size = window_size
        self.regularization = regularization
        
        # Initialize with regularization
        self.P = np.eye(p) / regularization
        self.beta = np.zeros(p)
        
        # Rolling sums for the normal equations
        self.XtX = regularization * np.eye(p)
        self.Xty = np.zeros(p)
        
        # Store recent observations for downdating
        self.history = deque(maxlen=window_size)
        
    def _update(self, x, y):
        """Add a sample (Sherman-Morrison update)."""
        Px = self.P @ x
        denom = 1 + x @ Px
        k = Px / denom
        
        error = y - x @ self.beta
        self.beta = self.beta + k * error
        self.P = self.P - np.outer(k, x @ self.P)
        
        # Also update running sums (for numerical stability checks)
        self.XtX += np.outer(x, x)
        self.Xty += x * y
        
    def _downdate(self, x, y):
        """Remove a sample (Sherman-Morrison downdate)."""
        Px = self.P @ x
        denom = 1 - x @ Px
        
        # Safety check: avoid division by near-zero
        if abs(denom) < 1e-10:
            raise ValueError("Downdating would make matrix singular! "
                           "Cannot remove this observation.")
        
        k = Px / denom
        
        # Reverse the update
        self.P = self.P + np.outer(k, x @ self.P)
        
        # Update running sums
        self.XtX -= np.outer(x, x)
        self.Xty -= x * y
        
        # Recompute beta from updated XtX, Xty
        # (more stable than incremental update for downdating)
        self.beta = self.P @ self.Xty
        
    def update(self, x, y):
        """
        Add new observation, removing oldest if at capacity.
        """
        x = np.asarray(x).flatten()
        
        # If at capacity, remove oldest observation
        if len(self.history) == self.window_size:
            x_old, y_old = self.history[0]
            self._downdate(x_old, y_old)
        
        # Add new observation
        self._update(x, y)
        self.history.append((x.copy(), y))
        
    def predict(self, X):
        return X @ self.beta
 
 
def demonstrate_sliding_window():
    """
    Demonstrate sliding window RLS on non-stationary data.
    """
    np.random.seed(42)
    
    p = 3
    window_size = 100
    n_total = 500
    
    # Time-varying coefficients (regime changes)
    def true_beta(t):
        if t < 150:
            return np.array([1, 0, 0])
        elif t < 300:
            return np.array([0, 1, 0])
        else:
            return np.array([0, 0, 1])
    
    # Generate data
    X = np.random.randn(n_total, p)
    y = np.array([X[t] @ true_beta(t) for t in range(n_total)])
    y += 0.1 * np.random.randn(n_total)  # Noise
    
    # Standard RLS (all data)
    rls_full = RecursiveLeastSquares(p)
    
    # Sliding window RLS
    rls_window = SlidingWindowRLS(p, window_size)
    
    # Track estimates
    beta_full_history = []
    beta_window_history = []
    
    for t in range(n_total):
        rls_full.update(X[t], y[t])
        rls_window.update(X[t], y[t])
        
        beta_full_history.append(rls_full.beta.copy())
        beta_window_history.append(rls_window.beta.copy())
    
    # Compare tracking at regime boundaries
    print("=== Sliding Window vs Full RLS ===")
    print("(Tracking time-varying coefficients)")
    print()
    
    for t in [100, 200, 350, 450]:
        true_b = true_beta(t)
        full_b = beta_full_history[t]
        window_b = beta_window_history[t]
        
        err_full = np.linalg.norm(full_b - true_b)
        err_window = np.linalg.norm(window_b - true_b)
        
        print(f"t={t}: True={true_b}")
        print(f"  Full RLS: {np.round(full_b, 2)}, error={err_full:.3f}")
        print(f"  Sliding:  {np.round(window_b, 2)}, error={err_window:.3f}")
        print()
    
    print("Observation: Sliding window RLS tracks regime changes;")
    print("Full RLS averages across all regimes (fails to track).")
 
 
# Need RecursiveLeastSquares from earlier
class RecursiveLeastSquares:
    def __init__(self, p, regularization=1e-6):
        self.p = p
        self.P = np.eye(p) / regularization
        self.beta = np.zeros(p)
        
    def update(self, x, y):
        x = np.asarray(x).flatten()
        Px = self.P @ x
        k = Px / (1 + x @ Px)
        error = y - x @ self.beta
        self.beta = self.beta + k * error
        self.P = self.P - np.outer(k, x @ self.P)
        return error
 
demonstrate_sliding_window()

Forgetting Factors: Soft Adaptation

RLS with Forgetting Factor:

Introduce a forgetting factor λ ∈ (0, 1]. At each step:

$$A_{t+1} = \lambda A_t + x_{t+1} x_{t+1}^\top$$

Old observations decay exponentially: a sample from k steps ago has effective weight λᵏ.

Modified Update Equations:

$$k = \frac{P_t x}{\lambda + x^\top P_t x}$$ $$\beta_{t+1} = \beta_t + k(y - x^\top \beta_t)$$ $$P_{t+1} = \frac{1}{\lambda}(P_t - k x^\top P_t)$$

Forgetting Factor Selection Guide
λ Value	Effective Window	Behavior
1.0	Infinite (all data)	Standard RLS, no forgetting
0.99	~100 samples	Slow adaptation
0.95	~20 samples	Moderate adaptation
0.90	~10 samples	Fast adaptation, more noise
0.80	~5 samples	Very fast, high variance

The Effective Window Size:

With forgetting factor λ, the effective window size (sum of weights on past observations) is approximately:

$$W_{eff} = \sum_{k=0}^{\infty} \lambda^k = \frac{1}{1-\lambda}$$

So λ = 0.99 gives an effective window of ~100 samples, λ = 0.95 gives ~20 samples, etc.

Trade-offs:

Large λ (near 1): More stable, better for stationary data, slow to adapt
Small λ: Faster adaptation, higher variance, can track non-stationarity

The optimal λ depends on how quickly the underlying relationship changes.

rls_forgetting.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import numpy as np
 
class RLSWithForgetting:
    """
    RLS with exponential forgetting factor.
    
    Older observations are exponentially down-weighted,
    allowing the model to adapt to time-varying relationships.
    """
    
    def __init__(self, p, forgetting_factor=0.99, regularization=1e-6):
        """
        Parameters:
        -----------
        p : int
            Number of features
        forgetting_factor : float
            λ ∈ (0, 1]. 1 = no forgetting, smaller = faster adaptation
        regularization : float
            Initial regularization
        """
        self.p = p
        self.lambda_ = forgetting_factor
        
        self.P = np.eye(p) / regularization
        self.beta = np.zeros(p)
        
    def update(self, x, y):
        """Update with new observation."""
        x = np.asarray(x).flatten()
        
        # Modified gain with forgetting factor
        Px = self.P @ x
        denom = self.lambda_ + x @ Px
        k = Px / denom
        
        # Coefficient update
        error = y - x @ self.beta
        self.beta = self.beta + k * error
        
        # Covariance update with forgetting
        # P = (1/λ) * (P - k @ x^T @ P)
        self.P = (self.P - np.outer(k, x @ self.P)) / self.lambda_
        
        return error
 
 
def compare_forgetting_factors():
    """
    Compare RLS with different forgetting factors on drifting data.
    """
    np.random.seed(42)
    
    p = 2
    n = 500
    
    # Linearly drifting coefficients
    t = np.arange(n)
    beta_true = np.column_stack([
        1 + 0.005 * t,  # Slowly increasing
        2 - 0.003 * t,  # Slowly decreasing
    ])
    
    # Generate data
    X = np.random.randn(n, p)
    y = np.array([X[i] @ beta_true[i] for i in range(n)])
    y += 0.2 * np.random.randn(n)
    
    # Different forgetting factors
    lambdas = [1.0, 0.99, 0.95]
    models = {f"λ={l}": RLSWithForgetting(p, l) for l in lambdas}
    
    # Track estimates and errors
    tracking_errors = {name: [] for name in models}
    
    for i in range(n):
        for name, model in models.items():
            model.update(X[i], y[i])
            error = np.linalg.norm(model.beta - beta_true[i])
            tracking_errors[name].append(error)
    
    # Print results at key timepoints
    print("=== Tracking Drifting Coefficients ===")
    print(f"True β drifts from [1, 2] to [{beta_true[-1,0]:.2f}, {beta_true[-1,1]:.2f}]")
    print()
    
    for t in [100, 250, 400]:
        print(f"t={t}: True β = {np.round(beta_true[t], 2)}")
        for name, model in models.items():
            # Get the beta at that time
            # (we'd need to store history for this; using final for demo)
            avg_error = np.mean(tracking_errors[name][max(0,t-10):t+10])
            print(f"  {name}: avg error = {avg_error:.4f}")
        print()
    
    # Summary
    print("Average tracking error over all time:")
    for name, errors in tracking_errors.items():
        print(f"  {name}: {np.mean(errors):.4f}")
 
compare_forgetting_factors()

Summary: Online Updates

Key Takeaways

•Sherman-Morrison enables O(p²) rank-1 updates — No need to re-invert the full matrix when one sample arrives.
•RLS maintains exact OLS solution incrementally — Mathematically equivalent to batch solving after each sample.
•Woodbury generalizes to rank-k batch updates — More efficient than k sequential Sherman-Morrison applications when k > 1.
•Downdating removes observations — Sign changes in Sherman-Morrison, but numerically more delicate.
•Sliding windows enable local regression — Combine updating and downdating to track non-stationary relationships.
•Forgetting factors provide soft adaptation — Exponential decay of old observations allows continuous tracking with tunable responsiveness.

Online Update Method Summary
Method	Cost per Update	Stationarity Assumption	Use Case
Standard RLS	O(p²)	Constant coefficients	Stable environments
RLS + Forgetting	O(p²)	Slow drift	Gradual concept drift
Sliding Window RLS	O(p²)	Local stationarity	Regime changes
Woodbury Batch	O(p²k + k³)	Constant coefficients	Periodic batch arrivals

What's Next:

Page Complete