Loading content...
Everything we've covered so far assumes batch learning: you have all your data upfront, you fit the model once, and you're done. But real-world systems often don't work this way:
For these scenarios, we need online (incremental) algorithms that update the model efficiently as each new data point arrives—without reprocessing all historical data.
By the end of this page, you will: (1) understand recursive least squares (RLS) and its derivation, (2) master the Sherman-Morrison-Woodbury formulas for matrix inverse updates, (3) implement efficient O(p²) incremental updates, and (4) handle data removal (downdating) for sliding-window regression.
The Batch Setting (Review):
Given n observations (X, y) where X is n × p, the OLS solution is:
$$\beta = (X^\top X)^{-1} X^\top y$$
Computing this from scratch costs O(np²) for forming XᵀX and O(p³) for the solve—acceptable once, but prohibitive if repeated for every new data point.
The Online Setting:
Data arrives sequentially: (x₁, y₁), (x₂, y₂), ..., (xₜ, yₜ), ...
After receiving (xₜ, yₜ), we want to update our coefficient estimate βₜ₋₁ → βₜ in:
The key insight: rather than re-computing (XᵀX)⁻¹ from scratch, we update the existing inverse when a new row is added to X.
Notation:
Let's establish notation for the incremental setting:
When a new observation (xₜ₊₁, yₜ₊₁) arrives:
$$A_{t+1} = A_t + x_{t+1} x_{t+1}^\top \quad \text{(rank-1 update)}$$ $$b_{t+1} = b_t + x_{t+1} y_{t+1}$$
Given Pₜ = Aₜ⁻¹ and the rank-1 update Aₜ₊₁ = Aₜ + xxᵀ, can we compute Pₜ₊₁ = Aₜ₊₁⁻¹ efficiently without inverting from scratch? The answer is yes—the Sherman-Morrison formula gives an O(p²) update.
The Sherman-Morrison formula is one of the most important results in computational linear algebra. It tells us how to update a matrix inverse when the matrix undergoes a rank-1 perturbation.
Theorem (Sherman-Morrison):
If A is invertible and A + uvᵀ is also invertible, then:
$$\boxed{(A + uv^\top)^{-1} = A^{-1} - \frac{A^{-1}uv^\top A^{-1}}{1 + v^\top A^{-1}u}}$$
where u, v ∈ ℝᵖ are vectors.
Proof:
Let B = A + uvᵀ and suppose B⁻¹ = A⁻¹ + C for some correction matrix C.
We need B · B⁻¹ = I:
$$(A + uv^\top)(A^{-1} + C) = I$$ $$I + AC + uv^\top A^{-1} + uv^\top C = I$$ $$AC + uv^\top A^{-1} + uv^\top C = 0$$
Factor out u: $$AC = -u(v^\top A^{-1} + v^\top C)$$
This suggests C = -αuv̂ᵀ for some scalar α and some vector v̂. Since the correction is rank-1 (coming from a rank-1 update), try C = -αuv^ᵀA⁻¹:
$$A(-\alpha uv^\top A^{-1}) = -u(v^\top A^{-1} + v^\top(-\alpha uv^\top A^{-1}))$$ $$-\alpha uv^\top A^{-1} = -u(v^\top A^{-1} - \alpha v^\top uv^\top A^{-1})$$ $$-\alpha uv^\top A^{-1} = -uv^\top A^{-1}(1 - \alpha v^\top u)$$
Comparing: α = 1 - αvᵀu, solving: α(1 + vᵀu) = 1, so α = 1/(1 + vᵀA⁻¹u).
Substituting back gives the Sherman-Morrison formula. ∎
Sherman-Morrison requires: (1) one matrix-vector product A⁻¹u: O(p²), (2) one dot product vᵀ(A⁻¹u): O(p), (3) one outer product and matrix subtraction: O(p²). Total: O(p²), regardless of n!
For Least Squares:
When a new sample (x, y) arrives:
$$P_{t+1} = (A_t + xx^\top)^{-1} = P_t - \frac{P_t xx^\top P_t}{1 + x^\top P_t x}$$
Setting u = v = x in Sherman-Morrison (our rank-1 update is symmetric).
The Gain Vector:
Define the gain vector (or Kalman gain in filtering literature):
$$k_t = \frac{P_t x}{1 + x^\top P_t x}$$
Then the update becomes:
$$P_{t+1} = P_t - k_t x^\top P_t = (I - k_t x^\top) P_t$$
Recursive Least Squares combines the Sherman-Morrison inverse update with an elegant formula for updating the coefficient vector directly.
The RLS Algorithm:
Given current estimates Pₜ (inverse covariance) and βₜ (coefficients), when new observation (x, y) arrives:
1. Compute prediction error: $$e = y - x^\top \beta_t \quad \text{(innovation/residual)}$$
2. Compute gain vector: $$k = \frac{P_t x}{1 + x^\top P_t x}$$
3. Update coefficients: $$\beta_{t+1} = \beta_t + k \cdot e$$
4. Update inverse covariance: $$P_{t+1} = P_t - k x^\top P_t$$
Derivation of the Coefficient Update:
We need to show that βₜ₊₁ = βₜ + k·e gives the correct least squares solution.
The batch solution at time t+1: $$\beta_{t+1} = P_{t+1} b_{t+1} = P_{t+1}(b_t + xy)$$
Using Pₜ₊₁ = Pₜ - kxᵀPₜ: $$\beta_{t+1} = (P_t - kx^\top P_t)(b_t + xy)$$ $$= P_t b_t + P_t xy - kx^\top P_t b_t - kx^\top P_t xy$$ $$= \beta_t + P_t xy - k(x^\top \beta_t) - k(x^\top P_t x)y$$
Note that x^ᵀPₜx = (1 + xᵀPₜx) - 1 and k = Pₜx/(1 + xᵀPₜx): $$\beta_{t+1} = \beta_t + P_t xy - k(x^\top \beta_t) - k(x^\top P_t x)y$$
After simplification (using the definition of k): $$\beta_{t+1} = \beta_t + k(y - x^\top \beta_t) = \beta_t + k \cdot e \quad \checkmark$$
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140
import numpy as np class RecursiveLeastSquares: """ Recursive Least Squares estimator for online linear regression. Maintains O(p²) state and performs O(p²) updates per sample. Mathematically equivalent to batch OLS at each step. """ def __init__(self, p, regularization=1e-6): """ Initialize RLS estimator. Parameters: ----------- p : int Number of features regularization : float Initial regularization (P₀ = (1/regularization) * I) Smaller values = faster initial learning, less stable """ self.p = p # Initialize P = (regularization * I)^(-1) = (1/regularization) * I # Large initial P = high initial uncertainty = fast adaptation self.P = np.eye(p) / regularization # Initialize coefficients to zero self.beta = np.zeros(p) # Track number of samples seen self.n_samples = 0 def update(self, x, y): """ Update the model with a single new observation. Parameters: ----------- x : ndarray of shape (p,) Feature vector y : float Target value Returns: -------- prediction_error : float Error before updating (innovation) """ x = np.asarray(x).flatten() assert len(x) == self.p, f"Expected {self.p} features, got {len(x)}" # Step 1: Compute prediction error (before update) prediction = x @ self.beta error = y - prediction # Step 2: Compute gain vector via Sherman-Morrison Px = self.P @ x # O(p²) denominator = 1 + x @ Px # O(p) k = Px / denominator # Gain vector # Step 3: Update coefficients self.beta = self.beta + k * error # O(p) # Step 4: Update inverse covariance matrix # P = P - k * x^T * P = (I - k * x^T) * P self.P = self.P - np.outer(k, x @ self.P) # O(p²) self.n_samples += 1 return error def fit_batch(self, X, y): """ Process a batch of samples online (sequentially). Parameters: ----------- X : ndarray of shape (n, p) Design matrix y : ndarray of shape (n,) Target vector Returns: -------- errors : list Prediction errors for each sample """ errors = [] for i in range(len(y)): error = self.update(X[i], y[i]) errors.append(error) return errors def predict(self, X): """ Make predictions. """ return X @ self.beta @property def coefficients(self): return self.beta.copy() def demonstrate_rls(): """ Demonstrate RLS and compare with batch OLS. """ np.random.seed(42) # Generate data n, p = 1000, 5 X = np.random.randn(n, p) beta_true = np.array([1, -2, 3, -4, 5]) y = X @ beta_true + 0.5 * np.random.randn(n) # RLS online learning rls = RecursiveLeastSquares(p, regularization=1e-3) errors = rls.fit_batch(X, y) # Batch OLS for comparison beta_batch = np.linalg.lstsq(X, y, rcond=None)[0] print("=== RLS vs Batch OLS Comparison ===") print(f"True coefficients: {beta_true}") print(f"RLS estimate: {np.round(rls.beta, 4)}") print(f"Batch OLS estimate: {np.round(beta_batch, 4)}") print(f"RLS error: {np.linalg.norm(rls.beta - beta_true):.6f}") print(f"Batch error: {np.linalg.norm(beta_batch - beta_true):.6f}") # Show convergence behavior print(f"RLS converged after {n} samples") print(f"Initial MSE (first 10): {np.mean(np.array(errors[:10])**2):.4f}") print(f"Midpoint MSE (500-510): {np.mean(np.array(errors[500:510])**2):.4f}") print(f"Final MSE (last 10): {np.mean(np.array(errors[-10:])**2):.4f}") demonstrate_rls()RLS is a special case of the Kalman filter for a static system (constant true coefficients). The matrix P plays the role of the error covariance, and k is the Kalman gain. When coefficients are time-varying, full Kalman filtering adds process noise to track changes.
Sherman-Morrison handles rank-1 updates. What if we receive a batch of k new samples at once? Applying Sherman-Morrison k times costs O(kp²). We can do better.
Theorem (Woodbury Matrix Identity):
The Woodbury formula generalizes Sherman-Morrison to rank-k updates:
$$(A + UCV^\top)^{-1} = A^{-1} - A^{-1}U(C^{-1} + V^\top A^{-1}U)^{-1}V^\top A^{-1}$$
where:
For Least Squares:
When a batch of k new samples X_new (k × p) and y_new (k,) arrives:
$$A_{new} = A + X_{new}^\top X_{new}$$
This is a rank-k update with U = V = X_newᵀ ∈ ℝᵖˣᵏ and C = Iₖ:
$$P_{new} = P - P X_{new}^\top (I_k + X_{new} P X_{new}^\top)^{-1} X_{new} P$$
Cost:
When k << p, this is much better than O(p³) for re-inverting from scratch.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
import numpy as np def woodbury_update(P, X_new, y_new, beta): """ Batch update using Woodbury formula. More efficient than repeated Sherman-Morrison when k > 1. Parameters: ----------- P : ndarray of shape (p, p) Current inverse covariance matrix X_new : ndarray of shape (k, p) New feature batch y_new : ndarray of shape (k,) New target batch beta : ndarray of shape (p,) Current coefficient estimate Returns: -------- P_new : ndarray of shape (p, p) Updated inverse covariance beta_new : ndarray of shape (p,) Updated coefficients """ k, p = X_new.shape # Woodbury: P_new = P - P @ X^T @ (I + X @ P @ X^T)^{-1} @ X @ P # Compute intermediate quantities PXt = P @ X_new.T # (p, k) XPXt = X_new @ PXt # (k, k) # Solve (I + XPX^T) @ M = I for M, equivalent to inverting IplusXPXt = np.eye(k) + XPXt M = np.linalg.solve(IplusXPXt, np.eye(k)) # (k, k) # Update P P_new = P - PXt @ M @ X_new @ P # Update beta # b_new = b + X_new^T @ y_new # beta_new = P_new @ b_new # # More efficient: use innovation form # beta_new = beta + P_new @ X_new^T @ (y_new - X_new @ beta) innovation = y_new - X_new @ beta beta_new = beta + P_new @ X_new.T @ innovation return P_new, beta_new def compare_update_methods(): """ Compare Sherman-Morrison (one at a time) vs Woodbury (batch). """ np.random.seed(42) p = 100 n_existing = 500 k_new = 50 # Batch size # Initial data X = np.random.randn(n_existing, p) beta_true = np.random.randn(p) y = X @ beta_true + 0.1 * np.random.randn(n_existing) # Compute initial solution XtX = X.T @ X + 1e-6 * np.eye(p) # Small regularization Xty = X.T @ y P = np.linalg.inv(XtX) beta = P @ Xty # New batch X_new = np.random.randn(k_new, p) y_new = X_new @ beta_true + 0.1 * np.random.randn(k_new) # Method 1: Sherman-Morrison (sequential) import time P_sm = P.copy() beta_sm = beta.copy() start = time.time() for i in range(k_new): x = X_new[i] yi = y_new[i] Px = P_sm @ x k_gain = Px / (1 + x @ Px) error = yi - x @ beta_sm beta_sm = beta_sm + k_gain * error P_sm = P_sm - np.outer(k_gain, x @ P_sm) time_sm = time.time() - start # Method 2: Woodbury (batch) start = time.time() P_wb, beta_wb = woodbury_update(P, X_new, y_new, beta) time_wb = time.time() - start # Method 3: Recompute from scratch X_all = np.vstack([X, X_new]) y_all = np.concatenate([y, y_new]) start = time.time() beta_full = np.linalg.lstsq(X_all, y_all, rcond=None)[0] time_full = time.time() - start print("=== Update Method Comparison ===") print(f"Batch size k = {k_new}, features p = {p}") print(f"Timing:") print(f" Sherman-Morrison (sequential): {time_sm*1000:.2f} ms") print(f" Woodbury (batch): {time_wb*1000:.2f} ms") print(f" Full recompute: {time_full*1000:.2f} ms") print(f"Accuracy (||β - β_full||):") print(f" Sherman-Morrison: {np.linalg.norm(beta_sm - beta_full):.2e}") print(f" Woodbury: {np.linalg.norm(beta_wb - beta_full):.2e}") compare_update_methods()Sometimes we need to remove observations rather than add them. This is called downdating and arises in:
The Downdating Formula:
To remove a sample (x, y), we reverse the Sherman-Morrison update:
$$A_{new} = A - xx^\top$$
$$P_{new} = P + \frac{Pxx^\top P}{1 - x^\top P x}$$
Note the sign changes: minus becomes plus, and the denominator uses 1 - xᵀPx instead of 1 + xᵀPx.
Downdating is numerically more treacherous than updating. If 1 - xᵀPx ≈ 0, the update explodes. This happens when x contributes most of the information about some direction—removing it makes that direction unidentifiable. Always check the denominator before downdating!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
import numpy as npfrom collections import deque class SlidingWindowRLS: """ RLS with a sliding window: maintains the solution over the last window_size observations only. Uses Sherman-Morrison for both adding new points (updating) and removing old points (downdating). """ def __init__(self, p, window_size, regularization=1e-6): """ Parameters: ----------- p : int Number of features window_size : int Maximum number of samples to consider regularization : float Regularization strength """ self.p = p self.window_size = window_size self.regularization = regularization # Initialize with regularization self.P = np.eye(p) / regularization self.beta = np.zeros(p) # Rolling sums for the normal equations self.XtX = regularization * np.eye(p) self.Xty = np.zeros(p) # Store recent observations for downdating self.history = deque(maxlen=window_size) def _update(self, x, y): """Add a sample (Sherman-Morrison update).""" Px = self.P @ x denom = 1 + x @ Px k = Px / denom error = y - x @ self.beta self.beta = self.beta + k * error self.P = self.P - np.outer(k, x @ self.P) # Also update running sums (for numerical stability checks) self.XtX += np.outer(x, x) self.Xty += x * y def _downdate(self, x, y): """Remove a sample (Sherman-Morrison downdate).""" Px = self.P @ x denom = 1 - x @ Px # Safety check: avoid division by near-zero if abs(denom) < 1e-10: raise ValueError("Downdating would make matrix singular! " "Cannot remove this observation.") k = Px / denom # Reverse the update self.P = self.P + np.outer(k, x @ self.P) # Update running sums self.XtX -= np.outer(x, x) self.Xty -= x * y # Recompute beta from updated XtX, Xty # (more stable than incremental update for downdating) self.beta = self.P @ self.Xty def update(self, x, y): """ Add new observation, removing oldest if at capacity. """ x = np.asarray(x).flatten() # If at capacity, remove oldest observation if len(self.history) == self.window_size: x_old, y_old = self.history[0] self._downdate(x_old, y_old) # Add new observation self._update(x, y) self.history.append((x.copy(), y)) def predict(self, X): return X @ self.beta def demonstrate_sliding_window(): """ Demonstrate sliding window RLS on non-stationary data. """ np.random.seed(42) p = 3 window_size = 100 n_total = 500 # Time-varying coefficients (regime changes) def true_beta(t): if t < 150: return np.array([1, 0, 0]) elif t < 300: return np.array([0, 1, 0]) else: return np.array([0, 0, 1]) # Generate data X = np.random.randn(n_total, p) y = np.array([X[t] @ true_beta(t) for t in range(n_total)]) y += 0.1 * np.random.randn(n_total) # Noise # Standard RLS (all data) rls_full = RecursiveLeastSquares(p) # Sliding window RLS rls_window = SlidingWindowRLS(p, window_size) # Track estimates beta_full_history = [] beta_window_history = [] for t in range(n_total): rls_full.update(X[t], y[t]) rls_window.update(X[t], y[t]) beta_full_history.append(rls_full.beta.copy()) beta_window_history.append(rls_window.beta.copy()) # Compare tracking at regime boundaries print("=== Sliding Window vs Full RLS ===") print("(Tracking time-varying coefficients)") print() for t in [100, 200, 350, 450]: true_b = true_beta(t) full_b = beta_full_history[t] window_b = beta_window_history[t] err_full = np.linalg.norm(full_b - true_b) err_window = np.linalg.norm(window_b - true_b) print(f"t={t}: True={true_b}") print(f" Full RLS: {np.round(full_b, 2)}, error={err_full:.3f}") print(f" Sliding: {np.round(window_b, 2)}, error={err_window:.3f}") print() print("Observation: Sliding window RLS tracks regime changes;") print("Full RLS averages across all regimes (fails to track).") # Need RecursiveLeastSquares from earlierclass RecursiveLeastSquares: def __init__(self, p, regularization=1e-6): self.p = p self.P = np.eye(p) / regularization self.beta = np.zeros(p) def update(self, x, y): x = np.asarray(x).flatten() Px = self.P @ x k = Px / (1 + x @ Px) error = y - x @ self.beta self.beta = self.beta + k * error self.P = self.P - np.outer(k, x @ self.P) return error demonstrate_sliding_window()Sliding windows use a hard forgetting scheme: observations older than the window are completely forgotten. An alternative is exponential forgetting, where older observations are gradually down-weighted.
RLS with Forgetting Factor:
Introduce a forgetting factor λ ∈ (0, 1]. At each step:
$$A_{t+1} = \lambda A_t + x_{t+1} x_{t+1}^\top$$
Old observations decay exponentially: a sample from k steps ago has effective weight λᵏ.
Modified Update Equations:
$$k = \frac{P_t x}{\lambda + x^\top P_t x}$$ $$\beta_{t+1} = \beta_t + k(y - x^\top \beta_t)$$ $$P_{t+1} = \frac{1}{\lambda}(P_t - k x^\top P_t)$$
| λ Value | Effective Window | Behavior |
|---|---|---|
| 1.0 | Infinite (all data) | Standard RLS, no forgetting |
| 0.99 | ~100 samples | Slow adaptation |
| 0.95 | ~20 samples | Moderate adaptation |
| 0.90 | ~10 samples | Fast adaptation, more noise |
| 0.80 | ~5 samples | Very fast, high variance |
The Effective Window Size:
With forgetting factor λ, the effective window size (sum of weights on past observations) is approximately:
$$W_{eff} = \sum_{k=0}^{\infty} \lambda^k = \frac{1}{1-\lambda}$$
So λ = 0.99 gives an effective window of ~100 samples, λ = 0.95 gives ~20 samples, etc.
Trade-offs:
The optimal λ depends on how quickly the underlying relationship changes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
import numpy as np class RLSWithForgetting: """ RLS with exponential forgetting factor. Older observations are exponentially down-weighted, allowing the model to adapt to time-varying relationships. """ def __init__(self, p, forgetting_factor=0.99, regularization=1e-6): """ Parameters: ----------- p : int Number of features forgetting_factor : float λ ∈ (0, 1]. 1 = no forgetting, smaller = faster adaptation regularization : float Initial regularization """ self.p = p self.lambda_ = forgetting_factor self.P = np.eye(p) / regularization self.beta = np.zeros(p) def update(self, x, y): """Update with new observation.""" x = np.asarray(x).flatten() # Modified gain with forgetting factor Px = self.P @ x denom = self.lambda_ + x @ Px k = Px / denom # Coefficient update error = y - x @ self.beta self.beta = self.beta + k * error # Covariance update with forgetting # P = (1/λ) * (P - k @ x^T @ P) self.P = (self.P - np.outer(k, x @ self.P)) / self.lambda_ return error def compare_forgetting_factors(): """ Compare RLS with different forgetting factors on drifting data. """ np.random.seed(42) p = 2 n = 500 # Linearly drifting coefficients t = np.arange(n) beta_true = np.column_stack([ 1 + 0.005 * t, # Slowly increasing 2 - 0.003 * t, # Slowly decreasing ]) # Generate data X = np.random.randn(n, p) y = np.array([X[i] @ beta_true[i] for i in range(n)]) y += 0.2 * np.random.randn(n) # Different forgetting factors lambdas = [1.0, 0.99, 0.95] models = {f"λ={l}": RLSWithForgetting(p, l) for l in lambdas} # Track estimates and errors tracking_errors = {name: [] for name in models} for i in range(n): for name, model in models.items(): model.update(X[i], y[i]) error = np.linalg.norm(model.beta - beta_true[i]) tracking_errors[name].append(error) # Print results at key timepoints print("=== Tracking Drifting Coefficients ===") print(f"True β drifts from [1, 2] to [{beta_true[-1,0]:.2f}, {beta_true[-1,1]:.2f}]") print() for t in [100, 250, 400]: print(f"t={t}: True β = {np.round(beta_true[t], 2)}") for name, model in models.items(): # Get the beta at that time # (we'd need to store history for this; using final for demo) avg_error = np.mean(tracking_errors[name][max(0,t-10):t+10]) print(f" {name}: avg error = {avg_error:.4f}") print() # Summary print("Average tracking error over all time:") for name, errors in tracking_errors.items(): print(f" {name}: {np.mean(errors):.4f}") compare_forgetting_factors()| Method | Cost per Update | Stationarity Assumption | Use Case |
|---|---|---|---|
| Standard RLS | O(p²) | Constant coefficients | Stable environments |
| RLS + Forgetting | O(p²) | Slow drift | Gradual concept drift |
| Sliding Window RLS | O(p²) | Local stationarity | Regime changes |
| Woodbury Batch | O(p²k + k³) | Constant coefficients | Periodic batch arrivals |
What's Next:
The final page of this module covers Weighted Least Squares—regression when observations have different reliabilities or when we want to emphasize certain samples. We'll see how weights modify the normal equations and algorithms, and explore connections to heteroscedasticity correction.
You now understand how to update linear regression models incrementally as data arrives. These techniques are essential for streaming applications, real-time systems, and any scenario where recomputing from scratch is prohibitively expensive.