Loading content...
We've established that Ridge regression, for some $\lambda > 0$, provides better MSE than OLS. But the theoretical optimal λ depends on unknown quantities—the true coefficients and noise variance. This creates a practical challenge: how do we select λ using only the observed data?
This is a fundamental problem in machine learning: selecting hyperparameters that control model complexity. The regularization parameter λ is not learned from data like the coefficients β; it must be set through a separate procedure.
By the end of this page, you will understand and implement multiple methods for selecting λ: K-fold cross-validation, Leave-One-Out cross-validation, Generalized Cross-Validation (GCV), information criteria (AIC, BIC), and the Empirical Bayes approach. You'll know when to use each and their tradeoffs.
A naive approach (don't do this):
One might think: "Minimize training RSS over λ." But this fails catastrophically:
$$\text{RSS}(\lambda) = |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}_\lambda|_2^2$$
The training RSS is minimized at $\lambda = 0$ (OLS), because regularization only adds constraints. The OLS solution fits training data best, but (as we've seen) may generalize poorly.
The fundamental problem:
Training error is a biased estimate of test error. Models that fit training data well may have:
Never select hyperparameters by minimizing training error. This leads to overfitting. The hyperparameter that minimizes training error (λ = 0 for Ridge) is usually the worst choice for generalization. We need methods that estimate test error or penalize complexity.
What we really want to minimize:
We want to choose λ to minimize expected test error—the error on future, unseen data:
$$\text{Test Error}(\lambda) = \mathbb{E}{(\mathbf{x}{\text{new}}, y_{\text{new}})} \left[ (y_{\text{new}} - \mathbf{x}{\text{new}}^T\hat{\boldsymbol{\beta}}\lambda)^2 \right]$$
Since we don't have access to the test distribution, we must estimate this quantity from training data. The methods below provide various approaches to this estimation.
The idea:
Cross-validation estimates test error by systematically holding out portions of the training data and measuring prediction error on the held-out portions.
K-Fold Cross-Validation procedure:
$$\text{CV}K(\lambda) = \frac{1}{K} \sum{k=1}^{K} \frac{1}{|\mathcal{F}k|} \sum{i \in \mathcal{F}k} (y_i - \hat{y}{i}^{(-k)}(\lambda))^2$$
where $\hat{y}_{i}^{(-k)}(\lambda)$ is the prediction for observation $i$ using the model trained without fold $k$.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import numpy as npfrom sklearn.model_selection import KFoldfrom sklearn.linear_model import Ridge def ridge_cv_manual(X, y, lambdas, n_folds=5, random_state=42): """ Perform K-fold cross-validation for Ridge regression to select the optimal lambda. Parameters: ----------- X : ndarray of shape (n_samples, n_features) Feature matrix (should be standardized) y : ndarray of shape (n_samples,) Target vector lambdas : array-like Candidate lambda values to evaluate n_folds : int Number of cross-validation folds random_state : int Random seed for reproducible fold splits Returns: -------- best_lambda : float Lambda that minimizes CV error cv_scores : ndarray CV error for each lambda cv_std : ndarray Standard error of CV estimate for each lambda """ n_lambdas = len(lambdas) cv_scores = np.zeros(n_lambdas) cv_std = np.zeros(n_lambdas) kf = KFold(n_splits=n_folds, shuffle=True, random_state=random_state) for i, lam in enumerate(lambdas): fold_scores = [] for train_idx, val_idx in kf.split(X): X_train, X_val = X[train_idx], X[val_idx] y_train, y_val = y[train_idx], y[val_idx] # Fit Ridge on training fold model = Ridge(alpha=lam, fit_intercept=False) model.fit(X_train, y_train) # Predict on validation fold y_pred = model.predict(X_val) # Compute MSE on validation fold mse = np.mean((y_val - y_pred)**2) fold_scores.append(mse) cv_scores[i] = np.mean(fold_scores) cv_std[i] = np.std(fold_scores) / np.sqrt(n_folds) best_idx = np.argmin(cv_scores) best_lambda = lambdas[best_idx] return best_lambda, cv_scores, cv_std def select_lambda_one_se_rule(lambdas, cv_scores, cv_std): """ Select lambda using the one-standard-error rule: Choose the largest (most regularized) lambda whose CV error is within one standard error of the minimum. This provides additional regularization as a hedge against overfitting during hyperparameter selection. """ best_idx = np.argmin(cv_scores) threshold = cv_scores[best_idx] + cv_std[best_idx] # Find largest lambda with score below threshold # (assuming lambdas are sorted descending) for i, (lam, score) in enumerate(zip(lambdas, cv_scores)): if score <= threshold: return lam return lambdas[best_idx]Choosing K:
The one-standard-error rule:
Instead of selecting the λ that minimizes CV error, a more robust approach:
This provides additional regularization, hedging against overfitting in hyperparameter selection.
For Ridge regression, clever linear algebra can speed up CV. Since solutions for different λ values share the same SVD of X, compute the SVD once, then rapidly evaluate many λ values. For K-fold CV, update formulas exist that avoid refitting from scratch.
The extreme case: K = n
Leave-One-Out Cross-Validation (LOOCV) uses each observation as its own validation fold:
$$\text{LOOCV}(\lambda) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_{i}^{(-i)}(\lambda))^2$$
where $\hat{y}_{i}^{(-i)}$ is the prediction for observation $i$ from a model trained on all other observations.
Properties of LOOCV:
The shortcut formula:
Remarkably, for linear models like Ridge regression, LOOCV can be computed without refitting n models. Using the hat matrix $\mathbf{H}_\lambda = \mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T$:
$$\text{LOOCV}(\lambda) = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}i(\lambda)}{1 - h{ii}(\lambda)} \right)^2$$
where:
This is the PRESS statistic (Predicted Residual Error Sum of Squares), normalized.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import numpy as np def ridge_loocv_efficient(X, y, lambdas): """ Compute LOOCV for Ridge regression efficiently using the hat matrix formula. Only requires O(n*p^2) for each lambda instead of O(n * (n*p^2)) for naive approach. Parameters: ----------- X : ndarray of shape (n_samples, n_features) y : ndarray of shape (n_samples,) lambdas : array-like Returns: -------- loocv_scores : ndarray LOOCV error for each lambda """ n, p = X.shape loocv_scores = [] # Compute SVD once (can reuse for all lambdas) U, s, Vt = np.linalg.svd(X, full_matrices=False) for lam in lambdas: # Shrinkage factors for singular values d = s**2 / (s**2 + lam) # diagonal of H in SVD basis # Hat matrix diagonal: h_ii = sum_j (U_ij * d_j)^2 # But actually: H = U * diag(s^2/(s^2+λ)) * U^T # So h_ii = sum_j U_ij^2 * s_j^2/(s_j^2 + λ) leverage = np.sum((U ** 2) * d, axis=1) # Fitted values: y_hat = H @ y = U @ diag(d) @ U^T @ y UTy = U.T @ y y_hat = U @ (d * UTy) # Residuals residuals = y - y_hat # LOOCV = mean of (residual / (1 - leverage))^2 loocv = np.mean((residuals / (1 - leverage)) ** 2) loocv_scores.append(loocv) return np.array(loocv_scores) def select_lambda_loocv(X, y, lambdas): """ Select optimal lambda using LOOCV. """ scores = ridge_loocv_efficient(X, y, lambdas) best_idx = np.argmin(scores) return lambdas[best_idx], scoresLOOCV is preferred when: (1) sample size is small (maximize training data), (2) you want a deterministic result (no random fold splits), (3) the efficient formula applies (Ridge, linear regression). For large n, K-fold CV may be preferable due to lower variance.
Motivation:
Generalized Cross-Validation (GCV), introduced by Craven and Wahba (1979), is a rotation-invariant approximation to LOOCV that replaces individual leverages with their average.
The GCV formula:
$$\text{GCV}(\lambda) = \frac{\frac{1}{n} |\mathbf{y} - \hat{\mathbf{y}}_\lambda|_2^2}{\left(1 - \frac{\text{df}(\lambda)}{n}\right)^2}$$
where $\text{df}(\lambda) = \text{tr}(\mathbf{H}\lambda) = \sum{j=1}^{p} \frac{d_j}{d_j + \lambda}$ is the effective degrees of freedom.
Comparison to LOOCV:
In LOOCV, each observation has its own leverage $h_{ii}$. GCV replaces all leverages with the average leverage $\bar{h} = \text{df}(\lambda)/n$:
$$\text{GCV} \approx \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}_i}{1 - \bar{h}} \right)^2$$
This simplification makes GCV computationally trivial and often works well in practice.
Properties of GCV:
Rotation invariance: GCV is unchanged by orthogonal transformations of the data. This is desirable since the "best" λ shouldn't depend on coordinate choices.
Computational efficiency: Only requires computing RSS and trace of hat matrix—both easy for Ridge.
Asymptotic optimality: Under certain conditions, GCV-selected λ achieves optimal prediction risk asymptotically.
No tuning parameters: Unlike K-fold CV, no choice of K is needed.
The effective degrees of freedom:
The denominator involves $\text{df}(\lambda) = \text{tr}(\mathbf{H}_\lambda)$, which measures the "effective number of parameters" used by the model. As λ increases:
As λ decreases:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as np def ridge_gcv(X, y, lambdas): """ Compute Generalized Cross-Validation score for Ridge regression. GCV(λ) = RSS(λ) / n / (1 - df(λ)/n)^2 Parameters: ----------- X : ndarray of shape (n_samples, n_features) y : ndarray of shape (n_samples,) lambdas : array-like Returns: -------- gcv_scores : ndarray GCV score for each lambda df_values : ndarray Effective degrees of freedom for each lambda """ n, p = X.shape # SVD for efficient computation U, s, Vt = np.linalg.svd(X, full_matrices=False) UTy = U.T @ y gcv_scores = [] df_values = [] for lam in lambdas: # Shrinkage factors shrink = s**2 / (s**2 + lam) # Effective degrees of freedom df = np.sum(shrink) df_values.append(df) # Fitted values in SVD basis y_hat = U @ (shrink * UTy) # RSS rss = np.sum((y - y_hat)**2) # GCV score gcv = (rss / n) / (1 - df / n)**2 gcv_scores.append(gcv) return np.array(gcv_scores), np.array(df_values) def select_lambda_gcv(X, y, lambdas): """ Select optimal lambda using GCV. """ gcv_scores, df_values = ridge_gcv(X, y, lambdas) best_idx = np.argmin(gcv_scores) return lambdas[best_idx], gcv_scores, df_values| Method | Bias | Variance | Computation | Tuning |
|---|---|---|---|---|
| K-Fold CV | Moderate (depends on K) | Moderate | K model fits per λ | Choose K |
| LOOCV | Low (nearly unbiased) | High | Efficient formula available | None |
| GCV | Low (approximates LOOCV) | Lower than LOOCV | Very fast | None |
Information criteria provide an alternative to cross-validation by explicitly penalizing model complexity.
The general form:
$$\text{IC}(\lambda) = n \log(\text{RSS}(\lambda)/n) + \text{Penalty}(\text{df}(\lambda))$$
The penalty increases with the effective degrees of freedom, discouraging overly complex models.
Akaike Information Criterion (AIC):
$$\text{AIC}(\lambda) = n \log(\text{RSS}(\lambda)/n) + 2 \cdot \text{df}(\lambda)$$
AIC aims to select the model that minimizes expected prediction error. It tends to select more complex models.
Bayesian Information Criterion (BIC):
$$\text{BIC}(\lambda) = n \log(\text{RSS}(\lambda)/n) + \log(n) \cdot \text{df}(\lambda)$$
BIC has a stronger penalty for complexity ($\log(n)$ vs 2 when $n > 7$). It aims to select the "true" model if it exists, preferring simpler models.
Corrected AIC (AICc):
When sample size is small relative to parameters, AIC tends to overfit. The corrected version:
$$\text{AICc}(\lambda) = \text{AIC}(\lambda) + \frac{2 \cdot \text{df}(\lambda)(\text{df}(\lambda) + 1)}{n - \text{df}(\lambda) - 1}$$
AICc adds an extra penalty that becomes significant when df is comparable to n.
When to use which:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
import numpy as np def ridge_information_criteria(X, y, lambdas): """ Compute AIC, AICc, and BIC for Ridge regression. Uses effective degrees of freedom: df(λ) = tr(H_λ) = Σ d_j/(d_j + λ) Parameters: ----------- X : ndarray of shape (n_samples, n_features) y : ndarray of shape (n_samples,) lambdas : array-like Returns: -------- dict with 'aic', 'aicc', 'bic' arrays and 'df' array """ n, p = X.shape # SVD U, s, Vt = np.linalg.svd(X, full_matrices=False) UTy = U.T @ y results = { 'aic': [], 'aicc': [], 'bic': [], 'df': [], 'rss': [] } for lam in lambdas: # Shrinkage factors shrink = s**2 / (s**2 + lam) # Effective degrees of freedom df = np.sum(shrink) # Fitted values and RSS y_hat = U @ (shrink * UTy) rss = np.sum((y - y_hat)**2) # Log-likelihood term (proportional) ll_term = n * np.log(rss / n) # AIC aic = ll_term + 2 * df # AICc (corrected AIC) if n - df - 1 > 0: aicc = aic + 2 * df * (df + 1) / (n - df - 1) else: aicc = np.inf # Model too complex for sample size # BIC bic = ll_term + np.log(n) * df results['aic'].append(aic) results['aicc'].append(aicc) results['bic'].append(bic) results['df'].append(df) results['rss'].append(rss) for key in results: results[key] = np.array(results[key]) return results def select_lambda_ic(X, y, lambdas, criterion='aic'): """ Select optimal lambda using an information criterion. Parameters: ----------- criterion : str One of 'aic', 'aicc', 'bic' """ results = ridge_information_criteria(X, y, lambdas) scores = results[criterion] best_idx = np.argmin(scores) return lambdas[best_idx], resultsFor most prediction tasks, GCV or K-fold CV work well. Use BIC if you specifically want a sparser model (more regularization). AIC/AICc are theoretically motivated but often give similar results to GCV. When in doubt, use 5-fold or 10-fold CV as it's well-understood and widely accepted.
The Bayesian perspective revisited:
Recall that Ridge regression corresponds to MAP estimation with a Gaussian prior: $\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2 \mathbf{I})$, where $\lambda = \sigma^2/\tau^2$.
Instead of choosing λ by cross-validation, we can estimate it by maximizing the marginal likelihood—the probability of the observed data after integrating out the unknown coefficients.
The marginal likelihood:
Integrating over the prior on $\boldsymbol{\beta}$:
$$p(\mathbf{y} | \mathbf{X}, \sigma^2, \tau^2) = \int p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}, \sigma^2) p(\boldsymbol{\beta} | \tau^2) d\boldsymbol{\beta}$$
For the Gaussian model, this integral is tractable:
$$\mathbf{y} | \mathbf{X}, \sigma^2, \tau^2 \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n + \tau^2 \mathbf{X}\mathbf{X}^T)$$
Log marginal likelihood:
$$\log p(\mathbf{y} | \mathbf{X}, \sigma^2, \tau^2) = -\frac{n}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{C}| - \frac{1}{2}\mathbf{y}^T\mathbf{C}^{-1}\mathbf{y}$$
where $\mathbf{C} = \sigma^2 \mathbf{I} + \tau^2 \mathbf{X}\mathbf{X}^T$.
Empirical Bayes estimation:
Maximize the (log) marginal likelihood over $\sigma^2$ and $\tau^2$ (or equivalently, over $\sigma^2$ and $\lambda = \sigma^2/\tau^2$):
$$(\hat{\sigma}^2, \hat{\tau}^2) = \arg\max_{\sigma^2, \tau^2} \log p(\mathbf{y} | \mathbf{X}, \sigma^2, \tau^2)$$
Then use $\hat{\lambda} = \hat{\sigma}^2/\hat{\tau}^2$ as the regularization parameter.
Connection to Restricted Maximum Likelihood (REML):
The marginal likelihood approach is related to REML estimation used in mixed-effects models. Both aim to estimate variance components (here, $\sigma^2$ and $\tau^2$) in a principled way.
Advantages of Empirical Bayes:
Disadvantages:
Empirical Bayes λ selection is implemented in some statistical packages (e.g., lme4 in R for mixed models, which is closely related). For pure Ridge regression, CV/GCV are more commonly used due to their simplicity and robustness.
Having covered multiple λ selection methods, here's a practical workflow for applying Ridge regression:
np.logspace(-4, 4, 100).12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
import numpy as npfrom sklearn.linear_model import RidgeCV, Ridgefrom sklearn.preprocessing import StandardScalerfrom sklearn.model_selection import train_test_split def complete_ridge_workflow(X, y, test_size=0.2, random_state=42): """ Complete workflow for Ridge regression with λ selection. Parameters: ----------- X : ndarray Feature matrix y : ndarray Target vector test_size : float Fraction of data for final testing Returns: -------- dict with model, selected lambda, train/test scores """ # 1. Split into train+validation and final test X_trainval, X_test, y_trainval, y_test = train_test_split( X, y, test_size=test_size, random_state=random_state ) # 2. Standardize features (fit on train only!) scaler = StandardScaler() X_trainval_scaled = scaler.fit_transform(X_trainval) X_test_scaled = scaler.transform(X_test) # Center response y_mean = y_trainval.mean() y_trainval_centered = y_trainval - y_mean y_test_centered = y_test - y_mean # 3. Define λ grid (log-spaced) alphas = np.logspace(-4, 4, 100) # 4. Use RidgeCV with built-in cross-validation # 'gcv_mode' uses efficient GCV; can also do '5-fold', etc. ridge_cv = RidgeCV( alphas=alphas, scoring='neg_mean_squared_error', cv=5, # 5-fold CV fit_intercept=False # Already centered/scaled ) ridge_cv.fit(X_trainval_scaled, y_trainval_centered) best_alpha = ridge_cv.alpha_ print(f"Selected λ = {best_alpha:.6f}") # 5. Final model is stored in ridge_cv # Predictions on test set y_pred_test = ridge_cv.predict(X_test_scaled) + y_mean test_mse = np.mean((y_test - y_pred_test)**2) test_r2 = 1 - test_mse / np.var(y_test) print(f"Test MSE: {test_mse:.4f}") print(f"Test R²: {test_r2:.4f}") return { 'model': ridge_cv, 'scaler': scaler, 'y_mean': y_mean, 'best_alpha': best_alpha, 'test_mse': test_mse, 'test_r2': test_r2 }Avoid these common mistakes: (1) Fitting scaler on both train and test data (data leakage); (2) Selecting λ using test set (overfits hyperparameters); (3) Forgetting to standardize features (Ridge depends heavily on scale); (4) Using too narrow a λ grid (may miss optimal region).
We've covered the major approaches to selecting the regularization parameter λ for Ridge regression:
Module complete:
You now have a complete understanding of Ridge regression: the L2 penalty formulation, closed-form solution, shrinkage interpretation, bias-variance tradeoff, and λ selection methods. This foundation prepares you for Lasso (L1 regularization) in the next module, where sparsity-inducing properties provide automatic feature selection.
Congratulations! You've mastered Ridge regression from first principles. You understand the mathematical foundations, geometric intuition, statistical properties, and practical implementation. This knowledge forms the basis for understanding more complex regularization methods and is essential for effective machine learning practice.