Ridge Regression - Learning Module

Loading content...

0/245

Selecting Regularization Strength

The Hyperparameter Selection Problem

We've established that Ridge regression, for some $\lambda > 0$, provides better MSE than OLS. But the theoretical optimal λ depends on unknown quantities—the true coefficients and noise variance. This creates a practical challenge: how do we select λ using only the observed data?

This is a fundamental problem in machine learning: selecting hyperparameters that control model complexity. The regularization parameter λ is not learned from data like the coefficients β; it must be set through a separate procedure.

What You Will Learn

By the end of this page, you will understand and implement multiple methods for selecting λ: K-fold cross-validation, Leave-One-Out cross-validation, Generalized Cross-Validation (GCV), information criteria (AIC, BIC), and the Empirical Bayes approach. You'll know when to use each and their tradeoffs.

The Dangers of Training Error

A naive approach (don't do this):

One might think: "Minimize training RSS over λ." But this fails catastrophically:

$$\text{RSS}(\lambda) = |\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}_\lambda|_2^2$$

The training RSS is minimized at $\lambda = 0$ (OLS), because regularization only adds constraints. The OLS solution fits training data best, but (as we've seen) may generalize poorly.

The fundamental problem:

Training error is a biased estimate of test error. Models that fit training data well may have:

Captured noise patterns specific to the training set
Exploited coincidental correlations that don't generalize
Overfit to idiosyncrasies of the sample

Critical Warning

Never select hyperparameters by minimizing training error. This leads to overfitting. The hyperparameter that minimizes training error (λ = 0 for Ridge) is usually the worst choice for generalization. We need methods that estimate test error or penalize complexity.

What we really want to minimize:

We want to choose λ to minimize expected test error—the error on future, unseen data:

$$\text{Test Error}(\lambda) = \mathbb{E}{(\mathbf{x}{\text{new}}, y_{\text{new}})} \left[ (y_{\text{new}} - \mathbf{x}{\text{new}}^T\hat{\boldsymbol{\beta}}\lambda)^2 \right]$$

Since we don't have access to the test distribution, we must estimate this quantity from training data. The methods below provide various approaches to this estimation.

K-Fold Cross-Validation

The idea:

Cross-validation estimates test error by systematically holding out portions of the training data and measuring prediction error on the held-out portions.

K-Fold Cross-Validation procedure:

Randomly partition the training data into K roughly equal-sized folds: $\mathcal{F}_1, \ldots, \mathcal{F}_K$
For each fold $k = 1, \ldots, K$:
- Train Ridge regression on all data except fold $k$ (using regularization λ)
- Compute prediction error on fold $k$
Average the K fold errors to get the CV estimate:

$$\text{CV}K(\lambda) = \frac{1}{K} \sum{k=1}^{K} \frac{1}{|\mathcal{F}k|} \sum{i \in \mathcal{F}k} (y_i - \hat{y}{i}^{(-k)}(\lambda))^2$$

where $\hat{y}_{i}^{(-k)}(\lambda)$ is the prediction for observation $i$ using the model trained without fold $k$.

k_fold_cv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
 
def ridge_cv_manual(X, y, lambdas, n_folds=5, random_state=42):
    """
    Perform K-fold cross-validation for Ridge regression
    to select the optimal lambda.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
        Feature matrix (should be standardized)
    y : ndarray of shape (n_samples,)
        Target vector
    lambdas : array-like
        Candidate lambda values to evaluate
    n_folds : int
        Number of cross-validation folds
    random_state : int
        Random seed for reproducible fold splits
    
    Returns:
    --------
    best_lambda : float
        Lambda that minimizes CV error
    cv_scores : ndarray
        CV error for each lambda
    cv_std : ndarray
        Standard error of CV estimate for each lambda
    """
    n_lambdas = len(lambdas)
    cv_scores = np.zeros(n_lambdas)
    cv_std = np.zeros(n_lambdas)
    
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=random_state)
    
    for i, lam in enumerate(lambdas):
        fold_scores = []
        
        for train_idx, val_idx in kf.split(X):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train, y_val = y[train_idx], y[val_idx]
            
            # Fit Ridge on training fold
            model = Ridge(alpha=lam, fit_intercept=False)
            model.fit(X_train, y_train)
            
            # Predict on validation fold
            y_pred = model.predict(X_val)
            
            # Compute MSE on validation fold
            mse = np.mean((y_val - y_pred)**2)
            fold_scores.append(mse)
        
        cv_scores[i] = np.mean(fold_scores)
        cv_std[i] = np.std(fold_scores) / np.sqrt(n_folds)
    
    best_idx = np.argmin(cv_scores)
    best_lambda = lambdas[best_idx]
    
    return best_lambda, cv_scores, cv_std
 
 
def select_lambda_one_se_rule(lambdas, cv_scores, cv_std):
    """
    Select lambda using the one-standard-error rule:
    Choose the largest (most regularized) lambda whose CV error
    is within one standard error of the minimum.
    
    This provides additional regularization as a hedge against
    overfitting during hyperparameter selection.
    """
    best_idx = np.argmin(cv_scores)
    threshold = cv_scores[best_idx] + cv_std[best_idx]
    
    # Find largest lambda with score below threshold
    # (assuming lambdas are sorted descending)
    for i, (lam, score) in enumerate(zip(lambdas, cv_scores)):
        if score <= threshold:
            return lam
    
    return lambdas[best_idx]

Choosing K:

K = 5 or K = 10: Most common choices. Good balance between bias and variance of the CV estimate.
Small K (e.g., K = 2): High bias (training sets are small), low variance, fast computation.
Large K (e.g., K = n): Low bias, high variance, expensive computation (this is Leave-One-Out CV).

The one-standard-error rule:

Instead of selecting the λ that minimizes CV error, a more robust approach:

Compute CV error and its standard error for each λ
Find λ* that minimizes CV error
Select the largest λ whose CV error is within one SE of the minimum

This provides additional regularization, hedging against overfitting in hyperparameter selection.

Computational Efficiency

For Ridge regression, clever linear algebra can speed up CV. Since solutions for different λ values share the same SVD of X, compute the SVD once, then rapidly evaluate many λ values. For K-fold CV, update formulas exist that avoid refitting from scratch.

Leave-One-Out Cross-Validation (LOOCV)

The extreme case: K = n

Leave-One-Out Cross-Validation (LOOCV) uses each observation as its own validation fold:

$$\text{LOOCV}(\lambda) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_{i}^{(-i)}(\lambda))^2$$

where $\hat{y}_{i}^{(-i)}$ is the prediction for observation $i$ from a model trained on all other observations.

Properties of LOOCV:

Nearly unbiased: Training sets have size n-1, very close to full data
High variance: Single-observation validation sets are noisy
Deterministic: No randomness from fold assignment
Naively expensive: Requires n model fits

The shortcut formula:

Remarkably, for linear models like Ridge regression, LOOCV can be computed without refitting n models. Using the hat matrix $\mathbf{H}_\lambda = \mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T$:

$$\text{LOOCV}(\lambda) = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}i(\lambda)}{1 - h{ii}(\lambda)} \right)^2$$

where:

$\hat{y}i(\lambda) = [\mathbf{H}\lambda \mathbf{y}]_i$ is the fitted value from the full model
$h_{ii}(\lambda) = [\mathbf{H}\lambda]{ii}$ is the $i$-th diagonal element of the hat matrix (leverage)

This is the PRESS statistic (Predicted Residual Error Sum of Squares), normalized.

loocv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import numpy as np
 
def ridge_loocv_efficient(X, y, lambdas):
    """
    Compute LOOCV for Ridge regression efficiently using
    the hat matrix formula. Only requires O(n*p^2) for each lambda
    instead of O(n * (n*p^2)) for naive approach.
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
    y : ndarray of shape (n_samples,)
    lambdas : array-like
    
    Returns:
    --------
    loocv_scores : ndarray
        LOOCV error for each lambda
    """
    n, p = X.shape
    loocv_scores = []
    
    # Compute SVD once (can reuse for all lambdas)
    U, s, Vt = np.linalg.svd(X, full_matrices=False)
    
    for lam in lambdas:
        # Shrinkage factors for singular values
        d = s**2 / (s**2 + lam)  # diagonal of H in SVD basis
        
        # Hat matrix diagonal: h_ii = sum_j (U_ij * d_j)^2
        # But actually: H = U * diag(s^2/(s^2+λ)) * U^T
        # So h_ii = sum_j U_ij^2 * s_j^2/(s_j^2 + λ)
        leverage = np.sum((U ** 2) * d, axis=1)
        
        # Fitted values: y_hat = H @ y = U @ diag(d) @ U^T @ y
        UTy = U.T @ y
        y_hat = U @ (d * UTy)
        
        # Residuals
        residuals = y - y_hat
        
        # LOOCV = mean of (residual / (1 - leverage))^2
        loocv = np.mean((residuals / (1 - leverage)) ** 2)
        loocv_scores.append(loocv)
    
    return np.array(loocv_scores)
 
 
def select_lambda_loocv(X, y, lambdas):
    """
    Select optimal lambda using LOOCV.
    """
    scores = ridge_loocv_efficient(X, y, lambdas)
    best_idx = np.argmin(scores)
    return lambdas[best_idx], scores

When to Use LOOCV

LOOCV is preferred when: (1) sample size is small (maximize training data), (2) you want a deterministic result (no random fold splits), (3) the efficient formula applies (Ridge, linear regression). For large n, K-fold CV may be preferable due to lower variance.

Generalized Cross-Validation (GCV)

Motivation:

Generalized Cross-Validation (GCV), introduced by Craven and Wahba (1979), is a rotation-invariant approximation to LOOCV that replaces individual leverages with their average.

The GCV formula:

$$\text{GCV}(\lambda) = \frac{\frac{1}{n} |\mathbf{y} - \hat{\mathbf{y}}_\lambda|_2^2}{\left(1 - \frac{\text{df}(\lambda)}{n}\right)^2}$$

where $\text{df}(\lambda) = \text{tr}(\mathbf{H}\lambda) = \sum{j=1}^{p} \frac{d_j}{d_j + \lambda}$ is the effective degrees of freedom.

Comparison to LOOCV:

In LOOCV, each observation has its own leverage $h_{ii}$. GCV replaces all leverages with the average leverage $\bar{h} = \text{df}(\lambda)/n$:

$$\text{GCV} \approx \frac{1}{n} \sum_{i=1}^{n} \left( \frac{y_i - \hat{y}_i}{1 - \bar{h}} \right)^2$$

This simplification makes GCV computationally trivial and often works well in practice.

Properties of GCV:

Rotation invariance: GCV is unchanged by orthogonal transformations of the data. This is desirable since the "best" λ shouldn't depend on coordinate choices.
Computational efficiency: Only requires computing RSS and trace of hat matrix—both easy for Ridge.
Asymptotic optimality: Under certain conditions, GCV-selected λ achieves optimal prediction risk asymptotically.
No tuning parameters: Unlike K-fold CV, no choice of K is needed.

The effective degrees of freedom:

The denominator involves $\text{df}(\lambda) = \text{tr}(\mathbf{H}_\lambda)$, which measures the "effective number of parameters" used by the model. As λ increases:

$\text{df}(\lambda) \to 0$ (highly constrained model)
The denominator $(1 - \text{df}/n)^2 \to 1$ (minimal penalty)

As λ decreases:

$\text{df}(\lambda) \to p$ (approaching OLS)
If $p \approx n$, the denominator becomes very small, heavily penalizing complex models.

gcv.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import numpy as np
 
def ridge_gcv(X, y, lambdas):
    """
    Compute Generalized Cross-Validation score for Ridge regression.
    
    GCV(λ) = RSS(λ) / n / (1 - df(λ)/n)^2
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
    y : ndarray of shape (n_samples,)
    lambdas : array-like
    
    Returns:
    --------
    gcv_scores : ndarray
        GCV score for each lambda
    df_values : ndarray
        Effective degrees of freedom for each lambda
    """
    n, p = X.shape
    
    # SVD for efficient computation
    U, s, Vt = np.linalg.svd(X, full_matrices=False)
    UTy = U.T @ y
    
    gcv_scores = []
    df_values = []
    
    for lam in lambdas:
        # Shrinkage factors
        shrink = s**2 / (s**2 + lam)
        
        # Effective degrees of freedom
        df = np.sum(shrink)
        df_values.append(df)
        
        # Fitted values in SVD basis
        y_hat = U @ (shrink * UTy)
        
        # RSS
        rss = np.sum((y - y_hat)**2)
        
        # GCV score
        gcv = (rss / n) / (1 - df / n)**2
        gcv_scores.append(gcv)
    
    return np.array(gcv_scores), np.array(df_values)
 
 
def select_lambda_gcv(X, y, lambdas):
    """
    Select optimal lambda using GCV.
    """
    gcv_scores, df_values = ridge_gcv(X, y, lambdas)
    best_idx = np.argmin(gcv_scores)
    return lambdas[best_idx], gcv_scores, df_values

Comparison of Cross-Validation Methods
Method	Bias	Variance	Computation	Tuning
K-Fold CV	Moderate (depends on K)	Moderate	K model fits per λ	Choose K
LOOCV	Low (nearly unbiased)	High	Efficient formula available	None
GCV	Low (approximates LOOCV)	Lower than LOOCV	Very fast	None

Information Criteria — AIC and BIC

Information criteria provide an alternative to cross-validation by explicitly penalizing model complexity.

The general form:

$$\text{IC}(\lambda) = n \log(\text{RSS}(\lambda)/n) + \text{Penalty}(\text{df}(\lambda))$$

The penalty increases with the effective degrees of freedom, discouraging overly complex models.

Akaike Information Criterion (AIC):

$$\text{AIC}(\lambda) = n \log(\text{RSS}(\lambda)/n) + 2 \cdot \text{df}(\lambda)$$

AIC aims to select the model that minimizes expected prediction error. It tends to select more complex models.

Bayesian Information Criterion (BIC):

$$\text{BIC}(\lambda) = n \log(\text{RSS}(\lambda)/n) + \log(n) \cdot \text{df}(\lambda)$$

BIC has a stronger penalty for complexity ($\log(n)$ vs 2 when $n > 7$). It aims to select the "true" model if it exists, preferring simpler models.

Corrected AIC (AICc):

When sample size is small relative to parameters, AIC tends to overfit. The corrected version:

$$\text{AICc}(\lambda) = \text{AIC}(\lambda) + \frac{2 \cdot \text{df}(\lambda)(\text{df}(\lambda) + 1)}{n - \text{df}(\lambda) - 1}$$

AICc adds an extra penalty that becomes significant when df is comparable to n.

When to use which:

AIC: When prediction is the goal, you're willing to accept slightly more complex models
BIC: When you believe a "true" sparse model exists and want to find it
AICc: When n is small relative to p, as a more conservative alternative to AIC
CV/GCV: When you want a data-driven approach without assuming a particular penalty form

information_criteria.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
 
def ridge_information_criteria(X, y, lambdas):
    """
    Compute AIC, AICc, and BIC for Ridge regression.
    
    Uses effective degrees of freedom: df(λ) = tr(H_λ) = Σ d_j/(d_j + λ)
    
    Parameters:
    -----------
    X : ndarray of shape (n_samples, n_features)
    y : ndarray of shape (n_samples,)
    lambdas : array-like
    
    Returns:
    --------
    dict with 'aic', 'aicc', 'bic' arrays and 'df' array
    """
    n, p = X.shape
    
    # SVD
    U, s, Vt = np.linalg.svd(X, full_matrices=False)
    UTy = U.T @ y
    
    results = {
        'aic': [],
        'aicc': [],
        'bic': [],
        'df': [],
        'rss': []
    }
    
    for lam in lambdas:
        # Shrinkage factors
        shrink = s**2 / (s**2 + lam)
        
        # Effective degrees of freedom
        df = np.sum(shrink)
        
        # Fitted values and RSS
        y_hat = U @ (shrink * UTy)
        rss = np.sum((y - y_hat)**2)
        
        # Log-likelihood term (proportional)
        ll_term = n * np.log(rss / n)
        
        # AIC
        aic = ll_term + 2 * df
        
        # AICc (corrected AIC)
        if n - df - 1 > 0:
            aicc = aic + 2 * df * (df + 1) / (n - df - 1)
        else:
            aicc = np.inf  # Model too complex for sample size
        
        # BIC
        bic = ll_term + np.log(n) * df
        
        results['aic'].append(aic)
        results['aicc'].append(aicc)
        results['bic'].append(bic)
        results['df'].append(df)
        results['rss'].append(rss)
    
    for key in results:
        results[key] = np.array(results[key])
    
    return results
 
 
def select_lambda_ic(X, y, lambdas, criterion='aic'):
    """
    Select optimal lambda using an information criterion.
    
    Parameters:
    -----------
    criterion : str
        One of 'aic', 'aicc', 'bic'
    """
    results = ridge_information_criteria(X, y, lambdas)
    scores = results[criterion]
    best_idx = np.argmin(scores)
    return lambdas[best_idx], results

Practical Recommendation

For most prediction tasks, GCV or K-fold CV work well. Use BIC if you specifically want a sparser model (more regularization). AIC/AICc are theoretically motivated but often give similar results to GCV. When in doubt, use 5-fold or 10-fold CV as it's well-understood and widely accepted.

Empirical Bayes — Maximum Marginal Likelihood

The Bayesian perspective revisited:

Recall that Ridge regression corresponds to MAP estimation with a Gaussian prior: $\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0}, \tau^2 \mathbf{I})$, where $\lambda = \sigma^2/\tau^2$.

Instead of choosing λ by cross-validation, we can estimate it by maximizing the marginal likelihood—the probability of the observed data after integrating out the unknown coefficients.

The marginal likelihood:

Integrating over the prior on $\boldsymbol{\beta}$:

$$p(\mathbf{y} | \mathbf{X}, \sigma^2, \tau^2) = \int p(\mathbf{y} | \mathbf{X}, \boldsymbol{\beta}, \sigma^2) p(\boldsymbol{\beta} | \tau^2) d\boldsymbol{\beta}$$

For the Gaussian model, this integral is tractable:

$$\mathbf{y} | \mathbf{X}, \sigma^2, \tau^2 \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n + \tau^2 \mathbf{X}\mathbf{X}^T)$$

Log marginal likelihood:

$$\log p(\mathbf{y} | \mathbf{X}, \sigma^2, \tau^2) = -\frac{n}{2}\log(2\pi) - \frac{1}{2}\log|\mathbf{C}| - \frac{1}{2}\mathbf{y}^T\mathbf{C}^{-1}\mathbf{y}$$

where $\mathbf{C} = \sigma^2 \mathbf{I} + \tau^2 \mathbf{X}\mathbf{X}^T$.

Empirical Bayes estimation:

Maximize the (log) marginal likelihood over $\sigma^2$ and $\tau^2$ (or equivalently, over $\sigma^2$ and $\lambda = \sigma^2/\tau^2$):

$$(\hat{\sigma}^2, \hat{\tau}^2) = \arg\max_{\sigma^2, \tau^2} \log p(\mathbf{y} | \mathbf{X}, \sigma^2, \tau^2)$$

Then use $\hat{\lambda} = \hat{\sigma}^2/\hat{\tau}^2$ as the regularization parameter.

Connection to Restricted Maximum Likelihood (REML):

The marginal likelihood approach is related to REML estimation used in mixed-effects models. Both aim to estimate variance components (here, $\sigma^2$ and $\tau^2$) in a principled way.

Advantages of Empirical Bayes:

Provides a principled estimate of the noise variance $\sigma^2$ alongside λ
Coherent probabilistic interpretation
Can be extended to more complex prior structures

Disadvantages:

More complex to compute than CV/GCV
Assumes Gaussian errors (less robust to outliers)
May overfit hyperparameters in small samples ("double dipping")

Implementation Note

Empirical Bayes λ selection is implemented in some statistical packages (e.g., lme4 in R for mixed models, which is closely related). For pure Ridge regression, CV/GCV are more commonly used due to their simplicity and robustness.

Practical Recommendations and Workflow

Having covered multiple λ selection methods, here's a practical workflow for applying Ridge regression:

Recommended Workflow

•Preprocess data: Standardize features (mean 0, variance 1) and center the response. This is essential—Ridge is not scale-invariant.
•Define λ grid: Use a log-spaced grid from very small (near OLS) to very large (near null model). Example: np.logspace(-4, 4, 100).
•Choose selection method: For most purposes, 5-fold or 10-fold CV is a safe default. Use GCV if determinism is important or n is small.
•Compute CV curves: Plot CV error (or GCV, AIC, etc.) vs log(λ) to visualize the tradeoff. Look for a clear minimum.
•Select λ: Use the one-standard-error rule for additional robustness, or simply take the minimizer.
•Refit on full data: Train the final model using all training data with the selected λ.
•Evaluate on test set: Report final performance on held-out test data (untouched during λ selection).

ridge_workflow.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import numpy as np
from sklearn.linear_model import RidgeCV, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
 
def complete_ridge_workflow(X, y, test_size=0.2, random_state=42):
    """
    Complete workflow for Ridge regression with λ selection.
    
    Parameters:
    -----------
    X : ndarray
        Feature matrix
    y : ndarray
        Target vector
    test_size : float
        Fraction of data for final testing
    
    Returns:
    --------
    dict with model, selected lambda, train/test scores
    """
    # 1. Split into train+validation and final test
    X_trainval, X_test, y_trainval, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    
    # 2. Standardize features (fit on train only!)
    scaler = StandardScaler()
    X_trainval_scaled = scaler.fit_transform(X_trainval)
    X_test_scaled = scaler.transform(X_test)
    
    # Center response
    y_mean = y_trainval.mean()
    y_trainval_centered = y_trainval - y_mean
    y_test_centered = y_test - y_mean
    
    # 3. Define λ grid (log-spaced)
    alphas = np.logspace(-4, 4, 100)
    
    # 4. Use RidgeCV with built-in cross-validation
    # 'gcv_mode' uses efficient GCV; can also do '5-fold', etc.
    ridge_cv = RidgeCV(
        alphas=alphas,
        scoring='neg_mean_squared_error',
        cv=5,  # 5-fold CV
        fit_intercept=False  # Already centered/scaled
    )
    ridge_cv.fit(X_trainval_scaled, y_trainval_centered)
    
    best_alpha = ridge_cv.alpha_
    print(f"Selected λ = {best_alpha:.6f}")
    
    # 5. Final model is stored in ridge_cv
    # Predictions on test set
    y_pred_test = ridge_cv.predict(X_test_scaled) + y_mean
    test_mse = np.mean((y_test - y_pred_test)**2)
    test_r2 = 1 - test_mse / np.var(y_test)
    
    print(f"Test MSE: {test_mse:.4f}")
    print(f"Test R²: {test_r2:.4f}")
    
    return {
        'model': ridge_cv,
        'scaler': scaler,
        'y_mean': y_mean,
        'best_alpha': best_alpha,
        'test_mse': test_mse,
        'test_r2': test_r2
    }

Common Pitfalls

Avoid these common mistakes: (1) Fitting scaler on both train and test data (data leakage); (2) Selecting λ using test set (overfits hyperparameters); (3) Forgetting to standardize features (Ridge depends heavily on scale); (4) Using too narrow a λ grid (may miss optimal region).

Summary: Selecting Regularization Strength

We've covered the major approaches to selecting the regularization parameter λ for Ridge regression:

Key Takeaways

•Never use training error to select λ—it always chooses λ = 0 (OLS).
•K-Fold CV (K = 5 or 10) is the standard workhorse, balancing bias and computational cost.
•LOOCV is nearly unbiased but has high variance; efficient formula exists for Ridge.
•GCV approximates LOOCV with lower variance and simpler computation.
•AIC/BIC explicitly penalize complexity via effective degrees of freedom.
•Empirical Bayes maximizes marginal likelihood for a principled estimate.
•One-standard-error rule provides additional robustness against overfitting hyperparameters.
•Standard workflow: preprocess → CV over λ grid → select → refit → evaluate on test.

Module complete:

You now have a complete understanding of Ridge regression: the L2 penalty formulation, closed-form solution, shrinkage interpretation, bias-variance tradeoff, and λ selection methods. This foundation prepares you for Lasso (L1 regularization) in the next module, where sparsity-inducing properties provide automatic feature selection.

Module Complete

Congratulations! You've mastered Ridge regression from first principles. You understand the mathematical foundations, geometric intuition, statistical properties, and practical implementation. This knowledge forms the basis for understanding more complex regularization methods and is essential for effective machine learning practice.