The Overfitting Problem - Learning Module

Loading content...

0/245

Training vs Test Error

The Model That Knew Too Much

Imagine you're a student preparing for an exam. You have access to last year's test, complete with answers. You study those specific questions until you can answer each one perfectly. On exam day, you feel confident—until you realize the new test has entirely different questions. Your perfect preparation becomes worthless.

This is precisely what happens when machine learning models overfit.

A model that achieves 99% accuracy on training data but only 60% on new data hasn't learned the underlying patterns—it has memorized the specific examples. Understanding this distinction, quantifying it, and learning to prevent it forms the very heart of practical machine learning.

What You Will Learn

By the end of this page, you will understand the precise mathematical distinction between training and test error, how to interpret learning curves, and why the generalization gap is the most critical metric for evaluating model quality. These concepts form the diagnostic foundation for all regularization techniques that follow.

Defining Training and Test Error

Before we can understand overfitting, we need precise definitions. Machine learning fundamentally involves learning a function $f: \mathcal{X} \rightarrow \mathcal{Y}$ from examples. Our goal is to find a hypothesis $h$ that approximates the true function $f$ (or the conditional distribution $P(Y|X)$).

The Training Set and Training Error

We observe a finite sample of data:

$$\mathcal{D}_{train} = {(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)}$$

The training error (also called empirical error or in-sample error) measures how well our hypothesis $h$ performs on this observed data:

$$\text{Err}{train} = \frac{1}{n} \sum{i=1}^{n} L(y_i, h(x_i))$$

where $L(y, \hat{y})$ is a loss function (e.g., squared error $(y - \hat{y})^2$ for regression, or 0-1 loss $\mathbf{1}[y \neq \hat{y}]$ for classification).

Common Loss Functions

For regression: Mean Squared Error (MSE) = $\frac{1}{n}\sum(y_i - \hat{y}_i)^2$. For classification: 0-1 loss, log loss (cross-entropy), or hinge loss. The choice of loss affects the optimal solution, but the training/test distinction remains fundamental regardless.

The Test Set and Test Error

The test error (also called generalization error or out-of-sample error) measures performance on new, unseen data drawn from the same distribution:

$$\text{Err}{test} = \mathbb{E}{(x,y) \sim P}[L(y, h(x))]$$

In practice, we estimate this expectation using a held-out test set $\mathcal{D}_{test}$ that was never used during training:

$$\widehat{\text{Err}}{test} = \frac{1}{m} \sum{j=1}^{m} L(y_j^{test}, h(x_j^{test}))$$

The test set serves as a proxy for all possible future data. It must remain completely untouched during model development—using it for any decision (hyperparameter tuning, model selection) contaminates it and invalidates the error estimate.

Training vs Test Error: Key Distinctions
Aspect	Training Error	Test Error
Data source	Same data used to fit model	New, unseen data from same distribution
Optimization	Directly minimized during training	Never optimized—only evaluated
Bias direction	Optimistically biased (underestimates true error)	Unbiased estimate of generalization
Sample size	Usually larger (all training data)	Often smaller (held-out portion)
Ideal value	As low as possible without overfitting	As close to training error as possible

The Generalization Gap

The generalization gap is the difference between test error and training error:

$$\text{Generalization Gap} = \text{Err}{test} - \text{Err}{train}$$

This gap is perhaps the single most important diagnostic in machine learning. It tells us exactly how much our model has overfit to the training data.

Why the Gap Exists

The gap arises because the model sees the training data during optimization. Any noise, outliers, or idiosyncratic patterns in the training set get incorporated into the hypothesis. These patterns don't generalize to new data because they were specific to the training sample, not the underlying distribution.

The Gap is Always Non-Negative (in Expectation)

Training error is always optimistically biased because the model was optimized on exactly that data. While individual test sets might occasionally yield lower error than training error due to random variation, in expectation, $\mathbb{E}[\text{Err}{test}] \geq \mathbb{E}[\text{Err}{train}]$. A model claiming lower test error than training error warrants skepticism.

Interpreting the Gap

The generalization gap reveals the model's behavior:

Gap Size	Training Error	Interpretation	Diagnosis
Small	Low	Ideal: good fit, good generalization	Well-fitted model
Small	High	Underfitting: model too simple	Increase model complexity
Large	Low	Overfitting: memorized training data	Regularize or simplify
Large	High	Rare: usually indicates data issues	Check for data problems

The sweet spot is a small gap with low training error—the model has learned the true patterns without memorizing noise.

Mathematical Decomposition

We can decompose the expected test error of our learning algorithm into components that illuminate the gap:

$$\mathbb{E}[\text{Err}_{test}] = \text{Irreducible Error} + \text{Bias}^2 + \text{Variance}$$

where:

Irreducible Error: Noise inherent in the problem; cannot be reduced by any model
Bias²: Error from incorrect assumptions in the model (underfitting)
Variance: Error from sensitivity to training data fluctuations (overfitting)

The generalization gap is primarily driven by the variance term. High variance means the model's predictions change significantly with different training sets—a hallmark of overfitting.

Bias-Variance as Diagnostic

High training error suggests high bias (underfitting). Large generalization gap with low training error suggests high variance (overfitting). This decomposition guides whether to increase model complexity or add regularization.

Learning Curves: Visualizing the Gap

Learning curves are the primary diagnostic tool for understanding model behavior. They plot training and test error against either:

Training set size (more data)
Model complexity (more parameters)
Training iterations (more optimization)

Each perspective reveals different aspects of the training/test relationship.

Learning Curves vs Training Set Size

As training data increases:

For training error:

Starts very low (with few examples, any model can fit perfectly)
Increases gradually as more data makes perfect fitting harder
Converges to the irreducible error plus bias

For test error:

Starts very high (model hasn't learned from enough examples)
Decreases as more data reveals true patterns
Converges toward training error (gap shrinks)

$$\lim_{n \to \infty} (\text{Err}{test} - \text{Err}{train}) = 0$$

With infinite data, the training set would contain all possible patterns, and the distinction between training and test would vanish.

learning_curve_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
 
def analyze_learning_curves(X, y, degrees=[1, 4, 15], alphas=[0.0, 0.0, 0.0]):
    """
    Plot learning curves for polynomial models of different complexity.
    
    Demonstrates the evolution of training and test error as:
    1. Training set size increases (x-axis)
    2. Model complexity increases (different curves)
    
    Key insight: Complex models (degree=15) show large gap that shrinks
    with more data, while simple models (degree=1) show persistent gap
    with high baseline error (underfitting).
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    train_sizes = np.linspace(0.1, 1.0, 10)
    
    for ax, degree, alpha in zip(axes, degrees, alphas):
        # Create polynomial features
        poly = PolynomialFeatures(degree=degree)
        X_poly = poly.fit_transform(X)
        
        # Compute learning curves
        train_sizes_abs, train_scores, test_scores = learning_curve(
            Ridge(alpha=alpha), 
            X_poly, y,
            train_sizes=train_sizes,
            cv=5,
            scoring='neg_mean_squared_error',
            n_jobs=-1
        )
        
        # Convert to positive error and compute statistics
        train_mean = -train_scores.mean(axis=1)
        train_std = train_scores.std(axis=1)
        test_mean = -test_scores.mean(axis=1)
        test_std = test_scores.std(axis=1)
        
        # Plot with confidence bands
        ax.fill_between(train_sizes_abs, train_mean - train_std, 
                        train_mean + train_std, alpha=0.2, color='blue')
        ax.plot(train_sizes_abs, train_mean, 'b-', label='Training Error')
        
        ax.fill_between(train_sizes_abs, test_mean - test_std,
                        test_mean + test_std, alpha=0.2, color='red')
        ax.plot(train_sizes_abs, test_mean, 'r-', label='Test Error')
        
        # Annotate the gap
        gap = test_mean[-1] - train_mean[-1]
        ax.annotate(f'Gap: {gap:.3f}', 
                    xy=(train_sizes_abs[-1], (test_mean[-1] + train_mean[-1])/2),
                    fontsize=10, ha='right')
        
        ax.set_xlabel('Training Set Size')
        ax.set_ylabel('Mean Squared Error')
        ax.set_title(f'Degree {degree} Polynomial')
        ax.legend(loc='upper right')
        ax.set_ylim(bottom=0)
    
    plt.tight_layout()
    return fig
 
# Key diagnostic interpretations:
# 
# Degree 1 (Underfitting):
#   - Both curves converge to HIGH error
#   - Small gap, but poor absolute performance
#   - Solution: Increase complexity
#
# Degree 4 (Good fit):
#   - Both curves converge to LOW error  
#   - Moderate gap that shrinks with data
#   - Solution: This is the target behavior
#
# Degree 15 (Overfitting):
#   - Training error near zero
#   - Test error high, gap very large
#   - Solution: Regularize or reduce complexity

Learning Curves vs Model Complexity

As model complexity increases (more parameters, higher-degree polynomials, deeper networks):

For training error:

Monotonically decreases
More complex models can fit training data better (or perfectly)
Eventually reaches zero for sufficiently complex models

For test error:

Initially decreases (reduced bias)
Reaches a minimum at optimal complexity
Increases beyond that point (increased variance)

This U-shaped test error curve is characteristic of the bias-variance tradeoff. The optimal complexity minimizes test error, not training error.

The Double Descent Phenomenon

Recent research has revealed that in some settings (particularly deep learning), the test error curve can exhibit 'double descent'—after the classical U-shaped curve, error decreases again as models become massively overparameterized. This challenges classical intuitions but doesn't invalidate the importance of understanding the generalization gap.

Recognizing Overfitting in Practice

Knowing the theory is essential, but recognizing overfitting in real-world scenarios requires pattern recognition across multiple symptoms.

Red Flags Indicating Overfitting

•Near-perfect training performance: Training accuracy of 99%+ or training loss near zero should trigger suspicion, especially with limited data.
•Large train-test gap: If test error is more than 20-30% higher than training error, overfitting is likely present.
•Erratic predictions: Overfit models often make wild predictions for inputs slightly outside training distribution—small input changes cause large output changes.
•High coefficient magnitudes: In linear models, extremely large positive and negative weights often indicate the model is fitting noise.
•Poor performance on similar datasets: If the model works well on one test set but poorly on another from the same domain, it has overfit to idiosyncrasies.

A Concrete Example: Polynomial Regression

Consider fitting polynomials to noisy data generated from $y = \sin(2\pi x) + \epsilon$ where $\epsilon \sim N(0, 0.3)$:

Degree	Training MSE	Test MSE	Gap	Status
1	0.42	0.45	0.03	Underfitting
3	0.08	0.11	0.03	Good fit
5	0.05	0.09	0.04	Slight overfit
9	0.01	0.18	0.17	Overfitting
15	0.0001	0.89	0.89	Severe overfitting

The degree-15 polynomial achieves essentially zero training error—it passes exactly through every training point. But its test error is catastrophic because it has learned the noise, not the signal.

overfitting_demonstration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
 
def demonstrate_overfitting():
    """
    Concrete demonstration of overfitting with polynomial regression.
    
    The true function is sin(2πx), observed with Gaussian noise.
    We fit polynomials of increasing degree and observe the
    training/test error divergence.
    """
    np.random.seed(42)
    
    # Generate true function with noise
    n_train, n_test = 20, 100
    X_train = np.random.uniform(0, 1, n_train).reshape(-1, 1)
    X_test = np.random.uniform(0, 1, n_test).reshape(-1, 1)
    
    noise_std = 0.3
    y_train = np.sin(2 * np.pi * X_train).ravel() + np.random.normal(0, noise_std, n_train)
    y_test = np.sin(2 * np.pi * X_test).ravel() + np.random.normal(0, noise_std, n_test)
    
    results = []
    
    for degree in [1, 3, 5, 9, 15, 20]:
        # Transform features
        poly = PolynomialFeatures(degree=degree)
        X_train_poly = poly.fit_transform(X_train)
        X_test_poly = poly.transform(X_test)
        
        # Fit model
        model = LinearRegression()
        model.fit(X_train_poly, y_train)
        
        # Compute errors
        y_train_pred = model.predict(X_train_poly)
        y_test_pred = model.predict(X_test_poly)
        
        train_mse = mean_squared_error(y_train, y_train_pred)
        test_mse = mean_squared_error(y_test, y_test_pred)
        gap = test_mse - train_mse
        
        # Coefficient analysis (overfitting signature)
        coef_magnitude = np.sum(np.abs(model.coef_))
        
        results.append({
            'degree': degree,
            'n_params': degree + 1,
            'train_mse': train_mse,
            'test_mse': test_mse,
            'gap': gap,
            'coef_magnitude': coef_magnitude
        })
        
        print(f"Degree {degree:2d}: Train MSE = {train_mse:.4f}, "
              f"Test MSE = {test_mse:.4f}, Gap = {gap:.4f}, "
              f"|coefs| = {coef_magnitude:.1f}")
    
    return results
 
# Output:
# Degree  1: Train MSE = 0.4234, Test MSE = 0.4521, Gap = 0.0287, |coefs| = 2.3
# Degree  3: Train MSE = 0.0782, Test MSE = 0.1084, Gap = 0.0302, |coefs| = 15.2
# Degree  5: Train MSE = 0.0531, Test MSE = 0.0923, Gap = 0.0392, |coefs| = 89.7
# Degree  9: Train MSE = 0.0089, Test MSE = 0.1812, Gap = 0.1723, |coefs| = 8234.1
# Degree 15: Train MSE = 0.0001, Test MSE = 0.8892, Gap = 0.8891, |coefs| = 892341.2
# Degree 20: Train MSE = 0.0000, Test MSE = 5.2341, Gap = 5.2341, |coefs| = 9.2e+07
 
# Key observations:
# 1. Training MSE decreases monotonically
# 2. Test MSE has a minimum around degree 3-5
# 3. Gap increases dramatically with complexity
# 4. Coefficient magnitude explodes—a key overfitting signature

The Proper Evaluation Protocol

Correctly measuring the generalization gap requires disciplined data handling. Violations of proper protocol lead to data leakage—where information from the test set inappropriately influences training, yielding optimistic but invalid error estimates.

The Three-Way Split

For proper evaluation, data should be divided into three disjoint sets:

Training Set (~60-70%): Used to fit model parameters
Validation Set (~15-20%): Used for hyperparameter tuning and model selection
Test Set (~15-20%): Used ONCE at the end to estimate generalization

$$\mathcal{D} = \mathcal{D}{train} \cup \mathcal{D}{val} \cup \mathcal{D}{test}$$ $$\mathcal{D}{train} \cap \mathcal{D}{val} = \mathcal{D}{train} \cap \mathcal{D}{test} = \mathcal{D}{val} \cap \mathcal{D}_{test} = \emptyset$$

The Validation Set is NOT the Test Set

A common error is using the test set for model selection. If you try 50 hyperparameter configurations and pick the one with lowest test error, you've effectively trained on the test set. The reported test error will be optimistic. The validation set absorbs this selection bias, preserving the test set for final evaluation.

Cross-Validation for Small Datasets

When data is limited, k-fold cross-validation provides a more robust estimate:

Split data into k folds
For each fold i:
- Train on all folds except i
- Evaluate on fold i
Average the k error estimates

$$\widehat{\text{Err}}{CV} = \frac{1}{k} \sum{i=1}^{k} \text{Err}^{(i)}$$

This uses all data for both training and validation (never simultaneously), yielding lower-variance estimates than a single train-test split.

Evaluation Protocols Comparison
Protocol	When to Use	Advantages	Disadvantages
Single Train-Test	Large datasets (>100K)	Simple, fast, low variance	Wastes data, single estimate
Train-Val-Test	Standard practice, medium+ datasets	Protects test set integrity	Still wastes some data
5-Fold CV	Small-medium datasets	Uses all data efficiently	5x computational cost
10-Fold CV	Standard for small datasets	Lower variance than 5-fold	10x computational cost
Leave-One-Out	Very small datasets (<100)	Maximum data efficiency	High variance, expensive

proper_evaluation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
 
def proper_evaluation_protocol(X, y, model_class, param_grid):
    """
    Demonstrates the correct three-way split protocol with 
    cross-validation for hyperparameter tuning.
    
    Steps:
    1. Hold out final test set (never touch until the very end)
    2. Use cross-validation on train+val for hyperparameter selection
    3. Retrain best model on full train+val
    4. Report final performance on held-out test set
    """
    # Step 1: Create held-out test set
    X_train_val, X_test, y_train_val, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    print(f"Data split: {len(X_train_val)} train+val, {len(X_test)} test")
    
    # Step 2: Cross-validation for model selection
    best_score = -np.inf
    best_params = None
    
    cv = KFold(n_splits=5, shuffle=True, random_state=42)
    
    for params in param_grid:
        model = model_class(**params)
        
        # Evaluate using 5-fold CV on train+val set
        cv_scores = cross_val_score(
            model, X_train_val, y_train_val, 
            cv=cv, 
            scoring='neg_mean_squared_error'
        )
        
        mean_score = cv_scores.mean()
        std_score = cv_scores.std()
        
        print(f"Params {params}: CV Score = {-mean_score:.4f} ± {std_score:.4f}")
        
        if mean_score > best_score:
            best_score = mean_score
            best_params = params
    
    print(f"\nBest params: {best_params}")
    
    # Step 3: Retrain on full train+val with best params
    final_model = model_class(**best_params)
    final_model.fit(X_train_val, y_train_val)
    
    # Step 4: Final evaluation on held-out test set
    y_pred = final_model.predict(X_test)
    test_error = np.mean((y_test - y_pred) ** 2)
    
    # Also compute training error for gap analysis
    y_train_pred = final_model.predict(X_train_val)
    train_error = np.mean((y_train_val - y_train_pred) ** 2)
    
    gap = test_error - train_error
    
    print(f"\nFinal Results:")
    print(f"  Training Error: {train_error:.4f}")
    print(f"  Test Error:     {test_error:.4f}")
    print(f"  Gap:            {gap:.4f}")
    
    return final_model, test_error, gap
 
# Critical points:
# - The test set is touched exactly ONCE, at the very end
# - Cross-validation absorbs the model selection bias
# - The gap computed at the end is our true overfitting estimate

Why Training Error Alone Fails

It's tempting to evaluate models by training error alone—after all, it's always available and requires no data splitting. But training error is fundamentally misleading, especially for complex models.

The Mathematical Problem

Training error is computed on the same data used for optimization. The model parameters $\theta$ are chosen to minimize:

$$\hat{\theta} = \arg\min_\theta \frac{1}{n} \sum_{i=1}^{n} L(y_i, f_\theta(x_i))$$

By definition, $\hat{\theta}$ makes the training error as small as possible. This means training error is optimistically biased:

$$\mathbb{E}[\text{Err}{train}] < \mathbb{E}[\text{Err}{test}]$$

The more flexible the model, the more it can overfit, and the more optimistic the training error becomes.

The Extreme Case: Interpolation

Consider a model complex enough to perfectly interpolate all training points (zero training error). Examples include:

High-degree polynomials with degree ≥ n-1
Decision trees grown to full depth
1-Nearest Neighbor
Kernel machines with sufficient capacity

For such models: $$\text{Err}_{train} = 0$$

But test error can be arbitrarily bad! The model has memorized the training set but learned nothing about the underlying pattern. Training error gives no indication of this failure.

Training Error Says

•Model A (degree 20): Error = 0.0001
•Model B (degree 5): Error = 0.0523
•Model C (degree 2): Error = 0.3412
•'Model A is clearly best!'

Test Error Reveals

•Model A (degree 20): Error = 2.3451
•Model B (degree 5): Error = 0.0891
•Model C (degree 2): Error = 0.3567
•'Model B is actually best!'

Never Trust Training Error for Model Selection

Ranking models by training error would systematically select the most overfit model. Always use held-out data (validation or test) for model selection. Training error only tells you whether the model can fit the data—not whether it should.

Connections to Statistical Learning Theory

The training/test distinction connects to deep results in statistical learning theory that provide theoretical bounds on the generalization gap.

Empirical Risk Minimization (ERM)

Most learning algorithms minimize empirical risk (training error):

$$\hat{h} = \arg\min_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^{n} L(y_i, h(x_i))$$

The fundamental question is: when does minimizing empirical risk also minimize true risk (test error)? The answer depends on:

Sample size n: More data → better guarantees
Hypothesis class complexity: Simpler $\mathcal{H}$ → tighter bounds
Data distribution: Some distributions are easier to learn

Generalization Bounds

Statistical learning theory provides bounds of the form:

$$\text{Err}{test}(\hat{h}) \leq \text{Err}{train}(\hat{h}) + \text{Complexity}(\mathcal{H}, n, \delta)$$

with probability at least $1 - \delta$. The complexity term depends on measures like:

VC Dimension: For binary classifiers, $\text{Gap} = O\left(\sqrt{\frac{d_{VC}}{n}}\right)$
Rademacher Complexity: More general measure of hypothesis class richness
PAC-Bayes Bounds: Incorporate prior beliefs about good hypotheses

These bounds formalize the intuition that:

More training data shrinks the gap
Restricting model complexity shrinks the gap
Regularization effectively reduces complexity, improving bounds

VC Dimension Intuition

The VC dimension measures how many points a hypothesis class can 'shatter' (classify in all possible ways). Linear classifiers in d dimensions have VC dimension d+1. Higher VC dimension → more capacity to overfit → larger expected gap.

Practical Implications

While theoretical bounds are often loose, they provide crucial guidance:

Theoretical Insight	Practical Implication
Gap ~ $\sqrt{\text{complexity}/n}$	Collect more data to shrink gap
Regularization reduces effective complexity	Ridge, Lasso, etc. improve generalization
Gap vanishes as $n \to \infty$	Big data naturally reduces overfitting
Some hypothesis classes have bounded complexity	Linear models generalize better than deep networks (all else equal)

Summary: Training vs Test Error

We've established the foundational distinction that underlies all of regularization theory. Let's consolidate the key insights:

Key Takeaways

•Training error measures fit, test error measures generalization — Low training error is necessary but not sufficient for a good model.
•The generalization gap is the primary diagnostic — It quantifies overfitting directly: $\text{Gap} = \text{Err}{test} - \text{Err}{train}$.
•Learning curves visualize model behavior — They reveal underfitting (high plateau) vs overfitting (large gap) at a glance.
•Proper evaluation requires discipline — Train/validation/test splits, cross-validation, and never contaminating the test set.
•Training error alone is dangerously misleading — It systematically favors overfit models and provides no generalization information.
•Theory backs practice — Statistical learning theory formalizes why the gap exists and how to control it.

What's Next?

Now that we understand the generalization gap, we need to understand why it grows uncontrollably in certain settings. The next page explores high-dimensional settings—where the number of features approaches or exceeds the number of samples, creating the perfect conditions for catastrophic overfitting.

Page Complete

You now understand the critical distinction between training and test error, how to measure and interpret the generalization gap, and why proper evaluation protocols are essential. This diagnostic framework is the foundation for understanding why regularization is necessary—and how it works.

Training vs Test Error

The Model That Knew Too Much

This is precisely what happens when machine learning models overfit.

What You Will Learn

Defining Training and Test Error

The Training Set and Training Error

We observe a finite sample of data:

$$\mathcal{D}_{train} = {(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)}$$

The training error (also called empirical error or in-sample error) measures how well our hypothesis $h$ performs on this observed data:

$$\text{Err}{train} = \frac{1}{n} \sum{i=1}^{n} L(y_i, h(x_i))$$

where $L(y, \hat{y})$ is a loss function (e.g., squared error $(y - \hat{y})^2$ for regression, or 0-1 loss $\mathbf{1}[y \neq \hat{y}]$ for classification).

Common Loss Functions

The Test Set and Test Error

The test error (also called generalization error or out-of-sample error) measures performance on new, unseen data drawn from the same distribution:

$$\text{Err}{test} = \mathbb{E}{(x,y) \sim P}[L(y, h(x))]$$

In practice, we estimate this expectation using a held-out test set $\mathcal{D}_{test}$ that was never used during training:

$$\widehat{\text{Err}}{test} = \frac{1}{m} \sum{j=1}^{m} L(y_j^{test}, h(x_j^{test}))$$

Training vs Test Error: Key Distinctions
Aspect	Training Error	Test Error
Data source	Same data used to fit model	New, unseen data from same distribution
Optimization	Directly minimized during training	Never optimized—only evaluated
Bias direction	Optimistically biased (underestimates true error)	Unbiased estimate of generalization
Sample size	Usually larger (all training data)	Often smaller (held-out portion)
Ideal value	As low as possible without overfitting	As close to training error as possible

The Generalization Gap

The generalization gap is the difference between test error and training error:

$$\text{Generalization Gap} = \text{Err}{test} - \text{Err}{train}$$

This gap is perhaps the single most important diagnostic in machine learning. It tells us exactly how much our model has overfit to the training data.

Why the Gap Exists

The Gap is Always Non-Negative (in Expectation)

Interpreting the Gap

The generalization gap reveals the model's behavior:

Gap Size	Training Error	Interpretation	Diagnosis
Small	Low	Ideal: good fit, good generalization	Well-fitted model
Small	High	Underfitting: model too simple	Increase model complexity
Large	Low	Overfitting: memorized training data	Regularize or simplify
Large	High	Rare: usually indicates data issues	Check for data problems

The sweet spot is a small gap with low training error—the model has learned the true patterns without memorizing noise.

Mathematical Decomposition

We can decompose the expected test error of our learning algorithm into components that illuminate the gap:

$$\mathbb{E}[\text{Err}_{test}] = \text{Irreducible Error} + \text{Bias}^2 + \text{Variance}$$

where:

Irreducible Error: Noise inherent in the problem; cannot be reduced by any model
Bias²: Error from incorrect assumptions in the model (underfitting)
Variance: Error from sensitivity to training data fluctuations (overfitting)

The generalization gap is primarily driven by the variance term. High variance means the model's predictions change significantly with different training sets—a hallmark of overfitting.

Bias-Variance as Diagnostic

Learning Curves: Visualizing the Gap

Learning curves are the primary diagnostic tool for understanding model behavior. They plot training and test error against either:

Training set size (more data)
Model complexity (more parameters)
Training iterations (more optimization)

Each perspective reveals different aspects of the training/test relationship.

Learning Curves vs Training Set Size

As training data increases:

For training error:

Starts very low (with few examples, any model can fit perfectly)
Increases gradually as more data makes perfect fitting harder
Converges to the irreducible error plus bias

For test error:

Starts very high (model hasn't learned from enough examples)
Decreases as more data reveals true patterns
Converges toward training error (gap shrinks)

$$\lim_{n \to \infty} (\text{Err}{test} - \text{Err}{train}) = 0$$

With infinite data, the training set would contain all possible patterns, and the distinction between training and test would vanish.

learning_curve_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
 
def analyze_learning_curves(X, y, degrees=[1, 4, 15], alphas=[0.0, 0.0, 0.0]):
    """
    Plot learning curves for polynomial models of different complexity.
    
    Demonstrates the evolution of training and test error as:
    1. Training set size increases (x-axis)
    2. Model complexity increases (different curves)
    
    Key insight: Complex models (degree=15) show large gap that shrinks
    with more data, while simple models (degree=1) show persistent gap
    with high baseline error (underfitting).
    """
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    train_sizes = np.linspace(0.1, 1.0, 10)
    
    for ax, degree, alpha in zip(axes, degrees, alphas):
        # Create polynomial features
        poly = PolynomialFeatures(degree=degree)
        X_poly = poly.fit_transform(X)
        
        # Compute learning curves
        train_sizes_abs, train_scores, test_scores = learning_curve(
            Ridge(alpha=alpha), 
            X_poly, y,
            train_sizes=train_sizes,
            cv=5,
            scoring='neg_mean_squared_error',
            n_jobs=-1
        )
        
        # Convert to positive error and compute statistics
        train_mean = -train_scores.mean(axis=1)
        train_std = train_scores.std(axis=1)
        test_mean = -test_scores.mean(axis=1)
        test_std = test_scores.std(axis=1)
        
        # Plot with confidence bands
        ax.fill_between(train_sizes_abs, train_mean - train_std, 
                        train_mean + train_std, alpha=0.2, color='blue')
        ax.plot(train_sizes_abs, train_mean, 'b-', label='Training Error')
        
        ax.fill_between(train_sizes_abs, test_mean - test_std,
                        test_mean + test_std, alpha=0.2, color='red')
        ax.plot(train_sizes_abs, test_mean, 'r-', label='Test Error')
        
        # Annotate the gap
        gap = test_mean[-1] - train_mean[-1]
        ax.annotate(f'Gap: {gap:.3f}', 
                    xy=(train_sizes_abs[-1], (test_mean[-1] + train_mean[-1])/2),
                    fontsize=10, ha='right')
        
        ax.set_xlabel('Training Set Size')
        ax.set_ylabel('Mean Squared Error')
        ax.set_title(f'Degree {degree} Polynomial')
        ax.legend(loc='upper right')
        ax.set_ylim(bottom=0)
    
    plt.tight_layout()
    return fig
 
# Key diagnostic interpretations:
# 
# Degree 1 (Underfitting):
#   - Both curves converge to HIGH error
#   - Small gap, but poor absolute performance
#   - Solution: Increase complexity
#
# Degree 4 (Good fit):
#   - Both curves converge to LOW error  
#   - Moderate gap that shrinks with data
#   - Solution: This is the target behavior
#
# Degree 15 (Overfitting):
#   - Training error near zero
#   - Test error high, gap very large
#   - Solution: Regularize or reduce complexity

Learning Curves vs Model Complexity

As model complexity increases (more parameters, higher-degree polynomials, deeper networks):

For training error:

Monotonically decreases
More complex models can fit training data better (or perfectly)
Eventually reaches zero for sufficiently complex models

For test error:

Initially decreases (reduced bias)
Reaches a minimum at optimal complexity
Increases beyond that point (increased variance)

This U-shaped test error curve is characteristic of the bias-variance tradeoff. The optimal complexity minimizes test error, not training error.

The Double Descent Phenomenon

Recognizing Overfitting in Practice

Knowing the theory is essential, but recognizing overfitting in real-world scenarios requires pattern recognition across multiple symptoms.

Red Flags Indicating Overfitting

•Near-perfect training performance: Training accuracy of 99%+ or training loss near zero should trigger suspicion, especially with limited data.
•Large train-test gap: If test error is more than 20-30% higher than training error, overfitting is likely present.
•Erratic predictions: Overfit models often make wild predictions for inputs slightly outside training distribution—small input changes cause large output changes.
•High coefficient magnitudes: In linear models, extremely large positive and negative weights often indicate the model is fitting noise.
•Poor performance on similar datasets: If the model works well on one test set but poorly on another from the same domain, it has overfit to idiosyncrasies.

A Concrete Example: Polynomial Regression

Consider fitting polynomials to noisy data generated from $y = \sin(2\pi x) + \epsilon$ where $\epsilon \sim N(0, 0.3)$:

Degree	Training MSE	Test MSE	Gap	Status
1	0.42	0.45	0.03	Underfitting
3	0.08	0.11	0.03	Good fit
5	0.05	0.09	0.04	Slight overfit
9	0.01	0.18	0.17	Overfitting
15	0.0001	0.89	0.89	Severe overfitting

overfitting_demonstration.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
 
def demonstrate_overfitting():
    """
    Concrete demonstration of overfitting with polynomial regression.
    
    The true function is sin(2πx), observed with Gaussian noise.
    We fit polynomials of increasing degree and observe the
    training/test error divergence.
    """
    np.random.seed(42)
    
    # Generate true function with noise
    n_train, n_test = 20, 100
    X_train = np.random.uniform(0, 1, n_train).reshape(-1, 1)
    X_test = np.random.uniform(0, 1, n_test).reshape(-1, 1)
    
    noise_std = 0.3
    y_train = np.sin(2 * np.pi * X_train).ravel() + np.random.normal(0, noise_std, n_train)
    y_test = np.sin(2 * np.pi * X_test).ravel() + np.random.normal(0, noise_std, n_test)
    
    results = []
    
    for degree in [1, 3, 5, 9, 15, 20]:
        # Transform features
        poly = PolynomialFeatures(degree=degree)
        X_train_poly = poly.fit_transform(X_train)
        X_test_poly = poly.transform(X_test)
        
        # Fit model
        model = LinearRegression()
        model.fit(X_train_poly, y_train)
        
        # Compute errors
        y_train_pred = model.predict(X_train_poly)
        y_test_pred = model.predict(X_test_poly)
        
        train_mse = mean_squared_error(y_train, y_train_pred)
        test_mse = mean_squared_error(y_test, y_test_pred)
        gap = test_mse - train_mse
        
        # Coefficient analysis (overfitting signature)
        coef_magnitude = np.sum(np.abs(model.coef_))
        
        results.append({
            'degree': degree,
            'n_params': degree + 1,
            'train_mse': train_mse,
            'test_mse': test_mse,
            'gap': gap,
            'coef_magnitude': coef_magnitude
        })
        
        print(f"Degree {degree:2d}: Train MSE = {train_mse:.4f}, "
              f"Test MSE = {test_mse:.4f}, Gap = {gap:.4f}, "
              f"|coefs| = {coef_magnitude:.1f}")
    
    return results
 
# Output:
# Degree  1: Train MSE = 0.4234, Test MSE = 0.4521, Gap = 0.0287, |coefs| = 2.3
# Degree  3: Train MSE = 0.0782, Test MSE = 0.1084, Gap = 0.0302, |coefs| = 15.2
# Degree  5: Train MSE = 0.0531, Test MSE = 0.0923, Gap = 0.0392, |coefs| = 89.7
# Degree  9: Train MSE = 0.0089, Test MSE = 0.1812, Gap = 0.1723, |coefs| = 8234.1
# Degree 15: Train MSE = 0.0001, Test MSE = 0.8892, Gap = 0.8891, |coefs| = 892341.2
# Degree 20: Train MSE = 0.0000, Test MSE = 5.2341, Gap = 5.2341, |coefs| = 9.2e+07
 
# Key observations:
# 1. Training MSE decreases monotonically
# 2. Test MSE has a minimum around degree 3-5
# 3. Gap increases dramatically with complexity
# 4. Coefficient magnitude explodes—a key overfitting signature

The Proper Evaluation Protocol

The Three-Way Split

For proper evaluation, data should be divided into three disjoint sets:

Training Set (~60-70%): Used to fit model parameters
Validation Set (~15-20%): Used for hyperparameter tuning and model selection
Test Set (~15-20%): Used ONCE at the end to estimate generalization

The Validation Set is NOT the Test Set

Cross-Validation for Small Datasets

When data is limited, k-fold cross-validation provides a more robust estimate:

Split data into k folds
For each fold i:
- Train on all folds except i
- Evaluate on fold i
Average the k error estimates

$$\widehat{\text{Err}}{CV} = \frac{1}{k} \sum{i=1}^{k} \text{Err}^{(i)}$$

This uses all data for both training and validation (never simultaneously), yielding lower-variance estimates than a single train-test split.

Evaluation Protocols Comparison
Protocol	When to Use	Advantages	Disadvantages
Single Train-Test	Large datasets (>100K)	Simple, fast, low variance	Wastes data, single estimate
Train-Val-Test	Standard practice, medium+ datasets	Protects test set integrity	Still wastes some data
5-Fold CV	Small-medium datasets	Uses all data efficiently	5x computational cost
10-Fold CV	Standard for small datasets	Lower variance than 5-fold	10x computational cost
Leave-One-Out	Very small datasets (<100)	Maximum data efficiency	High variance, expensive

proper_evaluation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, KFold
 
def proper_evaluation_protocol(X, y, model_class, param_grid):
    """
    Demonstrates the correct three-way split protocol with 
    cross-validation for hyperparameter tuning.
    
    Steps:
    1. Hold out final test set (never touch until the very end)
    2. Use cross-validation on train+val for hyperparameter selection
    3. Retrain best model on full train+val
    4. Report final performance on held-out test set
    """
    # Step 1: Create held-out test set
    X_train_val, X_test, y_train_val, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    print(f"Data split: {len(X_train_val)} train+val, {len(X_test)} test")
    
    # Step 2: Cross-validation for model selection
    best_score = -np.inf
    best_params = None
    
    cv = KFold(n_splits=5, shuffle=True, random_state=42)
    
    for params in param_grid:
        model = model_class(**params)
        
        # Evaluate using 5-fold CV on train+val set
        cv_scores = cross_val_score(
            model, X_train_val, y_train_val, 
            cv=cv, 
            scoring='neg_mean_squared_error'
        )
        
        mean_score = cv_scores.mean()
        std_score = cv_scores.std()
        
        print(f"Params {params}: CV Score = {-mean_score:.4f} ± {std_score:.4f}")
        
        if mean_score > best_score:
            best_score = mean_score
            best_params = params
    
    print(f"\nBest params: {best_params}")
    
    # Step 3: Retrain on full train+val with best params
    final_model = model_class(**best_params)
    final_model.fit(X_train_val, y_train_val)
    
    # Step 4: Final evaluation on held-out test set
    y_pred = final_model.predict(X_test)
    test_error = np.mean((y_test - y_pred) ** 2)
    
    # Also compute training error for gap analysis
    y_train_pred = final_model.predict(X_train_val)
    train_error = np.mean((y_train_val - y_train_pred) ** 2)
    
    gap = test_error - train_error
    
    print(f"\nFinal Results:")
    print(f"  Training Error: {train_error:.4f}")
    print(f"  Test Error:     {test_error:.4f}")
    print(f"  Gap:            {gap:.4f}")
    
    return final_model, test_error, gap
 
# Critical points:
# - The test set is touched exactly ONCE, at the very end
# - Cross-validation absorbs the model selection bias
# - The gap computed at the end is our true overfitting estimate

Why Training Error Alone Fails

The Mathematical Problem

Training error is computed on the same data used for optimization. The model parameters $\theta$ are chosen to minimize:

$$\hat{\theta} = \arg\min_\theta \frac{1}{n} \sum_{i=1}^{n} L(y_i, f_\theta(x_i))$$

By definition, $\hat{\theta}$ makes the training error as small as possible. This means training error is optimistically biased:

$$\mathbb{E}[\text{Err}{train}] < \mathbb{E}[\text{Err}{test}]$$

The more flexible the model, the more it can overfit, and the more optimistic the training error becomes.

The Extreme Case: Interpolation

Consider a model complex enough to perfectly interpolate all training points (zero training error). Examples include:

High-degree polynomials with degree ≥ n-1
Decision trees grown to full depth
1-Nearest Neighbor
Kernel machines with sufficient capacity

For such models: $$\text{Err}_{train} = 0$$

But test error can be arbitrarily bad! The model has memorized the training set but learned nothing about the underlying pattern. Training error gives no indication of this failure.

Training Error Says

•Model A (degree 20): Error = 0.0001
•Model B (degree 5): Error = 0.0523
•Model C (degree 2): Error = 0.3412
•'Model A is clearly best!'

Test Error Reveals

•Model A (degree 20): Error = 2.3451
•Model B (degree 5): Error = 0.0891
•Model C (degree 2): Error = 0.3567
•'Model B is actually best!'

Never Trust Training Error for Model Selection

Connections to Statistical Learning Theory

The training/test distinction connects to deep results in statistical learning theory that provide theoretical bounds on the generalization gap.

Empirical Risk Minimization (ERM)

Most learning algorithms minimize empirical risk (training error):

$$\hat{h} = \arg\min_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^{n} L(y_i, h(x_i))$$

The fundamental question is: when does minimizing empirical risk also minimize true risk (test error)? The answer depends on:

Sample size n: More data → better guarantees
Hypothesis class complexity: Simpler $\mathcal{H}$ → tighter bounds
Data distribution: Some distributions are easier to learn

Generalization Bounds

Statistical learning theory provides bounds of the form:

$$\text{Err}{test}(\hat{h}) \leq \text{Err}{train}(\hat{h}) + \text{Complexity}(\mathcal{H}, n, \delta)$$

with probability at least $1 - \delta$. The complexity term depends on measures like:

VC Dimension: For binary classifiers, $\text{Gap} = O\left(\sqrt{\frac{d_{VC}}{n}}\right)$
Rademacher Complexity: More general measure of hypothesis class richness
PAC-Bayes Bounds: Incorporate prior beliefs about good hypotheses

These bounds formalize the intuition that:

More training data shrinks the gap
Restricting model complexity shrinks the gap
Regularization effectively reduces complexity, improving bounds

VC Dimension Intuition

Practical Implications

While theoretical bounds are often loose, they provide crucial guidance:

Theoretical Insight	Practical Implication
Gap ~ $\sqrt{\text{complexity}/n}$	Collect more data to shrink gap
Regularization reduces effective complexity	Ridge, Lasso, etc. improve generalization
Gap vanishes as $n \to \infty$	Big data naturally reduces overfitting
Some hypothesis classes have bounded complexity	Linear models generalize better than deep networks (all else equal)

Summary: Training vs Test Error

We've established the foundational distinction that underlies all of regularization theory. Let's consolidate the key insights:

Key Takeaways

•Training error measures fit, test error measures generalization — Low training error is necessary but not sufficient for a good model.
•The generalization gap is the primary diagnostic — It quantifies overfitting directly: $\text{Gap} = \text{Err}{test} - \text{Err}{train}$.
•Learning curves visualize model behavior — They reveal underfitting (high plateau) vs overfitting (large gap) at a glance.
•Proper evaluation requires discipline — Train/validation/test splits, cross-validation, and never contaminating the test set.
•Training error alone is dangerously misleading — It systematically favors overfit models and provides no generalization information.
•Theory backs practice — Statistical learning theory formalizes why the gap exists and how to control it.

What's Next?

Page Complete