Capacity And Generalization - Learning Module

Loading content...

0/278

Effective Capacity

Beyond Theoretical Capacity

In the previous page, we established that model capacity quantifies what a model can represent—the full richness of its hypothesis class. We also noted a profound puzzle: deep neural networks with billions of parameters (and correspondingly astronomical VC dimensions) generalize remarkably well, defying all classical predictions.

The resolution to this puzzle lies in a crucial distinction: what a model can represent is not the same as what it will represent after training.

This is the concept of effective capacity. While a deep network's architecture may theoretically be capable of memorizing any dataset, the combination of:

The training algorithm (SGD and its variants)
The loss landscape geometry
The weight initialization
The data distribution

...all conspire to dramatically restrict which solutions the network actually finds. The effective capacity—the subset of the hypothesis class the model realistically explores—is far smaller than the theoretical capacity.

What You Will Master

By the end of this page, you will understand: (1) why the distinction between theoretical and effective capacity matters, (2) how optimization algorithms implicitly constrain effective capacity, (3) the role of initialization and training dynamics, and (4) how to reason about what your models actually learn versus what they could learn.

The Capacity Gap: Theory vs. Reality

Let's make the gap between theoretical and effective capacity concrete with a thought experiment.

The Memorization Experiment (Zhang et al., 2017):

In a landmark paper, researchers at MIT and Google trained deep neural networks on CIFAR-10 with various label modifications:

True labels: The network achieved ~95% test accuracy (good generalization).
Random labels: The network achieved ~100% training accuracy but ~10% test accuracy (pure memorization).

The same architecture, with identical theoretical capacity, exhibited completely different behaviors depending on the data. With true labels, it generalized. With random labels, it memorized.

What does this tell us?

The network's ability to memorize (demonstrated by 100% training accuracy on random labels) didn't translate to actually memorizing when the data had learnable structure. The training process somehow 'preferred' the generalizing solution.

memorization_experiment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
"""
Recreating the Zhang et al. Memorization Experiment
 
Demonstrating that the same architecture can either generalize or memorize,
depending on whether the data has learnable structure.
"""
 
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
 
class SimpleCNN(nn.Module):
    """A simple CNN with substantial capacity."""
    def __init__(self, num_classes=10):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding=1)
        self.conv2 = nn.Conv2d(64, 128, 3, padding=1)
        self.conv3 = nn.Conv2d(128, 256, 3, padding=1)
        self.pool = nn.MaxPool2d(2)
        self.fc1 = nn.Linear(256 * 4 * 4, 512)
        self.fc2 = nn.Linear(512, num_classes)
        
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))  # 32 -> 16
        x = self.pool(F.relu(self.conv2(x)))  # 16 -> 8
        x = self.pool(F.relu(self.conv3(x)))  # 8 -> 4
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        return self.fc2(x)
 
def train_epoch(model, loader, optimizer, criterion):
    model.train()
    total_loss, correct, total = 0, 0, 0
    for X, y in loader:
        optimizer.zero_grad()
        pred = model(X)
        loss = criterion(pred, y)
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item() * X.size(0)
        correct += (pred.argmax(1) == y).sum().item()
        total += X.size(0)
    
    return total_loss / total, correct / total
 
def evaluate(model, loader):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for X, y in loader:
            pred = model(X)
            correct += (pred.argmax(1) == y).sum().item()
            total += X.size(0)
    return correct / total
 
# Simulate a smaller version of the experiment
np.random.seed(42)
torch.manual_seed(42)
 
n_samples = 5000
input_shape = (3, 32, 32)
n_classes = 10
 
# Create synthetic "structured" data (true labels based on simple patterns)
X_data = torch.randn(n_samples, *input_shape)
# True labels: based on mean of red channel
true_labels = (X_data[:, 0].mean(dim=(1,2)) > 0).long() * 5 +               (X_data[:, 1].mean(dim=(1,2)) > 0).long() * 2 +               (X_data[:, 2].mean(dim=(1,2)) > 0).long()
true_labels = true_labels % n_classes
 
# Random labels: completely uncorrelated with input
random_labels = torch.randint(0, n_classes, (n_samples,))
 
# Split data
train_X, test_X = X_data[:4000], X_data[4000:]
train_true, test_true = true_labels[:4000], true_labels[4000:]
train_random, test_random = random_labels[:4000], random_labels[4000:]
 
# Create dataloaders
true_train = DataLoader(TensorDataset(train_X, train_true), batch_size=64, shuffle=True)
true_test = DataLoader(TensorDataset(test_X, test_true), batch_size=64)
random_train = DataLoader(TensorDataset(train_X, train_random), batch_size=64, shuffle=True)
random_test = DataLoader(TensorDataset(test_X, test_random), batch_size=64)
 
n_epochs = 50
criterion = nn.CrossEntropyLoss()
 
print("Memorization vs. Generalization Experiment")
print("=" * 60)
 
# Train on true labels
model_true = SimpleCNN()
optimizer = torch.optim.Adam(model_true.parameters(), lr=0.001)
 
print("
Training on TRUE (structured) labels:")
for epoch in [0, 9, 24, 49]:  # Check at specific epochs
    for e in range(epoch + 1 if epoch == 0 else 0, epoch + 1):
        train_epoch(model_true, true_train, optimizer, criterion)
    train_acc = evaluate(model_true, true_train)
    test_acc = evaluate(model_true, true_test)
    print(f"  Epoch {epoch+1:2d}: Train acc = {train_acc:.2%}, Test acc = {test_acc:.2%}")
 
# Train on random labels  
model_random = SimpleCNN()
optimizer = torch.optim.Adam(model_random.parameters(), lr=0.001)
 
print("
Training on RANDOM labels:")
for epoch in [0, 9, 24, 49]:
    for e in range(epoch + 1 if epoch == 0 else 0, epoch + 1):
        train_epoch(model_random, random_train, optimizer, criterion)
    train_acc = evaluate(model_random, random_train)
    test_acc = evaluate(model_random, random_test)
    print(f"  Epoch {epoch+1:2d}: Train acc = {train_acc:.2%}, Test acc = {test_acc:.2%}")
 
print("
" + "=" * 60)
print("KEY INSIGHT: Same architecture, vastly different behavior!")
print("With structure: Generalizes. Without: Memorizes.")
print("Effective capacity adapts to the data.")

The Core Insight

Theoretical capacity is about what's possible. Effective capacity is about what's likely. The training algorithm, combined with the structure of the data, creates a strong preference for certain solutions over others. Understanding this preference is the key to understanding deep learning generalization.

Optimization as Implicit Regularization

One of the most profound insights in modern deep learning theory is that the optimization algorithm itself acts as a regularizer. Even without explicit regularization (L2, dropout, etc.), stochastic gradient descent (SGD) has properties that favor simpler, more generalizing solutions.

2.1 The Implicit Bias of SGD

When multiple solutions achieve zero training loss, SGD doesn't find just any of them—it finds solutions with specific properties:

For Linear Regression:

When the problem is underdetermined (more parameters than data points), gradient descent initialized at θ₀ = 0 converges to the minimum norm solution:

$$\theta^* = \arg\min_\theta ||\theta||_2 \quad \text{subject to } X\theta = y$$

This is exactly what explicit L2 regularization would give in the limit λ → 0. SGD implicitly prefers simpler (smaller norm) solutions!

For Deep Networks:

The implicit bias is more complex but still exists:

SGD tends to find flat minima rather than sharp ones
It favors low-rank weight matrices in linear networks
It exhibits a 'simplicity bias' toward learning simple patterns first

implicit_regularization_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
"""
Implicit Regularization of Gradient Descent
 
Demonstrating that GD finds minimum-norm solutions without explicit regularization.
"""
 
import numpy as np
import matplotlib.pyplot as plt
 
def gradient_descent_linear(X, y, lr=0.01, n_iters=5000):
    """
    Solve linear regression via gradient descent.
    Returns the trajectory of weights.
    """
    n_features = X.shape[1]
    theta = np.zeros(n_features)  # Initialize at zero (crucial!)
    
    trajectory = [theta.copy()]
    
    for _ in range(n_iters):
        # Gradient of MSE: (1/n) * X.T @ (X @ theta - y)
        pred = X @ theta
        grad = X.T @ (pred - y) / len(y)
        theta = theta - lr * grad
        trajectory.append(theta.copy())
    
    return np.array(trajectory)
 
def minimum_norm_solution(X, y):
    """
    Compute the minimum L2-norm solution using pseudoinverse.
    theta* = X^+ @ y = X.T @ (X @ X.T)^(-1) @ y
    """
    return X.T @ np.linalg.solve(X @ X.T, y)
 
def explicit_ridge_solution(X, y, lambda_reg):
    """
    Compute ridge regression solution.
    theta* = (X.T @ X + lambda * I)^(-1) @ X.T @ y
    """
    n_features = X.shape[1]
    return np.linalg.solve(X.T @ X + lambda_reg * np.eye(n_features), X.T @ y)
 
# Create an underdetermined system (more parameters than data)
np.random.seed(42)
n_samples, n_features = 10, 100  # 10 equations, 100 unknowns
 
X = np.random.randn(n_samples, n_features)
true_theta = np.random.randn(n_features)
y = X @ true_theta  # Noise-free (infinitely many exact solutions exist)
 
# Run gradient descent
trajectory = gradient_descent_linear(X, y, lr=0.1, n_iters=10000)
theta_gd = trajectory[-1]
 
# Compute theoretical minimum-norm solution
theta_min_norm = minimum_norm_solution(X, y)
 
# Compare norms
print("Implicit Regularization in Underdetermined Linear Regression")
print("=" * 60)
print(f"Problem: {n_samples} samples, {n_features} features")
print(f"(Infinitely many solutions exist with zero training error)")
print()
print(f"Solution found by GD:")
print(f"  L2 norm: {np.linalg.norm(theta_gd):.4f}")
print(f"  Training error: {np.mean((X @ theta_gd - y)**2):.2e}")
print()
print(f"Minimum-norm solution (theoretical):")
print(f"  L2 norm: {np.linalg.norm(theta_min_norm):.4f}")
print(f"  Training error: {np.mean((X @ theta_min_norm - y)**2):.2e}")
print()
print(f"Distance between GD solution and min-norm solution:")
print(f"  ||theta_GD - theta_min_norm|| = {np.linalg.norm(theta_gd - theta_min_norm):.2e}")
print()
 
# Show equivalence to ridge regression as lambda -> 0
print("Comparison with explicit Ridge regression (lambda -> 0):")
for lam in [1.0, 0.1, 0.01, 0.001]:
    theta_ridge = explicit_ridge_solution(X, y, lam)
    dist = np.linalg.norm(theta_ridge - theta_min_norm)
    print(f"  lambda = {lam}: distance to min-norm = {dist:.4f}")
 
print()
print("=" * 60)
print("KEY INSIGHT: Gradient descent IMPLICITLY finds the minimum-norm")
print("solution, which is equivalent to infinitesimal L2 regularization.")
print("The optimization process itself regularizes!")

2.2 Why Does This Happen?

The implicit bias toward minimum-norm solutions isn't magic—it emerges from the geometry of gradient descent:

Gradient Flow in Continuous Time:

Consider the continuous-time limit of gradient descent (gradient flow): $$\frac{d\theta}{dt} = - abla L(\theta)$$

For linear regression with $L(\theta) = ||X\theta - y||^2$, the gradient is: $$ abla L(\theta) = X^T(X\theta - y)$$

Note that $ abla L(\theta)$ always lies in the row space of $X$ (i.e., $\text{span}(x_1, ..., x_n)$). This means:

Gradient descent only moves within the row space of $X$
If we initialize at $\theta_0 = 0$, we stay in the row space forever
The minimum-norm solution is precisely the projection of the true solution onto this subspace

This geometric argument extends, with modifications, to nonlinear networks.

Initialization Matters

The implicit regularization toward minimum-norm solutions depends critically on weight initialization. Initializing at zero (or near zero) is essential. Different initializations lead to different implicit biases. This is why initialization schemes like Xavier/He are so important—they set up the optimization trajectory for good solutions.

The Role of Stochasticity in SGD

Beyond the implicit bias of gradient descent, the stochastic nature of SGD provides additional regularization benefits.

3.1 Noise as Regularization

SGD computes gradients using random mini-batches rather than the full dataset. This introduces noise into the optimization:

$$\theta_{t+1} = \theta_t - \eta \cdot ( abla L(\theta_t) + \xi_t)$$

Where $\xi_t$ is the gradient noise from mini-batch sampling. This noise has several effects:

1. Escaping Sharp Minima:

Sharp minima (with high curvature) correspond to solutions that are sensitive to small perturbations—a hallmark of overfitting. SGD noise helps escape these sharp minima in favor of flatter regions.

2. Implicit Exploration:

The random walk component of SGD allows exploration of the loss landscape, potentially finding better (more generalizing) solutions than pure gradient descent would.

3. Regularization Proportional to Learning Rate:

Higher learning rates produce more gradient noise. There's evidence that the ratio (learning rate / batch size) controls the regularization strength.

3.2 Flat vs. Sharp Minima

The flatness of a minimum relates to the eigenvalues of the Hessian at that point:

Sharp minimum: Large eigenvalues → high curvature → small perturbations cause big loss increases
Flat minimum: Small eigenvalues → low curvature → robust to perturbations

The Flatness-Generalization Hypothesis:

Hochreiter & Schmidhuber (1997) proposed that flat minima generalize better because:

They're robust to noise in the weights
They're robust to differences between training and test distributions
They correspond to simpler functional forms (in some sense)

While the exact relationship between flatness and generalization remains debated (it's sensitive to reparameterization), empirical evidence strongly supports that SGD tends to find minima that generalize well.

flat_minima_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
"""
Flat vs. Sharp Minima and Generalization
 
Demonstrating the concept of loss landscape flatness.
"""
 
import numpy as np
import matplotlib.pyplot as plt
 
def loss_landscape_1d(x, sharpness=1.0, offset=0.0):
    """
    A parametric loss function that can be sharp or flat.
    
    Args:
        x: parameter value
        sharpness: controls curvature (higher = sharper minimum)
        offset: shifts the minimum location
    
    Returns:
        loss value
    """
    return sharpness * (x - offset) ** 2
 
def visualize_minima_flatness():
    """Visualize sharp vs. flat minima and their robustness."""
    
    x = np.linspace(-3, 3, 1000)
    
    # Three loss landscapes with different sharpness
    loss_sharp = loss_landscape_1d(x, sharpness=4.0, offset=0.0)
    loss_medium = loss_landscape_1d(x, sharpness=1.0, offset=0.0)
    loss_flat = loss_landscape_1d(x, sharpness=0.25, offset=0.0)
    
    print("Flat vs. Sharp Minima: Robustness Analysis")
    print("=" * 60)
    
    # All minima at x=0 with loss=0
    # But what happens with small perturbations?
    perturbation = 0.5
    
    print(f"
With perturbation Δx = {perturbation}:")
    print(f"  Sharp minimum (sharpness=4):  loss increases to {loss_landscape_1d(perturbation, 4.0):.2f}")
    print(f"  Medium minimum (sharpness=1): loss increases to {loss_landscape_1d(perturbation, 1.0):.2f}")
    print(f"  Flat minimum (sharpness=0.25): loss increases to {loss_landscape_1d(perturbation, 0.25):.2f}")
    
    print("
Implication for generalization:")
    print("  - Training finds θ* with low training loss")
    print("  - Test data is 'perturbed' from training distribution")
    print("  - Flat minima: small perturbation → small loss increase")
    print("  - Sharp minima: small perturbation → large loss increase")
    print("  - Flat minima are more robust to distribution shift!")
    
    # Hessian eigenvalues interpretation
    print("
Hessian interpretation:")
    print("  - Sharpness = second derivative = Hessian eigenvalue")
    print("  - Sharp: large eigenvalue → high curvature")
    print("  - Flat: small eigenvalue → low curvature")
    print("  - SGD noise helps escape sharp minima → finds flat minima")
 
visualize_minima_flatness()
 
# Demonstrate SGD escape from sharp minima
def sgd_with_noise(start, gradient_fn, lr=0.1, noise_std=0.5, n_steps=100):
    """
    SGD with explicit noise injection (simulating mini-batch variance).
    """
    x = start
    trajectory = [x]
    
    for _ in range(n_steps):
        grad = gradient_fn(x)
        noise = np.random.normal(0, noise_std)
        x = x - lr * (grad + noise)
        trajectory.append(x)
    
    return np.array(trajectory)
 
def double_well_loss(x):
    """Loss with two minima: one sharp at x=-1, one flat at x=1."""
    # Sharp minimum at x=-1, flat minimum at x=1
    sharp_well = 4 * (x + 1)**2 if x < 0 else 2 * (x + 1)**2
    flat_well = 0.25 * (x - 1)**2 if x > 0 else 0.5 * (x - 1)**2
    # Blend based on position
    return 0.1 * x**4 - 0.5 * x**2 + 0.1 * x + 0.5
 
def double_well_gradient(x):
    """Gradient of double well loss."""
    return 0.4 * x**3 - 1.0 * x + 0.1
 
print("
" + "=" * 60)
print("SGD Escape from Sharp Minima Demo")
print("=" * 60)
 
# Start at sharp minimum
x_start = -1.0
 
# Run multiple SGD trajectories with different noise levels
for noise in [0.0, 0.3, 0.8]:
    trajectories = []
    for seed in range(10):
        np.random.seed(seed)
        traj = sgd_with_noise(x_start, double_well_gradient, noise_std=noise, n_steps=200)
        trajectories.append(traj[-1])
    
    avg_final = np.mean(trajectories)
    std_final = np.std(trajectories)
    print(f"
Noise std = {noise:.1f}:")
    print(f"  Final position: {avg_final:.2f} ± {std_final:.2f}")
    print(f"  (Sharp minimum at -1.0, flat minimum at +1.0)")
 
print("
KEY INSIGHT: Higher noise helps SGD escape sharp minima")
print("and find flatter, more generalizing regions of the loss landscape.")

The Flatness Debate

While the flatness-generalization connection is intuitively appealing and empirically observed, it's not a complete explanation. Dinh et al. (2017) showed that flatness is not invariant to reparameterization—you can make any minimum appear flat or sharp by rescaling. The full story involves more nuanced measures like PAC-Bayes bounds that account for the parameter geometry.

Early Stopping: Controlling Effective Capacity Through Time

One of the simplest and most effective ways to control effective capacity is early stopping—halting training before the model reaches zero training error.

4.1 The Training Process as Capacity Expansion

Think of training not as finding a static solution, but as a dynamic process where the model's effective complexity grows over time:

Early training (low effective capacity):

Model captures only the simplest patterns
Predictions are rough, biased toward mean
Training error high, test error high

Mid training (appropriate effective capacity):

Model learns the main structure of the data
Training error decreasing, test error at minimum
Sweet spot for generalization

Late training (high effective capacity):

Model starts memorizing noise
Training error approaching zero
Test error increasing (overfitting)

4.2 Equivalence to Regularization

For some model classes, early stopping is mathematically equivalent to explicit regularization:

Theorem (Ali, 1994; Raskutti et al., 2014): For linear regression, gradient descent with learning rate $\eta$ stopped at iteration $T$ is equivalent to ridge regression with regularization parameter $\lambda \approx 1/(\eta T)$.

This means:

Early stopping ($T$ small) = strong regularization ($\lambda$ large)
Late stopping ($T$ large) = weak regularization ($\lambda$ small)
Running forever ($T → ∞$) = no regularization ($\lambda = 0$)

The 'regularization strength' of early stopping is controlled by the number of iterations—providing a continuous dial on effective capacity.

early_stopping_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
"""
Early Stopping as Implicit Regularization
 
Demonstrating the equivalence between early stopping and explicit regularization.
"""
 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
 
def generate_polynomial_data(n_samples=100, noise_std=0.3, true_degree=3, seed=42):
    """Generate data from a polynomial function with noise."""
    np.random.seed(seed)
    X = np.random.uniform(-1, 1, n_samples).reshape(-1, 1)
    y_true = 0.5 + X.flatten() - 0.8 * X.flatten()**2 + 0.3 * X.flatten()**3
    y = y_true + np.random.normal(0, noise_std, n_samples)
    return X, y, y_true
 
def gradient_descent_with_logging(X, y, X_val, y_val, lr=0.01, max_iters=10000, log_every=100):
    """
    Run gradient descent and log training/validation error at each step.
    """
    n_features = X.shape[1]
    theta = np.zeros(n_features)
    
    train_errors = []
    val_errors = []
    theta_norms = []
    
    for i in range(max_iters):
        # Compute predictions and errors
        train_pred = X @ theta
        val_pred = X_val @ theta
        train_mse = np.mean((train_pred - y)**2)
        val_mse = np.mean((val_pred - y_val)**2)
        
        if i % log_every == 0:
            train_errors.append(train_mse)
            val_errors.append(val_mse)
            theta_norms.append(np.linalg.norm(theta))
        
        # Gradient descent step
        grad = X.T @ (train_pred - y) / len(y)
        theta = theta - lr * grad
    
    return train_errors, val_errors, theta_norms
 
def ridge_regression(X, y, lambda_reg):
    """Compute ridge regression solution."""
    n_features = X.shape[1]
    theta = np.linalg.solve(
        X.T @ X + lambda_reg * np.eye(n_features),
        X.T @ y
    )
    return theta
 
# Generate data
X, y, y_true = generate_polynomial_data(n_samples=50)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)
 
# Create polynomial features (high capacity)
poly_degree = 15
poly = PolynomialFeatures(poly_degree, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_val_poly = poly.transform(X_val)
 
# Normalize features for stable GD
X_mean, X_std = X_train_poly.mean(0), X_train_poly.std(0) + 1e-8
X_train_norm = (X_train_poly - X_mean) / X_std
X_val_norm = (X_val_poly - X_mean) / X_std
 
# Run gradient descent with logging
print("Early Stopping Analysis")
print("=" * 60)
 
train_errs, val_errs, norms = gradient_descent_with_logging(
    X_train_norm, y_train, X_val_norm, y_val,
    lr=0.1, max_iters=10000, log_every=50
)
 
iterations = np.arange(0, len(train_errs) * 50, 50)
 
# Find optimal stopping point
best_val_idx = np.argmin(val_errs)
best_val_iter = iterations[best_val_idx]
 
print(f"
Polynomial degree: {poly_degree} (high capacity)")
print(f"Training samples: {len(y_train)}")
print()
print("Training Progress:")
print(f"  Iteration    0: Train MSE = {train_errs[0]:.4f}, Val MSE = {val_errs[0]:.4f}")
print(f"  Iteration {best_val_iter:4d}: Train MSE = {train_errs[best_val_idx]:.4f}, Val MSE = {val_errs[best_val_idx]:.4f} *** OPTIMAL ***")
print(f"  Iteration {iterations[-1]:4d}: Train MSE = {train_errs[-1]:.4f}, Val MSE = {val_errs[-1]:.4f}")
 
print("
" + "=" * 60)
print("Equivalence to Ridge Regression")
print("=" * 60)
 
# Compare early stopping to ridge regression
# At optimal iteration, what ridge lambda gives similar results?
for lam in [10.0, 1.0, 0.1, 0.01, 0.001]:
    theta_ridge = ridge_regression(X_train_norm, y_train, lam)
    val_pred = X_val_norm @ theta_ridge
    ridge_val_mse = np.mean((val_pred - y_val)**2)
    print(f"  Ridge λ={lam:6.3f}: Val MSE = {ridge_val_mse:.4f}, ||θ|| = {np.linalg.norm(theta_ridge):.4f}")
 
print(f"
  Early stopping at iter {best_val_iter}: Val MSE = {val_errs[best_val_idx]:.4f}, ||θ|| = {norms[best_val_idx]:.4f}")
print("
KEY INSIGHT: Early stopping achieves similar regularization to")
print("explicit Ridge with an appropriate λ. The number of iterations")
print("controls effective capacity just like explicit regularization.")

Practical Implications

Early stopping is one of the most effective regularization techniques in practice. It's computationally free (you have to train anyway), automatically adapts to the problem, and requires only a validation set to monitor. Always track validation loss during training and save checkpoints at the best validation performance.

Data-Dependent Effective Capacity

Perhaps the most important insight about effective capacity is that it depends critically on the data distribution. The same model can have vastly different effective capacities on different datasets.

5.1 The Structure of Real Data

Real-world data is not random—it has structure:

1. Low Intrinsic Dimensionality:

High-dimensional data (e.g., images with millions of pixels) often lies on or near a low-dimensional manifold. The 'manifold hypothesis' suggests that natural images occupy a tiny fraction of all possible pixel configurations.

2. Hierarchical Structure:

Natural data often has hierarchical organization (edges → textures → parts → objects). Networks that match this structure require less effective capacity.

3. Statistical Regularities:

Real data has consistent patterns—the same edge detection principles apply everywhere in an image. Networks can exploit this through parameter sharing (convolutions).

When data has structure, less capacity is needed to model it. The effective capacity required is proportional to the intrinsic complexity of the data, not its raw dimensionality.

5.2 Measuring Data Complexity

Several concepts try to capture how much 'capacity' data actually requires:

Intrinsic Dimension:

The minimum number of coordinates needed to represent the data without significant loss:

MNIST digits (~784 pixels): intrinsic dimension ≈ 10-15
CIFAR-10 images (~3072 pixels): intrinsic dimension ≈ 30-50
Natural images: much lower than raw dimensionality

Compression-Based Measures:

How well can the data be compressed? The minimum description length provides a measure:

Highly compressible data = low intrinsic complexity
Random data = incompressible = high complexity

Learning Curve Analysis:

How quickly does test error decrease with more training data?

Fast decay = simple data, low effective capacity needed
Slow decay = complex data, more capacity required

data_complexity_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
"""
Data Complexity and Effective Capacity
 
Demonstrating how data structure affects required model capacity.
"""
 
import numpy as np
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import learning_curve
 
def estimate_intrinsic_dimension(X, variance_threshold=0.95):
    """
    Estimate intrinsic dimensionality using PCA.
    Returns the number of components explaining variance_threshold of variance.
    """
    pca = PCA()
    pca.fit(X)
    
    cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
    n_components = np.argmax(cumulative_variance >= variance_threshold) + 1
    
    return n_components, pca.explained_variance_ratio_
 
def generate_data_scenarios():
    """Generate datasets with different intrinsic complexities."""
    np.random.seed(42)
    n_samples = 1000
    
    scenarios = {}
    
    # Scenario 1: Data lies on a 2D plane in 100D space (low intrinsic dim)
    true_dim = 2
    ambient_dim = 100
    # Generate 2D data, then embed in 100D
    low_rank_basis = np.random.randn(true_dim, ambient_dim)
    low_rank_basis /= np.linalg.norm(low_rank_basis, axis=1, keepdims=True)
    Z_low = np.random.randn(n_samples, true_dim)
    X_low_intrinsic = Z_low @ low_rank_basis
    X_low_intrinsic += np.random.randn(n_samples, ambient_dim) * 0.1  # Small noise
    scenarios['Low Intrinsic (2D in 100D)'] = X_low_intrinsic
    
    # Scenario 2: Data lies on a 20D manifold in 100D space (medium)
    true_dim = 20
    medium_rank_basis = np.random.randn(true_dim, ambient_dim)
    medium_rank_basis /= np.linalg.norm(medium_rank_basis, axis=1, keepdims=True)
    Z_med = np.random.randn(n_samples, true_dim) 
    X_medium = Z_med @ medium_rank_basis
    X_medium += np.random.randn(n_samples, ambient_dim) * 0.1
    scenarios['Medium Intrinsic (20D in 100D)'] = X_medium
    
    # Scenario 3: Random data in 100D (high intrinsic dim = ambient)
    X_high = np.random.randn(n_samples, ambient_dim)
    scenarios['High Intrinsic (Full 100D)'] = X_high
    
    return scenarios
 
print("Data Complexity Analysis")
print("=" * 70)
 
scenarios = generate_data_scenarios()
 
for name, X in scenarios.items():
    intrinsic_dim, variance_ratios = estimate_intrinsic_dimension(X)
    
    # Also measure how many components for 80% and 99% variance
    cumvar = np.cumsum(variance_ratios)
    dim_80 = np.argmax(cumvar >= 0.80) + 1
    dim_99 = np.argmax(cumvar >= 0.99) + 1
    
    print(f"
{name}")
    print(f"  Ambient dimension: {X.shape[1]}")
    print(f"  Dims for 80% variance: {dim_80}")
    print(f"  Dims for 95% variance: {intrinsic_dim}")
    print(f"  Dims for 99% variance: {dim_99}")
    print(f"  Compression ratio (95%): {X.shape[1]/intrinsic_dim:.1f}x")
 
print("
" + "=" * 70)
print("Implications for Model Capacity")
print("=" * 70)
print("""
- Low intrinsic dimension data can be modeled with lower effective capacity
- High intrinsic dimension requires more capacity (or more data)
- Real data (images, text) typically has much lower intrinsic dim than ambient
- This explains why massive neural nets can generalize on natural data:
  The effective problem complexity is much lower than it appears!
""")
 
# Learning curve comparison
print("
Learning Curves: How Data Complexity Affects Sample Efficiency")
print("-" * 70)
 
# Create classification tasks with different complexities
for name, X in scenarios.items():
    # Create labels based on linear combination of top principal components
    pca = PCA(n_components=min(5, X.shape[1]))
    X_pca = pca.fit_transform(X)
    y = (X_pca[:, 0] > 0).astype(int)  # Simple boundary in top PC
    
    # Compute learning curve
    train_sizes = [0.1, 0.2, 0.3, 0.5, 0.7, 1.0]
    train_sizes_abs, train_scores, val_scores = learning_curve(
        LogisticRegression(max_iter=1000),
        X, y,
        train_sizes=train_sizes,
        cv=3,
        scoring='accuracy'
    )
    
    print(f"
{name}:")
    print(f"  Samples:  {train_sizes_abs}")
    print(f"  Val Acc:  {np.mean(val_scores, axis=1).round(3)}")

The Manifold Hypothesis

The manifold hypothesis states that high-dimensional data (like images) lies on a low-dimensional manifold embedded in the high-dimensional space. If true, this fundamentally changes the generalization picture: the effective complexity of the learning problem is determined by the manifold dimension, not the ambient dimension. A model needs only enough capacity to represent functions on this manifold.

Measuring Effective Capacity in Practice

Given that effective capacity is what matters for generalization, how can we measure or estimate it in practice?

6.1 The Random Label Test

A simple diagnostic: How well can your model fit random labels?

Procedure:

Train your model on the real dataset → measure training accuracy
Replace labels with random labels → train again → measure training accuracy
Compare: If random label accuracy is much lower, effective capacity is constrained

Interpretation:

High random label accuracy = high effective capacity (can memorize)
Low random label accuracy = low effective capacity (constrained)
In between = the regularization is working

6.2 Weight Norm Tracking

The norm of the weights provides a proxy for effective capacity:

Larger weights = higher effective capacity (can implement more complex functions)
Smaller weights = lower effective capacity (limited expressiveness)

Tracking weight norms during training reveals how capacity evolves:

Epoch 1:   ||W|| = 0.5  → Low capacity, captures simple patterns
Epoch 50:  ||W|| = 5.0  → Medium capacity, captures main structure  
Epoch 500: ||W|| = 50.0 → High capacity, may be memorizing

6.3 Effective Degrees of Freedom

For generalized linear models, effective degrees of freedom (EDF) provides a rigorous measure:

$$\text{EDF} = \text{trace}(H)$$

where $H$ is the hat matrix ($\hat{y} = Hy$). EDF measures how many 'effective parameters' the model uses, accounting for regularization.

6.4 Information-Based Measures

Information-theoretic approaches measure effective capacity through:

Description length: How many bits to specify the trained weights (with appropriate quantization)
Information in weights: Mutual information between input and weight gradients
Compression-based metrics: How much can the model be compressed without performance loss

Methods for Assessing Effective Capacity
Method	What It Measures	Advantages	Limitations
Random Label Test	Memorization ability	Simple, intuitive	Binary (can/can't memorize)
Weight Norm	Magnitude of solution	Easy to compute, continuous	Not invariant to scaling
Effective DoF	Trace of influence matrix	Theoretically grounded	Hard to compute for DNNs
Compression	Bits to describe model	Information-theoretic basis	Depends on encoding
PAC-Bayes Bound	KL from prior to posterior	Non-vacuous for DNNs	Requires choosing prior

Practical Recommendation

For day-to-day deep learning work, the most practical capacity diagnostics are: (1) Track the gap between training and validation loss—a growing gap signals excessive effective capacity. (2) Monitor weight norms—explosive growth often precedes overfitting. (3) Use the random label sanity check during development to understand your model's memorization ability.

Effective Capacity in Modern Architectures

Different architectures have different effective capacity characteristics, even with similar parameter counts.

7.1 Convolutional Networks

Parameter Sharing → Reduced Effective Capacity:

Convolutions use the same weights at every spatial location. This:

Reduces parameter count dramatically
Enforces translation equivariance
Limits the functions that can be represented
Matches the structure of images (patterns can appear anywhere)

This is a form of capacity control through architecture—constraining what can be learned to match known data structure.

7.2 Residual Connections

Skip Connections → Modulated Effective Capacity:

Residual connections $y = F(x) + x$ affect effective capacity in subtle ways:

Enable identity function easily (low capacity baseline)
Allow gradual complexity increase during training
Provide implicit 'ensembling' across depths
May help optimization find flatter minima

7.3 Attention and Transformers

Dynamic Computation → Data-Adaptive Capacity:

Self-attention computes weights dynamically based on input:

Different inputs use the parameters differently
Implicitly increases capacity for 'harder' inputs
Can focus capacity where it's needed

This makes transformer effective capacity particularly data-dependent.

7.4 Normalization Layers

BatchNorm/LayerNorm → Capacity Regularization:

Normalization layers constrain the space of possible functions:

Limit the scale of activations
Break pathological solutions
Smooth the loss landscape
Reduce effective capacity while maintaining expressiveness

architecture_capacity_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
"""
Architecture Effects on Effective Capacity
 
Comparing how different architectures constrain what models actually learn.
"""
 
import torch
import torch.nn as nn
import numpy as np
 
def count_parameters(model):
    """Count trainable parameters."""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
 
def measure_weight_statistics(model):
    """Compute statistics about weight magnitudes."""
    total_norm = 0.0
    max_weight = 0.0
    weights = []
    
    for name, param in model.named_parameters():
        if 'weight' in name:
            w = param.data.cpu().numpy().flatten()
            weights.extend(w.tolist())
            total_norm += np.sum(w**2)
            max_weight = max(max_weight, np.max(np.abs(w)))
    
    weights = np.array(weights)
    return {
        'total_l2_norm': np.sqrt(total_norm),
        'max_weight': max_weight,
        'mean_abs': np.mean(np.abs(weights)),
        'sparsity': np.mean(np.abs(weights) < 0.01),  # Near-zero weights
    }
 
# Compare MLP vs CNN on image-like data
input_shape = (3, 32, 32)
n_classes = 10
 
# Architecture 1: Fully Connected MLP
class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.layers = nn.Sequential(
            nn.Linear(3*32*32, 512),
            nn.ReLU(),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, n_classes)
        )
    
    def forward(self, x):
        return self.layers(self.flatten(x))
 
# Architecture 2: Convolutional Network (similar depth, much fewer params)
class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),  # 16x16
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),  # 8x8
            nn.Conv2d(64, 128, 3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d(1),
        )
        self.fc = nn.Linear(128, n_classes)
    
    def forward(self, x):
        x = self.conv(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)
 
# Architecture 3: ResNet-style with skip connections
class ResBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, padding=1)
        self.conv2 = nn.Conv2d(channels, channels, 3, padding=1)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        identity = x
        out = self.relu(self.conv1(x))
        out = self.conv2(out)
        out = out + identity  # Skip connection
        return self.relu(out)
 
class ResNetStyle(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding=1)
        self.block1 = ResBlock(64)
        self.pool = nn.MaxPool2d(2)  # 16x16
        self.block2 = ResBlock(64)
        self.gap = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(64, n_classes)
        
    def forward(self, x):
        x = self.conv1(x)
        x = self.block1(x)
        x = self.pool(x)
        x = self.block2(x)
        x = self.gap(x)
        x = x.view(x.size(0), -1)
        return self.fc(x)
 
# Initialize models
mlp = MLP()
cnn = CNN()
resnet = ResNetStyle()
 
print("Architecture Comparison: Capacity Characteristics")
print("=" * 70)
 
for name, model in [("MLP (Fully Connected)", mlp), 
                     ("CNN (Convolutional)", cnn), 
                     ("ResNet-style", resnet)]:
    params = count_parameters(model)
    stats = measure_weight_statistics(model)
    
    print(f"
{name}")
    print(f"  Parameters: {params:,}")
    
    # Architecture-specific capacity notes
    if "MLP" in name:
        print(f"  Capacity type: Global (every input neuron connects to every hidden neuron)")
        print(f"  Inductive bias: None (most flexible, least constrained)")
        
    elif "CNN" in name:
        print(f"  Capacity type: Local (weight sharing reduces effective parameters)")
        print(f"  Inductive bias: Translation equivariance (good for images)")
        
    else:
        print(f"  Capacity type: Residual (can learn identity, gradual complexity)")
        print(f"  Inductive bias: Skip connections enable stable deep networks")
 
print("
" + "=" * 70)
print("KEY INSIGHT: Parameter count ≠ Effective Capacity")
print("-" * 70)
print("""
- MLP has most parameters but may need more data to generalize on images
- CNN has fewer parameters but exploits image structure → lower effective 
  capacity needed for good generalization on images
- ResNet can go deeper without gradient issues → can modulate capacity
  gradually during training
 
The right architecture matches its inductive bias to the data structure,
reducing the effective capacity needed for good generalization.
""")

Architecture as Inductive Bias

The choice of architecture is itself a form of regularization. When you choose a CNN for images, you're encoding assumptions about translation invariance and local structure. These assumptions reduce effective capacity by ruling out functions that violate them—but if the assumptions match the data, this is beneficial for generalization.

Summary and Key Insights

We've explored the crucial concept of effective capacity—the distinction between what a model can theoretically represent and what it actually learns in practice. This concept is fundamental to understanding deep learning.

Key Takeaways

•Effective capacity ≠ theoretical capacity. What matters for generalization is not what a model can represent, but what it actually represents after training with the given optimization procedure.
•Optimization provides implicit regularization. Gradient descent has inherent biases—toward minimum-norm solutions in linear models, toward flatter minima via stochastic noise, toward simpler solutions early in training.
•Early stopping controls effective capacity. Stopping training early is mathematically equivalent to regularization. The number of iterations is a continuous dial on model complexity.
•Data structure dramatically affects needed capacity. Real data lies on low-dimensional manifolds. The effective complexity of learning problems is often much lower than the ambient dimension suggests.
•Architecture encodes assumptions. CNNs, residual connections, normalization layers—all constrain effective capacity in ways that can improve generalization when assumptions match the data.
•The same architecture can generalize or memorize. The Zhang et al. experiment shows that effective capacity is data-dependent. Structure in labels leads to generalization; noise leads to memorization.

Looking Ahead

In the next page, we'll explore the double descent phenomenon—a remarkable discovery that shows the bias-variance tradeoff has a second 'descent' phase where more capacity actually improves generalization. This fundamentally revises our understanding of the capacity-generalization relationship.

Page Complete

You now understand the concept of effective capacity and why it resolves the puzzle of deep learning generalization. The combination of optimization dynamics, data structure, and architectural choices constrains what models actually learn—explaining why massive networks can generalize despite astronomical theoretical capacity.