Loading content...
In the previous page, we established that model capacity quantifies what a model can represent—the full richness of its hypothesis class. We also noted a profound puzzle: deep neural networks with billions of parameters (and correspondingly astronomical VC dimensions) generalize remarkably well, defying all classical predictions.
The resolution to this puzzle lies in a crucial distinction: what a model can represent is not the same as what it will represent after training.
This is the concept of effective capacity. While a deep network's architecture may theoretically be capable of memorizing any dataset, the combination of:
...all conspire to dramatically restrict which solutions the network actually finds. The effective capacity—the subset of the hypothesis class the model realistically explores—is far smaller than the theoretical capacity.
By the end of this page, you will understand: (1) why the distinction between theoretical and effective capacity matters, (2) how optimization algorithms implicitly constrain effective capacity, (3) the role of initialization and training dynamics, and (4) how to reason about what your models actually learn versus what they could learn.
Let's make the gap between theoretical and effective capacity concrete with a thought experiment.
The Memorization Experiment (Zhang et al., 2017):
In a landmark paper, researchers at MIT and Google trained deep neural networks on CIFAR-10 with various label modifications:
The same architecture, with identical theoretical capacity, exhibited completely different behaviors depending on the data. With true labels, it generalized. With random labels, it memorized.
What does this tell us?
The network's ability to memorize (demonstrated by 100% training accuracy on random labels) didn't translate to actually memorizing when the data had learnable structure. The training process somehow 'preferred' the generalizing solution.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123
"""Recreating the Zhang et al. Memorization Experiment Demonstrating that the same architecture can either generalize or memorize,depending on whether the data has learnable structure.""" import torchimport torch.nn as nnimport torch.nn.functional as Ffrom torch.utils.data import DataLoader, TensorDatasetimport numpy as np class SimpleCNN(nn.Module): """A simple CNN with substantial capacity.""" def __init__(self, num_classes=10): super().__init__() self.conv1 = nn.Conv2d(3, 64, 3, padding=1) self.conv2 = nn.Conv2d(64, 128, 3, padding=1) self.conv3 = nn.Conv2d(128, 256, 3, padding=1) self.pool = nn.MaxPool2d(2) self.fc1 = nn.Linear(256 * 4 * 4, 512) self.fc2 = nn.Linear(512, num_classes) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) # 32 -> 16 x = self.pool(F.relu(self.conv2(x))) # 16 -> 8 x = self.pool(F.relu(self.conv3(x))) # 8 -> 4 x = x.view(x.size(0), -1) x = F.relu(self.fc1(x)) return self.fc2(x) def train_epoch(model, loader, optimizer, criterion): model.train() total_loss, correct, total = 0, 0, 0 for X, y in loader: optimizer.zero_grad() pred = model(X) loss = criterion(pred, y) loss.backward() optimizer.step() total_loss += loss.item() * X.size(0) correct += (pred.argmax(1) == y).sum().item() total += X.size(0) return total_loss / total, correct / total def evaluate(model, loader): model.eval() correct, total = 0, 0 with torch.no_grad(): for X, y in loader: pred = model(X) correct += (pred.argmax(1) == y).sum().item() total += X.size(0) return correct / total # Simulate a smaller version of the experimentnp.random.seed(42)torch.manual_seed(42) n_samples = 5000input_shape = (3, 32, 32)n_classes = 10 # Create synthetic "structured" data (true labels based on simple patterns)X_data = torch.randn(n_samples, *input_shape)# True labels: based on mean of red channeltrue_labels = (X_data[:, 0].mean(dim=(1,2)) > 0).long() * 5 + (X_data[:, 1].mean(dim=(1,2)) > 0).long() * 2 + (X_data[:, 2].mean(dim=(1,2)) > 0).long()true_labels = true_labels % n_classes # Random labels: completely uncorrelated with inputrandom_labels = torch.randint(0, n_classes, (n_samples,)) # Split datatrain_X, test_X = X_data[:4000], X_data[4000:]train_true, test_true = true_labels[:4000], true_labels[4000:]train_random, test_random = random_labels[:4000], random_labels[4000:] # Create dataloaderstrue_train = DataLoader(TensorDataset(train_X, train_true), batch_size=64, shuffle=True)true_test = DataLoader(TensorDataset(test_X, test_true), batch_size=64)random_train = DataLoader(TensorDataset(train_X, train_random), batch_size=64, shuffle=True)random_test = DataLoader(TensorDataset(test_X, test_random), batch_size=64) n_epochs = 50criterion = nn.CrossEntropyLoss() print("Memorization vs. Generalization Experiment")print("=" * 60) # Train on true labelsmodel_true = SimpleCNN()optimizer = torch.optim.Adam(model_true.parameters(), lr=0.001) print("Training on TRUE (structured) labels:")for epoch in [0, 9, 24, 49]: # Check at specific epochs for e in range(epoch + 1 if epoch == 0 else 0, epoch + 1): train_epoch(model_true, true_train, optimizer, criterion) train_acc = evaluate(model_true, true_train) test_acc = evaluate(model_true, true_test) print(f" Epoch {epoch+1:2d}: Train acc = {train_acc:.2%}, Test acc = {test_acc:.2%}") # Train on random labels model_random = SimpleCNN()optimizer = torch.optim.Adam(model_random.parameters(), lr=0.001) print("Training on RANDOM labels:")for epoch in [0, 9, 24, 49]: for e in range(epoch + 1 if epoch == 0 else 0, epoch + 1): train_epoch(model_random, random_train, optimizer, criterion) train_acc = evaluate(model_random, random_train) test_acc = evaluate(model_random, random_test) print(f" Epoch {epoch+1:2d}: Train acc = {train_acc:.2%}, Test acc = {test_acc:.2%}") print("" + "=" * 60)print("KEY INSIGHT: Same architecture, vastly different behavior!")print("With structure: Generalizes. Without: Memorizes.")print("Effective capacity adapts to the data.")Theoretical capacity is about what's possible. Effective capacity is about what's likely. The training algorithm, combined with the structure of the data, creates a strong preference for certain solutions over others. Understanding this preference is the key to understanding deep learning generalization.
One of the most profound insights in modern deep learning theory is that the optimization algorithm itself acts as a regularizer. Even without explicit regularization (L2, dropout, etc.), stochastic gradient descent (SGD) has properties that favor simpler, more generalizing solutions.
When multiple solutions achieve zero training loss, SGD doesn't find just any of them—it finds solutions with specific properties:
For Linear Regression:
When the problem is underdetermined (more parameters than data points), gradient descent initialized at θ₀ = 0 converges to the minimum norm solution:
$$\theta^* = \arg\min_\theta ||\theta||_2 \quad \text{subject to } X\theta = y$$
This is exactly what explicit L2 regularization would give in the limit λ → 0. SGD implicitly prefers simpler (smaller norm) solutions!
For Deep Networks:
The implicit bias is more complex but still exists:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
"""Implicit Regularization of Gradient Descent Demonstrating that GD finds minimum-norm solutions without explicit regularization.""" import numpy as npimport matplotlib.pyplot as plt def gradient_descent_linear(X, y, lr=0.01, n_iters=5000): """ Solve linear regression via gradient descent. Returns the trajectory of weights. """ n_features = X.shape[1] theta = np.zeros(n_features) # Initialize at zero (crucial!) trajectory = [theta.copy()] for _ in range(n_iters): # Gradient of MSE: (1/n) * X.T @ (X @ theta - y) pred = X @ theta grad = X.T @ (pred - y) / len(y) theta = theta - lr * grad trajectory.append(theta.copy()) return np.array(trajectory) def minimum_norm_solution(X, y): """ Compute the minimum L2-norm solution using pseudoinverse. theta* = X^+ @ y = X.T @ (X @ X.T)^(-1) @ y """ return X.T @ np.linalg.solve(X @ X.T, y) def explicit_ridge_solution(X, y, lambda_reg): """ Compute ridge regression solution. theta* = (X.T @ X + lambda * I)^(-1) @ X.T @ y """ n_features = X.shape[1] return np.linalg.solve(X.T @ X + lambda_reg * np.eye(n_features), X.T @ y) # Create an underdetermined system (more parameters than data)np.random.seed(42)n_samples, n_features = 10, 100 # 10 equations, 100 unknowns X = np.random.randn(n_samples, n_features)true_theta = np.random.randn(n_features)y = X @ true_theta # Noise-free (infinitely many exact solutions exist) # Run gradient descenttrajectory = gradient_descent_linear(X, y, lr=0.1, n_iters=10000)theta_gd = trajectory[-1] # Compute theoretical minimum-norm solutiontheta_min_norm = minimum_norm_solution(X, y) # Compare normsprint("Implicit Regularization in Underdetermined Linear Regression")print("=" * 60)print(f"Problem: {n_samples} samples, {n_features} features")print(f"(Infinitely many solutions exist with zero training error)")print()print(f"Solution found by GD:")print(f" L2 norm: {np.linalg.norm(theta_gd):.4f}")print(f" Training error: {np.mean((X @ theta_gd - y)**2):.2e}")print()print(f"Minimum-norm solution (theoretical):")print(f" L2 norm: {np.linalg.norm(theta_min_norm):.4f}")print(f" Training error: {np.mean((X @ theta_min_norm - y)**2):.2e}")print()print(f"Distance between GD solution and min-norm solution:")print(f" ||theta_GD - theta_min_norm|| = {np.linalg.norm(theta_gd - theta_min_norm):.2e}")print() # Show equivalence to ridge regression as lambda -> 0print("Comparison with explicit Ridge regression (lambda -> 0):")for lam in [1.0, 0.1, 0.01, 0.001]: theta_ridge = explicit_ridge_solution(X, y, lam) dist = np.linalg.norm(theta_ridge - theta_min_norm) print(f" lambda = {lam}: distance to min-norm = {dist:.4f}") print()print("=" * 60)print("KEY INSIGHT: Gradient descent IMPLICITLY finds the minimum-norm")print("solution, which is equivalent to infinitesimal L2 regularization.")print("The optimization process itself regularizes!")The implicit bias toward minimum-norm solutions isn't magic—it emerges from the geometry of gradient descent:
Gradient Flow in Continuous Time:
Consider the continuous-time limit of gradient descent (gradient flow): $$\frac{d\theta}{dt} = - abla L(\theta)$$
For linear regression with $L(\theta) = ||X\theta - y||^2$, the gradient is: $$ abla L(\theta) = X^T(X\theta - y)$$
Note that $ abla L(\theta)$ always lies in the row space of $X$ (i.e., $\text{span}(x_1, ..., x_n)$). This means:
This geometric argument extends, with modifications, to nonlinear networks.
The implicit regularization toward minimum-norm solutions depends critically on weight initialization. Initializing at zero (or near zero) is essential. Different initializations lead to different implicit biases. This is why initialization schemes like Xavier/He are so important—they set up the optimization trajectory for good solutions.
Beyond the implicit bias of gradient descent, the stochastic nature of SGD provides additional regularization benefits.
SGD computes gradients using random mini-batches rather than the full dataset. This introduces noise into the optimization:
$$\theta_{t+1} = \theta_t - \eta \cdot ( abla L(\theta_t) + \xi_t)$$
Where $\xi_t$ is the gradient noise from mini-batch sampling. This noise has several effects:
1. Escaping Sharp Minima:
Sharp minima (with high curvature) correspond to solutions that are sensitive to small perturbations—a hallmark of overfitting. SGD noise helps escape these sharp minima in favor of flatter regions.
2. Implicit Exploration:
The random walk component of SGD allows exploration of the loss landscape, potentially finding better (more generalizing) solutions than pure gradient descent would.
3. Regularization Proportional to Learning Rate:
Higher learning rates produce more gradient noise. There's evidence that the ratio (learning rate / batch size) controls the regularization strength.
The flatness of a minimum relates to the eigenvalues of the Hessian at that point:
The Flatness-Generalization Hypothesis:
Hochreiter & Schmidhuber (1997) proposed that flat minima generalize better because:
While the exact relationship between flatness and generalization remains debated (it's sensitive to reparameterization), empirical evidence strongly supports that SGD tends to find minima that generalize well.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
"""Flat vs. Sharp Minima and Generalization Demonstrating the concept of loss landscape flatness.""" import numpy as npimport matplotlib.pyplot as plt def loss_landscape_1d(x, sharpness=1.0, offset=0.0): """ A parametric loss function that can be sharp or flat. Args: x: parameter value sharpness: controls curvature (higher = sharper minimum) offset: shifts the minimum location Returns: loss value """ return sharpness * (x - offset) ** 2 def visualize_minima_flatness(): """Visualize sharp vs. flat minima and their robustness.""" x = np.linspace(-3, 3, 1000) # Three loss landscapes with different sharpness loss_sharp = loss_landscape_1d(x, sharpness=4.0, offset=0.0) loss_medium = loss_landscape_1d(x, sharpness=1.0, offset=0.0) loss_flat = loss_landscape_1d(x, sharpness=0.25, offset=0.0) print("Flat vs. Sharp Minima: Robustness Analysis") print("=" * 60) # All minima at x=0 with loss=0 # But what happens with small perturbations? perturbation = 0.5 print(f"With perturbation Δx = {perturbation}:") print(f" Sharp minimum (sharpness=4): loss increases to {loss_landscape_1d(perturbation, 4.0):.2f}") print(f" Medium minimum (sharpness=1): loss increases to {loss_landscape_1d(perturbation, 1.0):.2f}") print(f" Flat minimum (sharpness=0.25): loss increases to {loss_landscape_1d(perturbation, 0.25):.2f}") print("Implication for generalization:") print(" - Training finds θ* with low training loss") print(" - Test data is 'perturbed' from training distribution") print(" - Flat minima: small perturbation → small loss increase") print(" - Sharp minima: small perturbation → large loss increase") print(" - Flat minima are more robust to distribution shift!") # Hessian eigenvalues interpretation print("Hessian interpretation:") print(" - Sharpness = second derivative = Hessian eigenvalue") print(" - Sharp: large eigenvalue → high curvature") print(" - Flat: small eigenvalue → low curvature") print(" - SGD noise helps escape sharp minima → finds flat minima") visualize_minima_flatness() # Demonstrate SGD escape from sharp minimadef sgd_with_noise(start, gradient_fn, lr=0.1, noise_std=0.5, n_steps=100): """ SGD with explicit noise injection (simulating mini-batch variance). """ x = start trajectory = [x] for _ in range(n_steps): grad = gradient_fn(x) noise = np.random.normal(0, noise_std) x = x - lr * (grad + noise) trajectory.append(x) return np.array(trajectory) def double_well_loss(x): """Loss with two minima: one sharp at x=-1, one flat at x=1.""" # Sharp minimum at x=-1, flat minimum at x=1 sharp_well = 4 * (x + 1)**2 if x < 0 else 2 * (x + 1)**2 flat_well = 0.25 * (x - 1)**2 if x > 0 else 0.5 * (x - 1)**2 # Blend based on position return 0.1 * x**4 - 0.5 * x**2 + 0.1 * x + 0.5 def double_well_gradient(x): """Gradient of double well loss.""" return 0.4 * x**3 - 1.0 * x + 0.1 print("" + "=" * 60)print("SGD Escape from Sharp Minima Demo")print("=" * 60) # Start at sharp minimumx_start = -1.0 # Run multiple SGD trajectories with different noise levelsfor noise in [0.0, 0.3, 0.8]: trajectories = [] for seed in range(10): np.random.seed(seed) traj = sgd_with_noise(x_start, double_well_gradient, noise_std=noise, n_steps=200) trajectories.append(traj[-1]) avg_final = np.mean(trajectories) std_final = np.std(trajectories) print(f"Noise std = {noise:.1f}:") print(f" Final position: {avg_final:.2f} ± {std_final:.2f}") print(f" (Sharp minimum at -1.0, flat minimum at +1.0)") print("KEY INSIGHT: Higher noise helps SGD escape sharp minima")print("and find flatter, more generalizing regions of the loss landscape.")While the flatness-generalization connection is intuitively appealing and empirically observed, it's not a complete explanation. Dinh et al. (2017) showed that flatness is not invariant to reparameterization—you can make any minimum appear flat or sharp by rescaling. The full story involves more nuanced measures like PAC-Bayes bounds that account for the parameter geometry.
One of the simplest and most effective ways to control effective capacity is early stopping—halting training before the model reaches zero training error.
Think of training not as finding a static solution, but as a dynamic process where the model's effective complexity grows over time:
Early training (low effective capacity):
Mid training (appropriate effective capacity):
Late training (high effective capacity):
For some model classes, early stopping is mathematically equivalent to explicit regularization:
Theorem (Ali, 1994; Raskutti et al., 2014): For linear regression, gradient descent with learning rate $\eta$ stopped at iteration $T$ is equivalent to ridge regression with regularization parameter $\lambda \approx 1/(\eta T)$.
This means:
The 'regularization strength' of early stopping is controlled by the number of iterations—providing a continuous dial on effective capacity.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
"""Early Stopping as Implicit Regularization Demonstrating the equivalence between early stopping and explicit regularization.""" import numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import PolynomialFeatures def generate_polynomial_data(n_samples=100, noise_std=0.3, true_degree=3, seed=42): """Generate data from a polynomial function with noise.""" np.random.seed(seed) X = np.random.uniform(-1, 1, n_samples).reshape(-1, 1) y_true = 0.5 + X.flatten() - 0.8 * X.flatten()**2 + 0.3 * X.flatten()**3 y = y_true + np.random.normal(0, noise_std, n_samples) return X, y, y_true def gradient_descent_with_logging(X, y, X_val, y_val, lr=0.01, max_iters=10000, log_every=100): """ Run gradient descent and log training/validation error at each step. """ n_features = X.shape[1] theta = np.zeros(n_features) train_errors = [] val_errors = [] theta_norms = [] for i in range(max_iters): # Compute predictions and errors train_pred = X @ theta val_pred = X_val @ theta train_mse = np.mean((train_pred - y)**2) val_mse = np.mean((val_pred - y_val)**2) if i % log_every == 0: train_errors.append(train_mse) val_errors.append(val_mse) theta_norms.append(np.linalg.norm(theta)) # Gradient descent step grad = X.T @ (train_pred - y) / len(y) theta = theta - lr * grad return train_errors, val_errors, theta_norms def ridge_regression(X, y, lambda_reg): """Compute ridge regression solution.""" n_features = X.shape[1] theta = np.linalg.solve( X.T @ X + lambda_reg * np.eye(n_features), X.T @ y ) return theta # Generate dataX, y, y_true = generate_polynomial_data(n_samples=50)X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42) # Create polynomial features (high capacity)poly_degree = 15poly = PolynomialFeatures(poly_degree, include_bias=False)X_train_poly = poly.fit_transform(X_train)X_val_poly = poly.transform(X_val) # Normalize features for stable GDX_mean, X_std = X_train_poly.mean(0), X_train_poly.std(0) + 1e-8X_train_norm = (X_train_poly - X_mean) / X_stdX_val_norm = (X_val_poly - X_mean) / X_std # Run gradient descent with loggingprint("Early Stopping Analysis")print("=" * 60) train_errs, val_errs, norms = gradient_descent_with_logging( X_train_norm, y_train, X_val_norm, y_val, lr=0.1, max_iters=10000, log_every=50) iterations = np.arange(0, len(train_errs) * 50, 50) # Find optimal stopping pointbest_val_idx = np.argmin(val_errs)best_val_iter = iterations[best_val_idx] print(f"Polynomial degree: {poly_degree} (high capacity)")print(f"Training samples: {len(y_train)}")print()print("Training Progress:")print(f" Iteration 0: Train MSE = {train_errs[0]:.4f}, Val MSE = {val_errs[0]:.4f}")print(f" Iteration {best_val_iter:4d}: Train MSE = {train_errs[best_val_idx]:.4f}, Val MSE = {val_errs[best_val_idx]:.4f} *** OPTIMAL ***")print(f" Iteration {iterations[-1]:4d}: Train MSE = {train_errs[-1]:.4f}, Val MSE = {val_errs[-1]:.4f}") print("" + "=" * 60)print("Equivalence to Ridge Regression")print("=" * 60) # Compare early stopping to ridge regression# At optimal iteration, what ridge lambda gives similar results?for lam in [10.0, 1.0, 0.1, 0.01, 0.001]: theta_ridge = ridge_regression(X_train_norm, y_train, lam) val_pred = X_val_norm @ theta_ridge ridge_val_mse = np.mean((val_pred - y_val)**2) print(f" Ridge λ={lam:6.3f}: Val MSE = {ridge_val_mse:.4f}, ||θ|| = {np.linalg.norm(theta_ridge):.4f}") print(f" Early stopping at iter {best_val_iter}: Val MSE = {val_errs[best_val_idx]:.4f}, ||θ|| = {norms[best_val_idx]:.4f}")print("KEY INSIGHT: Early stopping achieves similar regularization to")print("explicit Ridge with an appropriate λ. The number of iterations")print("controls effective capacity just like explicit regularization.")Early stopping is one of the most effective regularization techniques in practice. It's computationally free (you have to train anyway), automatically adapts to the problem, and requires only a validation set to monitor. Always track validation loss during training and save checkpoints at the best validation performance.
Perhaps the most important insight about effective capacity is that it depends critically on the data distribution. The same model can have vastly different effective capacities on different datasets.
Real-world data is not random—it has structure:
1. Low Intrinsic Dimensionality:
High-dimensional data (e.g., images with millions of pixels) often lies on or near a low-dimensional manifold. The 'manifold hypothesis' suggests that natural images occupy a tiny fraction of all possible pixel configurations.
2. Hierarchical Structure:
Natural data often has hierarchical organization (edges → textures → parts → objects). Networks that match this structure require less effective capacity.
3. Statistical Regularities:
Real data has consistent patterns—the same edge detection principles apply everywhere in an image. Networks can exploit this through parameter sharing (convolutions).
When data has structure, less capacity is needed to model it. The effective capacity required is proportional to the intrinsic complexity of the data, not its raw dimensionality.
Several concepts try to capture how much 'capacity' data actually requires:
Intrinsic Dimension:
The minimum number of coordinates needed to represent the data without significant loss:
Compression-Based Measures:
How well can the data be compressed? The minimum description length provides a measure:
Learning Curve Analysis:
How quickly does test error decrease with more training data?
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
"""Data Complexity and Effective Capacity Demonstrating how data structure affects required model capacity.""" import numpy as npfrom sklearn.decomposition import PCAfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import learning_curve def estimate_intrinsic_dimension(X, variance_threshold=0.95): """ Estimate intrinsic dimensionality using PCA. Returns the number of components explaining variance_threshold of variance. """ pca = PCA() pca.fit(X) cumulative_variance = np.cumsum(pca.explained_variance_ratio_) n_components = np.argmax(cumulative_variance >= variance_threshold) + 1 return n_components, pca.explained_variance_ratio_ def generate_data_scenarios(): """Generate datasets with different intrinsic complexities.""" np.random.seed(42) n_samples = 1000 scenarios = {} # Scenario 1: Data lies on a 2D plane in 100D space (low intrinsic dim) true_dim = 2 ambient_dim = 100 # Generate 2D data, then embed in 100D low_rank_basis = np.random.randn(true_dim, ambient_dim) low_rank_basis /= np.linalg.norm(low_rank_basis, axis=1, keepdims=True) Z_low = np.random.randn(n_samples, true_dim) X_low_intrinsic = Z_low @ low_rank_basis X_low_intrinsic += np.random.randn(n_samples, ambient_dim) * 0.1 # Small noise scenarios['Low Intrinsic (2D in 100D)'] = X_low_intrinsic # Scenario 2: Data lies on a 20D manifold in 100D space (medium) true_dim = 20 medium_rank_basis = np.random.randn(true_dim, ambient_dim) medium_rank_basis /= np.linalg.norm(medium_rank_basis, axis=1, keepdims=True) Z_med = np.random.randn(n_samples, true_dim) X_medium = Z_med @ medium_rank_basis X_medium += np.random.randn(n_samples, ambient_dim) * 0.1 scenarios['Medium Intrinsic (20D in 100D)'] = X_medium # Scenario 3: Random data in 100D (high intrinsic dim = ambient) X_high = np.random.randn(n_samples, ambient_dim) scenarios['High Intrinsic (Full 100D)'] = X_high return scenarios print("Data Complexity Analysis")print("=" * 70) scenarios = generate_data_scenarios() for name, X in scenarios.items(): intrinsic_dim, variance_ratios = estimate_intrinsic_dimension(X) # Also measure how many components for 80% and 99% variance cumvar = np.cumsum(variance_ratios) dim_80 = np.argmax(cumvar >= 0.80) + 1 dim_99 = np.argmax(cumvar >= 0.99) + 1 print(f"{name}") print(f" Ambient dimension: {X.shape[1]}") print(f" Dims for 80% variance: {dim_80}") print(f" Dims for 95% variance: {intrinsic_dim}") print(f" Dims for 99% variance: {dim_99}") print(f" Compression ratio (95%): {X.shape[1]/intrinsic_dim:.1f}x") print("" + "=" * 70)print("Implications for Model Capacity")print("=" * 70)print("""- Low intrinsic dimension data can be modeled with lower effective capacity- High intrinsic dimension requires more capacity (or more data)- Real data (images, text) typically has much lower intrinsic dim than ambient- This explains why massive neural nets can generalize on natural data: The effective problem complexity is much lower than it appears!""") # Learning curve comparisonprint("Learning Curves: How Data Complexity Affects Sample Efficiency")print("-" * 70) # Create classification tasks with different complexitiesfor name, X in scenarios.items(): # Create labels based on linear combination of top principal components pca = PCA(n_components=min(5, X.shape[1])) X_pca = pca.fit_transform(X) y = (X_pca[:, 0] > 0).astype(int) # Simple boundary in top PC # Compute learning curve train_sizes = [0.1, 0.2, 0.3, 0.5, 0.7, 1.0] train_sizes_abs, train_scores, val_scores = learning_curve( LogisticRegression(max_iter=1000), X, y, train_sizes=train_sizes, cv=3, scoring='accuracy' ) print(f"{name}:") print(f" Samples: {train_sizes_abs}") print(f" Val Acc: {np.mean(val_scores, axis=1).round(3)}")The manifold hypothesis states that high-dimensional data (like images) lies on a low-dimensional manifold embedded in the high-dimensional space. If true, this fundamentally changes the generalization picture: the effective complexity of the learning problem is determined by the manifold dimension, not the ambient dimension. A model needs only enough capacity to represent functions on this manifold.
Given that effective capacity is what matters for generalization, how can we measure or estimate it in practice?
A simple diagnostic: How well can your model fit random labels?
Procedure:
Interpretation:
The norm of the weights provides a proxy for effective capacity:
Tracking weight norms during training reveals how capacity evolves:
Epoch 1: ||W|| = 0.5 → Low capacity, captures simple patterns
Epoch 50: ||W|| = 5.0 → Medium capacity, captures main structure
Epoch 500: ||W|| = 50.0 → High capacity, may be memorizing
For generalized linear models, effective degrees of freedom (EDF) provides a rigorous measure:
$$\text{EDF} = \text{trace}(H)$$
where $H$ is the hat matrix ($\hat{y} = Hy$). EDF measures how many 'effective parameters' the model uses, accounting for regularization.
Information-theoretic approaches measure effective capacity through:
| Method | What It Measures | Advantages | Limitations |
|---|---|---|---|
| Random Label Test | Memorization ability | Simple, intuitive | Binary (can/can't memorize) |
| Weight Norm | Magnitude of solution | Easy to compute, continuous | Not invariant to scaling |
| Effective DoF | Trace of influence matrix | Theoretically grounded | Hard to compute for DNNs |
| Compression | Bits to describe model | Information-theoretic basis | Depends on encoding |
| PAC-Bayes Bound | KL from prior to posterior | Non-vacuous for DNNs | Requires choosing prior |
For day-to-day deep learning work, the most practical capacity diagnostics are: (1) Track the gap between training and validation loss—a growing gap signals excessive effective capacity. (2) Monitor weight norms—explosive growth often precedes overfitting. (3) Use the random label sanity check during development to understand your model's memorization ability.
Different architectures have different effective capacity characteristics, even with similar parameter counts.
Parameter Sharing → Reduced Effective Capacity:
Convolutions use the same weights at every spatial location. This:
This is a form of capacity control through architecture—constraining what can be learned to match known data structure.
Skip Connections → Modulated Effective Capacity:
Residual connections $y = F(x) + x$ affect effective capacity in subtle ways:
Dynamic Computation → Data-Adaptive Capacity:
Self-attention computes weights dynamically based on input:
This makes transformer effective capacity particularly data-dependent.
BatchNorm/LayerNorm → Capacity Regularization:
Normalization layers constrain the space of possible functions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156
"""Architecture Effects on Effective Capacity Comparing how different architectures constrain what models actually learn.""" import torchimport torch.nn as nnimport numpy as np def count_parameters(model): """Count trainable parameters.""" return sum(p.numel() for p in model.parameters() if p.requires_grad) def measure_weight_statistics(model): """Compute statistics about weight magnitudes.""" total_norm = 0.0 max_weight = 0.0 weights = [] for name, param in model.named_parameters(): if 'weight' in name: w = param.data.cpu().numpy().flatten() weights.extend(w.tolist()) total_norm += np.sum(w**2) max_weight = max(max_weight, np.max(np.abs(w))) weights = np.array(weights) return { 'total_l2_norm': np.sqrt(total_norm), 'max_weight': max_weight, 'mean_abs': np.mean(np.abs(weights)), 'sparsity': np.mean(np.abs(weights) < 0.01), # Near-zero weights } # Compare MLP vs CNN on image-like datainput_shape = (3, 32, 32)n_classes = 10 # Architecture 1: Fully Connected MLPclass MLP(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.layers = nn.Sequential( nn.Linear(3*32*32, 512), nn.ReLU(), nn.Linear(512, 256), nn.ReLU(), nn.Linear(256, n_classes) ) def forward(self, x): return self.layers(self.flatten(x)) # Architecture 2: Convolutional Network (similar depth, much fewer params)class CNN(nn.Module): def __init__(self): super().__init__() self.conv = nn.Sequential( nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # 16x16 nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2), # 8x8 nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(), nn.AdaptiveAvgPool2d(1), ) self.fc = nn.Linear(128, n_classes) def forward(self, x): x = self.conv(x) x = x.view(x.size(0), -1) return self.fc(x) # Architecture 3: ResNet-style with skip connectionsclass ResBlock(nn.Module): def __init__(self, channels): super().__init__() self.conv1 = nn.Conv2d(channels, channels, 3, padding=1) self.conv2 = nn.Conv2d(channels, channels, 3, padding=1) self.relu = nn.ReLU() def forward(self, x): identity = x out = self.relu(self.conv1(x)) out = self.conv2(out) out = out + identity # Skip connection return self.relu(out) class ResNetStyle(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(3, 64, 3, padding=1) self.block1 = ResBlock(64) self.pool = nn.MaxPool2d(2) # 16x16 self.block2 = ResBlock(64) self.gap = nn.AdaptiveAvgPool2d(1) self.fc = nn.Linear(64, n_classes) def forward(self, x): x = self.conv1(x) x = self.block1(x) x = self.pool(x) x = self.block2(x) x = self.gap(x) x = x.view(x.size(0), -1) return self.fc(x) # Initialize modelsmlp = MLP()cnn = CNN()resnet = ResNetStyle() print("Architecture Comparison: Capacity Characteristics")print("=" * 70) for name, model in [("MLP (Fully Connected)", mlp), ("CNN (Convolutional)", cnn), ("ResNet-style", resnet)]: params = count_parameters(model) stats = measure_weight_statistics(model) print(f"{name}") print(f" Parameters: {params:,}") # Architecture-specific capacity notes if "MLP" in name: print(f" Capacity type: Global (every input neuron connects to every hidden neuron)") print(f" Inductive bias: None (most flexible, least constrained)") elif "CNN" in name: print(f" Capacity type: Local (weight sharing reduces effective parameters)") print(f" Inductive bias: Translation equivariance (good for images)") else: print(f" Capacity type: Residual (can learn identity, gradual complexity)") print(f" Inductive bias: Skip connections enable stable deep networks") print("" + "=" * 70)print("KEY INSIGHT: Parameter count ≠ Effective Capacity")print("-" * 70)print("""- MLP has most parameters but may need more data to generalize on images- CNN has fewer parameters but exploits image structure → lower effective capacity needed for good generalization on images- ResNet can go deeper without gradient issues → can modulate capacity gradually during training The right architecture matches its inductive bias to the data structure,reducing the effective capacity needed for good generalization.""")The choice of architecture is itself a form of regularization. When you choose a CNN for images, you're encoding assumptions about translation invariance and local structure. These assumptions reduce effective capacity by ruling out functions that violate them—but if the assumptions match the data, this is beneficial for generalization.
We've explored the crucial concept of effective capacity—the distinction between what a model can theoretically represent and what it actually learns in practice. This concept is fundamental to understanding deep learning.
In the next page, we'll explore the double descent phenomenon—a remarkable discovery that shows the bias-variance tradeoff has a second 'descent' phase where more capacity actually improves generalization. This fundamentally revises our understanding of the capacity-generalization relationship.
You now understand the concept of effective capacity and why it resolves the puzzle of deep learning generalization. The combination of optimization dynamics, data structure, and architectural choices constrains what models actually learn—explaining why massive networks can generalize despite astronomical theoretical capacity.