Loading learning content...
At the heart of machine learning lies a fundamental question: How does the complexity of a model relate to its ability to generalize? This question has occupied statisticians and computer scientists for decades, and its answer forms the theoretical foundation upon which all of modern deep learning rests.
When we train a neural network, we're essentially searching through a vast space of possible functions to find one that maps our inputs to our desired outputs. The model capacity—also called the hypothesis class complexity or model expressiveness—determines the size and shape of this function space. A model with high capacity can represent more complex functions; a model with low capacity is restricted to simpler ones.
Understanding capacity is not merely academic. It directly impacts every practical decision we make: how many layers to use, how many neurons per layer, whether to add regularization, and how much data we need. Without a principled understanding of capacity, we're navigating in the dark, relying entirely on trial-and-error experimentation.
By the end of this page, you will understand multiple rigorous definitions of model capacity (VC dimension, Rademacher complexity, and more), the relationship between capacity and generalization, and the profound implications these concepts have for designing deep learning systems.
Before diving into formal definitions, let's build intuition for what capacity means. Consider the problem of fitting a curve to a set of data points.
The Polynomial Fitting Example:
Suppose we have 10 data points in 2D space and want to fit a polynomial curve. We have choices:
The degree of the polynomial is a form of capacity. Higher degree means higher capacity—the model can fit more complex patterns.
1234567891011121314151617181920212223
import numpy as npimport matplotlib.pyplot as plt # Generate noisy data from a true quadratic functionnp.random.seed(42)X = np.linspace(0, 1, 10)y_true = 0.5 + 2*X - 3*X**2 # True underlying functiony_noisy = y_true + np.random.normal(0, 0.1, 10) # Add noise # Fit polynomials of different degrees (capacities)degrees = [1, 2, 9]X_dense = np.linspace(0, 1, 100) for degree in degrees: coeffs = np.polyfit(X, y_noisy, degree) y_fit = np.polyval(coeffs, X_dense) print(f"Degree {degree} polynomial:") print(f" Number of parameters: {degree + 1}") print(f" Training error (MSE): {np.mean((np.polyval(coeffs, X) - y_noisy)**2):.6f}") # Key insight: Degree-9 polynomial has ZERO training error# but will generalize poorly to new pointsThe Capacity-Generalization Tradeoff:
This example illustrates a fundamental tension:
Too little capacity (underfitting): The model cannot capture the true underlying pattern. The linear fit misses the curvature entirely.
Too much capacity (overfitting): The model fits the noise rather than the signal. The degree-9 polynomial passes through every noisy point, memorizing the training data rather than learning the underlying relationship.
Just right (good generalization): The quadratic model matches the true data-generating process. It captures the essential pattern without fitting the noise.
This is sometimes called the bias-variance tradeoff, which we'll explore in depth. For now, recognize that capacity is not simply 'more is better'—at least in classical statistical learning theory.
Classical theory suggests high-capacity models should overfit. Yet modern deep networks with billions of parameters often generalize well. This apparent contradiction—which we'll resolve in later pages—is one of the most important open questions in machine learning theory.
Computer scientists and statisticians have developed several rigorous ways to quantify model capacity. Each captures different aspects of what it means for a model to be 'complex' or 'expressive'.
We begin with the hypothesis class, denoted $\mathcal{H}$. This is the set of all functions our model can represent. For a neural network, $\mathcal{H}$ depends on the architecture (number of layers, neurons per layer, activation functions) and is implicitly defined by the parameter space.
For example:
The 'capacity' of a model is essentially a measure of how 'large' or 'rich' this hypothesis class is.
A naive approach might count parameters: more parameters = higher capacity. But this is inadequate. A ReLU network with 1000 parameters can represent vastly more complex functions than a degree-999 polynomial (which also has ~1000 coefficients). The architecture matters, not just parameter count.
The Vapnik-Chervonenkis (VC) dimension is the most famous measure of capacity, developed in the 1970s. It quantifies capacity through the lens of shattering.
Definition (Shattering): A hypothesis class $\mathcal{H}$ shatters a set of points ${x_1, ..., x_n}$ if, for every possible binary labeling of these points, there exists some $h \in \mathcal{H}$ that achieves that labeling.
Definition (VC Dimension): The VC dimension of $\mathcal{H}$, denoted $\text{VC}(\mathcal{H})$, is the largest number $n$ such that there exists a set of $n$ points that $\mathcal{H}$ can shatter.
Intuitively, VC dimension measures the largest dataset the model can perfectly memorize for any arbitrary labeling.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
"""VC Dimension Examples Understanding VC dimension through concrete examples.""" # Example 1: Linear classifiers in 2D (lines)# # Consider the hypothesis class of all lines in 2D:# H = {h(x) = sign(w1*x1 + w2*x2 + b) : w1, w2, b ∈ ℝ}## Can shatter 3 points? YES# - For any 3 non-collinear points, any binary labeling # (2^3 = 8 labelings) can be achieved by some line.## Can shatter 4 points? NO# - Consider 4 points in convex position (square vertices)# - Label diagonal points the same (XOR pattern)# - No line can separate this labeling## Therefore: VC(linear classifiers in 2D) = 3 # Example 2: Linear classifiers in d dimensions# # General result: VC(linear classifiers in ℝ^d) = d + 1## This makes intuitive sense:# - d=1 (threshold on a line): VC = 2# - d=2 (lines in plane): VC = 3# - d=10 (hyperplanes in ℝ^10): VC = 11 # Example 3: Neural Networks## For a ReLU network with W weights:# - Lower bound: VC ≥ Ω(W log W) for deep networks# - Upper bound: VC ≤ O(W^2) in general## Key insight: ReLU networks have much higher VC dimension# than their parameter count alone would suggest! def demonstrate_shattering_2d(): """ Demonstrate that 3 points can be shattered by lines in 2D. """ import numpy as np # Three non-collinear points points = np.array([[0, 0], [1, 0], [0.5, 1]]) # All 8 possible binary labelings labelings = [ [0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1] ] # For each labeling, we claim a separating line exists # (Proof by construction or convexity argument) print("For 3 non-collinear points in 2D:") print(f" Number of possible labelings: {len(labelings)}") print(" All labelings achievable by some line: YES") print(" Therefore these 3 points are shattered") print(f"\n VC dimension of linear classifiers in 2D = 3") demonstrate_shattering_2d()VC Dimension of Common Models:
| Model | VC Dimension |
|---|---|
| Threshold on ℝ (single feature) | 2 |
| Linear classifiers in ℝ^d | d + 1 |
| Axis-aligned rectangles in ℝ^d | 2d |
| Decision stumps (single threshold per feature) | 2d |
| Degree-k polynomials in ℝ^d | $\binom{d+k}{k}$ |
| k-nearest neighbors | ∞ (can shatter any finite set) |
| Neural network with W weights (ReLU) | O(W² log W) upper bound |
Note that k-NN has infinite VC dimension yet often works well in practice. This hints that VC dimension, while theoretically important, doesn't tell the whole story about generalization.
The real power of VC dimension emerges when we connect it to generalization. The VC generalization bound tells us how training error relates to test error as a function of capacity.
Theorem (VC Generalization Bound): For a hypothesis class $\mathcal{H}$ with finite VC dimension $d$, and a training set of size $n$, with probability at least $1 - \delta$:
$$R(h) \leq \hat{R}_n(h) + \sqrt{\frac{d \log(\frac{2en}{d}) + \log(\frac{4}{\delta})}{n}}$$
Where:
This is profound: it says that the gap between training and test error depends on the ratio $d/n$ (VC dimension over sample size).
The VC bound tells us: to generalize well with a high-capacity model, you need proportionally more data. If VC dimension is d, you need roughly O(d) samples for the generalization gap to be small. This is the theoretical foundation for 'more complex models need more data.'
Let's unpack what this bound actually tells us:
1. Capacity Controls Generalization Gap
The term $\frac{d}{n}$ dominates. If you have a model with VC dimension 1000 and only 100 training samples, the bound is vacuous (greater than 1). You need $n >> d$ for meaningful bounds.
2. Logarithmic Dependence on Confidence
The $\log(1/\delta)$ term means we pay a small price for high-confidence bounds. Going from 95% confidence to 99.9% confidence only adds a constant factor.
3. Sample Complexity
Rearranging, to achieve generalization gap $\epsilon$ with confidence $1-\delta$, you need: $$n = O\left(\frac{d + \log(1/\delta)}{\epsilon^2}\right)$$
This is the sample complexity of learning with hypothesis class $\mathcal{H}$.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
"""VC Generalization Bound Analysis Exploring how the bound behaves for different capacity/sample relationships.""" import numpy as npimport matplotlib.pyplot as plt def vc_generalization_bound(n: int, d: int, delta: float = 0.05) -> float: """ Compute the VC generalization bound. Args: n: Number of training samples d: VC dimension delta: Confidence parameter (bound holds with prob 1-delta) Returns: Upper bound on generalization gap """ if n < d: return float('inf') # Bound is vacuous # VC bound formula term1 = d * np.log(2 * np.e * n / d) term2 = np.log(4 / delta) return np.sqrt((term1 + term2) / n) # Analyze for a neural network with different numbers of parameters# Assume VC dimension ≈ O(W) for simplicity (actually O(W² log W)) print("VC Bound Analysis for Different Capacity/Sample Ratios")print("=" * 60) # Scenario 1: Small model, adequate datad1, n1 = 100, 10000bound1 = vc_generalization_bound(n1, d1)print(f"\nScenario 1: d={d1}, n={n1}")print(f" Ratio n/d = {n1/d1:.0f}")print(f" Generalization bound: {bound1:.4f}") # Scenario 2: Large model, same datad2, n2 = 10000, 10000bound2 = vc_generalization_bound(n2, d2)print(f"\nScenario 2: d={d2}, n={n2}")print(f" Ratio n/d = {n2/d2:.0f}")print(f" Generalization bound: {bound2:.4f}") # Scenario 3: Large model, proportionally more datad3, n3 = 10000, 1000000bound3 = vc_generalization_bound(n3, d3)print(f"\nScenario 3: d={d3}, n={n3}")print(f" Ratio n/d = {n3/d3:.0f}")print(f" Generalization bound: {bound3:.4f}") # Scenario 4: Modern deep learning regime (overparameterized)d4, n4 = 100000000, 1000000 # 100M params, 1M samplesbound4 = vc_generalization_bound(n4, d4)print(f"\nScenario 4: d={d4:,}, n={n4:,}")print(f" Ratio n/d = {n4/d4:.4f}")print(f" Generalization bound: {bound4:.4f}")print(" Note: Bound is vacuous! Yet deep nets generalize in practice.") print("\n" + "=" * 60)print("KEY INSIGHT: Classical VC bounds fail to explain deep learning!")print("This is the 'generalization puzzle' that motivates modern theory.")For modern deep networks with millions or billions of parameters, VC bounds are completely vacuous. A network with 100 million parameters trained on 1 million samples should—according to VC theory—overfit catastrophically. Yet they generalize remarkably well. Understanding why requires concepts we'll explore in later pages: effective capacity, implicit regularization, and the geometry of loss landscapes.
VC dimension is worst-case: it considers whether any arrangement of points can be shattered. But real data isn't arbitrary—it has structure. Rademacher complexity provides a more refined, data-dependent measure of capacity.
Definition (Empirical Rademacher Complexity): For a hypothesis class $\mathcal{H}$, a sample $S = {x_1, ..., x_n}$, and i.i.d. Rademacher random variables $\sigma_i \in {-1, +1}$ with equal probability:
$$\hat{\mathcal{R}}S(\mathcal{H}) = \mathbb{E}\sigma\left[\sup_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i h(x_i)\right]$$
Intuition: We're asking: how well can the best hypothesis in $\mathcal{H}$ correlate with random noise? If $\mathcal{H}$ has high capacity, some hypothesis will correlate well with any noise pattern. If capacity is low, the correlation will be weak.
Data-Dependence: Unlike VC dimension, Rademacher complexity depends on the actual data distribution. A model might have high VC dimension in theory but low Rademacher complexity on structured real-world data.
Tighter Bounds: The Rademacher generalization bound is:
$$R(h) \leq \hat{R}_n(h) + 2\mathcal{R}_n(\mathcal{H}) + \sqrt{\frac{\ln(1/\delta)}{2n}}$$
Of crucial importance: $\mathcal{R}_n(\mathcal{H})$ can be much smaller than what VC bounds predict, especially when the data has low intrinsic complexity.
Connection to Margin: For classification with margins, Rademacher complexity decreases with larger margins. This connects capacity to the geometric structure of the learned classifier.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
"""Rademacher Complexity Estimation Empirically estimating Rademacher complexity via Monte Carlo sampling.""" import numpy as npfrom typing import Callable, List def estimate_rademacher_complexity( hypothesis_class_outputs: np.ndarray, n_rademacher_samples: int = 1000) -> float: """ Estimate empirical Rademacher complexity. Args: hypothesis_class_outputs: Array of shape (n_hypotheses, n_samples) where entry [h, i] is the output of hypothesis h on sample i. n_rademacher_samples: Number of Monte Carlo samples for expectation. Returns: Estimated Rademacher complexity. """ n_hypotheses, n_samples = hypothesis_class_outputs.shape supremums = [] for _ in range(n_rademacher_samples): # Sample random Rademacher variables sigma = np.random.choice([-1, 1], size=n_samples) # Compute correlation for each hypothesis correlations = hypothesis_class_outputs @ sigma / n_samples # Take supremum over hypothesis class supremums.append(np.max(correlations)) return np.mean(supremums) # Example: Compare linear models vs. polynomial modelsnp.random.seed(42) # Generate structured data (not arbitrary)n_samples = 100X = np.linspace(-1, 1, n_samples).reshape(-1, 1) # Hypothesis class 1: Linear functions (low capacity)# Sample many random linear functionsn_linear = 1000linear_weights = np.random.randn(n_linear)linear_outputs = linear_weights.reshape(-1, 1) * X.reshape(1, -1) # Hypothesis class 2: Degree-10 polynomials (high capacity)n_poly = 1000poly_features = np.column_stack([X**i for i in range(11)]) # 11 featurespoly_weights = np.random.randn(n_poly, 11)poly_outputs = poly_weights @ poly_features.T # Normalize outputs to [-1, 1] for fair comparisonlinear_outputs = np.tanh(linear_outputs)poly_outputs = np.tanh(poly_outputs) # Estimate Rademacher complexityrc_linear = estimate_rademacher_complexity(linear_outputs)rc_poly = estimate_rademacher_complexity(poly_outputs) print("Rademacher Complexity Comparison")print("=" * 50)print(f"Linear functions: {rc_linear:.4f}")print(f"Degree-10 polynomials: {rc_poly:.4f}")print(f"\nRatio: {rc_poly / rc_linear:.2f}x higher for polynomials")print("\nInterpretation: Higher Rademacher complexity means")print("the model class can fit random labels better → more capacity")Zhang et al. (2017) famously showed that deep networks can fit randomly labeled data perfectly. This is exactly what high Rademacher complexity predicts: the hypothesis class is rich enough to correlate with any noise pattern. The mystery is why, despite this capability, networks trained on real labels generalize well.
Beyond VC dimension and Rademacher complexity, researchers have proposed numerous other ways to quantify model capacity. Each provides different insights.
A covering number $\mathcal{N}(\epsilon, \mathcal{H}, d)$ is the minimum number of balls of radius $\epsilon$ (in metric $d$) needed to cover the hypothesis class $\mathcal{H}$.
An extension of VC dimension for real-valued functions that accounts for margin:
Definition: The fat-shattering dimension at margin $\gamma$, denoted $\text{fat}_\gamma(\mathcal{H})$, is the largest set $S$ such that there exist 'witnesses' $r_1, ..., r_n$ where:
This captures that classifiers with larger margins have lower effective capacity.
The Minimum Description Length (MDL) principle provides an information-theoretic view of capacity:
This perspective is compelling because it connects directly to compression: a model that generalizes well should compress the data, not memorize it.
Probably Approximately Correct (PAC)-Bayes bounds provide some of the tightest generalization bounds currently known:
$$R(\rho) \leq \hat{R}(\rho) + \sqrt{\frac{\text{KL}(\rho | \pi) + \ln(n/\delta)}{2n}}$$
Where:
The key insight: if the learned model stays 'close' to the prior (in KL sense), generalization is guaranteed. This formalizes the intuition that simpler models (closer to prior) generalize better.
| Measure | Type | Advantages | Limitations |
|---|---|---|---|
| VC Dimension | Worst-case, combinatorial | Conceptually simple, well-studied | Too pessimistic for deep learning |
| Rademacher Complexity | Data-dependent, average-case | Tighter bounds, captures data structure | Hard to compute exactly |
| Covering Numbers | Geometric | Works for continuous hypothesis classes | Depends on choice of metric |
| Fat-Shattering | Margin-aware | Captures benefit of large margins | Still worst-case |
| MDL / Bits | Information-theoretic | Intuitive, connects to compression | Depends on encoding scheme |
| PAC-Bayes | Probabilistic, algorithmic | Tightest known bounds, non-vacuous for DNNs | Requires choosing prior |
PAC-Bayes bounds are currently the most promising approach for understanding deep learning generalization. They can give non-vacuous bounds (bounds less than 1) for networks with millions of parameters—something VC theory cannot achieve. The key is that they account for the specific weights found by training, not just the architecture's capacity.
How do these theoretical concepts apply specifically to neural networks? Let's examine what determines the capacity of deep learning models.
Depth (Number of Layers):
Width (Neurons per Layer):
Connection Patterns:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
"""Neural Network Capacity Analysis Examining how architectural choices affect representational power.""" import torchimport torch.nn as nnimport numpy as np def count_parameters(model: nn.Module) -> int: """Count trainable parameters.""" return sum(p.numel() for p in model.parameters() if p.requires_grad) def count_linear_regions(model: nn.Module, input_dim: int, n_samples: int = 10000) -> int: """ Estimate the number of linear regions for a ReLU network. For ReLU networks, the function is piecewise linear. More linear regions = higher capacity to represent complex functions. """ model.eval() # Sample random inputs X = torch.randn(n_samples, input_dim) # Get activation patterns for each sample patterns = [] def hook_fn(name): def hook(module, inp, out): # Binary pattern: which neurons are active (> 0) pattern = (out > 0).float() patterns.append(pattern.flatten().cpu().numpy()) return hook # Register hooks on all ReLU activations hooks = [] for name, module in model.named_modules(): if isinstance(module, nn.ReLU): hooks.append(module.register_forward_hook(hook_fn(name))) # Forward pass with torch.no_grad(): _ = model(X) # Remove hooks for hook in hooks: hook.remove() if not patterns: return 1 # Concatenate all activation patterns all_patterns = np.concatenate(patterns, axis=1) if len(patterns) > 1 else patterns[0] # Count unique activation patterns unique_patterns = np.unique(all_patterns, axis=0) return len(unique_patterns) # Compare different architecturesinput_dim = 10hidden_dims = [64, 64] # 2 hidden layers # Architecture 1: Shallow and wideshallow_wide = nn.Sequential( nn.Linear(input_dim, 256), nn.ReLU(), nn.Linear(256, 1)) # Architecture 2: Deep and narrowdeep_narrow = nn.Sequential( nn.Linear(input_dim, 32), nn.ReLU(), nn.Linear(32, 32), nn.ReLU(), nn.Linear(32, 32), nn.ReLU(), nn.Linear(32, 32), nn.ReLU(), nn.Linear(32, 1)) # Architecture 3: Balancedbalanced = nn.Sequential( nn.Linear(input_dim, 64), nn.ReLU(), nn.Linear(64, 64), nn.ReLU(), nn.Linear(64, 1)) models = { "Shallow-Wide (1 hidden, 256 units)": shallow_wide, "Deep-Narrow (4 hidden, 32 units)": deep_narrow, "Balanced (2 hidden, 64 units)": balanced,} print("Neural Network Capacity Comparison")print("=" * 60) for name, model in models.items(): params = count_parameters(model) regions = count_linear_regions(model, input_dim) print(f"\n{name}") print(f" Parameters: {params:,}") print(f" Estimated linear regions: {regions:,}") print(f" Regions per parameter: {regions/params:.2f}") print("\n" + "=" * 60)print("KEY INSIGHT: Depth enables exponentially more linear regions")print("with similar parameter counts. Depth provides more 'capacity per parameter'.")The Universal Approximation Theorem states that a single-hidden-layer neural network with sufficient width can approximate any continuous function on a compact domain to arbitrary precision.
Key Results:
Cybenko (1989): Networks with sigmoidal activations are universal approximators.
Hornik (1991): Universality holds for any non-constant, bounded, continuous activation function.
Modern Extensions: ReLU networks, residual networks, and transformers all have universal approximation properties.
What This Means for Capacity:
Universality is about existence, not efficiency or learnability.
Universal approximation is both a blessing and a curse. The blessing: we know our architecture can represent the solution. The curse: the same architecture can represent infinitely many wrong solutions too. The capacity to represent everything includes the capacity to memorize noise.
The bias-variance tradeoff is the classical framework for understanding how capacity affects generalization. While deep learning challenges some aspects of this framework (as we'll see in later pages), understanding it remains essential.
For regression with mean-squared error, the expected prediction error at any point $x$ can be decomposed:
$$\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(\mathbb{E}[\hat{f}(x)] - f(x))^2}{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]}{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Noise}}$$
Where:
Interpretation:
Low Capacity (e.g., linear model for nonlinear data):
High Capacity (e.g., degree-20 polynomial for 10 data points):
Optimal Capacity:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
"""Bias-Variance Tradeoff Simulation Empirically demonstrating the tradeoff through repeated experiments.""" import numpy as npfrom sklearn.preprocessing import PolynomialFeaturesfrom sklearn.linear_model import LinearRegressionfrom sklearn.pipeline import make_pipeline def true_function(x): """The true underlying function we're trying to learn.""" return np.sin(2 * np.pi * x) def generate_data(n_samples=30, noise_std=0.3, seed=None): """Generate noisy samples from the true function.""" if seed is not None: np.random.seed(seed) X = np.random.uniform(0, 1, n_samples) y = true_function(X) + np.random.normal(0, noise_std, n_samples) return X.reshape(-1, 1), y def bias_variance_experiment(degree, n_trials=100, n_train=30, n_test=100): """ Estimate bias and variance for polynomial regression of given degree. """ # Fixed test points X_test = np.linspace(0, 1, n_test).reshape(-1, 1) y_true = true_function(X_test.flatten()) # Collect predictions from many trials all_predictions = np.zeros((n_trials, n_test)) for trial in range(n_trials): X_train, y_train = generate_data(n_train, seed=trial) model = make_pipeline( PolynomialFeatures(degree), LinearRegression() ) model.fit(X_train, y_train) all_predictions[trial] = model.predict(X_test) # Compute bias and variance mean_prediction = np.mean(all_predictions, axis=0) # Pointwise squared bias bias_squared = (mean_prediction - y_true) ** 2 # Pointwise variance variance = np.var(all_predictions, axis=0) # Average over test points avg_bias_squared = np.mean(bias_squared) avg_variance = np.mean(variance) return avg_bias_squared, avg_variance # Run experiment for different polynomial degreesdegrees = [1, 2, 3, 4, 5, 7, 10, 15, 20] print("Bias-Variance Tradeoff: Polynomial Regression")print("=" * 60)print(f"{'Degree':<10} {'Bias²':<15} {'Variance':<15} {'Total':<15}")print("-" * 60) results = []for degree in degrees: bias_sq, var = bias_variance_experiment(degree) total = bias_sq + var results.append((degree, bias_sq, var, total)) print(f"{degree:<10} {bias_sq:<15.4f} {var:<15.4f} {total:<15.4f}") # Find optimalbest = min(results, key=lambda x: x[3])print("-" * 60)print(f"\nOptimal degree: {best[0]} (Total error: {best[3]:.4f})")print("\nObservations:")print(" - Low degrees: High bias, low variance (underfitting)")print(" - High degrees: Low bias, high variance (overfitting)")print(" - Middle degrees: Best balance")The traditional prescription: Choose model complexity using held-out validation. Start simple, increase complexity until validation error starts increasing. This 'Goldilocks' approach—not too simple, not too complex—has been the standard for decades. We'll see how deep learning upends this prescription in later pages.
We have covered substantial ground on the foundational concept of model capacity. Let's consolidate the essential takeaways:
In the next page, we'll explore effective capacity—the idea that what matters is not what a model can represent, but what it actually represents after training. This distinction is crucial for understanding why deep learning defies classical theory.
Core Questions to Reflect On:
If VC dimension of a neural network is O(W²) where W is parameter count, why don't billion-parameter models overfit catastrophically?
How does the training algorithm (SGD) interact with model capacity?
What role does data structure play in determining effective capacity?
These questions drive modern deep learning theory. With the foundation from this page, you're ready to explore the answers.
You now understand the theoretical foundations of model capacity, including VC dimension, Rademacher complexity, and the bias-variance tradeoff. You've also seen why classical theory fails to explain deep learning, setting the stage for more refined concepts like effective capacity and implicit regularization.