Regularization in Deep LearningImplicit Regularization

Implicit Regularization in Deep Learning

LevelAdvanced

Duration90 mins

TopicImplicit Regularization

1 / 5

SGD Implicit Bias

The Hidden Regularizer in Plain Sight

For decades, the machine learning community focused on explicit regularization—adding penalty terms like L2 weight decay, dropout, or data augmentation to prevent overfitting. These techniques have clear mathematical formulations and are well understood. However, a profound discovery has reshaped our understanding: the optimization algorithm itself acts as a powerful implicit regularizer.

Stochastic Gradient Descent (SGD), the workhorse of deep learning, doesn't merely find a solution—it exhibits a systematic bias toward certain types of solutions. This bias, invisible in the loss function but present in the optimization dynamics, fundamentally shapes the generalization properties of trained neural networks.

What You Will Learn

By the end of this page, you will understand how SGD's noise and iterative updates implicitly regularize neural networks, why SGD finds flat minima that generalize better, the mathematical foundations of implicit bias in optimization, and how these insights explain deep learning's remarkable generalization despite massive overparameterization.

The overparameterization paradox:

Modern neural networks routinely have millions or billions of parameters—far more than training examples. Classical statistical learning theory predicts catastrophic overfitting: with enough parameters, a model can memorize any training set perfectly, achieving zero training loss while learning nothing generalizable.

Yet deep networks generalize remarkably well. They achieve low training loss and low test loss, defying classical predictions. The resolution to this paradox lies in implicit regularization: SGD doesn't explore all possible zero-loss solutions equally—it's biased toward solutions with special properties that enable generalization.

The Many Solutions Problem

To understand implicit bias, we must first appreciate the solution space landscape of overparameterized models.

The underdetermined system:

Consider a neural network with parameters θ ∈ ℝᵈ trained on n examples. When d >> n (far more parameters than training points), the optimization problem becomes underdetermined—there exist infinitely many parameter configurations that achieve zero training loss.

Mathematically, let L(θ) be the training loss. The set of global minima forms a manifold:

$$\mathcal{M} = {\theta : L(\theta) = 0}$$

This manifold has dimension at least d - n (and often much higher due to the non-convex nature of deep networks). Every point on this manifold has identical training performance, yet points can have vastly different generalization properties.

The Selection Problem

With infinitely many zero-loss solutions, the optimization algorithm must implicitly select one. The question becomes: which solution does SGD prefer, and why does that preference lead to good generalization?

Not all minima are equal:

Consider two solutions θ₁ and θ₂ that both achieve zero training loss:

Property	Solution θ₁	Solution θ₂
Training Loss	0	0
Weight Magnitudes	Large	Small
Decision Boundary	Complex, jagged	Smooth
Sensitivity to Perturbations	High	Low
Test Loss	High	Low

Solution θ₂ generalizes better not because of lower training loss (both are zero), but because of structural properties of the solution itself. These properties—weight magnitude, decision boundary smoothness, robustness—correlate with generalization.

The key insight:

SGD exhibits an implicit bias toward solutions like θ₂. Without any explicit regularization term, SGD's optimization dynamics naturally gravitate toward solutions with properties that happen to generalize well. Understanding this mechanism is central to modern deep learning theory.

Key Observations About Solution Selection

•Overparameterization creates choice — More parameters than needed means many solutions achieve zero training loss
•Optimizer determines selection — The specific algorithm (SGD vs Adam vs full-batch GD) biases toward different solutions
•Implicit bias affects generalization — The selected solution's structural properties determine test performance
•No explicit penalty required — These effects emerge from optimization dynamics alone

SGD's Minimum Norm Bias in Linear Models

The cleanest illustration of implicit bias appears in linear models, where we can prove precisely which solution gradient descent selects.

The linear regression setting:

Consider linear regression with design matrix X ∈ ℝⁿˣᵈ (n samples, d features) and targets y ∈ ℝⁿ. We seek θ ∈ ℝᵈ minimizing:

$$L(\theta) = \frac{1}{2}|X\theta - y|^2$$

When d > n and X has full row rank, the system Xθ = y is underdetermined with infinitely many solutions forming an affine subspace.

The Minimum Norm Solution

Gradient descent initialized at θ₀ = 0 converges to the minimum L2 norm solution: the unique solution that minimizes ||θ||² among all solutions satisfying Xθ = y. This is the Moore-Penrose pseudoinverse solution θ* = X†y.

Mathematical proof sketch:

Gradient descent update: θₜ₊₁ = θₜ - η∇L(θₜ) = θₜ - ηXᵀ(Xθₜ - y)

Starting from θ₀ = 0, every iterate θₜ lies in the row space of X:

$$\theta_t \in \text{rowspace}(X) = \text{span}{x_1, ..., x_n}$$

This is because each gradient update adds a linear combination of data points (rows of X). The gradient ∇L(θ) = Xᵀ(Xθ - y) is always in the row space of X.

Consequence: Since θₜ always lies in rowspace(X), the limit θ* must also lie in rowspace(X). Among all solutions to Xθ = y, the unique solution in rowspace(X) is precisely the minimum norm solution.

$$\theta^* = X^\dagger y = X^T(XX^T)^{-1}y$$

This is not coincidence—it's a fundamental property of gradient descent on quadratic objectives.

minimum_norm_verification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
 
def verify_minimum_norm_bias(n=50, d=200, eta=0.01, num_iters=5000):
    """
    Demonstrate that gradient descent converges to the minimum norm solution.
    
    Args:
        n: Number of training examples
        d: Number of features (d > n for overparameterized)
        eta: Learning rate
        num_iters: Number of gradient descent iterations
    """
    np.random.seed(42)
    
    # Generate random data (underdetermined system)
    X = np.random.randn(n, d)
    true_theta = np.random.randn(d)
    y = X @ true_theta  # noiseless targets
    
    # Gradient descent from zero initialization
    theta_gd = np.zeros(d)
    
    for t in range(num_iters):
        residual = X @ theta_gd - y
        gradient = X.T @ residual
        theta_gd = theta_gd - eta * gradient
    
    # Compute minimum norm solution analytically (Moore-Penrose pseudoinverse)
    theta_min_norm = X.T @ np.linalg.solve(X @ X.T, y)
    
    # Compare solutions
    print("Solution Comparison:")
    print(f"  ||θ_GD||²     = {np.linalg.norm(theta_gd)**2:.6f}")
    print(f"  ||θ_min_norm||²= {np.linalg.norm(theta_min_norm)**2:.6f}")
    print(f"  ||θ_true||²   = {np.linalg.norm(true_theta)**2:.6f}")
    print(f"\n  ||θ_GD - θ_min_norm|| = {np.linalg.norm(theta_gd - theta_min_norm):.8f}")
    print(f"  Training loss of GD solution: {0.5 * np.linalg.norm(X @ theta_gd - y)**2:.10f}")
    
    # Both solve the system, but with different norms
    print(f"\n  GD finds minimum norm solution: {np.allclose(theta_gd, theta_min_norm)}")
    
verify_minimum_norm_bias()

Why minimum norm matters for generalization:

The minimum norm solution has special properties that correlate with generalization:

Smoothness: Minimum norm solutions avoid unnecessarily large weights, which typically correspond to high-frequency, complex functions
Stability: Smaller weights mean the function is less sensitive to input perturbations (Lipschitz constant is bounded by weight norm)
Occam's Razor: Among all interpolating solutions, minimum norm represents a form of simplicity preference
Connection to Ridge Regression: The minimum norm solution is the limit of ridge regression (L2 regularized) as the regularization strength λ → 0

$$\theta_{\text{ridge}}^{(\lambda)} = (X^TX + \lambda I)^{-1}X^Ty \xrightarrow{\lambda \to 0} X^\dagger y = \theta_{\text{min-norm}}$$

This reveals a deep connection: gradient descent achieves L2 regularization effects without any explicit L2 penalty.

Implicit Bias in Deep Networks

While the linear case provides clean mathematical analysis, deep nonlinear networks exhibit richer and more complex implicit biases. The overparameterization and nonlinearity create intricate interactions that researchers are still actively investigating.

Beyond minimum norm:

In deep networks, implicit bias manifests through several interrelated phenomena:

Function space bias: Networks are biased toward certain functions (smooth, low-frequency) rather than just toward certain parameter configurations
Layer-wise effects: Different layers may exhibit different implicit biases; early layers often learn more general features
Architecture dependence: The specific network architecture (depth, width, skip connections) affects the implicit bias
Initialization sensitivity: The starting point influences which basin of attraction the optimization falls into

The Neural Tangent Kernel Perspective

For infinitely wide networks, the Neural Tangent Kernel (NTK) theory shows that gradient descent is equivalent to kernel regression with a specific kernel determined by the architecture. In this regime, implicit bias becomes explicit: the network is biased toward functions in the RKHS (Reproducing Kernel Hilbert Space) of the NTK with minimum norm.

Matrix factorization insights:

Consider the simple problem of matrix completion, where we factorize a matrix M ≈ UV^T with U ∈ ℝᵐˣʳ and V ∈ ℝⁿˣʳ. This is analogous to a two-layer linear network.

Remarkably, gradient descent on the factorized parameterization (U, V) exhibits implicit nuclear norm regularization—it finds low-rank solutions even without explicitly penalizing rank.

The gradient updates: $$U_{t+1} = U_t - \eta \nabla_U L = U_t - \eta (U_tV_t^T - M)V_t$$ $$V_{t+1} = V_t - \eta \nabla_V L = V_t - \eta (U_tV_t^T - M)^TU_t$$

When initialized at small scale, this converges to a solution with minimum nuclear norm, not minimum Frobenius norm of (U, V).

Implicit Bias vs. Architecture
Model Class	Implicit Bias	Regularization Equivalent
Linear regression (1 layer)	Minimum L2 norm	Ridge regression (λ → 0)
Matrix factorization (2 layers, linear)	Minimum nuclear norm (low rank)	Nuclear norm regularization
Diagonal linear networks	Sparsity (minimum L1 norm)	Lasso regression
Deep linear networks	Complex function of depth	Depth-dependent regularization
ReLU networks (classification)	Maximum margin	SVM-like margin maximization

Maximum margin bias in classification:

For classification with separable data, deep homogeneous networks (with ReLU, leaky ReLU, or linear activations) trained with gradient descent exhibit a remarkable property: they converge in direction to the maximum margin classifier.

Specifically, as training progresses and the loss approaches zero:

$$\frac{\theta_t}{|\theta_t|} \rightarrow \frac{\theta_{\text{max-margin}}}{|\theta_{\text{max-margin}}|}$$

The normalized parameters converge to the direction of maximum margin, meaning the network implicitly maximizes the separation between classes—exactly what SVMs do explicitly.

This provides a theoretical foundation for why deep networks generalize well even when perfectly interpolating training data: they don't just find any interpolating solution, but one with maximum geometric margin.

The Role of Stochasticity in SGD

So far we've discussed gradient descent, but Stochastic Gradient Descent introduces an additional crucial element: noise from mini-batch sampling. This stochasticity doesn't merely speed up computation—it fundamentally alters the optimization dynamics and provides its own implicit regularization.

Gradient noise structure:

At each step, SGD uses a mini-batch B to estimate the gradient:

$$g_t = \frac{1}{|B|}\sum_{i \in B} \nabla_\theta \ell(\theta_t; x_i, y_i)$$

This estimate has noise with structure: $$g_t = \nabla L(\theta_t) + \epsilon_t$$

where εₜ is approximately zero-mean with covariance depending on the loss landscape and current parameters. This noise is anisotropic—it's larger in directions with high gradient variance across samples.

SGD as Bayesian Inference

One interpretation views SGD as approximate Bayesian inference. The noise induces exploration, and the stationary distribution of SGD approximate the posterior distribution over parameters. Under this view, the implicit regularization of SGD corresponds to a prior preference for flat regions of the loss landscape.

Noise scale and regularization strength:

The effective noise scale in SGD is controlled by several factors:

$$\text{Effective Noise} \propto \frac{\eta}{|B|} \cdot \text{Var}[\nabla \ell]$$

Where:

η = learning rate
|B| = batch size
Var[∇ℓ] = gradient variance across samples

Key insight: The ratio η/|B| governs the noise-to-signal ratio. Larger learning rates or smaller batches increase noise, strengthening implicit regularization.

This explains empirical observations:

Large batch training often generalizes worse (less noise = weaker regularization)
Larger learning rates can improve generalization (more noise = stronger regularization)
The relationship is captured by the learning rate to batch size ratio

High Noise Regime (Small Batch / Large LR)

•Stronger implicit regularization
•Better generalization (usually)
•Prefers flat minima
•Escapes sharp minima
•More stable final solutions

Low Noise Regime (Large Batch / Small LR)

•Weaker implicit regularization
•Can overfit more easily
•May converge to sharp minima
•Faster wall-clock time
•May need explicit regularization

sgd_noise_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_sgd_noise_impact(n_samples=1000, d=100, batch_sizes=[8, 32, 128, 512],
                             learning_rates=[0.1, 0.01, 0.001]):
    """
    Analyze how batch size and learning rate affect SGD noise scale.
    
    The effective noise scale is proportional to (learning_rate / batch_size).
    This ratio controls the implicit regularization strength.
    """
    # Generate synthetic gradient variances (simulating a real problem)
    np.random.seed(42)
    
    # Per-sample gradient magnitudes (vary across samples)
    sample_gradient_variance = np.random.exponential(scale=1.0, size=n_samples)
    full_gradient_variance = np.var(sample_gradient_variance)
    
    print("SGD Noise Scale Analysis")
    print("=" * 60)
    print(f"Per-sample gradient variance: {full_gradient_variance:.4f}")
    print()
    
    results = []
    for lr in learning_rates:
        print(f"Learning Rate: {lr}")
        print("-" * 40)
        for bs in batch_sizes:
            # Mini-batch gradient variance scales as 1/batch_size
            minibatch_variance = full_gradient_variance / bs
            
            # Effective noise scale
            noise_scale = lr * np.sqrt(minibatch_variance)
            effective_ratio = lr / bs
            
            print(f"  Batch Size {bs:4d}: Noise Scale = {noise_scale:.6f}, "
                  f"η/B = {effective_ratio:.6f}")
            
            results.append({
                'lr': lr,
                'batch_size': bs,
                'noise_scale': noise_scale,
                'ratio': effective_ratio
            })
        print()
    
    # Key insight
    print("=" * 60)
    print("Key Insight: Maintaining η/B ratio maintains similar dynamics")
    print("  - lr=0.1, batch=32  → η/B = 0.003125")
    print("  - lr=0.4, batch=128 → η/B = 0.003125  (Linear scaling rule)")
    
    return results
 
# Run analysis
analyze_sgd_noise_impact()

Flat Minima and Generalization

A key mechanism through which SGD achieves implicit regularization is its bias toward flat minima—regions of parameter space where the loss is low and changes slowly with parameter perturbations.

Sharp vs. Flat Minima:

Consider two minima with identical training loss:

Sharp minimum: Loss increases rapidly as parameters deviate from the optimum. Small perturbations lead to large loss increases.
Flat minimum: Loss remains low over a large region. The solution is tolerant to parameter perturbations.

The curvature of the loss landscape is captured by the Hessian matrix H = ∇²L(θ). Sharp minima have large eigenvalues (high curvature), while flat minima have small eigenvalues (low curvature).

Why Flat Minima Generalize Better

The intuition is rooted in perturbation analysis. Training and test data are drawn from the same distribution but are different samples. A solution that is robust to small data perturbations (flat minimum) will likely perform similarly on new data. Sharp minima, finely tuned to training data, may not transfer to test data.

PAC-Bayes perspective:

The connection between flatness and generalization can be formalized through PAC-Bayes bounds. These bounds relate generalization error to the "sharpness" of the loss landscape:

$$L_{\text{test}}(\theta) \leq \hat{L}_{\text{train}}(\theta) + \mathcal{O}\left(\sqrt{\frac{\text{KL}(Q | P) + \log(n/\delta)}{n}}\right)$$

where Q is a posterior distribution over parameters (centered at θ with spread related to flatness) and P is a prior. Flat minima permit Q to be spread more widely while maintaining low expected loss, reducing the KL divergence term.

Minimum description length view:

Flat minima correspond to solutions that can be specified with fewer bits of precision. Since the loss doesn't change much in a neighborhood, we only need to specify parameters up to a coarse precision. This connects to Occam's Razor: simpler (lower description length) solutions generalize better.

SGD's preference for flat minima:

Why does SGD prefer flat minima? Several mechanisms contribute:

Escape dynamics: SGD's noise allows it to escape sharp minima. The higher the curvature, the more unstable the point is under SGD dynamics. Sharp minima act as repellers.
Anisotropic noise: SGD noise is larger in directions of high gradient variance, which often correlates with high curvature. This preferentially destabilizes sharp directions.
Continuous-time analysis: In the limit of small learning rates, SGD can be approximated as a stochastic differential equation (SDE):

$$d\theta = -\nabla L(\theta)dt + \sqrt{\eta \Sigma(\theta)}dW_t$$

where Σ(θ) is the gradient covariance. The stationary distribution of this SDE concentrates in flat regions.

Volume argument: Flat minima occupy larger volumes in parameter space. Since SGD performs a form of random walk biased toward lower loss, it naturally encounters and stays in larger basins (flat minima) more often.

flat_minima_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import numpy as np
import matplotlib.pyplot as plt
 
def create_loss_landscape_1d():
    """
    Create a 1D loss landscape with both sharp and flat minima
    to illustrate SGD preference.
    """
    def loss_function(x):
        # Flat minimum at x = -2
        flat_component = 0.1 * (x + 2)**2
        
        # Sharp minimum at x = 2 (same loss value at minimum)
        sharp_component = 2 * (x - 2)**2
        
        # Blend to create landscape with both minima
        if isinstance(x, np.ndarray):
            result = np.where(x < 0, flat_component, sharp_component)
            # Add barrier between minima
            result += 0.5 * np.exp(-0.5 * x**2)
            return result
        else:
            if x < 0:
                return flat_component + 0.5 * np.exp(-0.5 * x**2)
            else:
                return sharp_component + 0.5 * np.exp(-0.5 * x**2)
    
    def loss_gradient(x):
        """Numerical gradient for simplicity"""
        eps = 1e-7
        return (loss_function(x + eps) - loss_function(x - eps)) / (2 * eps)
    
    return loss_function, loss_gradient
 
def simulate_sgd_on_landscape(loss_fn, grad_fn, x_init=0.0, 
                               lr=0.1, noise_scale=0.3, n_steps=500):
    """
    Simulate SGD with noise on the loss landscape.
    """
    x = x_init
    trajectory = [x]
    
    for _ in range(n_steps):
        grad = grad_fn(x)
        noise = np.random.randn() * noise_scale
        x = x - lr * grad + noise
        x = np.clip(x, -5, 5)  # Keep in bounds
        trajectory.append(x)
    
    return np.array(trajectory)
 
def analyze_convergence_preference(n_runs=100):
    """
    Analyze which minimum SGD converges to across multiple runs.
    """
    loss_fn, grad_fn = create_loss_landscape_1d()
    
    flat_count = 0
    sharp_count = 0
    
    for _ in range(n_runs):
        trajectory = simulate_sgd_on_landscape(
            loss_fn, grad_fn, x_init=0.0, lr=0.05, noise_scale=0.2, n_steps=1000
        )
        final_x = trajectory[-1]
        
        if final_x < 0:
            flat_count += 1
        else:
            sharp_count += 1
    
    print("SGD Convergence Analysis (100 runs)")
    print("=" * 40)
    print(f"Converged to FLAT minimum:  {flat_count:3d}%")
    print(f"Converged to SHARP minimum: {sharp_count:3d}%")
    print()
    print("=> SGD preferentially converges to flat minima!")
 
# Run analysis
analyze_convergence_preference()

Learning Rate and Generalization

The learning rate's role in implicit regularization goes beyond simple convergence considerations. It fundamentally affects the geometry of solutions found by SGD.

Large learning rates as regularizers:

Counter to naive intuition, larger learning rates (up to a point) often lead to better generalization. The mechanism:

Larger noise amplitude: Learning rate scales the gradient noise, increasing the effective temperature of the SGD random walk
Sharper minima become unstable: As learning rate increases, solutions with high curvature become dynamically unstable—SGD cannot stay there
Effective sharpness filter: Only minima flat enough to be stable at the given learning rate can be reached

The Edge of Stability

Recent research has identified the 'Edge of Stability' phenomenon: during training, the largest eigenvalue of the Hessian (sharpness) rises until it reaches 2/η (where η is the learning rate), then oscillates around this value. SGD self-organizes to the boundary of stability, naturally producing solutions whose sharpness is inversely proportional to the learning rate.

Mathematical formalization:

For a minimum to be stable under GD with learning rate η, the updates must not diverge. For a quadratic approximation near a minimum:

$$\theta_{t+1} - \theta^* = (I - \eta H)(\theta_t - \theta^*)$$

Stability requires all eigenvalues of (I - ηH) to have magnitude ≤ 1, which means:

$$\lambda_{\max}(H) < \frac{2}{\eta}$$

Larger learning rates → smaller maximum allowable curvature → flatter minima.

The catapult dynamics:

When SGD encounters a region sharper than 2/η, the dynamics 'catapult' away from that region rather than converging. This is a feature, not a bug—SGD is automatically filtering out sharp solutions.

This explains why:

Training with very small learning rates can lead to sharp minima and poor generalization
Warmup (starting with small lr, increasing to large lr) helps escape initial bad basins
Learning rate decay at the end 'locks in' to a nearby flat minimum

Learning Rate Effects on Solution Properties
Learning Rate	Reachable Minima	Generalization	Training Stability
Very Small (0.0001)	Sharp and flat	Often worse (sharp minima)	Very stable
Medium (0.01)	Moderately flat	Good balance	Stable
Large (0.1)	Only flat minima	Often better	May oscillate
Very Large (1.0+)	Very flat or none	Unstable	Divergence risk

Practical Implications

•Use learning rate warmup — Start small to find a good basin, increase to filter sharp regions
•Train at the largest stable learning rate — This maximizes implicit regularization
•Learning rate decay near convergence — Lock in to a nearby flat minimum
•Monitor loss oscillation — Oscillation indicates operation near edge of stability
•Consider learning rate to batch size ratio — Maintain similar dynamics when scaling batch size

Implications for Training Practice

Understanding SGD's implicit bias transforms how we approach neural network training. Rather than treating optimization as a black box, we can deliberately leverage these effects.

Hyperparameter selection through the lens of implicit regularization:

Every hyperparameter choice affects the implicit regularization strength:

Hyperparameter	Higher Value	Effect on Implicit Regularization
Learning rate	Higher	Stronger (more noise, sharpness filter)
Batch size	Larger	Weaker (less noise)
Weight decay	Higher	(Explicit, but interacts with implicit)
Momentum	Higher	Complex (reduces noise along trajectory)
Dropout	Higher	(Explicit, but affects optimization path)

The Linear Scaling Rule

When increasing batch size by factor k, increase learning rate by factor k to maintain similar optimization dynamics. This preserves the η/B ratio that governs SGD noise. Facebook validated this empirically: training with batch size 8192 at lr=k·0.1 matches training with batch size 256 at lr=0.1.

When to rely on implicit vs. explicit regularization:

Implicit regularization from SGD is 'free'—it comes from the optimization algorithm you're already using. However, it has limitations:

Strengths of implicit regularization:

No additional hyperparameters to tune
Emerges naturally from the optimization process
Particularly effective for overparameterized models
Computationally free

Limitations:

Less direct control over regularization strength
Interacts complexly with learning rate schedules
May not be sufficient for small datasets or simple models
Difficult to reason about theoretically

Practical guidance:

Start with SGD and reasonable hyperparameters; let implicit regularization do its work
Add explicit regularization (weight decay, dropout) if validation performance plateaus
Tune the ratio η/B before adding new forms of regularization
Use early stopping as a complementary implicit regularizer (covered in next page)

Optimizer choice matters:

Different optimizers exhibit different implicit biases:

SGD: Strong implicit regularization through noise; biased toward minimum norm / max margin
SGD + Momentum: Reduces noise variance, potentially weakening regularization; adds directional smoothing
Adam: Adaptive learning rates per parameter; can have different implicit biases; generally considered to regularize less than SGD
AdamW: Decouples weight decay, recovering some SGD-like properties

Research consistently shows SGD often generalizes better than Adam on image classification tasks, despite Adam converging faster. This is attributed to SGD's stronger implicit regularization. For this reason, many practitioners use Adam for faster initial progress, then switch to SGD for final fine-tuning.

Summary: SGD Implicit Bias

We've explored how Stochastic Gradient Descent, the foundational algorithm of deep learning, acts as an implicit regularizer. This phenomenon is central to understanding why deep learning works despite theoretical predictions of overfitting.

Key Takeaways

•Overparameterized models have many solutions — All achieve zero training loss, but they differ drastically in generalization
•SGD doesn't find arbitrary solutions — It exhibits systematic biases toward certain solution types
•In linear models, GD finds minimum norm solutions — Equivalent to infinitesimal L2 regularization
•In deep networks, biases become richer — Maximum margin, low rank, and architecture-dependent preferences emerge
•Stochasticity provides additional regularization — Mini-batch noise helps escape sharp minima
•Flat minima generalize better — SGD preferentially finds and stays in flat regions
•Learning rate controls regularization strength — Larger lr filters sharper minima (edge of stability)
•Hyperparameters affect implicit regularization — The ratio η/B is key; optimizer choice matters

Page Complete

You now understand the foundational concept of SGD's implicit bias—how the optimization algorithm itself acts as a regularizer. Next, we'll explore how architectural choices provide another form of implicit regularization, shaping the function classes neural networks can express and prefer.

1 / 5

Loading learning content...

Regularization in Deep LearningImplicit Regularization

Implicit Regularization in Deep Learning

LevelAdvanced

Duration90 mins

TopicImplicit Regularization

1 / 5

SGD Implicit Bias

The Hidden Regularizer in Plain Sight

What You Will Learn

The overparameterization paradox:

The Many Solutions Problem

To understand implicit bias, we must first appreciate the solution space landscape of overparameterized models.

The underdetermined system:

Mathematically, let L(θ) be the training loss. The set of global minima forms a manifold:

$$\mathcal{M} = {\theta : L(\theta) = 0}$$

The Selection Problem

Not all minima are equal:

Consider two solutions θ₁ and θ₂ that both achieve zero training loss:

Property	Solution θ₁	Solution θ₂
Training Loss	0	0
Weight Magnitudes	Large	Small
Decision Boundary	Complex, jagged	Smooth
Sensitivity to Perturbations	High	Low
Test Loss	High	Low

The key insight:

Key Observations About Solution Selection

•Overparameterization creates choice — More parameters than needed means many solutions achieve zero training loss
•Optimizer determines selection — The specific algorithm (SGD vs Adam vs full-batch GD) biases toward different solutions
•Implicit bias affects generalization — The selected solution's structural properties determine test performance
•No explicit penalty required — These effects emerge from optimization dynamics alone

SGD's Minimum Norm Bias in Linear Models

The cleanest illustration of implicit bias appears in linear models, where we can prove precisely which solution gradient descent selects.

The linear regression setting:

Consider linear regression with design matrix X ∈ ℝⁿˣᵈ (n samples, d features) and targets y ∈ ℝⁿ. We seek θ ∈ ℝᵈ minimizing:

$$L(\theta) = \frac{1}{2}|X\theta - y|^2$$

When d > n and X has full row rank, the system Xθ = y is underdetermined with infinitely many solutions forming an affine subspace.

The Minimum Norm Solution

Mathematical proof sketch:

Gradient descent update: θₜ₊₁ = θₜ - η∇L(θₜ) = θₜ - ηXᵀ(Xθₜ - y)

Starting from θ₀ = 0, every iterate θₜ lies in the row space of X:

$$\theta_t \in \text{rowspace}(X) = \text{span}{x_1, ..., x_n}$$

This is because each gradient update adds a linear combination of data points (rows of X). The gradient ∇L(θ) = Xᵀ(Xθ - y) is always in the row space of X.

$$\theta^* = X^\dagger y = X^T(XX^T)^{-1}y$$

This is not coincidence—it's a fundamental property of gradient descent on quadratic objectives.

minimum_norm_verification.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import numpy as np
 
def verify_minimum_norm_bias(n=50, d=200, eta=0.01, num_iters=5000):
    """
    Demonstrate that gradient descent converges to the minimum norm solution.
    
    Args:
        n: Number of training examples
        d: Number of features (d > n for overparameterized)
        eta: Learning rate
        num_iters: Number of gradient descent iterations
    """
    np.random.seed(42)
    
    # Generate random data (underdetermined system)
    X = np.random.randn(n, d)
    true_theta = np.random.randn(d)
    y = X @ true_theta  # noiseless targets
    
    # Gradient descent from zero initialization
    theta_gd = np.zeros(d)
    
    for t in range(num_iters):
        residual = X @ theta_gd - y
        gradient = X.T @ residual
        theta_gd = theta_gd - eta * gradient
    
    # Compute minimum norm solution analytically (Moore-Penrose pseudoinverse)
    theta_min_norm = X.T @ np.linalg.solve(X @ X.T, y)
    
    # Compare solutions
    print("Solution Comparison:")
    print(f"  ||θ_GD||²     = {np.linalg.norm(theta_gd)**2:.6f}")
    print(f"  ||θ_min_norm||²= {np.linalg.norm(theta_min_norm)**2:.6f}")
    print(f"  ||θ_true||²   = {np.linalg.norm(true_theta)**2:.6f}")
    print(f"\n  ||θ_GD - θ_min_norm|| = {np.linalg.norm(theta_gd - theta_min_norm):.8f}")
    print(f"  Training loss of GD solution: {0.5 * np.linalg.norm(X @ theta_gd - y)**2:.10f}")
    
    # Both solve the system, but with different norms
    print(f"\n  GD finds minimum norm solution: {np.allclose(theta_gd, theta_min_norm)}")
    
verify_minimum_norm_bias()

Why minimum norm matters for generalization:

The minimum norm solution has special properties that correlate with generalization:

Smoothness: Minimum norm solutions avoid unnecessarily large weights, which typically correspond to high-frequency, complex functions
Stability: Smaller weights mean the function is less sensitive to input perturbations (Lipschitz constant is bounded by weight norm)
Occam's Razor: Among all interpolating solutions, minimum norm represents a form of simplicity preference
Connection to Ridge Regression: The minimum norm solution is the limit of ridge regression (L2 regularized) as the regularization strength λ → 0

$$\theta_{\text{ridge}}^{(\lambda)} = (X^TX + \lambda I)^{-1}X^Ty \xrightarrow{\lambda \to 0} X^\dagger y = \theta_{\text{min-norm}}$$

This reveals a deep connection: gradient descent achieves L2 regularization effects without any explicit L2 penalty.

Implicit Bias in Deep Networks

Beyond minimum norm:

In deep networks, implicit bias manifests through several interrelated phenomena:

Function space bias: Networks are biased toward certain functions (smooth, low-frequency) rather than just toward certain parameter configurations
Layer-wise effects: Different layers may exhibit different implicit biases; early layers often learn more general features
Architecture dependence: The specific network architecture (depth, width, skip connections) affects the implicit bias
Initialization sensitivity: The starting point influences which basin of attraction the optimization falls into

The Neural Tangent Kernel Perspective

Matrix factorization insights:

Consider the simple problem of matrix completion, where we factorize a matrix M ≈ UV^T with U ∈ ℝᵐˣʳ and V ∈ ℝⁿˣʳ. This is analogous to a two-layer linear network.

Remarkably, gradient descent on the factorized parameterization (U, V) exhibits implicit nuclear norm regularization—it finds low-rank solutions even without explicitly penalizing rank.

The gradient updates: $$U_{t+1} = U_t - \eta \nabla_U L = U_t - \eta (U_tV_t^T - M)V_t$$ $$V_{t+1} = V_t - \eta \nabla_V L = V_t - \eta (U_tV_t^T - M)^TU_t$$

When initialized at small scale, this converges to a solution with minimum nuclear norm, not minimum Frobenius norm of (U, V).

Implicit Bias vs. Architecture
Model Class	Implicit Bias	Regularization Equivalent
Linear regression (1 layer)	Minimum L2 norm	Ridge regression (λ → 0)
Matrix factorization (2 layers, linear)	Minimum nuclear norm (low rank)	Nuclear norm regularization
Diagonal linear networks	Sparsity (minimum L1 norm)	Lasso regression
Deep linear networks	Complex function of depth	Depth-dependent regularization
ReLU networks (classification)	Maximum margin	SVM-like margin maximization

Maximum margin bias in classification:

Specifically, as training progresses and the loss approaches zero:

$$\frac{\theta_t}{|\theta_t|} \rightarrow \frac{\theta_{\text{max-margin}}}{|\theta_{\text{max-margin}}|}$$

The normalized parameters converge to the direction of maximum margin, meaning the network implicitly maximizes the separation between classes—exactly what SVMs do explicitly.

The Role of Stochasticity in SGD

Gradient noise structure:

At each step, SGD uses a mini-batch B to estimate the gradient:

$$g_t = \frac{1}{|B|}\sum_{i \in B} \nabla_\theta \ell(\theta_t; x_i, y_i)$$

This estimate has noise with structure: $$g_t = \nabla L(\theta_t) + \epsilon_t$$

SGD as Bayesian Inference

Noise scale and regularization strength:

The effective noise scale in SGD is controlled by several factors:

$$\text{Effective Noise} \propto \frac{\eta}{|B|} \cdot \text{Var}[\nabla \ell]$$

Where:

η = learning rate
|B| = batch size
Var[∇ℓ] = gradient variance across samples

Key insight: The ratio η/|B| governs the noise-to-signal ratio. Larger learning rates or smaller batches increase noise, strengthening implicit regularization.

This explains empirical observations:

Large batch training often generalizes worse (less noise = weaker regularization)
Larger learning rates can improve generalization (more noise = stronger regularization)
The relationship is captured by the learning rate to batch size ratio

High Noise Regime (Small Batch / Large LR)

•Stronger implicit regularization
•Better generalization (usually)
•Prefers flat minima
•Escapes sharp minima
•More stable final solutions

Low Noise Regime (Large Batch / Small LR)

•Weaker implicit regularization
•Can overfit more easily
•May converge to sharp minima
•Faster wall-clock time
•May need explicit regularization

sgd_noise_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import numpy as np
import matplotlib.pyplot as plt
 
def analyze_sgd_noise_impact(n_samples=1000, d=100, batch_sizes=[8, 32, 128, 512],
                             learning_rates=[0.1, 0.01, 0.001]):
    """
    Analyze how batch size and learning rate affect SGD noise scale.
    
    The effective noise scale is proportional to (learning_rate / batch_size).
    This ratio controls the implicit regularization strength.
    """
    # Generate synthetic gradient variances (simulating a real problem)
    np.random.seed(42)
    
    # Per-sample gradient magnitudes (vary across samples)
    sample_gradient_variance = np.random.exponential(scale=1.0, size=n_samples)
    full_gradient_variance = np.var(sample_gradient_variance)
    
    print("SGD Noise Scale Analysis")
    print("=" * 60)
    print(f"Per-sample gradient variance: {full_gradient_variance:.4f}")
    print()
    
    results = []
    for lr in learning_rates:
        print(f"Learning Rate: {lr}")
        print("-" * 40)
        for bs in batch_sizes:
            # Mini-batch gradient variance scales as 1/batch_size
            minibatch_variance = full_gradient_variance / bs
            
            # Effective noise scale
            noise_scale = lr * np.sqrt(minibatch_variance)
            effective_ratio = lr / bs
            
            print(f"  Batch Size {bs:4d}: Noise Scale = {noise_scale:.6f}, "
                  f"η/B = {effective_ratio:.6f}")
            
            results.append({
                'lr': lr,
                'batch_size': bs,
                'noise_scale': noise_scale,
                'ratio': effective_ratio
            })
        print()
    
    # Key insight
    print("=" * 60)
    print("Key Insight: Maintaining η/B ratio maintains similar dynamics")
    print("  - lr=0.1, batch=32  → η/B = 0.003125")
    print("  - lr=0.4, batch=128 → η/B = 0.003125  (Linear scaling rule)")
    
    return results
 
# Run analysis
analyze_sgd_noise_impact()

Flat Minima and Generalization

Sharp vs. Flat Minima:

Consider two minima with identical training loss:

Sharp minimum: Loss increases rapidly as parameters deviate from the optimum. Small perturbations lead to large loss increases.
Flat minimum: Loss remains low over a large region. The solution is tolerant to parameter perturbations.

Why Flat Minima Generalize Better

PAC-Bayes perspective:

The connection between flatness and generalization can be formalized through PAC-Bayes bounds. These bounds relate generalization error to the "sharpness" of the loss landscape:

$$L_{\text{test}}(\theta) \leq \hat{L}_{\text{train}}(\theta) + \mathcal{O}\left(\sqrt{\frac{\text{KL}(Q | P) + \log(n/\delta)}{n}}\right)$$

Minimum description length view:

SGD's preference for flat minima:

Why does SGD prefer flat minima? Several mechanisms contribute:

Escape dynamics: SGD's noise allows it to escape sharp minima. The higher the curvature, the more unstable the point is under SGD dynamics. Sharp minima act as repellers.
Anisotropic noise: SGD noise is larger in directions of high gradient variance, which often correlates with high curvature. This preferentially destabilizes sharp directions.
Continuous-time analysis: In the limit of small learning rates, SGD can be approximated as a stochastic differential equation (SDE):

$$d\theta = -\nabla L(\theta)dt + \sqrt{\eta \Sigma(\theta)}dW_t$$

where Σ(θ) is the gradient covariance. The stationary distribution of this SDE concentrates in flat regions.

Volume argument: Flat minima occupy larger volumes in parameter space. Since SGD performs a form of random walk biased toward lower loss, it naturally encounters and stays in larger basins (flat minima) more often.

flat_minima_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import numpy as np
import matplotlib.pyplot as plt
 
def create_loss_landscape_1d():
    """
    Create a 1D loss landscape with both sharp and flat minima
    to illustrate SGD preference.
    """
    def loss_function(x):
        # Flat minimum at x = -2
        flat_component = 0.1 * (x + 2)**2
        
        # Sharp minimum at x = 2 (same loss value at minimum)
        sharp_component = 2 * (x - 2)**2
        
        # Blend to create landscape with both minima
        if isinstance(x, np.ndarray):
            result = np.where(x < 0, flat_component, sharp_component)
            # Add barrier between minima
            result += 0.5 * np.exp(-0.5 * x**2)
            return result
        else:
            if x < 0:
                return flat_component + 0.5 * np.exp(-0.5 * x**2)
            else:
                return sharp_component + 0.5 * np.exp(-0.5 * x**2)
    
    def loss_gradient(x):
        """Numerical gradient for simplicity"""
        eps = 1e-7
        return (loss_function(x + eps) - loss_function(x - eps)) / (2 * eps)
    
    return loss_function, loss_gradient
 
def simulate_sgd_on_landscape(loss_fn, grad_fn, x_init=0.0, 
                               lr=0.1, noise_scale=0.3, n_steps=500):
    """
    Simulate SGD with noise on the loss landscape.
    """
    x = x_init
    trajectory = [x]
    
    for _ in range(n_steps):
        grad = grad_fn(x)
        noise = np.random.randn() * noise_scale
        x = x - lr * grad + noise
        x = np.clip(x, -5, 5)  # Keep in bounds
        trajectory.append(x)
    
    return np.array(trajectory)
 
def analyze_convergence_preference(n_runs=100):
    """
    Analyze which minimum SGD converges to across multiple runs.
    """
    loss_fn, grad_fn = create_loss_landscape_1d()
    
    flat_count = 0
    sharp_count = 0
    
    for _ in range(n_runs):
        trajectory = simulate_sgd_on_landscape(
            loss_fn, grad_fn, x_init=0.0, lr=0.05, noise_scale=0.2, n_steps=1000
        )
        final_x = trajectory[-1]
        
        if final_x < 0:
            flat_count += 1
        else:
            sharp_count += 1
    
    print("SGD Convergence Analysis (100 runs)")
    print("=" * 40)
    print(f"Converged to FLAT minimum:  {flat_count:3d}%")
    print(f"Converged to SHARP minimum: {sharp_count:3d}%")
    print()
    print("=> SGD preferentially converges to flat minima!")
 
# Run analysis
analyze_convergence_preference()

Learning Rate and Generalization

The learning rate's role in implicit regularization goes beyond simple convergence considerations. It fundamentally affects the geometry of solutions found by SGD.

Large learning rates as regularizers:

Counter to naive intuition, larger learning rates (up to a point) often lead to better generalization. The mechanism:

Larger noise amplitude: Learning rate scales the gradient noise, increasing the effective temperature of the SGD random walk
Sharper minima become unstable: As learning rate increases, solutions with high curvature become dynamically unstable—SGD cannot stay there
Effective sharpness filter: Only minima flat enough to be stable at the given learning rate can be reached

The Edge of Stability

Mathematical formalization:

For a minimum to be stable under GD with learning rate η, the updates must not diverge. For a quadratic approximation near a minimum:

$$\theta_{t+1} - \theta^* = (I - \eta H)(\theta_t - \theta^*)$$

Stability requires all eigenvalues of (I - ηH) to have magnitude ≤ 1, which means:

$$\lambda_{\max}(H) < \frac{2}{\eta}$$

Larger learning rates → smaller maximum allowable curvature → flatter minima.

The catapult dynamics:

This explains why:

Training with very small learning rates can lead to sharp minima and poor generalization
Warmup (starting with small lr, increasing to large lr) helps escape initial bad basins
Learning rate decay at the end 'locks in' to a nearby flat minimum

Learning Rate Effects on Solution Properties
Learning Rate	Reachable Minima	Generalization	Training Stability
Very Small (0.0001)	Sharp and flat	Often worse (sharp minima)	Very stable
Medium (0.01)	Moderately flat	Good balance	Stable
Large (0.1)	Only flat minima	Often better	May oscillate
Very Large (1.0+)	Very flat or none	Unstable	Divergence risk

Practical Implications

•Use learning rate warmup — Start small to find a good basin, increase to filter sharp regions
•Train at the largest stable learning rate — This maximizes implicit regularization
•Learning rate decay near convergence — Lock in to a nearby flat minimum
•Monitor loss oscillation — Oscillation indicates operation near edge of stability
•Consider learning rate to batch size ratio — Maintain similar dynamics when scaling batch size

Implications for Training Practice

Understanding SGD's implicit bias transforms how we approach neural network training. Rather than treating optimization as a black box, we can deliberately leverage these effects.

Hyperparameter selection through the lens of implicit regularization:

Every hyperparameter choice affects the implicit regularization strength:

Hyperparameter	Higher Value	Effect on Implicit Regularization
Learning rate	Higher	Stronger (more noise, sharpness filter)
Batch size	Larger	Weaker (less noise)
Weight decay	Higher	(Explicit, but interacts with implicit)
Momentum	Higher	Complex (reduces noise along trajectory)
Dropout	Higher	(Explicit, but affects optimization path)

The Linear Scaling Rule

When to rely on implicit vs. explicit regularization:

Implicit regularization from SGD is 'free'—it comes from the optimization algorithm you're already using. However, it has limitations:

Strengths of implicit regularization:

No additional hyperparameters to tune
Emerges naturally from the optimization process
Particularly effective for overparameterized models
Computationally free

Limitations:

Less direct control over regularization strength
Interacts complexly with learning rate schedules
May not be sufficient for small datasets or simple models
Difficult to reason about theoretically

Practical guidance:

Start with SGD and reasonable hyperparameters; let implicit regularization do its work
Add explicit regularization (weight decay, dropout) if validation performance plateaus
Tune the ratio η/B before adding new forms of regularization
Use early stopping as a complementary implicit regularizer (covered in next page)

Optimizer choice matters:

Different optimizers exhibit different implicit biases:

SGD: Strong implicit regularization through noise; biased toward minimum norm / max margin
SGD + Momentum: Reduces noise variance, potentially weakening regularization; adds directional smoothing
Adam: Adaptive learning rates per parameter; can have different implicit biases; generally considered to regularize less than SGD
AdamW: Decouples weight decay, recovering some SGD-like properties

Summary: SGD Implicit Bias

Key Takeaways

•Overparameterized models have many solutions — All achieve zero training loss, but they differ drastically in generalization
•SGD doesn't find arbitrary solutions — It exhibits systematic biases toward certain solution types
•In linear models, GD finds minimum norm solutions — Equivalent to infinitesimal L2 regularization
•In deep networks, biases become richer — Maximum margin, low rank, and architecture-dependent preferences emerge
•Stochasticity provides additional regularization — Mini-batch noise helps escape sharp minima
•Flat minima generalize better — SGD preferentially finds and stays in flat regions
•Learning rate controls regularization strength — Larger lr filters sharper minima (edge of stability)
•Hyperparameters affect implicit regularization — The ratio η/B is key; optimizer choice matters

Page Complete

1 / 5