Advanced Optimization - Learning Module

Loading content...

0/245

Optimization Landscapes

The Geometry of Learning

Every machine learning model defines a loss landscape—a high-dimensional surface where the height at each point represents the model's error for a particular parameter configuration. Training is a journey across this landscape, seeking valleys (low loss) while avoiding getting stuck on plateaus or trapped in suboptimal basins.

Understanding optimization landscapes transforms how we think about training. Why does SGD work on non-convex neural networks? Why do wider networks train more easily? Why does learning rate matter so much? The answers lie in landscape geometry—the shapes, curvatures, and topological features of loss surfaces that determine whether optimization succeeds or fails.

What You Will Learn

By the end of this page, you will understand the geometry of convex vs non-convex loss surfaces, characterize critical points (minima, maxima, saddle points), analyze how landscape properties affect optimization dynamics, interpret recent theoretical results on neural network landscapes, and apply this understanding to diagnose and improve training.

Visualizing Loss Landscapes

Loss landscapes exist in extremely high-dimensional spaces—a neural network with 100 million parameters has a 100-million-dimensional loss surface. Direct visualization is impossible, but we can gain insight through dimensionality reduction techniques.

1D and 2D Linear Subspace Projections:

Choose two directions d₁, d₂ in parameter space (random directions, eigenvectors of the Hessian, or the training trajectory). Plot:

f(α, β) = L(θ* + α·d₁ + β·d₂)

for a grid of (α, β) values. This gives a 2D slice through the high-dimensional landscape.

Filter normalization:

For neural networks, raw random directions produce misleading visualizations because network scale varies across layers. Filter-normalized directions scale each layer's direction by the layer's weight norm, providing comparable perturbations across the network.

landscape_visualization.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from typing import Callable, List, Tuple
 
def visualize_loss_landscape_2d(
    loss_fn: Callable[[np.ndarray], float],
    center: np.ndarray,
    direction1: np.ndarray,
    direction2: np.ndarray,
    range_x: Tuple[float, float] = (-1, 1),
    range_y: Tuple[float, float] = (-1, 1),
    resolution: int = 50
) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Create 2D landscape visualization by projecting onto two directions.
    
    Parameters:
    -----------
    loss_fn : callable
        Loss function L(θ) -> scalar
    center : np.ndarray
        Center point θ* (typically a trained solution)
    direction1 : np.ndarray
        First direction for x-axis
    direction2 : np.ndarray
        Second direction for y-axis
    range_x, range_y : tuple
        Ranges for α and β coefficients
    resolution : int
        Grid resolution
        
    Returns:
    --------
    X, Y, Z : np.ndarray
        Meshgrid and loss values for plotting
    """
    alphas = np.linspace(range_x[0], range_x[1], resolution)
    betas = np.linspace(range_y[0], range_y[1], resolution)
    
    X, Y = np.meshgrid(alphas, betas)
    Z = np.zeros_like(X)
    
    for i in range(resolution):
        for j in range(resolution):
            theta = center + X[i, j] * direction1 + Y[i, j] * direction2
            Z[i, j] = loss_fn(theta)
    
    return X, Y, Z
 
 
def random_normalized_direction(shape: tuple, seed: int = None) -> np.ndarray:
    """Generate random unit direction vector."""
    if seed is not None:
        np.random.seed(seed)
    direction = np.random.randn(*shape)
    return direction / np.linalg.norm(direction)
 
 
def filter_normalize_direction(
    direction: List[np.ndarray],
    weights: List[np.ndarray]
) -> List[np.ndarray]:
    """
    Filter normalization for neural network directions.
    
    Scales each layer's direction by the magnitude of that layer's weights.
    This ensures perturbations have comparable effect across layers.
    
    d_i <- d_i * ||W_i|| / ||d_i||
    """
    normalized = []
    for d, w in zip(direction, weights):
        d_norm = np.linalg.norm(d)
        w_norm = np.linalg.norm(w)
        if d_norm > 1e-10:
            normalized.append(d * (w_norm / d_norm))
        else:
            normalized.append(d)
    return normalized
 
 
# Example: Visualize a simple 2D function
def example_landscape():
    """Visualize various loss landscape characteristics."""
    
    # Convex quadratic
    def convex_loss(x):
        return 0.5 * (x[0]**2 + 10*x[1]**2)
    
    # Non-convex with multiple minima
    def multimodal_loss(x):
        return (x[0]**2 + x[1]**2) * (1 + 0.5*np.sin(5*x[0])*np.sin(5*x[1]))
    
    # Saddle point
    def saddle_loss(x):
        return x[0]**2 - x[1]**2
    
    # Rosenbrock (banana valley)
    def rosenbrock_loss(x):
        return (1 - x[0])**2 + 100*(x[1] - x[0]**2)**2
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 10), subplot_kw={'projection': '3d'})
    
    losses = [
        (convex_loss, "Convex", (-2, 2)),
        (multimodal_loss, "Non-convex Multi-modal", (-2, 2)),
        (saddle_loss, "Saddle Point", (-2, 2)),
        (rosenbrock_loss, "Rosenbrock Valley", (-2, 2))
    ]
    
    for ax, (loss, title, rng) in zip(axes.flat, losses):
        X, Y, Z = visualize_loss_landscape_2d(
            loss,
            center=np.array([0.0, 0.0]),
            direction1=np.array([1.0, 0.0]),
            direction2=np.array([0.0, 1.0]),
            range_x=rng, range_y=rng,
            resolution=100
        )
        ax.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8)
        ax.set_title(title)
    
    plt.tight_layout()
    return fig
 
 
if __name__ == "__main__":
    fig = example_landscape()
    plt.show()

Interpreting Landscape Visualizations

Sharp, narrow minima appear as deep wells. Flat minima appear as broad valleys. Saddle points show curvature in opposite directions. But remember: these are 2D projections of extremely high-dimensional spaces. A minimum might appear sharp in one projection but flat in another.

Critical Points and Their Classification

A critical point (or stationary point) is where the gradient vanishes: ∇f(x*) = 0. Critical points are classified by the Hessian matrix H(x*) = ∇²f(x*).

Classification via Hessian eigenvalues:

For a critical point x* with Hessian H:

Local minimum: All eigenvalues λᵢ > 0 (H positive definite)
Local maximum: All eigenvalues λᵢ < 0 (H negative definite)
Saddle point: Mixed signs (some λᵢ > 0, some λᵢ < 0)
Degenerate: Some eigenvalues = 0 (indefinite about that direction)

Index of a saddle point:

The index is the number of negative eigenvalues—the number of directions with negative curvature (directions of descent). A saddle of index k has k 'downward' directions.

critical_point_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
import numpy as np
from typing import Tuple, Optional
 
def classify_critical_point(
    hessian: np.ndarray,
    tol: float = 1e-8
) -> Tuple[str, dict]:
    """
    Classify a critical point based on Hessian eigenvalues.
    
    Parameters:
    -----------
    hessian : np.ndarray
        Hessian matrix at the critical point (n, n)
    tol : float
        Tolerance for considering eigenvalue as zero
        
    Returns:
    --------
    classification : str
        'local_minimum', 'local_maximum', 'saddle', or 'degenerate'
    info : dict
        Eigenvalue statistics
    """
    eigenvalues = np.linalg.eigvalsh(hessian)
    
    n_positive = np.sum(eigenvalues > tol)
    n_negative = np.sum(eigenvalues < -tol)
    n_zero = np.sum(np.abs(eigenvalues) <= tol)
    
    info = {
        'eigenvalues': eigenvalues,
        'n_positive': n_positive,
        'n_negative': n_negative,
        'n_zero': n_zero,
        'min_eigenvalue': eigenvalues.min(),
        'max_eigenvalue': eigenvalues.max(),
        'condition_number': eigenvalues.max() / (eigenvalues.min() + 1e-10)
    }
    
    n = len(eigenvalues)
    
    if n_zero > 0:
        classification = 'degenerate'
    elif n_positive == n:
        classification = 'local_minimum'
    elif n_negative == n:
        classification = 'local_maximum'
    else:
        classification = 'saddle'
        info['saddle_index'] = n_negative
    
    return classification, info
 
 
def escape_saddle_direction(
    hessian: np.ndarray,
    tol: float = 1e-8
) -> Optional[np.ndarray]:
    """
    Find escape direction from saddle point.
    
    Returns the eigenvector corresponding to the most negative eigenvalue.
    Moving in this direction (either way) decreases the function.
    """
    eigenvalues, eigenvectors = np.linalg.eigh(hessian)
    
    min_idx = np.argmin(eigenvalues)
    min_eigval = eigenvalues[min_idx]
    
    if min_eigval < -tol:
        return eigenvectors[:, min_idx]
    else:
        return None  # Not a saddle point
 
 
def hessian_spectral_analysis(hessian: np.ndarray) -> dict:
    """
    Comprehensive spectral analysis of Hessian.
    
    Useful for understanding curvature structure.
    """
    eigenvalues = np.sort(np.linalg.eigvalsh(hessian))
    
    return {
        'min': eigenvalues[0],
        'max': eigenvalues[-1],
        'mean': np.mean(eigenvalues),
        'median': np.median(eigenvalues),
        'std': np.std(eigenvalues),
        'condition_number': eigenvalues[-1] / (abs(eigenvalues[0]) + 1e-10),
        'fraction_negative': np.mean(eigenvalues < 0),
        'trace': np.sum(eigenvalues),  # = Laplacian
        'det_sign': np.sign(np.prod(np.sign(eigenvalues)))
    }
 
 
# Example
if __name__ == "__main__":
    # Saddle point example: f(x,y) = x² - y²
    # Hessian at origin: [[2, 0], [0, -2]]
    H_saddle = np.array([[2.0, 0.0], [0.0, -2.0]])
    
    cls, info = classify_critical_point(H_saddle)
    print(f"Classification: {cls}")
    print(f"Saddle index: {info.get('saddle_index', 'N/A')}")
    
    escape_dir = escape_saddle_direction(H_saddle)
    print(f"Escape direction: {escape_dir}")
    
    # Neural network-like Hessian (many near-zero eigenvalues)
    np.random.seed(42)
    n = 100
    H_random = np.random.randn(n, n)
    H_random = (H_random + H_random.T) / 2  # Symmetrize
    
    # Make it mostly flat with a few sharp directions
    eigenvalues, eigenvectors = np.linalg.eigh(H_random)
    eigenvalues_modified = np.concatenate([
        np.random.randn(10) * 10,  # 10 directions with significant curvature
        np.random.randn(90) * 0.01  # 90 nearly flat directions
    ])
    H_nn_like = eigenvectors @ np.diag(eigenvalues_modified) @ eigenvectors.T
    
    spectral = hessian_spectral_analysis(H_nn_like)
    print(f"\nNeural-network-like Hessian:")
    print(f"  Condition number: {spectral['condition_number']:.1f}")
    print(f"  Fraction negative: {spectral['fraction_negative']:.1%}")

Saddle points dominate in high dimensions:

A remarkable result: in high-dimensional spaces, saddle points vastly outnumber local minima. Consider a random function with n parameters. At a critical point, each Hessian eigenvalue is independently likely to be positive or negative. The probability of all n being positive (local minimum) is exponentially small: ~2⁻ⁿ.

For neural networks with millions of parameters, almost all critical points are saddle points. The few local minima that exist tend to have similar loss values (the 'loss surface is surprisingly flat' phenomenon).

Implications:

Getting stuck at saddle points is a bigger concern than local minima
Optimization algorithms must escape saddles efficiently
SGD noise helps escape saddles (stochasticity is a feature, not a bug)

Strict Saddles and Saddle-Free Paths

A 'strict saddle' has at least one negative Hessian eigenvalue. Recent theory shows that gradient descent with random initialization almost surely escapes strict saddles. However, this can take exponentially long! Adding noise (SGD) or using second-order information dramatically accelerates escape.

Convex vs Non-Convex Landscapes

The distinction between convex and non-convex landscapes fundamentally shapes what guarantees we can make about optimization.

Convex landscapes:

Every local minimum is a global minimum
Gradient descent converges to global optimum
Strong theoretical guarantees (rates, bounds)
Examples: Linear regression, logistic regression, SVM

Non-convex landscapes:

Multiple local minima possible
Local minima may have different loss values
Saddle points are abundant
No general guarantee of finding global optimum
Examples: Neural networks, matrix factorization, mixture models

Convex vs Non-Convex Optimization
Property	Convex	Non-Convex
Local minima	Unique global	Multiple possible
Saddle points	None (for strict convex)	Abundant in high-D
Gradient descent	Converges to optimum	Converges to critical point
Initialization	Doesn't matter	Critical for success
Convergence rate	Well-characterized	Problem-dependent
Global guarantees	Yes	Only for special cases

Neural network landscapes are special:

Despite being highly non-convex, neural network loss surfaces have surprising structure:

Mode connectivity: Different local minima are connected by paths of nearly constant loss
Low barrier between solutions: The loss along these paths rarely increases much
Overparameterization helps: Wide networks have 'benign' landscapes with many good solutions
Global minima are abundant: When network is wide enough, most local minima are global

This structure explains why SGD, a simple first-order method, successfully trains massive neural networks—the landscape is non-convex but 'friendly' to gradient-based optimization.

The Lottery Ticket Hypothesis Connection

The landscape perspective illuminates the lottery ticket hypothesis: randomly initialized networks contain subnetworks that, when trained alone, match full network performance. These 'winning tickets' correspond to favorable regions of the initial landscape that gradient descent can reach.

Sharpness, Flatness, and Generalization

One of the most intriguing connections in deep learning is between the geometry of minima and generalization performance. The hypothesis: flat minima generalize better than sharp minima.

Intuition:

Consider the loss surface around a minimum. If the minimum is:

Sharp: Small perturbations in parameters cause large loss increase. The solution is sensitive to noise.
Flat: Parameter perturbations have minimal effect on loss. The solution is robust.

During training, we find parameters θ that minimize training loss. But test data comes from a slightly different distribution. Flat minima are more likely to generalize because they're less sensitive to this distributional shift.

Measuring sharpness:

Several metrics have been proposed:

Largest Hessian eigenvalue: λ_max(H). Sharp if large.
Trace of Hessian: Tr(H) = Σλᵢ. Average curvature.
Max perturbation loss: max_{‖δ‖≤ε} L(θ + δ) - L(θ). Worst-case sensitivity.
Fisher information: Related to expected gradient outer product.

sharpness_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import numpy as np
from typing import Callable, Tuple
 
def compute_sharpness_metrics(
    loss_fn: Callable[[np.ndarray], float],
    grad_fn: Callable[[np.ndarray, np.ndarray], np.ndarray],
    hess_fn: Callable[[np.ndarray], np.ndarray],
    theta: np.ndarray,
    epsilon: float = 0.01,
    n_directions: int = 100
) -> dict:
    """
    Compute various sharpness metrics at a point.
    
    Parameters:
    -----------
    loss_fn : callable
        Loss function L(θ)
    grad_fn : callable
        Gradient function ∇L(θ)
    hess_fn : callable
        Hessian function ∇²L(θ) (optional, can be expensive)
    theta : np.ndarray
        Point to analyze
    epsilon : float
        Perturbation radius for empirical sharpness
    n_directions : int
        Number of random directions for Monte Carlo estimates
    """
    n = len(theta)
    base_loss = loss_fn(theta)
    
    # 1. Empirical sharpness (max loss in ε-ball)
    max_perturbed_loss = base_loss
    for _ in range(n_directions):
        direction = np.random.randn(n)
        direction = direction / np.linalg.norm(direction) * epsilon
        perturbed_loss = loss_fn(theta + direction)
        max_perturbed_loss = max(max_perturbed_loss, perturbed_loss)
    
    empirical_sharpness = max_perturbed_loss - base_loss
    
    # 2. Average sensitivity (expected loss increase)
    avg_increase = 0.0
    for _ in range(n_directions):
        direction = np.random.randn(n)
        direction = direction / np.linalg.norm(direction) * epsilon
        avg_increase += loss_fn(theta + direction) - base_loss
    avg_increase /= n_directions
    
    # 3. Hessian-based metrics (if available)
    hessian_metrics = {}
    if hess_fn is not None:
        try:
            H = hess_fn(theta)
            eigenvalues = np.linalg.eigvalsh(H)
            
            hessian_metrics = {
                'hessian_max_eigval': eigenvalues.max(),
                'hessian_trace': np.sum(eigenvalues),
                'hessian_frobenius': np.sqrt(np.sum(eigenvalues**2)),
                'n_negative_eigenvalues': np.sum(eigenvalues < 0),
            }
        except:
            pass
    
    return {
        'base_loss': base_loss,
        'empirical_sharpness': empirical_sharpness,
        'avg_perturbation_sensitivity': avg_increase,
        **hessian_metrics
    }
 
 
def sam_sharpness(
    loss_fn: Callable[[np.ndarray], float],
    grad_fn: Callable[[np.ndarray], np.ndarray],
    theta: np.ndarray,
    rho: float = 0.05
) -> Tuple[float, np.ndarray]:
    """
    Sharpness-Aware Minimization (SAM) sharpness and perturbed gradient.
    
    SAM minimizes: max_{||δ||≤ρ} L(θ + δ)
    
    The adversarial perturbation is: δ* = ρ * ∇L(θ) / ||∇L(θ)||
    """
    grad = grad_fn(theta)
    grad_norm = np.linalg.norm(grad)
    
    if grad_norm == 0:
        return loss_fn(theta), grad
    
    # Adversarial perturbation (first-order approximation to worst case)
    delta = rho * grad / grad_norm
    
    # Sharpness = loss at perturbed point
    sharpness = loss_fn(theta + delta)
    
    # SAM gradient: gradient at perturbed point
    sam_grad = grad_fn(theta + delta)
    
    return sharpness, sam_grad
 
 
def compare_minima_sharpness(
    loss_fn: Callable,
    grad_fn: Callable,
    minima: list,
    labels: list,
    epsilon: float = 0.05
):
    """
    Compare sharpness of different local minima.
    """
    print("Minimum Sharpness Comparison")
    print("=" * 60)
    
    for theta, label in zip(minima, labels):
        metrics = compute_sharpness_metrics(
            loss_fn, grad_fn, None, theta, epsilon
        )
        
        print(f"\n{label}:")
        print(f"  Base loss: {metrics['base_loss']:.6f}")
        print(f"  Empirical sharpness (ε={epsilon}): {metrics['empirical_sharpness']:.6f}")
        print(f"  Avg perturbation effect: {metrics['avg_perturbation_sensitivity']:.6f}")

Sharpness-Aware Minimization (SAM):

SAM explicitly optimizes for flat minima by replacing the standard loss with:

L_SAM(θ) = max_{‖δ‖≤ρ} L(θ + δ)

This finds parameters θ where the worst-case loss in a ρ-neighborhood is minimized. Empirically, SAM improves generalization across many tasks.

Caveats and controversies:

Scale dependency: Sharpness measures depend on parameterization. Rescaling weights can change sharpness without changing the function.
Correlation vs causation: Flat minima may correlate with good generalization without being the cause.
Different flatness directions: A minimum can be sharp in some directions and flat in others.

Despite controversies, the practical success of SAM and related methods suggests sharpness captures something meaningful about generalization.

Large Batch Training and Sharpness

Large batch training tends to find sharper minima than small batch training, which may explain why large batches sometimes generalize worse. The noise from small batches helps escape sharp minima, biasing optimization toward flatter regions.

Neural Network Loss Surface Theory

Theoretical understanding of neural network landscapes has advanced significantly, providing insights into why deep learning works despite non-convexity.

Overparameterization and global minima:

For networks with more parameters than training samples (n_params >> n_samples), recent theory shows:

Gradient descent finds global minima: With sufficient width, gradient descent converges to zero training loss
No spurious local minima: All local minima achieve the global minimum loss
NTK regime: Infinitely wide networks behave like kernel methods with Neural Tangent Kernel

The role of depth:

Deeper networks have more complex landscapes
Skip connections (ResNets) create 'shortcut' paths through the landscape
Batch normalization smooths the landscape
Deeper networks can represent sharper boundaries with lower norm, affecting generalization

Landscape Properties by Network Architecture
Architecture	Landscape Property	Optimization Implication
Narrow network	Many local minima	May get stuck, initialization sensitive
Wide network	Connected minima, few bad minima	Easier optimization, multiple solutions
Deep network	Long narrow valleys, vanishing gradients	Needs careful initialization, normalization
ResNet (skip connections)	Smoother landscape	Faster training, better gradient flow
Batch normalized	Reduced curvature sensitivity	Larger learning rates possible

landscape_connectivity.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
import numpy as np
from typing import Callable, List, Tuple
 
def linear_path_loss(
    loss_fn: Callable[[np.ndarray], float],
    theta1: np.ndarray,
    theta2: np.ndarray,
    n_points: int = 50
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Evaluate loss along linear path between two solutions.
    
    Used to check mode connectivity.
    """
    alphas = np.linspace(0, 1, n_points)
    losses = []
    
    for alpha in alphas:
        theta = (1 - alpha) * theta1 + alpha * theta2
        losses.append(loss_fn(theta))
    
    return alphas, np.array(losses)
 
 
def bezier_path_loss(
    loss_fn: Callable[[np.ndarray], float],
    theta1: np.ndarray,
    theta2: np.ndarray,
    bend: np.ndarray,
    n_points: int = 50
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Evaluate loss along quadratic Bezier curve.
    
    B(t) = (1-t)²P₀ + 2(1-t)t·P₁ + t²P₂
    
    The 'bend' parameter allows finding non-linear low-loss paths.
    """
    t = np.linspace(0, 1, n_points)
    losses = []
    
    for ti in t:
        theta = (1 - ti)**2 * theta1 + 2*(1-ti)*ti * bend + ti**2 * theta2
        losses.append(loss_fn(theta))
    
    return t, np.array(losses)
 
 
def find_low_loss_path(
    loss_fn: Callable,
    grad_fn: Callable,
    theta1: np.ndarray,
    theta2: np.ndarray,
    n_bends: int = 1,
    lr: float = 0.01,
    n_iters: int = 1000
) -> Tuple[List[np.ndarray], List[float]]:
    """
    Find a low-loss path between two minima.
    
    Optimizes intermediate 'bend' points to minimize max loss along path.
    This tests mode connectivity hypothesis.
    """
    # Initialize bend points along straight line
    bends = []
    for i in range(n_bends):
        alpha = (i + 1) / (n_bends + 1)
        bends.append((1 - alpha) * theta1 + alpha * theta2)
    
    def path_max_loss(bends_flat):
        bends_list = np.split(bends_flat, n_bends)
        points = [theta1] + bends_list + [theta2]
        
        max_loss = 0
        for i in range(len(points) - 1):
            # Sample points on segment
            for alpha in np.linspace(0, 1, 10):
                theta = (1 - alpha) * points[i] + alpha * points[i+1]
                max_loss = max(max_loss, loss_fn(theta))
        
        return max_loss
    
    # Simple gradient descent on bend points
    bends_flat = np.concatenate(bends)
    history = []
    
    for _ in range(n_iters):
        # Numerical gradient
        eps = 1e-5
        grad = np.zeros_like(bends_flat)
        f0 = path_max_loss(bends_flat)
        
        for i in range(len(bends_flat)):
            bends_flat[i] += eps
            grad[i] = (path_max_loss(bends_flat) - f0) / eps
            bends_flat[i] -= eps
        
        bends_flat -= lr * grad
        history.append(f0)
        
        if len(history) > 10 and abs(history[-1] - history[-10]) < 1e-6:
            break
    
    final_bends = list(np.split(bends_flat, n_bends))
    return final_bends, history
 
 
def mode_connectivity_test(
    loss_fn: Callable,
    theta1: np.ndarray,
    theta2: np.ndarray,
    threshold: float = 1.1
) -> bool:
    """
    Test if two minima are connected by a low-loss path.
    
    Returns True if path loss never exceeds threshold * max(L(θ₁), L(θ₂)).
    """
    loss1 = loss_fn(theta1)
    loss2 = loss_fn(theta2)
    base = max(loss1, loss2)
    
    _, path_losses = linear_path_loss(loss_fn, theta1, theta2)
    
    if path_losses.max() <= threshold * base:
        return True  # Linear path is good enough
    
    # Try with bend points
    bends, _ = find_low_loss_path(
        loss_fn, None, theta1, theta2, n_bends=2
    )
    
    # Evaluate final path
    all_points = [theta1] + bends + [theta2]
    max_loss = 0
    for i in range(len(all_points) - 1):
        for alpha in np.linspace(0, 1, 20):
            theta = (1 - alpha) * all_points[i] + alpha * all_points[i+1]
            max_loss = max(max_loss, loss_fn(theta))
    
    return max_loss <= threshold * base

Mode Connectivity

Different local minima found by training from different initializations are often 'mode connected'—there exist paths between them along which loss stays low. This is evidence that the landscape is more benign than simple non-convex analysis suggests.

Practical Implications for Training

Understanding landscapes translates to concrete training strategies:

1. Initialization:

Proper initialization places parameters in a 'good' region of the landscape:

Xavier/Glorot: Maintains gradient variance across layers
He initialization: Scaled for ReLU activations
Bad initialization → start in high-loss basin with poor gradient flow

2. Learning rate:

Learning rate interacts with landscape curvature:

Too large: Overshoots minima, escapes good basins
Too small: Gets stuck in sharp minima, slow convergence
Learning rate warmup: Starts small to stabilize, increases to explore
Cyclical learning rates: Helps escape sharp minima

3. Batch size:

Small batches: More noise → helps escape sharp minima
Large batches: Lower variance → may converge to sharper minima
Gradient noise scale: √(batch_size) affects exploration

Landscape-Aware Training Strategies
Problem	Landscape Interpretation	Solution
Vanishing gradients	Near-flat region (plateau)	Skip connections, normalization, different activation
Exploding gradients	Extremely steep region	Gradient clipping, layer normalization
Poor generalization	Sharp minimum	SAM, weight decay, data augmentation
Slow convergence	Ill-conditioned curvature	Adam/momentum, learning rate scheduling
Stuck at plateau	Saddle point or flat region	More noise, second-order info, different LR
Inconsistent training	Chaotic landscape	Ensemble, regularization, simpler model

4. Regularization and landscape shape:

Weight decay: Adds quadratic bowl around origin, making landscape more convex
Dropout: Implicitly averages over many sub-landscapes
Data augmentation: Smooths landscape by averaging over perturbations
Label smoothing: Prevents overconfidence, softens loss surface

5. Architecture effects:

Wider layers: More solutions, easier optimization
Skip connections: Create 'shortcuts', smoother gradient flow
Normalization layers: Reduce curvature sensitivity
Attention: Creates smoother representations

Diagnostic Tools

When training fails, think about the landscape. Is the gradient too small (plateau)? Too large (explosive region)? Oscillating (sharp curvature)? Tools like gradient histograms, loss curves, and learning rate sensitivity tests help diagnose landscape issues.

Summary and Key Takeaways

The loss landscape perspective provides deep insight into why optimization succeeds or fails. Understanding geometry helps diagnose problems and design better training procedures.

Key Takeaways

•Critical points classified by Hessian — Positive definite = minimum, negative definite = maximum, mixed = saddle. Index = number of negative eigenvalues.
•Saddle points dominate high dimensions — In large neural networks, almost all critical points are saddle points, not local minima.
•Neural network landscapes are surprisingly benign — Mode connectivity, no bad local minima in overparameterized regime, gradient descent finds good solutions.
•Flat minima may generalize better — Though controversial, the connection between sharpness and generalization motivates SAM and related methods.
•Architecture affects landscape — Width helps (more solutions), depth complicates (needs shortcuts, normalization), skip connections smooth the surface.
•Training hyperparameters interact with geometry — Learning rate, batch size, momentum all affect which regions of the landscape are explored.
•Think geometrically when debugging — Plateaus, sharp regions, saddles, and chaotic landscapes each require different interventions.

Module Complete:

This concludes the Advanced Optimization module. You've mastered Newton's method and quasi-Newton approximations (BFGS, L-BFGS), coordinate descent and its variants, proximal methods for non-smooth optimization, and the geometry of optimization landscapes. These tools form the complete picture of how machine learning models are trained—from simple convex problems to complex neural networks.

Module Complete

You now understand optimization landscapes: how to visualize them, classify critical points, interpret neural network loss surfaces, and use this knowledge to improve training. Combined with the algorithmic tools from previous pages, you have a comprehensive foundation in advanced optimization for machine learning.