Loading content...
L2 and L1 regularization apply "soft" constraints—they penalize large weights but don't prevent them. A sufficiently strong gradient signal can still push weights to extreme values, particularly early in training when gradients may be large and varied.
Max-norm constraints take a different approach: they impose hard limits on weight magnitudes. After each gradient update, weights are projected back onto a constraint set if they exceed the specified bound. This guarantees that weight norms never exceed a maximum value $c$, regardless of gradient magnitudes.
This technique is particularly valuable for:
This page covers max-norm constraints comprehensively: the mathematical formulation, projection operations, relationship to constrained optimization, implementation with different norm types, interaction with dropout, and practical usage guidelines.
Max-norm regularization constrains the training problem:
$$\min_{\boldsymbol{\theta}} \mathcal{L}_{\text{data}}(\boldsymbol{\theta}) \quad \text{subject to} \quad |\mathbf{w}_j|_2 \leq c \quad \forall j$$
where $\mathbf{w}_j$ is the weight vector of the $j$-th unit (e.g., incoming weights to a neuron), and $c$ is the maximum allowed norm.
Types of Max-Norm:
Incoming weight norm: Constrain the L2 norm of weights entering each neuron $$|\mathbf{W}_{:,j}|_2 \leq c$$
Outgoing weight norm: Constrain weights leaving each neuron $$|\mathbf{W}_{i,:}|_2 \leq c$$
Full matrix norm: Constrain the Frobenius or spectral norm of the weight matrix (covered separately in spectral normalization)
Constraining per-unit rather than the full weight matrix provides finer control. Each neuron's capacity is bounded independently, preventing any single neuron from dominating while allowing the network as a whole to have high capacity.
Max-norm is enforced by projecting weights onto the constraint set after each gradient update.
Projection onto L2 Ball:
For a weight vector $\mathbf{w}$ with constraint $|\mathbf{w}|_2 \leq c$:
$$\mathbf{w}_{\text{proj}} = \begin{cases} \mathbf{w} & \text{if } |\mathbf{w}|_2 \leq c \ c \cdot \frac{\mathbf{w}}{|\mathbf{w}|_2} & \text{otherwise} \end{cases}$$
This is called rescaling or clipping: if the norm exceeds $c$, we scale the vector down to have norm exactly $c$, preserving its direction.
Algorithm:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
import numpy as npimport torchimport torch.nn as nn def project_to_max_norm(weights, max_norm): """ Project weight vectors onto L2 ball of radius max_norm. Args: weights: Weight matrix of shape (fan_out, fan_in) Each row is incoming weights to one neuron max_norm: Maximum L2 norm per neuron Returns: Projected weights """ # Compute norm of each row (incoming weights to each neuron) norms = np.linalg.norm(weights, axis=1, keepdims=True) # Compute scaling factor: min(1, max_norm / norm) # This equals 1 if norm <= max_norm, else max_norm/norm scale = np.clip(max_norm / (norms + 1e-8), 0, 1) return weights * scale def apply_max_norm_constraint(model, max_norm): """ Apply max-norm constraint to all Linear layers in a model. Should be called AFTER optimizer.step() """ with torch.no_grad(): for module in model.modules(): if isinstance(module, nn.Linear): # Weight shape: (out_features, in_features) # Each row = incoming weights to one output neuron norms = module.weight.norm(p=2, dim=1, keepdim=True) scale = torch.clamp(max_norm / norms, max=1.0) module.weight.mul_(scale) # Example usage in training loopmodel = nn.Sequential( nn.Linear(784, 256), nn.ReLU(), nn.Linear(256, 10)) optimizer = torch.optim.SGD(model.parameters(), lr=0.01)max_norm = 3.0 # Typical values: 1-5 for inputs, targets in dataloader: optimizer.zero_grad() loss = criterion(model(inputs), targets) loss.backward() optimizer.step() # Apply max-norm constraint AFTER the update apply_max_norm_constraint(model, max_norm)Max-norm constraints and L2 regularization are related through constrained optimization theory.
KKT Conditions:
For the constrained problem with L2 ball constraint, the Karush-Kuhn-Tucker (KKT) conditions show that at optimality, there exists a Lagrange multiplier $\mu \geq 0$ such that:
$$\nabla \mathcal{L}_{\text{data}} + \mu \mathbf{w} = 0$$
This looks exactly like the gradient of L2 regularized loss with $\lambda = \mu$. When the constraint is active (norm = c), $\mu > 0$; when inactive, $\mu = 0$.
Key Difference:
Max-norm is "smarter"—it only penalizes weights that are actually constrained, leaving unconstrained weights free.
| Aspect | L2 Regularization | Max-Norm Constraint |
|---|---|---|
| Type | Soft penalty | Hard constraint |
| Weights > bound | Penalized but allowed | Impossible (projected out) |
| Weights < bound | Still penalized (toward zero) | No penalty |
| Effect | Shrinks all weights | Only clips outliers |
| Hyperparameter | λ (penalty strength) | c (max norm) |
| Stability | Can still explode if λ small | Guaranteed bounded |
Max-norm constraints are particularly effective when combined with dropout. This combination was emphasized in the original dropout paper (Srivastava et al., 2014).
Why They Work Well Together:
Recommended Usage:
When using dropout rates > 0.3:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import torchimport torch.nn as nn class DropoutWithMaxNorm(nn.Module): """ Network combining dropout with max-norm constraints. Recommended when using aggressive dropout rates. """ def __init__(self, input_dim, hidden_dim, output_dim, dropout_rate=0.5, max_norm=4.0): super().__init__() self.max_norm = max_norm self.layers = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Dropout(dropout_rate), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Dropout(dropout_rate), nn.Linear(hidden_dim, output_dim) ) def forward(self, x): return self.layers(x) def apply_max_norm(self): """Apply max-norm constraint to all Linear layers.""" with torch.no_grad(): for module in self.modules(): if isinstance(module, nn.Linear): norms = module.weight.norm(p=2, dim=1, keepdim=True) scale = torch.clamp(self.max_norm / norms, max=1.0) module.weight.mul_(scale) # Training loopmodel = DropoutWithMaxNorm(784, 1024, 10, dropout_rate=0.5, max_norm=4.0)optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) for epoch in range(num_epochs): for inputs, targets in train_loader: optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, targets) loss.backward() optimizer.step() # Apply max-norm after each update model.apply_max_norm()The original dropout paper found that max-norm (c=4) combined with dropout (p=0.5) and high momentum (0.9-0.99) allowed training with very large learning rates, achieving faster convergence and better results than either technique alone.
Max-norm constraints can be applied in various ways beyond simple per-neuron L2 norms.
12345678910111213141516171819202122232425262728293031323334353637383940
import torchimport torch.nn as nn def apply_column_max_norm(weight, max_norm): """Constrain incoming weights to each neuron (columns of W^T).""" # For Linear layer: weight shape is (out, in) # Each ROW is incoming weights to one output neuron norms = weight.norm(p=2, dim=1, keepdim=True) scale = torch.clamp(max_norm / norms, max=1.0) return weight * scale def apply_row_max_norm(weight, max_norm): """Constrain outgoing weights from each neuron (columns of W).""" # Each COLUMN is outgoing weights from one input neuron norms = weight.norm(p=2, dim=0, keepdim=True) scale = torch.clamp(max_norm / norms, max=1.0) return weight * scale def apply_frobenius_max_norm(weight, max_norm): """Constrain entire matrix Frobenius norm.""" norm = weight.norm(p='fro') if norm > max_norm: return weight * (max_norm / norm) return weight def apply_layerwise_max_norm(model, max_norms): """ Apply different max-norm to each layer. Args: model: Neural network max_norms: Dict mapping layer index to max_norm value e.g., {0: 3.0, 1: 4.0, 2: 5.0} """ with torch.no_grad(): for i, module in enumerate(model.modules()): if isinstance(module, nn.Linear) and i in max_norms: norms = module.weight.norm(p=2, dim=1, keepdim=True) scale = torch.clamp(max_norms[i] / norms, max=1.0) module.weight.mul_(scale)| Scenario | Max-Norm c | Notes |
|---|---|---|
| Standard MLP | 3-5 | Original dropout paper recommendation |
| High dropout (0.5+) | 3-4 | Tighter constraint for stability |
| Low dropout (0.1-0.3) | 4-5 | Looser constraint acceptable |
| Large learning rate | 2-3 | Tighter to prevent explosion |
| Fine-tuning | 5-10 | Allow larger deviations from init |
You now understand max-norm constraints as hard bounds on weight magnitudes. While less common in modern architectures with built-in normalization, max-norm remains valuable for MLPs with dropout and situations requiring guaranteed weight bounds. Next, we explore spectral normalization—a more sophisticated technique that constrains the spectral norm (largest singular value) of weight matrices.