Muon Optimizer Update with Newton-Schulz Preconditioning (Medium) — Practice with Code Visualizer

The Muon optimizer is an advanced optimization algorithm designed for training deep neural networks that combines classical momentum-based updates with a sophisticated Newton-Schulz matrix iteration for preconditioning the update direction. Unlike standard optimizers like Adam or SGD with momentum, Muon leverages matrix-level transformations to accelerate convergence by adapting the effective curvature of the loss landscape.

Core Concept

The Muon optimizer operates in two main phases:

Phase 1: Momentum Accumulation

The first step is standard momentum update on the gradient:

$$B_{\text{new}} = \mu \cdot B + (1 - \mu) \cdot abla_{\theta}$$

where:

B is the momentum buffer (velocity)
μ is the momentum coefficient (typically 0.9 - 0.99)
∇θ is the current gradient

Phase 2: Newton-Schulz Matrix Preconditioning

The updated momentum buffer is preconditioned using the Newton-Schulz iteration, which approximates the matrix polar decomposition. This process orthogonalizes the update direction while preserving important directional information.

The Newton-Schulz iteration (order 5) proceeds as follows:

Normalize the momentum update using its Frobenius norm
For wide matrices (more columns than rows), work with the transpose
Apply the iterative update formula for the specified number of steps: $$X_{k+1} = X_k \cdot (aI + bX_k^T X_k + cX_k^T X_k X_k^T X_k)$$ where the coefficients (a, b, c) are derived from the 5th-order polynomial approximation
Scale the result using the RMS operator norm for numerical stability

Phase 3: Parameter Update

Finally, the parameters are updated using the preconditioned direction:

$$\theta_{\text{new}} = \theta - \eta \cdot \text{scale} \cdot \text{preconditioned_direction}$$

Your Task

Implement a function that performs a single Muon optimizer step on a 2D parameter matrix. Your implementation should:

Update the momentum buffer using the current gradient and momentum coefficient
Apply Newton-Schulz matrix preconditioning with the specified number of iteration steps
Handle both square and non-square (including wide) matrices appropriately
Compute the appropriate scale factor based on the matrix dimensions
Return both the updated parameters and the new momentum buffer

Starting with an identity parameter matrix and zero momentum, the gradient is accumulated into the momentum buffer (B_new = 1.0 * grad since mu=0.9 and B was zero). The Newton-Schulz iteration with 2 steps preconditions this update, and the parameters are updated accordingly. The large magnitude in theta_new reflects the amplification effect of the Newton-Schulz preconditioning on the uniform gradient matrix.

A 3×3 symmetric parameter matrix with a small gradient. The momentum accumulates the gradient values directly (since B starts at zero). With 3 Newton-Schulz iterations, the preconditioning process significantly amplifies the update magnitude. The symmetry in the input gradient is preserved in the output updates.

This example demonstrates the optimizer handling a non-square (wide) 2×3 matrix. The algorithm transposes wide matrices before applying Newton-Schulz iteration, then transposes back. The uniform gradient results in a structured update pattern, with the relative differences in theta_new preserving the original structure of theta.

Core Concept

The Muon optimizer operates in two main phases:

Phase 1: Momentum Accumulation

The first step is standard momentum update on the gradient:

$$B_{\text{new}} = \mu \cdot B + (1 - \mu) \cdot abla_{\theta}$$

where:

B is the momentum buffer (velocity)
μ is the momentum coefficient (typically 0.9 - 0.99)
∇θ is the current gradient

Phase 2: Newton-Schulz Matrix Preconditioning

The Newton-Schulz iteration (order 5) proceeds as follows:

Normalize the momentum update using its Frobenius norm
For wide matrices (more columns than rows), work with the transpose
Apply the iterative update formula for the specified number of steps: $$X_{k+1} = X_k \cdot (aI + bX_k^T X_k + cX_k^T X_k X_k^T X_k)$$ where the coefficients (a, b, c) are derived from the 5th-order polynomial approximation
Scale the result using the RMS operator norm for numerical stability

Phase 3: Parameter Update

Finally, the parameters are updated using the preconditioned direction:

$$\theta_{\text{new}} = \theta - \eta \cdot \text{scale} \cdot \text{preconditioned_direction}$$

Your Task

Implement a function that performs a single Muon optimizer step on a 2D parameter matrix. Your implementation should:

Update the momentum buffer using the current gradient and momentum coefficient
Apply Newton-Schulz matrix preconditioning with the specified number of iteration steps
Handle both square and non-square (including wide) matrices appropriately
Compute the appropriate scale factor based on the matrix dimensions
Return both the updated parameters and the new momentum buffer

Muon Optimizer Update with Newton-Schulz Preconditioning

Core Concept

Phase 1: Momentum Accumulation

Phase 2: Newton-Schulz Matrix Preconditioning

Phase 3: Parameter Update

Your Task

Hints

Muon Optimizer Update with Newton-Schulz Preconditioning

Core Concept

Phase 1: Momentum Accumulation

Phase 2: Newton-Schulz Matrix Preconditioning

Phase 3: Parameter Update

Your Task

Hints