Loading problem...
The Muon optimizer is an advanced optimization algorithm designed for training deep neural networks that combines classical momentum-based updates with a sophisticated Newton-Schulz matrix iteration for preconditioning the update direction. Unlike standard optimizers like Adam or SGD with momentum, Muon leverages matrix-level transformations to accelerate convergence by adapting the effective curvature of the loss landscape.
The Muon optimizer operates in two main phases:
The first step is standard momentum update on the gradient:
$$B_{\text{new}} = \mu \cdot B + (1 - \mu) \cdot abla_{\theta}$$
where:
The updated momentum buffer is preconditioned using the Newton-Schulz iteration, which approximates the matrix polar decomposition. This process orthogonalizes the update direction while preserving important directional information.
The Newton-Schulz iteration (order 5) proceeds as follows:
Finally, the parameters are updated using the preconditioned direction:
$$\theta_{\text{new}} = \theta - \eta \cdot \text{scale} \cdot \text{preconditioned_direction}$$
Implement a function that performs a single Muon optimizer step on a 2D parameter matrix. Your implementation should:
theta = [[1.0, 0.0], [0.0, 1.0]]
B = [[0.0, 0.0], [0.0, 0.0]]
grad = [[1.0, 1.0], [1.0, 1.0]]
eta = 0.1
mu = 0.9
ns_steps = 2{"theta_new": [[-4403962.0231, -4403963.0231], [-4403963.0231, -4403962.0231]], "B_new": [[1.0, 1.0], [1.0, 1.0]]}Starting with an identity parameter matrix and zero momentum, the gradient is accumulated into the momentum buffer (B_new = 1.0 * grad since mu=0.9 and B was zero). The Newton-Schulz iteration with 2 steps preconditions this update, and the parameters are updated accordingly. The large magnitude in theta_new reflects the amplification effect of the Newton-Schulz preconditioning on the uniform gradient matrix.
theta = [[1.0, 0.5, 0.2], [0.5, 1.0, 0.3], [0.2, 0.3, 1.0]]
B = [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
grad = [[0.1, 0.2, 0.1], [0.2, 0.1, 0.2], [0.1, 0.2, 0.1]]
eta = 0.01
mu = 0.9
ns_steps = 3{"theta_new": [[-9.788e+59, -1.161e+60, -9.788e+59], [-1.161e+60, -1.377e+60, -1.161e+60], [-9.788e+59, -1.161e+60, -9.788e+59]], "B_new": [[0.1, 0.2, 0.1], [0.2, 0.1, 0.2], [0.1, 0.2, 0.1]]}A 3×3 symmetric parameter matrix with a small gradient. The momentum accumulates the gradient values directly (since B starts at zero). With 3 Newton-Schulz iterations, the preconditioning process significantly amplifies the update magnitude. The symmetry in the input gradient is preserved in the output updates.
theta = [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]
B = [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
grad = [[0.5, 0.5, 0.5], [0.5, 0.5, 0.5]]
eta = 0.05
mu = 0.95
ns_steps = 2{"theta_new": [[-462332534.4221, -462332533.4221, -462332532.4221], [-462332531.4221, -462332530.4221, -462332529.4221]], "B_new": [[0.5, 0.5, 0.5], [0.5, 0.5, 0.5]]}This example demonstrates the optimizer handling a non-square (wide) 2×3 matrix. The algorithm transposes wide matrices before applying Newton-Schulz iteration, then transposes back. The uniform gradient results in a structured update pattern, with the relative differences in theta_new preserving the original structure of theta.
Constraints