From SGD to Muon: An Incremental TutorialThis tutorial builds up to the Muon optimizer one idea at a time. Each section adds exactly one concept on top of the previous one, so by the time we reach Muon, every design decision should feel inevitable rather than mysterious. The path we'll take: Gradient descent, and weights as matrices Momentum Preconditioning (why Adam is a "diagonal" preconditioner) Full-matrix AdaGrad — the ideal we can't afford Shampoo — Kronecker-factored preconditioning A surprising simplification: instantaneous Shampoo = orthogonalization Why orthogonalizing the update is a good idea on its own Muon — momentum + cheap orthogonalization via Newton–Schulz Practical details, code, and what Muon doesn't handle Notation: a weight matrix is W∈ℝm×n (e.g., a linear layer mapping n inputs to m outputs), its gradient is G=∇Wℒ, also m×n. Learning rate is η. Step 1: Gradient descent, and taking matrices seriouslyPlain SGD updates every parameter the same way: W←W−ηG Notice…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.