A tutorial that builds up everything you need — linear attention, state-space models, Mamba, the delta rule — and then assembles them into Gated DeltaNet (Yang, Kautz & Hatamizadeh, 2024; ICLR 2025). Part 0 — The problem everyone is trying to solveA standard Transformer's attention has two costs that hurt at long context: Training cost scales quadratically: attending every token to every other token is O(L2) in sequence length L. Inference cost grows linearly in memory: the KV cache stores keys and values for every past token, so generating token t requires O(t) memory and O(t) compute per step. The dream is a model that behaves like an RNN at inference time — a fixed-size state that gets updated once per token, giving O(1) memory and compute per generated token — while still being trainable in parallel like a Transformer. Linear attention, state-space models (Mamba), and DeltaNet are all members of this family. Gated DeltaNet is, in a precise sense, the merger of Mamba2 and DeltaNet,…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.