1 hour ago · Tech · 0 comments

MiniMax Sparse Attention: Per-Group Block Selection for Cheap Million-Token Inference Paper Code Model Long-context LLMs keep promising the same thing: feed more tokens into the prompt and let the model reason over them. The bottleneck is rarely the window itself: it is the cost of attending over it once agentic workflows, repo-scale code reasoning, and persistent memory push the context into the hundreds of thousands or millions of tokens. Softmax attention is quadratic, so million-token context is not just a modeling challenge, but an inference-cost and deployment challenge. MiniMax Sparse Attention (MSA) is the attention design behind MiniMax M3, and it tackles this problem with blockwise sparsity built on top of Grouped Query Attention. The idea is to keep exact softmax attention but run it over a tiny, query-dependent subset of the key-value history instead of the whole thing. A lightweight Index Branch decides which blocks matter, and the expensive Main Branch perform exact…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.