A brief history of KV cache compression developments

0 ▲

1 hour ago · Tech · 0 comments

While much is focused on the improvement of models, there's been radical improvements in the efficiency of KV cache compression. I was curious to figure out just how big the improvements are and why I think it matters so much. The headline figure: the memory needed to store one token of context has fallen by roughly 100x since 2017. Over the same period, the memory on a top of the range datacentre GPU has gone from 16GB to 288GB - an 18x improvement. The memory wall in AI has mostly been solved with maths, not silicon. What is a KV cache? When you use an LLM - either via the web on ChatGPT et al, or agentically via Claude Code, your "context" is stored in a KV cache. This is/has been an incredibly memory intensive process which has led to hard limits on the session length. Put simply, the longer your "conversation" grows with the LLM, the more KV cache you need. A more efficient KV cache allows you to input more stuff - conversations, code, reference documents, images - in the same…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.