Expert-aware quantisation: near-Q4 quality at near-Q2 size?

0 ▲

1 hour ago · 8 min read1520 words · Tech · 0 comments

While researching and writing my last article on the history of KV cache compression, it occurred to me while there has been so much implemented research on KV cache efficiency, actual model weights quantisation is still pretty blunt. This makes sense - at large scale with many tens of thousands of GPUs the weights themselves aren't a huge efficiency bottleneck for the most part, and KV cache starts dominating memory usage. But, for us lowly serfs who don't have access to a warehouse full of HBM memory, it is a problem. The key constraint for local models is (mostly) just loading the weights into something fast enough. Profiling I spend a lot of time profiling applications to improve their performance, and a couple of months ago I built a tool to do the same for MoE models. This got me thinking. What if instead of just quantising the entire model to a certain level - the blunt hammer I mentioned - we instead profile the model first and then quantise the "cold" experts selectively, for…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.