Distributing LLM inference in DwarfStar

0 ▲

1 hour ago · Tech · 0 comments

High end NVIDIA cards, and the server and power needed to run them, cost a lot of money, especially if you plan to reach enough VRAM to run massive models. The alternative, so far, has been Apple hardware, or the DGX Spark that, even if severely limited because of memory bandwidth, still allows to run LLMs prompt processing (prefill) fast enough. The Mac Studio provided up to 512GB unified memory, a solution with modest memory bandwidth (but much better than the Spark) and compute at a price that was, after all, given the current situation, relatively fair. For instance, with DwarfStar the Mac Studio M3 Ultra 512GB can run DeepSeek v4 PRO at 150 t/s prefill and ~10-13 t/s decoding, not great but at a level that is usable for certain use cases. Even 2-bit quantized, DeepSeek v4 PRO delivers like Flash at the same quantization. I would not consider a trivial fact to run a frontier model at home, with a ~12k total spending. One could expect this to get better and better, but the…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.