Last week I wrote about the hardware side of running AI locally, why memory bandwidth matters more than raw compute, which machines are worth building, and where the market is heading. If you missed it, start there as this post builds directly on top of it.In the quest of becoming AI independent, your hardware sets the ceiling, but what decides how close you actually get to it is software.Two machines with identical GPUs, identical VRAM, identical bandwidth, one running naive inference, one running an optimised stack can produce a 3-5x difference in tokens per second. This can mean the difference from running at 5tok/s to the 20-30tok/s that you need to get something usable. Even more, some techniques, and software-model-hardware optimised implementations may allow you to fit large models like DeepSeeekV4-Flash on a MacBook with at least 96GB of RAM. Same model, same hardware, different choices in the software layer (I feel we have a lot to learn from the video encoding and…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.