Every few weeks someone downloads the GGUF build and the MLX build of the same model, runs both, screenshots the tokens-per-second counter, and posts it as proof that one format wins. The replies split down the middle. Half the thread says MLX is obviously faster, the other half says the test was rigged. They are both right, which is the problem. The number on the screen is real and it is also not the number you actually wait for. And the format you should pick was never really about that number anyway. I have gone through this decision enough times now, on my own machine and for clients standing up local inference, that I want to write down the part nobody puts in the comparison tables: GGUF versus MLX is a five-question decision, and only one of those questions is about speed. What you are actually choosing between GGUF is the file format from the llama.cpp project. One file holds the quantized weights, the tokenizer, the chat template, and the metadata, and any runtime that can…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.