Georgi Gerganov created llama.cpp on March 10, 2023, ten days after Meta released LLaMA, and showed that a 7-billion-parameter language model could run inference on a MacBook CPU without a GPU. First nobody had weights worth running. Then the weights existed but could not fit the hardware. Then the models fit but served too slowly. Then the serving was fast but only handled text. By now, the bottleneck is making local models actually reliable as agents. They say time moves too fast in AI space, and that is how it felt to me, so here is an overview of what actually happened with local LLMs month by month from 2023 to 2026, and a look at what could be next. 2023 January January 2023 was still a pre-open-weights month for local LLMs, but it delivered one of the year's foundational ideas. Model compression was emerging as the path to personal hardware. SparseGPT appeared on arXiv on Jan. 2 and showed that GPT-family models as large as OPT-175B and BLOOM-176B could be pruned to roughly 50…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.