How to fit Qwen 3.6 35B A3B into 16GB of VRAM, & run it with Llama.cpp on an RTX 3080

1 ▲

2 hours ago · Tech · 0 comments

I have been informed by the internet that using Ollama is passé. Though it’s easy to get set up and use, and popular with rank beginners like The Autodidacts, power users balk at the fact that: Performance is worse than llama.cpp et al. It’s a VC-backed startup that’s rapidly selling-out on the local-first promise by touting paid cloud offerings It relies on Llama.cpp for the heavy lifting, without giving credit where credit is due Long ago, I thought, okay, I should just switch to llama.cpp. I’ve used whisper.cpp, how hard can it be? Let me just say: there’s a reason Ollama is so popular. Every time I tried to switch to llama.cpp, it either a) it wouldn’t compile, or b) I got cryptic CUDA memory allocation errors. (Even when I went to run the commands to test for this article, it was broken, because of an interrupted brew upgrade.) But now I am past that. After fruitless Googling, I threw command line arguments at it until it worked, and now this post is fruit for you to Google. Step…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.