So much inference, so few GPUs

0 ▲

Screaming At My Screen by Timo Zimmermann

94 days ago · Tech · hide · 0 comments

Building out a small fleet of coding and research agents for LazerBunny means I regularly have requests to LLMs running, which are hosted on various hardware in my local network. This is not a big deal most of the time, but things get a little bit more interesting when multiple agents compete for the limited local resources. I spent some time this week to look into potential solutions. There are some I ruled out: Buy more hardware. This has to wait till we hear "the big pop". Use online hosted models. Anything that does 400 other things I do not want. Returning to my earlier point: With only one agent per GPU things are really easy. But what if I want to have two or three coding agents running? My usual workflow is shoot off a bunch of questions or tasks to plan, make code suggestions and when I come back to the office start integrating the solutions. My agents usually do not directly edit the code, except for very small, localised changes. I prefer to take the code and manually fit…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.