A colleague told me a near-miss horror story. As Google began deprecating Gemini 2.0, we moved to Gemini 2.5 Pro. But reasoning is enabled by default and cannot be turned off. For our specific problem statement, reasoning was not required. Token costs increased 10x and speeds were 3-4x slower. We moved the client to Gemini 2.5 Flash Lite, which has reasoning turned off by default and offers much lower latency. Because we track compute costs closely, we managed this without a major financial impact. But model updates clearly require careful testing on the cost and latency front as well, not just output quality. AI used to keep getting cheaper. But now it’s more a “convergence”. Each line traces a model family. The X-axis is its intelligence (LMArena ELO score) and the Y-axis is the input cost ($ per million tokens, log scale). Time flows roughly left to right as models improve. Three patterns emerge and each seems strategic. Gemini Flash rockets upward (cheap -> expensive, moving…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.