17 hours ago · 14 min read2868 words · Tech · 0 comments

One night my AI agent Hermes, which I run 24/7 on my own hardware, spent 47 turns trying to “fix” a script. Every turn it ran the same broken command, got the same error, apologized and tried again. I sat there watching the token counter climb like a Tatkal queue at 10 AM, and I genuinely could not tell what had gone wrong. Was the prompt bad? Was it missing context? Was a tool broken? Was the loop just never going to stop? I am an AI Observability Architect, and I was staring at my own agent unable to name which layer had failed. That night is what this post is about. A production AI agent is not one skill, it is five, and when something breaks at 2 AM you need to know exactly which one to blame. Everyone talks about prompt engineering. Fewer people talk about context engineering. Almost nobody talks about harness, loop or evaluation engineering, even though they’re the difference between a demo that works once and a system that survives a night alone on my server. An AI…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.