3 hours ago · Tech · 0 comments

You don’t need a framework, a SaaS dashboard, or a dependency to test an AI agent. You need a way to run it, a way to grade it, and a loop around both. Here we build an eval harness in a single Bun file, start to finish, every line explained. By the end you’ll have one evals.ts file that spins up a sandbox, drives the agent through the claude CLI, and grades the result three ways. What we’re building An eval is a test for software that isn’t deterministic. A unit test asks “does 2 + 2 return 4?”, but an AI agent gives you a different paragraph every time you ask, so there’s no single value to assert against. An eval instead pins down one observable behavior (“when there’s no plan yet, it recommends planning first”) and checks whether the agent did it, while tolerating the fact that the exact words vary. People reach for hosted platforms for this. You don’t have to. Every eval harness, underneath the dashboard, is the same three moves: Run the agent. Give it a prompt in a controlled…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.