If you’ve read this far, you’ve noticed that every paper I’ve discussed has a number next to it. 85.9 percent on HumanEval. 12.5 percent on SWE-bench. 25 percent on TravelPlanner. These numbers do a lot of work in the multi-agent literature, and they also do a surprising amount of harm. This post is about the benchmarks themselves. What they measure. What they don’t. And why ChatDev and MetaGPT can report contradictory results on each other without either one being obviously wrong. Getting Up to Speed on MAS Part 1. The LandscapePart 2. The VocabularyPart 3. Wave 1: Can Agents Coordinate At All?Part 4. Wave 2: Why It BreaksPart 5. Debate, State, and CoordinationPart 6. Verification PatternsPart 7. Benchmarks and What They Miss (you are here) Part 8. Open Questions (publishes May 1) The Landscape Here’s every benchmark that’s come up in the series so far, plus a few that haven’t. BenchmarkDomainWhat It TestsScaleMulti-Agent?Notable Results HumanEval Code generation Write a correct…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.