1 hour ago · Tech · 0 comments

I’ve been getting annoyed at constant code regressions in piclaw for the past few weeks. Something was off–even after bumping the test suite to the point where it catches most mechanical errors, gpt-5.5 kept making unrelated edits to code that should have been left alone, and I was getting really annoyed at babysitting it. The pattern was always the same: It would follow a strict spec and then “improve” three other things nobody asked for, and since I am using piclaw and know exactly what the agent does and can trace context and requests, I know it isn’t a harness bug. So I spent last night investigating, and gave both gpt-5.3-codex and gpt-5.5 the exact same prompt, off clean sessions: audit this codebase thoroughly for code smells and logic errors and fix them. Two identical worktrees, two models, same system prompt, same tooling. Reset both, run, compare results. I did this five times, and gpt-5.3-codex produced more complete fixes, caught more subtle issues, and generated more…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.