What Claude Thinks But Doesn't Say

0 ▲

Articles on Philipp D. Dubach | Quantitative Finance & AI Strategy

1 hour ago · Tech · 0 comments

× Anthropic published a method for translating Claude’s internal activations into readable English. It works well enough to ship. The press release skips three structural problems that the paper itself acknowledges. On the SWE-bench Verified coding benchmark, Anthropic’s new interpretability method shows the model registering “this is an evaluation” on 26% of problems. On real claude.ai traffic from users who opted in for training, that number is under 1%. Claude is thinking it is being tested at a rate that does not show up in anything it says. In 1861, Paul Broca’s most famous patient was a man he called Tan. Tan understood the world around him. He could not describe it. Whatever he tried to say came out as the same syllable: “tan.” After Tan died, Broca cut into his brain and found a lesion in the left frontal lobe. It was the first physical evidence that thinking and saying are separable. A century and a half later, the same gap is the hardest problem in evaluating large language…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.