3 hours ago · Tech · 0 comments

FIPO: Teaching LLMs Which Thoughts Actually Matter Paper Notion writeup Code Most reasoning models today rely on outcome-based RL: generate a solution, check if the answer is correct, and reinforce the whole trajectory. As reasoning becomes longer, the learning signal collapses - important steps and irrelevant tokens receive the same credit, and performance plateaus. FIPO addresses this directly by introducing token-level credit assignment based on future impact. This turns a sparse, outcome-only signal into a dense, structured one. The authors train Qwen2.5-32B-Base on top of the DAPO recipe inside VeRL and evaluate on AIME 2024. They report Pass@1 going from 50.0% (DAPO) to a peak of 58.0%, converging around 56.0%. Response lengths roughly double during training, from around 4k tokens to more than 10k. FIPO FIPO measures, for each token, how much the policy’s behaviour on the rest of the rollout has shifted since the last update. For each position in a response, the authors compute…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.