A new study published on arXiv reveals that reasoning models like DeepSeek-R1 and GPT-OSS often know their final answers far earlier than their chain-of-thought output suggests. The research, titled "Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought," demonstrates that models engage in performative reasoning, continuing to generate tokens after they've already formed strong internal confidence in their answer.
Internal Probes Detect Answers Earlier Than Output Reveals
Researchers Siddharth Boppana, Jack Merullo, and colleagues compared three methods for detecting when models know their answers: activation probing (reading internal model states), early forced answering (stopping generation mid-chain), and CoT monitoring (analyzing the output text). Their experiments on DeepSeek-R1 671B and GPT-OSS 120B revealed significant gaps between internal knowledge and external expression.
On easy recall-based MMLU questions, the model's final answer was decodable from internal activations far earlier in the chain-of-thought than any monitor could detect from the output text. For difficult multihop GPQA-Diamond questions, however, the research found evidence of genuine reasoning rather than theater, with inflection points like backtracking occurring almost exclusively when probes showed large belief shifts.
Token Reduction of Up to 80% on Simple Questions
The practical implications are substantial. By using probe-guided early exit strategies, the researchers achieved:
- Up to 80% token reduction on MMLU questions while maintaining accuracy
- 30% token reduction on challenging GPQA-Diamond questions
- Similar accuracy to full chain-of-thought generation
- Task-dependent performance: more theater on easy questions, more genuine reasoning on hard ones
Performativity was defined as the gap between when a CoT monitor could detect the answer from output versus when activation probes showed near-certain internal confidence. When probes are correct early in the sequence, subsequent reasoning steps are considered performative since the model's confidence isn't being verbalized.
Implications for Inference Efficiency
The findings suggest that current reasoning models waste significant computational resources on "theater" after reaching internal conclusions. The research positions activation probing as an efficient tool for detecting performative reasoning and enabling adaptive computation. This could lead to substantially reduced inference costs for reasoning models, particularly on easier questions where the theater effect is most pronounced.
The study distinguishes between genuinely useful reasoning processes and performative output, offering a path toward more efficient deployment of reasoning models. By detecting when models have reached internal confidence, systems could potentially stop generation early without sacrificing accuracy, addressing one of the key cost challenges facing reasoning model deployment.
Key Takeaways
- Reasoning models often know their final answers far earlier than their chain-of-thought output reveals, engaging in performative "theater"
- Activation probing can detect answers up to 80% earlier than monitors analyzing the output text on easy MMLU questions
- Probe-guided early exit reduces tokens by 80% on simple questions and 30% on difficult questions while maintaining accuracy
- The theater effect is task-dependent: more prevalent on easy recall questions, with genuine reasoning occurring on complex multihop problems
- Inflection points like backtracking occur almost exclusively when internal probes show genuine belief shifts, suggesting these behaviors track real uncertainty