A new study published on arXiv reveals that large language models struggle dramatically with multi-step procedural execution, even as they achieve strong scores on standard reasoning benchmarks. The research, titled "When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models," tested 14 models across 55 datasets and found that average first-answer accuracy drops from 61% on 5-step procedures to just 20% on 95-step procedures—a threefold decline.
LLM Accuracy Plummets as Procedure Length Increases
Researchers Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, and Mayank Singh created a controlled diagnostic benchmark where models receive step-by-step arithmetic algorithms and two numeric inputs, then must return the final computed value by following the procedure exactly. While the arithmetic operations themselves are simple, complexity increases through algorithm length and look-back dependencies over intermediate variables.
The results reveal systematic failures across multiple dimensions:
- Missing answers where models produce no output
- Premature answers where execution stops before completing all steps
- Self-correction attempts after initial errors
- Under-executed traces that skip required steps
- Hallucinated extra steps that add non-existent operations
Benchmark Scores Mask Fundamental Execution Weaknesses
The researchers emphasize a critical distinction: "Large language models often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful instruction execution."
This gap between apparent reasoning and faithful execution has serious implications for real-world applications requiring precise, step-by-step execution, including scientific computing, financial calculations, and medical protocols. High benchmark scores on reasoning tasks may not reflect genuine procedural competency, raising questions about the reliability of LLMs in production environments where exact instruction following is critical.
Key Takeaways
- LLM accuracy on procedural tasks drops from 61% to 20% as procedures extend from 5 to 95 steps, representing a threefold decline
- Models exhibit multiple failure modes including missing answers, premature termination, skipped steps, and hallucinated operations
- Strong reasoning benchmark scores do not guarantee faithful execution of multi-step procedures
- The findings have critical implications for deploying LLMs in domains requiring precise instruction following like healthcare, finance, and scientific computing
- Testing across 14 models and 55 datasets demonstrates this is a systemic limitation rather than model-specific issue