Research Reveals LLMs Fail at Multi-Step Procedural Execution Despite Strong Benchmark Performance

A new study published on arXiv reveals that large language models struggle dramatically with multi-step procedural execution, even as they achieve strong scores on standard reasoning benchmarks. The research, titled "When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models," tested 14 models across 55 datasets and found that average first-answer accuracy drops from 61% on 5-step procedures to just 20% on 95-step procedures—a threefold decline.

LLM Accuracy Plummets as Procedure Length Increases

Researchers Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, and Mayank Singh created a controlled diagnostic benchmark where models receive step-by-step arithmetic algorithms and two numeric inputs, then must return the final computed value by following the procedure exactly. While the arithmetic operations themselves are simple, complexity increases through algorithm length and look-back dependencies over intermediate variables.

The results reveal systematic failures across multiple dimensions:

Missing answers where models produce no output
Premature answers where execution stops before completing all steps
Self-correction attempts after initial errors
Under-executed traces that skip required steps
Hallucinated extra steps that add non-existent operations

Benchmark Scores Mask Fundamental Execution Weaknesses

The researchers emphasize a critical distinction: "Large language models often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful instruction execution."

This gap between apparent reasoning and faithful execution has serious implications for real-world applications requiring precise, step-by-step execution, including scientific computing, financial calculations, and medical protocols. High benchmark scores on reasoning tasks may not reflect genuine procedural competency, raising questions about the reliability of LLMs in production environments where exact instruction following is critical.

Key Takeaways

LLM accuracy on procedural tasks drops from 61% to 20% as procedures extend from 5 to 95 steps, representing a threefold decline
Models exhibit multiple failure modes including missing answers, premature termination, skipped steps, and hallucinated operations
Strong reasoning benchmark scores do not guarantee faithful execution of multi-step procedures
The findings have critical implications for deploying LLMs in domains requiring precise instruction following like healthcare, finance, and scientific computing
Testing across 14 models and 55 datasets demonstrates this is a systemic limitation rather than model-specific issue

LLM Accuracy Plummets as Procedure Length Increases

The results reveal systematic failures across multiple dimensions:

Missing answers where models produce no output

Premature answers where execution stops before completing all steps

Self-correction attempts after initial errors

Under-executed traces that skip required steps

Hallucinated extra steps that add non-existent operations

Benchmark Scores Mask Fundamental Execution Weaknesses

Key Takeaways

LLM accuracy on procedural tasks drops from 61% to 20% as procedures extend from 5 to 95 steps, representing a threefold decline

Models exhibit multiple failure modes including missing answers, premature termination, skipped steps, and hallucinated operations

Strong reasoning benchmark scores do not guarantee faithful execution of multi-step procedures

The findings have critical implications for deploying LLMs in domains requiring precise instruction following like healthcare, finance, and scientific computing

Testing across 14 models and 55 datasets demonstrates this is a systemic limitation rather than model-specific issue