LLMs Excel at Spatial Transfer But Fail Systematically at Length Scaling in Planning Tasks

Language models can successfully apply learned problem-solving strategies to novel spatial layouts but consistently fail when problems require longer planning horizons, according to new research published April 16, 2026 on arXiv. The study reveals that current AI architectures face fundamental limitations in sequential reasoning that cannot be overcome through better training or inference techniques alone.

Researchers Yao Tong, Jiayuan Ye, Anastasia Borovykh, and Reza Shokri designed a controlled synthetic environment based on shortest-path planning to cleanly separate two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems.

Spatial Transfer Succeeds, Length Scaling Fails

The research produced a striking divergence in capabilities. Models exhibited strong spatial transfer—they successfully applied learned shortest-path strategies to novel spatial layouts they hadn't encountered during training. This suggests the models learn generalizable spatial reasoning patterns rather than memorizing specific configurations.

However, models consistently failed under length scaling due to what the researchers term "recursive instability." When problems required longer planning horizons, errors accumulated across extended solution sequences, causing systematic failure. This limitation persisted regardless of the training or inference approach used.

Pipeline Stages Influence Systematic Problem-Solving Differently

The study revealed how distinct stages of the learning pipeline influence systematic problem-solving capabilities in fundamentally different ways. Data coverage sets capability limits, defining what problems are solvable in principle. Reinforcement learning improves training stability but does not expand those fundamental limits. Inference-time scaling enhances performance within existing capabilities but cannot rescue length-scaling failures.

According to the paper's abstract: "Data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures."

Architectural Limitations Beyond Training Fixes

The findings suggest that systematic generalization failures aren't simply fixable through better training techniques or more compute at inference time. Some fundamental limitations persist that prevent models from reliably solving increasingly complex versions of the same problem class, even when they've mastered simpler variants.

This has significant implications for deploying language models in planning tasks, code generation (where longer programs require extended reasoning chains), and complex multi-step problem-solving. The inability of reinforcement learning and inference-time compute to address length-scaling failures suggests that architectural changes may be necessary to overcome these limitations.

Implications for Sequential Reasoning

The research provides empirical evidence that current language model architectures face fundamental constraints in sequential reasoning. While they can transfer learned patterns to new spatial configurations—demonstrating genuine generalization in one dimension—they cannot scale those same patterns to longer reasoning chains without accumulating catastrophic errors.

This asymmetry between spatial transfer success and length scaling failure reveals important boundaries in current AI capabilities, suggesting that different types of generalization may require different architectural approaches.

Key Takeaways

LLMs successfully transfer shortest-path strategies to novel spatial layouts, demonstrating strong spatial generalization
Models consistently fail when scaling to longer planning horizons due to recursive instability and error accumulation
Data coverage sets fundamental capability limits that reinforcement learning cannot expand
Inference-time scaling enhances performance but cannot overcome length-scaling failures
The findings suggest architectural changes may be needed to enable reliable sequential reasoning at scale

Spatial Transfer Succeeds, Length Scaling Fails

Pipeline Stages Influence Systematic Problem-Solving Differently

Architectural Limitations Beyond Training Fixes

Implications for Sequential Reasoning

Key Takeaways

LLMs successfully transfer shortest-path strategies to novel spatial layouts, demonstrating strong spatial generalization

Models consistently fail when scaling to longer planning horizons due to recursive instability and error accumulation

Data coverage sets fundamental capability limits that reinforcement learning cannot expand

Inference-time scaling enhances performance but cannot overcome length-scaling failures

The findings suggest architectural changes may be needed to enable reliable sequential reasoning at scale