Researchers have conducted the first systematic evaluation of vision-language models (VLMs) for spatial reasoning over robot motion, achieving 75% accuracy after fine-tuning on motion preference tasks. The study, published on arXiv on March 13, 2026, tested four state-of-the-art VLMs across different querying methods, revealing that Qwen2.5-VL achieves 71.4% zero-shot accuracy—outperforming GPT-4o—while identifying significant room for improvement before production deployment.
VLMs Show Promise for Robot Motion Planning Integration
The research addresses a critical gap in robot planning: while VLMs have demonstrated natural language and spatial reasoning capabilities, their ability to handle spatial reasoning for motion preferences remained unclear. Motion preferences include object-proximity requirements (desired distances from objects, topological properties) and path-style preferences (motion characteristics and trajectories).
Researchers evaluated four state-of-the-art VLMs using four different querying methods to assess their ability to enforce user preferences and constraints on robot motion. The study tested two fundamental types of motion preferences that robots must understand to follow user instructions effectively.
Qwen2.5-VL Outperforms GPT-4o on Motion Reasoning Tasks
With the highest-performing querying method, Qwen2.5-VL achieved 71.4% accuracy in zero-shot evaluation, surpassing GPT-4o's performance. After fine-tuning on a smaller model, accuracy improved to 75%. The research analyzed the trade-off between accuracy and computational cost measured in token count, providing practical guidance for deployment decisions.
The evaluation revealed which types of motion preferences work better across different VLM architectures. This analysis helps identify specific capabilities and limitations, guiding future development of VLMs for robotics applications. The study represents the first systematic evaluation specifically targeting motion preference enforcement in robot planning pipelines.
Results Indicate Path Toward Production Deployment
While the 75% accuracy demonstrates promising baseline performance, the researchers note significant room for improvement remains before production deployment. The study establishes clear benchmarks for measuring progress and identifies specific areas requiring advancement.
This research contributes to the growing field of Vision-Language-Action (VLA) models for robotics, which includes recent developments like VLM2 for 3D-aware spatial reasoning, VLM-PoseManip for 6D pose estimation in dexterous manipulation, and Xiaomi Robotics-0's February 2026 release for interpreting vague instructions and spatial relationships. The systematic evaluation framework provides a foundation for measuring improvements as VLM capabilities advance.
Key Takeaways
- Qwen2.5-VL achieves 71.4% zero-shot accuracy on robot motion spatial reasoning, outperforming GPT-4o
- Fine-tuning on motion preference tasks improves accuracy to 75% on smaller models
- Study evaluates four state-of-the-art VLMs across four querying methods for object-proximity and path-style preferences
- Research represents first systematic evaluation of VLMs specifically for motion preference enforcement in robot planning
- Results show promise for VLM integration with robot motion planning but require improvement before production deployment