Researchers have developed GEAR-VLA, a vision-language-action framework that learns unified geometry-aware action representations for robotic manipulation. The system achieves a 35 percentage point improvement on unseen viewpoints by explicitly incorporating 3D geometry understanding, addressing a critical weakness in current VLA models that learn view-dependent 2D patterns.
Current VLA Models Struggle With Real-World Deployment Scenarios
Vision-language-action models perform well on benchmarks but fail in real-world deployment when encountering unseen objects, background shifts, or different robot embodiments. Yuan Zhang and colleagues identified the root cause as a lack of unified geometry-aware manipulation representation, leaving VLAs vulnerable to trajectory supervision issues, misaligned 3D features, and embodiment differences.
GEAR-VLA Integrates 3D Geometry Through Three Key Components
The framework introduces coarse-to-fine action learning that combines multi-source embodied pretraining with latent action tokens connecting semantic understanding to continuous action generation. The semantic-aligned 3D integration component aligns a trainable 3D spatial backbone with VLA representations while freezing the original vision-language pathway to preserve learned knowledge.
Embodiment canonicalization separates robot-specific states from robot-invariant actions, confining morphology differences to low-level interfaces and enabling representation sharing across different robots without full retraining.
Real-World Testing Demonstrates 90% Success Rate on Universal Grasping Tasks
GEAR-VLA achieved state-of-the-art performance across multiple benchmarks:
- AgileX robot: 85.9% success rate on real-world tasks
- LDT-01 embodiment (unseen during pretraining): 81.0% success rate
- Universal grasping: 90.1% success across 6,360 trials with 212 unseen objects
- LIBERO unseen-view scenarios: 35 percentage point improvement over baseline VLA models
- CALVIN unseen-view scenarios: 11+ percentage point improvement
The system demonstrated strong zero-shot performance on LIBERO-Plus and top results on RoboTwin 2.0, showing that explicit 3D geometry understanding enables generalization to new viewpoints and embodiments.
Key Takeaways
- GEAR-VLA addresses VLA model limitations by incorporating explicit 3D geometry understanding rather than relying on view-dependent 2D patterns
- The framework achieves 35 percentage point improvements on LIBERO unseen-view scenarios and 11+ point gains on CALVIN unseen views
- Embodiment canonicalization enables transfer across different robot morphologies, with 81% success on pretraining-unseen LDT-01 embodiment
- Real-world testing demonstrated 90.1% success rate across 6,360 universal grasping trials with 212 unseen objects
- The semantic-aligned 3D integration preserves vision-language knowledge while adding geometry awareness through a trainable spatial backbone