GEAR-VLA Achieves 35-Point Improvement on Unseen Viewpoints With Geometry-Aware Action Representations

Researchers have developed GEAR-VLA, a vision-language-action framework that learns unified geometry-aware action representations for robotic manipulation. The system achieves a 35 percentage point improvement on unseen viewpoints by explicitly incorporating 3D geometry understanding, addressing a critical weakness in current VLA models that learn view-dependent 2D patterns.

Current VLA Models Struggle With Real-World Deployment Scenarios

Vision-language-action models perform well on benchmarks but fail in real-world deployment when encountering unseen objects, background shifts, or different robot embodiments. Yuan Zhang and colleagues identified the root cause as a lack of unified geometry-aware manipulation representation, leaving VLAs vulnerable to trajectory supervision issues, misaligned 3D features, and embodiment differences.

GEAR-VLA Integrates 3D Geometry Through Three Key Components

The framework introduces coarse-to-fine action learning that combines multi-source embodied pretraining with latent action tokens connecting semantic understanding to continuous action generation. The semantic-aligned 3D integration component aligns a trainable 3D spatial backbone with VLA representations while freezing the original vision-language pathway to preserve learned knowledge.

Embodiment canonicalization separates robot-specific states from robot-invariant actions, confining morphology differences to low-level interfaces and enabling representation sharing across different robots without full retraining.

Real-World Testing Demonstrates 90% Success Rate on Universal Grasping Tasks

GEAR-VLA achieved state-of-the-art performance across multiple benchmarks:

AgileX robot: 85.9% success rate on real-world tasks
LDT-01 embodiment (unseen during pretraining): 81.0% success rate
Universal grasping: 90.1% success across 6,360 trials with 212 unseen objects
LIBERO unseen-view scenarios: 35 percentage point improvement over baseline VLA models
CALVIN unseen-view scenarios: 11+ percentage point improvement

The system demonstrated strong zero-shot performance on LIBERO-Plus and top results on RoboTwin 2.0, showing that explicit 3D geometry understanding enables generalization to new viewpoints and embodiments.

Key Takeaways

GEAR-VLA addresses VLA model limitations by incorporating explicit 3D geometry understanding rather than relying on view-dependent 2D patterns
The framework achieves 35 percentage point improvements on LIBERO unseen-view scenarios and 11+ point gains on CALVIN unseen views
Embodiment canonicalization enables transfer across different robot morphologies, with 81% success on pretraining-unseen LDT-01 embodiment
Real-world testing demonstrated 90.1% success rate across 6,360 universal grasping trials with 212 unseen objects
The semantic-aligned 3D integration preserves vision-language knowledge while adding geometry awareness through a trainable spatial backbone

Current VLA Models Struggle With Real-World Deployment Scenarios

GEAR-VLA Integrates 3D Geometry Through Three Key Components

Real-World Testing Demonstrates 90% Success Rate on Universal Grasping Tasks

GEAR-VLA achieved state-of-the-art performance across multiple benchmarks:

AgileX robot: 85.9% success rate on real-world tasks

LDT-01 embodiment (unseen during pretraining): 81.0% success rate

Universal grasping: 90.1% success across 6,360 trials with 212 unseen objects

LIBERO unseen-view scenarios: 35 percentage point improvement over baseline VLA models

CALVIN unseen-view scenarios: 11+ percentage point improvement

Key Takeaways

GEAR-VLA addresses VLA model limitations by incorporating explicit 3D geometry understanding rather than relying on view-dependent 2D patterns

The framework achieves 35 percentage point improvements on LIBERO unseen-view scenarios and 11+ point gains on CALVIN unseen views

Embodiment canonicalization enables transfer across different robot morphologies, with 81% success on pretraining-unseen LDT-01 embodiment

Real-world testing demonstrated 90.1% success rate across 6,360 universal grasping trials with 212 unseen objects

The semantic-aligned 3D integration preserves vision-language knowledge while adding geometry awareness through a trainable spatial backbone