TempoVLA: Speed-Controllable Robot Policies Accelerate Through Low-Risk Phases

Researchers have developed TempoVLA, a Vision-Language-Action model that enables robots to dynamically adjust their movement speed based on task risk. Published June 4, 2026 on arXiv, the system addresses a fundamental limitation in current robot manipulation policies: the inability to slow down during high-risk contact operations while accelerating through safer transit phases.

Current VLA Models Operate at Fixed Speeds

Existing Vision-Language-Action models inherit a single fixed speed from their training demonstrations, making them inefficient at tasks that alternate between low-risk movement and high-precision operations. While previous research attempted to accelerate VLAs, these efforts only shifted from one fixed speed to another, leaving controlled deceleration largely unexplored. TempoVLA recognizes that the magnitude of each predicted action already governs robot movement speed, creating a direct pathway to controllable execution.

Variable-Speed Trajectory Augmentation Enables Flexible Training

The system employs a two-part approach combining data augmentation with model conditioning. Variable-Speed Trajectory Augmentation (VSTA) re-times training demonstrations to any target speed by merging or splitting actions while preserving motion semantics. Experimental results show VSTA reaches requested speeds with negligible motion error. On the model side, TempoVLA feeds the desired speed to the policy as an explicit input, enabling the generation of appropriate action magnitudes for different execution contexts.

Dynamic Speed Control Responds to Real-Time Risk Assessment

By integrating with large multimodal models, TempoVLA achieves dynamic speed control that adjusts in real-time based on visual risk assessment. Vision-language models evaluate camera images to identify high-risk scenarios requiring precision—such as grasping fragile objects or operating near humans—and command appropriate speed adjustments. In warehouse environments, this enables robots to move quickly through open spaces while automatically decelerating when approaching obstacles or performing delicate manipulations.

Experiments Demonstrate Performance Gains in Both Directions

Simulation and real-world experiments confirm that TempoVLA achieves flexible speed control in both acceleration and deceleration directions. The Variable-Speed Trajectory Augmentation component additionally improves baseline performance at standard 1× speed through better data utilization. These results position TempoVLA as a significant advance in Vision-Language-Action models, which represent a unified approach to robot control by combining vision, language understanding, and action generation in a single system.

Key Takeaways

TempoVLA enables robot policies to dynamically adjust movement speed based on task risk, accelerating through low-risk phases and decelerating during high-precision operations
Variable-Speed Trajectory Augmentation (VSTA) re-times training demonstrations to any target speed while preserving motion semantics with negligible error
The system integrates with large multimodal models for real-time risk assessment and speed commands based on visual input
Experiments in simulation and real-world tasks confirm flexible speed control in both directions while improving baseline performance through better data utilization
TempoVLA addresses a fundamental limitation in Vision-Language-Action models, which previously operated at single fixed speeds inherited from training data

Current VLA Models Operate at Fixed Speeds

Variable-Speed Trajectory Augmentation Enables Flexible Training

Dynamic Speed Control Responds to Real-Time Risk Assessment

Experiments Demonstrate Performance Gains in Both Directions

Key Takeaways

TempoVLA enables robot policies to dynamically adjust movement speed based on task risk, accelerating through low-risk phases and decelerating during high-precision operations

Variable-Speed Trajectory Augmentation (VSTA) re-times training demonstrations to any target speed while preserving motion semantics with negligible error

The system integrates with large multimodal models for real-time risk assessment and speed commands based on visual input

Experiments in simulation and real-world tasks confirm flexible speed control in both directions while improving baseline performance through better data utilization

TempoVLA addresses a fundamental limitation in Vision-Language-Action models, which previously operated at single fixed speeds inherited from training data