RoboCasa365 Introduces 365 Tasks Across 2,500 Kitchen Environments for Robot Training

Researchers from UT Austin released RoboCasa365, a large-scale simulation framework featuring 365 everyday household tasks across 2,500 diverse kitchen environments, on March 4, 2026. The benchmark includes over 2,200 hours of demonstration data—600 hours from human demonstrations and 1,600 hours synthetically generated—designed to support systematic evaluation of generalist robot policies.

Scale Represents Major Expansion for Household Robotics Research

RoboCasa365 significantly expands upon the original RoboCasa benchmark, increasing from 100 tasks to 365 and from 100 scenes to 2,500 kitchen environments. The framework includes 50 new layouts based on real-world homes paired with 50 additional styles, creating unprecedented diversity for training household robots. Version 1.0 became openly available on GitHub on February 18, 2026, with support for single-arm mobile platforms, humanoid robots, and quadruped robots with arms.

Task Design Targets Foundational Skills and Complex Activities

The benchmark organizes tasks into two categories: 65 atomic tasks for training foundational skills and composite tasks requiring skill sequencing. The ten foundational skills include pick and place, opening/closing doors and drawers, twisting knobs, turning levers, pressing buttons, insertion, navigation, sliding racks, and opening/closing lids. Composite tasks involve semantically meaningful activities like restocking kitchen supplies and brewing coffee.

Experiments Analyze Impact of Diversity on Generalization

Extensive experiments conducted by the research team analyze how task diversity, dataset scale, and environment variation affect robot generalization performance. According to the paper by Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, and Yuke Zhu, the results provide insights into which factors most strongly influence generalist robot capabilities. The researchers designed RoboCasa365 specifically to support multi-task learning, robot foundation model training, and lifelong learning research.

Addresses Critical Gap in Reproducible Benchmarks

The researchers note that the field lacks reproducible, large-scale benchmarks for systematic evaluation of household robotics. RoboCasa365 represents "one of the most diverse and large-scale resources for studying generalist policies," according to the paper. The framework's open availability through GitHub and dedicated website aims to accelerate research in embodied AI and household robotics.

Key Takeaways

RoboCasa365 includes 365 tasks across 2,500 kitchen environments, up from 100 tasks and 100 scenes in the original version
The framework provides over 2,200 hours of demonstration data, combining 600 hours of human demonstrations with 1,600 hours of synthetic data
Tasks cover ten foundational skills and range from atomic operations to complex composite activities like restocking supplies and brewing coffee
The benchmark supports multiple robot forms including single-arm mobile platforms, humanoid robots, and quadruped robots with arms
Research experiments analyze how task diversity, dataset scale, and environment variation impact generalist robot performance and generalization

Scale Represents Major Expansion for Household Robotics Research

Task Design Targets Foundational Skills and Complex Activities

Experiments Analyze Impact of Diversity on Generalization

Addresses Critical Gap in Reproducible Benchmarks

Key Takeaways

RoboCasa365 includes 365 tasks across 2,500 kitchen environments, up from 100 tasks and 100 scenes in the original version

The framework provides over 2,200 hours of demonstration data, combining 600 hours of human demonstrations with 1,600 hours of synthetic data

Tasks cover ten foundational skills and range from atomic operations to complex composite activities like restocking supplies and brewing coffee

The benchmark supports multiple robot forms including single-arm mobile platforms, humanoid robots, and quadruped robots with arms

Research experiments analyze how task diversity, dataset scale, and environment variation impact generalist robot performance and generalization