Researchers have published ActionParty, the first video diffusion model capable of controlling up to seven agents simultaneously within the same scene. The ArXiv paper (2604.02330), published April 2, 2026, by Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov, Fabio Pizzati, and Aliaksandr Siarohin, addresses a fundamental limitation in existing video generation models that struggle with multi-agent control.
ActionParty Solves Action Binding Through Subject State Tokens
Existing video diffusion models face a critical problem called action binding—when given multiple commands like "triangle moves right, square moves up," the models frequently swap actions between subjects or ignore them entirely. ActionParty introduces subject state tokens, which are latent variables that persistently capture the state of each subject throughout the scene. By jointly modeling these state tokens and video latents with a spatial biasing mechanism, the system successfully disentangles global video frame rendering from individual action-controlled subject updates.
Benchmark Results Show Multi-Agent Control Across 46 Environments
The researchers evaluated ActionParty on the Melting Pot benchmark, demonstrating control of up to seven players simultaneously across 46 diverse game environments. The results show significant improvements in both action-following accuracy and identity consistency compared to existing approaches. ActionParty operates as a generative game engine, rendering frames and updating state based on actions in a single forward pass, enabling true multi-agent interactive environments.
Technical Achievement Opens Path to Procedurally Generated Multiplayer Games
Unlike previous video world models restricted to single-agent scenarios, ActionParty's architecture enables multiple characters to be independently controlled within the same generated video. This represents a step toward procedurally generated multiplayer games where the game engine itself is a neural network. The system maintains consistent agent identities while processing complex multi-agent interactions, addressing one of the key challenges in generative video for interactive applications.
Key Takeaways
- ActionParty is the first video world model capable of controlling up to seven players simultaneously in the same scene
- The system introduces subject state tokens to solve the action binding problem that plagued previous video diffusion models
- Researchers evaluated the model across 46 diverse game environments using the Melting Pot benchmark
- The architecture functions as a generative game engine, rendering frames and updating state in one forward pass
- This breakthrough enables true multi-agent interactive environments and moves toward procedurally generated multiplayer games