Thinking Machines Lab published research on May 11, 2026 introducing "interaction models"—AI systems designed to process audio, video, and text simultaneously in continuous 200-millisecond chunks rather than sequential turn-based exchanges. The new architecture challenges traditional approaches that "bolt on interactivity with a harness," instead building real-time multimodal communication as a fundamental design principle.
Time-Aligned Architecture Replaces Traditional Turn-Based Systems
Interaction models fundamentally differ from conventional AI systems by processing communication in a continuous temporal loop. Traditional models experience conversation sequentially, waiting for complete user input before generating responses. In contrast, interaction models split simultaneous input and output streams into 200ms segments, enabling true concurrent interaction.
The researchers note that current systems typically "stitch components together to emulate interruptions, multimodality, or concurrency," treating interactivity as an afterthought. Their approach reimagines the architecture to "perceive and respond in the same continuous loop" natively.
Core Capabilities Enable Natural Human-AI Collaboration
The interaction model architecture delivers several breakthrough capabilities:
- Seamless dialog management without separate orchestration components
- Verbal and visual interjections triggered by contextual understanding, not speech completion
- Simultaneous speech enabling scenarios like live translation where both parties speak concurrently
- Time-awareness for tracking elapsed duration within conversations
- Concurrent tool use woven into ongoing dialogue rather than discrete request-response cycles
A background model handles sustained reasoning asynchronously while the interaction model maintains continuous real-time presence, allowing the system to think deeply while remaining responsive.
Technical Implementation Prioritizes Low-Latency Processing
The system employs several technical innovations to achieve real-time performance:
- Time-aligned micro-turns: Splitting communication streams into 200ms segments for concurrent processing
- Encoder-free early fusion: Minimizing preprocessing overhead by fusing modalities early in the pipeline
- Streaming session optimization: Reducing latency through persistent inference sessions
- Trainer-sampler alignment: Improving debugging stability during development
Their TML-Interaction-Small model contains 276 billion parameters with 12 billion active parameters, achieving state-of-the-art results on interaction quality benchmarks while maintaining competitive intelligence scores—a combination previously unachieved in the field.
Architectural Shift Suggests New Category of AI Applications
The research team argues that "interactivity should scale alongside intelligence," suggesting fundamental architectural changes rather than bolted-on components. This approach could transform human-AI collaboration by enabling natural, continuous dialogue instead of rigid prompt-response cycles.
Potential applications include real-time translation with simultaneous speech, collaborative problem-solving where AI maintains continuous presence, and interactive learning environments that respond fluidly to verbal and visual cues.
Key Takeaways
- Thinking Machines Lab introduced interaction models that process audio, video, and text in continuous 200ms chunks rather than sequential turns
- The architecture employs time-aligned micro-turns and encoder-free early fusion to minimize latency in real-time multimodal communication
- TML-Interaction-Small (276B parameters, 12B active) achieves state-of-the-art interaction quality while maintaining competitive intelligence benchmarks
- System enables simultaneous speech, contextual interjections, and concurrent tool use woven into ongoing conversation
- The approach represents a fundamental shift from bolted-on interactivity to native real-time communication as a core architectural principle