Parlor Delivers Real-Time Multimodal Voice AI Running Entirely On-Device

Parlor, an open-source project created by developer fikrikarim, enables natural voice and vision conversations with AI running completely locally on user hardware, eliminating cloud dependencies and associated costs. The system combines Google's Gemma 4 E2B model with Kokoro text-to-speech to deliver sub-3-second response times on Apple Silicon.

Three-Component Architecture Processes Audio and Video Locally

The system architecture distributes processing across browser, server, and output layers. The browser captures microphone audio and camera frames while performing voice activity detection locally using Silero VAD. A FastAPI WebSocket backend processes inputs and generates responses, with audio playback streamed back to the browser.

Core models include Gemma 4 E2B handling speech and vision understanding via LiteRT-LM on GPU, and Kokoro 82M managing text-to-speech generation through MLX on Mac or ONNX on Linux. The complete system requires approximately 3GB of RAM and 2.6GB of storage for the Gemma 4 E2B model.

M3 Pro Performance Achieves Near-Real-Time Response

Benchmark testing on Apple M3 Pro hardware demonstrates end-to-end latency of 2.5-3.0 seconds total, with decode speeds reaching 83 tokens per second on GPU. The system supports hands-free operation through voice activity detection, barge-in capability to interrupt mid-response, and sentence-level TTS streaming for faster perceived responsiveness.

Multi-lingual support and camera integration for visual context extend functionality beyond basic voice assistants. System requirements include Python 3.12 or higher and either Apple Silicon Mac or Linux GPU hardware.

Privacy-First Design Eliminates Cloud Dependencies

Unlike commercial voice assistants that process data through cloud services, Parlor runs entirely on-device with no API keys required and no data leaving the local machine. This architecture provides privacy guarantees while maintaining functionality comparable to cloud-based alternatives.

The GitHub repository has accumulated 1,137 stars since its creation on April 5, 2026, across 36 commits. The project carries an Apache 2.0 license and is marked as a "research preview" with acknowledged rough edges in current implementation.

Key Takeaways

Parlor combines Gemma 4 E2B and Kokoro 82M models to deliver multimodal voice AI running entirely on local hardware
Performance on M3 Pro achieves 2.5-3.0 second end-to-end latency with 83 tokens per second decode speed
Privacy-first architecture requires no API keys or cloud services, keeping all data on-device
System supports hands-free operation, barge-in interruption, and real-time camera integration for visual context
Minimum requirements include 3GB RAM, 2.6GB storage, and Apple Silicon Mac or Linux GPU hardware

Three-Component Architecture Processes Audio and Video Locally

M3 Pro Performance Achieves Near-Real-Time Response

Privacy-First Design Eliminates Cloud Dependencies

Key Takeaways

Parlor combines Gemma 4 E2B and Kokoro 82M models to deliver multimodal voice AI running entirely on local hardware

Performance on M3 Pro achieves 2.5-3.0 second end-to-end latency with 83 tokens per second decode speed

Privacy-first architecture requires no API keys or cloud services, keeping all data on-device

System supports hands-free operation, barge-in interruption, and real-time camera integration for visual context

Minimum requirements include 3GB RAM, 2.6GB storage, and Apple Silicon Mac or Linux GPU hardware