Parlor demonstrates that capable multimodal AI conversations can run entirely on consumer hardware without cloud infrastructure, achieving 2.5-3.0 second end-to-end latency on an Apple M3 Pro. Created by developer Fikri Karim, the open-source project addresses both privacy and sustainability concerns by processing speech and vision locally while enabling natural voice interactions.
Technical Architecture Coordinates Multiple AI Components
Parlor follows a streamlined pipeline: Browser → WebSocket (audio PCM + JPEG frames) → FastAPI Server → Model Processing → WebSocket (audio output) → Browser Playback. The system integrates Gemma 4 E2B via LiteRT-LM for speech and vision understanding, Kokoro for text-to-speech (MLX on Mac, ONNX on Linux), and Silero VAD in the browser for hands-free voice activity detection.
Performance Metrics Show Sub-3-Second Response Times
On an Apple M3 Pro, Parlor achieves impressive performance metrics:
- Speech + vision processing: 1.8-2.2 seconds
- Response generation (approximately 25 tokens): 0.3 seconds
- Text-to-speech (1-3 sentences): 0.3-0.7 seconds
- Total end-to-end latency: 2.5-3.0 seconds
- Decode speed: 83 tokens per second on GPU
Advanced Features Enable Natural Conversations
Parlor includes barge-in capability allowing users to interrupt mid-response by speaking, sentence-level streaming that begins audio playback before full response generation completes, and multilingual support enabling users to switch to native languages. All processing happens locally with zero cloud dependency, requiring approximately 3GB of free RAM.
Privacy-Preserving Architecture Addresses Sustainability Concerns
Karim emphasizes that Parlor addresses sustainability concerns associated with free cloud services while enabling privacy-preserving local AI interactions. The ability to run multimodal conversations entirely on-device represents a significant shift from the cloud-dependent paradigm that dominates current AI applications.
Community Reception Highlights Demand for On-Device AI
The project reached 186 Hacker News points within hours of release, with comments focusing on practical implications of on-device AI and comparisons to cloud services. One developer noted: "This is exactly what I've been waiting for - real conversational AI that doesn't send my data anywhere."
Sophisticated Engineering Minimizes Latency
Parlor demonstrates sophisticated coordination of multiple AI components (speech recognition, vision understanding, language generation, and text-to-speech) with minimal latency while maintaining quality. The sentence-level streaming approach represents a practical optimization that improves user experience without sacrificing accuracy.
Open-Source Availability Enables Further Development
Full source code is available on GitHub at https://github.com/fikrikarim/parlor with Apache-style licensing. The project has garnered 517 stars as of April 6, 2026, indicating strong community interest in privacy-preserving multimodal AI.
Key Takeaways
- Parlor achieves 2.5-3.0 second end-to-end latency for multimodal AI conversations on an Apple M3 Pro
- The system runs completely locally using Gemma 4 E2B for understanding and Kokoro for text-to-speech synthesis
- Advanced features include barge-in capability, sentence-level streaming, and multilingual support
- The project addresses privacy and sustainability concerns by eliminating cloud dependency
- Open-source code is available on GitHub with 517 stars and Apache-style licensing