YC W26's RunAnywhere Launches MetalRT: Fastest AI Inference Engine for Apple Silicon

RunAnywhere, a Y Combinator Winter 2026 startup founded by Sanchit and Shubham, launched MetalRT on March 10, 2026—an AI inference engine claiming to be the fastest available for Apple Silicon. The engine achieves 658 tokens per second running Qwen3-0.6B on an M4 Max chip, outperforming MLX's 552 tokens per second and llama.cpp's 295 tokens per second. MetalRT is the first engine to natively handle LLM, speech-to-text, and text-to-speech on Apple Silicon through a unified architecture.

Record-Breaking Performance Across Multiple Modalities

MetalRT's benchmarks demonstrate significant advantages across all three AI modalities. Time-to-first-token measures just 6.6 milliseconds for language model inference. For speech-to-text, the engine processes 70 seconds of audio in 101 milliseconds—a 714x real-time speedup that runs 4.6x faster than mlx-whisper. Text-to-speech synthesis completes in 178 milliseconds, achieving 2.8x faster performance than mlx-audio.

The technical approach relies on custom Metal compute shaders with all memory pre-allocated at initialization, resulting in zero memory allocations during inference. This unified engine architecture eliminates the overhead typically associated with chaining multiple models together.

Open-Source Voice Pipeline with 38 macOS Actions

Alongside MetalRT, RunAnywhere released RCLI under an MIT license—an open-source voice pipeline supporting 38 native macOS actions. The system includes local RAG (Retrieval-Augmented Generation) with 4-millisecond retrieval times over 5,000 chunks and supports 20 hot-swappable models. The software requires M3 or newer chips with Metal 3.1 support.

The founders emphasize that voice applications present the hardest test case for inference speed. "Voice is hardest test: chaining STT+LLM+TTS sequentially. If each adds 200ms, you're at 600ms before user hears a word," they explained. MetalRT's architecture addresses this latency problem by optimizing each step in the pipeline.

Community Response and Technical Validation

The launch post on Hacker News received 176 points and generated 82 comments, with developers expressing interest in the unified approach to multimodal inference. The project represents a significant optimization effort specifically for Apple's Metal API, taking advantage of hardware capabilities that cross-platform frameworks may not fully exploit.

Key Takeaways

MetalRT achieves 658 tokens/second on M4 Max for Qwen3-0.6B, outperforming MLX (552 tok/s) and llama.cpp (295 tok/s)
First unified engine to natively handle LLM, speech-to-text, and text-to-speech on Apple Silicon
Speech-to-text processes 70 seconds of audio in 101ms (714x real-time), 4.6x faster than mlx-whisper
RCLI open-source voice pipeline supports 38 macOS actions with 4ms RAG retrieval over 5K chunks
Requires M3+ chips with Metal 3.1, uses custom Metal shaders with zero-allocation inference design

Record-Breaking Performance Across Multiple Modalities

Open-Source Voice Pipeline with 38 macOS Actions

Community Response and Technical Validation

Key Takeaways

MetalRT achieves 658 tokens/second on M4 Max for Qwen3-0.6B, outperforming MLX (552 tok/s) and llama.cpp (295 tok/s)

First unified engine to natively handle LLM, speech-to-text, and text-to-speech on Apple Silicon

Speech-to-text processes 70 seconds of audio in 101ms (714x real-time), 4.6x faster than mlx-whisper

RCLI open-source voice pipeline supports 38 macOS actions with 4ms RAG retrieval over 5K chunks

Requires M3+ chips with Metal 3.1, uses custom Metal shaders with zero-allocation inference design