Salvatore Sanfilippo (antirez), creator of Redis, released ds4.c on May 6, 2026—a native inference engine built exclusively for DeepSeek V4 Flash. The project gained 577 GitHub stars within two days and reached the Hacker News front page with 279 points and 84 comments. Notably, antirez openly declares the software was "developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging."
Deliberately Narrow Philosophy: One Model, Validated Performance
Unlike generic GGUF runners or multi-model frameworks, ds4 takes a deliberately focused approach. The README states: "The local inference landscape contains many excellent projects, but new models are released continuously, and the attention immediately gets captured by the next model to implement. This project takes a deliberately narrow bet: one model at a time, official-vector validation, long-context tests, and enough agent integration to know if it really works."
The engine validates logits against DeepSeek's official implementation and conducts long-context testing to ensure production readiness rather than just benchmarking demos.
Why DeepSeek V4 Flash Deserves Dedicated Infrastructure
Antirez outlines eight technical reasons justifying a single-model focus. The 284B parameter MoE architecture runs faster than dense models due to fewer active parameters. In thinking mode with max thinking disabled, it produces thinking sections one-fifth the length of other models—making thinking-enabled inference practical where other models are "practically impossible." The 1 million token context window and incredibly compressed KV cache enable long-context inference on consumer hardware with on-disk KV cache persistence.
DeepSeek V4 Flash also handles 2-bit quantization effectively when done specifically—only routed MoE experts are quantized (up/gate at IQ2_XXS, down at Q2_K), leaving shared experts, projections, and routing at full precision for quality retention. Antirez expects DeepSeek will release updated V4 Flash versions, making sustained investment worthwhile.
AI-Assisted Development Transparency
The README explicitly states: "This software is developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging. We say this openly because it shaped how the project was built. If you are not happy with AI-developed code, this software is not for you." This transparency from a prominent open-source figure represents a notable shift in how AI-assisted development is discussed publicly.
Technical Implementation and Performance
ds4 is Metal-only currently (CUDA may be added later) with no CPU path due to a macOS kernel bug affecting virtual memory implementation. The q2 quantized model requires 128GB RAM, while q4 requires ≥256GB. Custom GGUFs are available through antirez's Hugging Face repository.
Benchmarks on the q2 GGUF with Metal, 32768 context, greedy decoding, and 256 token generation show:
- MacBook Pro M3 Max (128GB): 58.52 tokens/second prefill (short prompts), 26.68 t/s generation; 250.11 t/s prefill (11,709 token prompt), 21.47 t/s generation
- Mac Studio M3 Ultra (512GB): 84.43 t/s prefill (short), 36.86 t/s generation; 468.03 t/s prefill (11,709 tokens), 27.39 t/s generation
The project builds on llama.cpp and GGML by Georgi Gerganov, adapting kernels, GGUF layouts, and CPU logic under MIT license.
Key Takeaways
- Redis creator antirez released ds4, a DeepSeek V4 Flash inference engine, on May 6, 2026, gaining 577 stars in two days and 279 points on Hacker News
- The project was openly developed with "strong assistance from GPT 5.5" with humans leading ideas, testing, and debugging—a notable transparency statement from a prominent open-source developer
- ds4 focuses exclusively on one model rather than generic inference, providing official-vector validation and long-context testing for production readiness
- MacBook Pro M3 Max (128GB RAM) achieves 26.68 tokens/second generation and 250.11 t/s prefill on 11,709 token prompts; Mac Studio M3 Ultra hits 468.03 t/s prefill
- Custom 2-bit quantization quantizes only routed MoE experts while leaving shared experts and routing at full precision, enabling the 284B parameter model to run on 128GB consumer hardware