Redis Creator Antirez Releases ds4: DeepSeek V4 Flash Inference Engine Built with GPT-5.5

Salvatore Sanfilippo (antirez), creator of Redis, released ds4.c on May 6, 2026—a native inference engine built exclusively for DeepSeek V4 Flash. The project gained 577 GitHub stars within two days and reached the Hacker News front page with 279 points and 84 comments. Notably, antirez openly declares the software was "developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging."

Deliberately Narrow Philosophy: One Model, Validated Performance

Unlike generic GGUF runners or multi-model frameworks, ds4 takes a deliberately focused approach. The README states: "The local inference landscape contains many excellent projects, but new models are released continuously, and the attention immediately gets captured by the next model to implement. This project takes a deliberately narrow bet: one model at a time, official-vector validation, long-context tests, and enough agent integration to know if it really works."

The engine validates logits against DeepSeek's official implementation and conducts long-context testing to ensure production readiness rather than just benchmarking demos.

Why DeepSeek V4 Flash Deserves Dedicated Infrastructure

Antirez outlines eight technical reasons justifying a single-model focus. The 284B parameter MoE architecture runs faster than dense models due to fewer active parameters. In thinking mode with max thinking disabled, it produces thinking sections one-fifth the length of other models—making thinking-enabled inference practical where other models are "practically impossible." The 1 million token context window and incredibly compressed KV cache enable long-context inference on consumer hardware with on-disk KV cache persistence.

DeepSeek V4 Flash also handles 2-bit quantization effectively when done specifically—only routed MoE experts are quantized (up/gate at IQ2_XXS, down at Q2_K), leaving shared experts, projections, and routing at full precision for quality retention. Antirez expects DeepSeek will release updated V4 Flash versions, making sustained investment worthwhile.

AI-Assisted Development Transparency

The README explicitly states: "This software is developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging. We say this openly because it shaped how the project was built. If you are not happy with AI-developed code, this software is not for you." This transparency from a prominent open-source figure represents a notable shift in how AI-assisted development is discussed publicly.

Technical Implementation and Performance

ds4 is Metal-only currently (CUDA may be added later) with no CPU path due to a macOS kernel bug affecting virtual memory implementation. The q2 quantized model requires 128GB RAM, while q4 requires ≥256GB. Custom GGUFs are available through antirez's Hugging Face repository.

Benchmarks on the q2 GGUF with Metal, 32768 context, greedy decoding, and 256 token generation show:

MacBook Pro M3 Max (128GB): 58.52 tokens/second prefill (short prompts), 26.68 t/s generation; 250.11 t/s prefill (11,709 token prompt), 21.47 t/s generation
Mac Studio M3 Ultra (512GB): 84.43 t/s prefill (short), 36.86 t/s generation; 468.03 t/s prefill (11,709 tokens), 27.39 t/s generation

The project builds on llama.cpp and GGML by Georgi Gerganov, adapting kernels, GGUF layouts, and CPU logic under MIT license.

Key Takeaways

Redis creator antirez released ds4, a DeepSeek V4 Flash inference engine, on May 6, 2026, gaining 577 stars in two days and 279 points on Hacker News
The project was openly developed with "strong assistance from GPT 5.5" with humans leading ideas, testing, and debugging—a notable transparency statement from a prominent open-source developer
ds4 focuses exclusively on one model rather than generic inference, providing official-vector validation and long-context testing for production readiness
MacBook Pro M3 Max (128GB RAM) achieves 26.68 tokens/second generation and 250.11 t/s prefill on 11,709 token prompts; Mac Studio M3 Ultra hits 468.03 t/s prefill
Custom 2-bit quantization quantizes only routed MoE experts while leaving shared experts and routing at full precision, enabling the 284B parameter model to run on 128GB consumer hardware

Deliberately Narrow Philosophy: One Model, Validated Performance

The engine validates logits against DeepSeek's official implementation and conducts long-context testing to ensure production readiness rather than just benchmarking demos.

Why DeepSeek V4 Flash Deserves Dedicated Infrastructure

AI-Assisted Development Transparency

Technical Implementation and Performance

Benchmarks on the q2 GGUF with Metal, 32768 context, greedy decoding, and 256 token generation show:

MacBook Pro M3 Max (128GB): 58.52 tokens/second prefill (short prompts), 26.68 t/s generation; 250.11 t/s prefill (11,709 token prompt), 21.47 t/s generation

Mac Studio M3 Ultra (512GB): 84.43 t/s prefill (short), 36.86 t/s generation; 468.03 t/s prefill (11,709 tokens), 27.39 t/s generation

The project builds on llama.cpp and GGML by Georgi Gerganov, adapting kernels, GGUF layouts, and CPU logic under MIT license.

Key Takeaways

Redis creator antirez released ds4, a DeepSeek V4 Flash inference engine, on May 6, 2026, gaining 577 stars in two days and 279 points on Hacker News

The project was openly developed with "strong assistance from GPT 5.5" with humans leading ideas, testing, and debugging—a notable transparency statement from a prominent open-source developer

ds4 focuses exclusively on one model rather than generic inference, providing official-vector validation and long-context testing for production readiness

MacBook Pro M3 Max (128GB RAM) achieves 26.68 tokens/second generation and 250.11 t/s prefill on 11,709 token prompts; Mac Studio M3 Ultra hits 468.03 t/s prefill

Custom 2-bit quantization quantizes only routed MoE experts while leaving shared experts and routing at full precision, enabling the 284B parameter model to run on 128GB consumer hardware