Cactus Releases Needle: 26M-Parameter Function-Calling Model Runs at 6000 tok/s on Phones

Henry Ndubuaku from Cactus announced Needle on Hacker News on May 12, 2026, introducing a 26-million-parameter function-calling model that achieves 6000 tokens per second prefill and 1200 tokens per second decode on consumer devices. The model challenges the assumption that tool calling requires large models by using a Simple Attention Networks architecture that eliminates Multi-Layer Perceptrons entirely.

Simple Attention Networks Architecture Eliminates MLPs

Needle's core architectural innovation is building the entire model using only attention and gating mechanisms—no MLPs anywhere. The team explains: "Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale."

Despite having only 26 million parameters, Needle beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling benchmarks. However, those models retain advantages in conversational settings due to greater scope and capacity.

Training on 200B Tokens Across 16 TPU v6e

The model was pretrained on 200 billion tokens across 16 TPU v6e chips over 27 hours, then post-trained on 2 billion tokens of synthesized function-calling data for 45 minutes. The dataset was synthesized via Gemini and covers 15 tool categories including timers, messaging, navigation, and smart home controls.

Broader Implications for RAG and Tool Use

The research team found that the "no FFN" architecture generalizes beyond function calling to any task where the model has access to external structured knowledge. The model doesn't need to memorize facts in FFN weights if the facts are provided in the input, making the approach applicable to RAG and tool use scenarios.

MIT Licensed and Available on HuggingFace

Needle is part of the broader Cactus project, an inference engine built from scratch for mobile devices, wearables, and custom hardware. Everything is MIT licensed, with model weights available on HuggingFace at Cactus-Compute/needle. Target use cases include budget phones, smartwatches, and AR glasses requiring on-device function calling without cloud dependency.

Key Takeaways

Needle is a 26-million-parameter function-calling model achieving 6000 tok/s prefill and 1200 tok/s decode on consumer devices
The model uses Simple Attention Networks architecture with zero MLPs, relying entirely on attention and gating mechanisms
Pretrained on 200 billion tokens across 16 TPU v6e chips over 27 hours, then post-trained on 2 billion function-calling tokens
Needle beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling benchmarks
MIT licensed with weights available on HuggingFace, targeting budget phones, smartwatches, and AR glasses for on-device AI

Simple Attention Networks Architecture Eliminates MLPs

Training on 200B Tokens Across 16 TPU v6e

Broader Implications for RAG and Tool Use

MIT Licensed and Available on HuggingFace

Key Takeaways

Needle is a 26-million-parameter function-calling model achieving 6000 tok/s prefill and 1200 tok/s decode on consumer devices

The model uses Simple Attention Networks architecture with zero MLPs, relying entirely on attention and gating mechanisms

Pretrained on 200 billion tokens across 16 TPU v6e chips over 27 hours, then post-trained on 2 billion function-calling tokens

Needle beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M on single-shot function calling benchmarks

MIT licensed with weights available on HuggingFace, targeting budget phones, smartwatches, and AR glasses for on-device AI