Needle: 26M-Parameter Function-Calling Model Runs on Phones and Wearables

Cactus Compute has released Needle, a 26-million-parameter language model designed specifically for function calling on resource-constrained devices including phones, watches, and smart glasses. The open-source model outperforms significantly larger models while running at 6,000 tokens per second on mobile hardware.

Architecture Eliminates Feed-Forward Networks for Efficiency

Needle uses a novel Simple Attention Networks architecture that challenges conventional transformer design. The encoder features 12 layers with self-attention, grouped query attention, and rotary position embeddings — notably excluding feed-forward networks entirely. The decoder includes 8 layers with masked self-attention and cross-attention using gated residual connections. The configuration runs with 512 hidden dimensions, 8 heads, 4 key-value heads, and an 8,192 BPE vocabulary with tied embeddings to minimize parameter count.

The team's core insight: function calling is fundamentally a retrieval-and-assembly task requiring matching queries to tool names and extracting argument values, not deep reasoning. Cross-attention serves as the ideal primitive for this, making FFN parameters wasteful at this scale. According to the creators, this "no FFN" finding extends beyond function calling to any task involving external structured knowledge like RAG or tool use.

Training on 200 Billion Tokens in Under 28 Hours

The model was pretrained on 200 billion tokens across 16 TPU v6e chips over 27 hours, then post-trained on 2 billion synthetic function-calling tokens in 45 minutes. The team synthesized the training dataset using Gemini, covering 15 tool categories including timers, messaging, navigation, and smart home controls.

Needle delivers inference speeds of 6,000 tokens per second for prefill and 1,200 tokens per second for decode on Cactus hardware. In single-shot function calling benchmarks, it outperforms FunctionGemma-270M, Qwen-0.6B, Granite-350M, and LFM2.5-350M — models with up to 13x more parameters. While these larger models offer broader capabilities in conversational contexts, Needle's specialized architecture proves more efficient for its specific use case.

Open Source Release Targets Edge AI Applications

The model is available under MIT license with weights on HuggingFace at Cactus-Compute/needle and code on GitHub at cactus-compute/needle. The team includes Henry Ndubuaku, Jakub Mroz, and Karen Mosoyan.

As Ndubuaku explained in the Hacker News announcement: "We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it."

The release addresses a gap in edge AI capabilities, enabling sophisticated function-calling features on devices where power and memory constraints previously made such features impractical.

Key Takeaways

Needle is a 26M-parameter model that runs function calling at 6,000 tokens/second prefill on mobile devices
The architecture eliminates feed-forward networks entirely, using only attention mechanisms for a 13x parameter reduction versus comparable models
Trained on 200 billion pretraining tokens and 2 billion synthetic function-calling tokens across 16 TPU v6e chips
Outperforms models up to 270M parameters (FunctionGemma, Qwen-0.6B, Granite-350M) on single-shot function calling tasks
Released open source under MIT license with weights on HuggingFace and code on GitHub

Architecture Eliminates Feed-Forward Networks for Efficiency

Training on 200 Billion Tokens in Under 28 Hours

Open Source Release Targets Edge AI Applications

The release addresses a gap in edge AI capabilities, enabling sophisticated function-calling features on devices where power and memory constraints previously made such features impractical.

Key Takeaways

Needle is a 26M-parameter model that runs function calling at 6,000 tokens/second prefill on mobile devices

The architecture eliminates feed-forward networks entirely, using only attention mechanisms for a 13x parameter reduction versus comparable models

Trained on 200 billion pretraining tokens and 2 billion synthetic function-calling tokens across 16 TPU v6e chips

Outperforms models up to 270M parameters (FunctionGemma, Qwen-0.6B, Granite-350M) on single-shot function calling tasks

Released open source under MIT license with weights on HuggingFace and code on GitHub