Needle: A 26M Parameter Model That Handles Tool Calling

Key Takeaways

- Needle is a 26 million parameter model distilled from Gemini 3.1, designed for single-shot function calling
- The model runs at 6,000 tokens per second prefill and 1,200 decode speed on Cactus infrastructure
- Weights and training data are fully open-source, with local finetuning supported on Mac and PC
What Needle Does
Cactus Compute has released Needle, a 26 million parameter model built specifically for function calling on resource-constrained devices. The company distilled Google's Gemini 3.1 into what they call a "Simple Attention Network" that can run on phones, watches, and smart glasses.
The model is designed for a narrow task: converting natural language queries into structured tool calls. Ask it "What's the weather in San Francisco?" with a weather tool definition, and it returns the correct JSON function call. That's it. No conversation, no reasoning, no general knowledge.
Architecture and Training
Needle uses an encoder-decoder architecture with 12 encoder layers and 8 decoder layers. The model has a dimension of 512, uses 8 attention heads with 4 key-value heads (grouped query attention), and a BPE vocabulary of 8,192 tokens.
The encoder processes the text query. The decoder handles tool definitions and generates the function call output. Cross-attention connects the two. The design skips feed-forward networks entirely, relying only on attention and gated residual connections.
Pretraining took 27 hours on 16 TPU v6e chips, covering 200 billion tokens. Post-training on the function calling dataset used 2 billion tokens and finished in 45 minutes.
How It Compares
Cactus claims Needle beats several larger models on single-shot function calling: FunctionGemma at 270M parameters, Qwen at 600M, Granite at 350M, and LFM2.5 at 350M. All of these are 10x to 23x larger than Needle.
The company is upfront about limitations. Those larger models "have more scope/capacity and excel in conversational settings." Needle does one thing. If you need multi-turn conversation, general Q&A, or anything beyond structured tool calls, look elsewhere.
Small models can also be inconsistent. The team recommends testing with your specific tools and finetuning as needed.
Getting Started
The quickstart is straightforward. Clone the repository, run setup, and launch the playground UI. The web interface at localhost:7860 lets you test custom tools and finetune with your own data.
git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playgroundFor Python integration, the API is minimal. Load the checkpoint, initialize the model and tokenizer, then call generate with your query and tool definitions.
from needle import SimpleAttentionNetwork, load_checkpoint, generate, get_tokenizer
params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()
result = generate(
model, params, tokenizer,
query="What's the weather in San Francisco?",
tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
stream=False
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]Finetuning on Your Own Data
The playground UI handles the full finetuning workflow: data generation via Gemini, training, evaluation, and bundling the result. For command-line users, pass a JSONL file to the finetune command.
Weights download automatically. The CLI supports single inference, full training runs, pretraining on synthetic data, and checkpoint evaluation.
Why This Matters for Edge AI
Most AI assistants route function calls through cloud APIs. Every request hits a server. That adds latency, requires connectivity, and raises privacy concerns for sensitive queries.
A 26M parameter model can run entirely on-device. Consumer phones have more than enough compute. Even wearables could handle inference at this scale.
The trade-off is capability. Needle won't hold a conversation or answer general questions. It's a specialist. For personal AI assistants that need to control smart home devices, query calendars, or trigger app actions, that specialization might be enough.
More on optimizing software performance on consumer devices
Logicity's Take
Frequently Asked Questions
How big is the Needle model?
Needle has 26 million parameters, making it roughly 10x smaller than comparable function-calling models like FunctionGemma (270M) or Qwen (600M).
Can I run Needle on my local machine?
Yes. The model can be finetuned locally on Mac or PC. Weights download automatically when you run the setup.
What is Needle designed to do?
Needle handles single-shot function calling. It converts natural language queries into structured JSON tool calls. It does not support multi-turn conversation or general Q&A.
Is Needle open source?
Yes. The weights are available on Hugging Face under Cactus-Compute/needle, and the dataset generation code is also open.
How fast does Needle run?
On Cactus infrastructure, Needle achieves 6,000 tokens per second for prefill and 1,200 tokens per second for decode.
Need Help Implementing This?
Source: Hacker News: Best
Huma Shazia
Senior AI & Tech Writer
Related Articles
Browse all
Robotaxi Companies Are Hiding How Often Humans Take the Wheel
Autonomous vehicle firms like Waymo and Tesla are under scrutiny for refusing to disclose how often remote operators step in to control their self-driving cars. A Senate investigation reveals major gaps in transparency, raising safety and accountability concerns.

Wisconsin Governor Throws a Wrench in Age Verification Plans
Wisconsin Governor Tony Evers has vetoed a bill that would have required residents to verify their age before accessing adult content online, citing concerns over privacy and data security. This move comes as several other states have already implemented similar age check requirements. The veto has significant implications for the future of online age verification.

Apple's App Store Empire Under Siege: The Battle for the Future of Tech
The long-running feud between Apple and Epic Games has reached a boiling point, with Apple preparing to take its case to the Supreme Court. The tech giant is fighting to maintain control over its App Store, while Epic Games is pushing for more freedom for developers. The outcome could have far-reaching implications for the entire tech industry.

Tesla's Remote Parking Feature: The Investigation That Didn't Quite Park Itself
The US auto safety regulators have closed their investigation into Tesla's remote parking feature, but what does this mean for the future of autonomous driving? We dive into the details of the investigation and what it reveals about the technology. The National Highway Traffic Safety Administration found that crashes were rare and minor, but the investigation's closure doesn't necessarily mean the feature is completely safe.
Also Read

Sid Meier's Railroads Deserves a Modern Remake
PC Gamer's archive dive resurfaces a 2009 love letter to Sid Meier's Railroads!, the 2006 train business sim that was 'cruelly ignored upon release.' Nearly 20 years later, the game still has 108 concurrent Steam players, and fans argue it's overdue for the same remake treatment Firaxis gave other Meier classics.

5 Hands-Free Work Lights That Make Repair Jobs Easier
Holding a flashlight in your teeth while working under a sink is nobody's idea of fun. These five cordless work lights from Ryobi, Milwaukee, DeWalt, Makita, and Ridgid hang, stick, or prop themselves up so both hands stay free for the actual repair.

Samsung Strike Looms: Union Rejects Pay Deal After Talks Fail
Samsung Electronics and its South Korean labor union have failed to reach a pay agreement after marathon negotiations. The union plans an 18-day strike starting May 21, threatening production of AI chips and other semiconductors.