All posts
Trending Tech

Needle: A 26M Parameter Model That Handles Tool Calling

Huma Shazia13 May 2026 at 5:38 am4 min read
Needle: A 26M Parameter Model That Handles Tool Calling

Key Takeaways

Needle: A 26M Parameter Model That Handles Tool Calling
Source: Hacker News: Best
  • Needle is a 26 million parameter model distilled from Gemini 3.1, designed for single-shot function calling
  • The model runs at 6,000 tokens per second prefill and 1,200 decode speed on Cactus infrastructure
  • Weights and training data are fully open-source, with local finetuning supported on Mac and PC

What Needle Does

Cactus Compute has released Needle, a 26 million parameter model built specifically for function calling on resource-constrained devices. The company distilled Google's Gemini 3.1 into what they call a "Simple Attention Network" that can run on phones, watches, and smart glasses.

The model is designed for a narrow task: converting natural language queries into structured tool calls. Ask it "What's the weather in San Francisco?" with a weather tool definition, and it returns the correct JSON function call. That's it. No conversation, no reasoning, no general knowledge.

6,000 tokens/sec
Needle's prefill speed on Cactus infrastructure, with 1,200 tokens/sec decode speed

Architecture and Training

Needle uses an encoder-decoder architecture with 12 encoder layers and 8 decoder layers. The model has a dimension of 512, uses 8 attention heads with 4 key-value heads (grouped query attention), and a BPE vocabulary of 8,192 tokens.

The encoder processes the text query. The decoder handles tool definitions and generates the function call output. Cross-attention connects the two. The design skips feed-forward networks entirely, relying only on attention and gated residual connections.

Pretraining took 27 hours on 16 TPU v6e chips, covering 200 billion tokens. Post-training on the function calling dataset used 2 billion tokens and finished in 45 minutes.

How It Compares

Cactus claims Needle beats several larger models on single-shot function calling: FunctionGemma at 270M parameters, Qwen at 600M, Granite at 350M, and LFM2.5 at 350M. All of these are 10x to 23x larger than Needle.

The company is upfront about limitations. Those larger models "have more scope/capacity and excel in conversational settings." Needle does one thing. If you need multi-turn conversation, general Q&A, or anything beyond structured tool calls, look elsewhere.

Small models can also be inconsistent. The team recommends testing with your specific tools and finetuning as needed.

Getting Started

The quickstart is straightforward. Clone the repository, run setup, and launch the playground UI. The web interface at localhost:7860 lets you test custom tools and finetune with your own data.

bash
git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground

For Python integration, the API is minimal. Load the checkpoint, initialize the model and tokenizer, then call generate with your query and tool definitions.

python
from needle import SimpleAttentionNetwork, load_checkpoint, generate, get_tokenizer

params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()

result = generate(
    model, params, tokenizer,
    query="What's the weather in San Francisco?",
    tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
    stream=False
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]

Finetuning on Your Own Data

The playground UI handles the full finetuning workflow: data generation via Gemini, training, evaluation, and bundling the result. For command-line users, pass a JSONL file to the finetune command.

Weights download automatically. The CLI supports single inference, full training runs, pretraining on synthetic data, and checkpoint evaluation.

Why This Matters for Edge AI

Most AI assistants route function calls through cloud APIs. Every request hits a server. That adds latency, requires connectivity, and raises privacy concerns for sensitive queries.

A 26M parameter model can run entirely on-device. Consumer phones have more than enough compute. Even wearables could handle inference at this scale.

The trade-off is capability. Needle won't hold a conversation or answer general questions. It's a specialist. For personal AI assistants that need to control smart home devices, query calendars, or trigger app actions, that specialization might be enough.

Also Read
Windows 11's 'Low Latency Profile' Promises Faster Apps

More on optimizing software performance on consumer devices

ℹ️

Logicity's Take

Frequently Asked Questions

How big is the Needle model?

Needle has 26 million parameters, making it roughly 10x smaller than comparable function-calling models like FunctionGemma (270M) or Qwen (600M).

Can I run Needle on my local machine?

Yes. The model can be finetuned locally on Mac or PC. Weights download automatically when you run the setup.

What is Needle designed to do?

Needle handles single-shot function calling. It converts natural language queries into structured JSON tool calls. It does not support multi-turn conversation or general Q&A.

Is Needle open source?

Yes. The weights are available on Hugging Face under Cactus-Compute/needle, and the dataset generation code is also open.

How fast does Needle run?

On Cactus infrastructure, Needle achieves 6,000 tokens per second for prefill and 1,200 tokens per second for decode.

ℹ️

Need Help Implementing This?

Source: Hacker News: Best

H

Huma Shazia

Senior AI & Tech Writer

Related Articles

Tesla's Remote Parking Feature: The Investigation That Didn't Quite Park Itself
Trending Tech·8 min

Tesla's Remote Parking Feature: The Investigation That Didn't Quite Park Itself

The US auto safety regulators have closed their investigation into Tesla's remote parking feature, but what does this mean for the future of autonomous driving? We dive into the details of the investigation and what it reveals about the technology. The National Highway Traffic Safety Administration found that crashes were rare and minor, but the investigation's closure doesn't necessarily mean the feature is completely safe.