All posts

Ornith-1.0 claims top open-source spot for agentic coding

Manaal KhanJuly 1, 2026 at 9:17 AM5 min read
Ornith-1.0 claims top open-source spot for agentic coding

Key Takeaways

Ornith-1.0 claims top open-source spot for agentic coding
Source: Hacker News: Best
  • Ornith-1.0 offers four model sizes (9B to 397B parameters) under MIT license with no regional restrictions
  • The self-improving training framework jointly optimizes solution generation and problem-solving scaffolds via reinforcement learning
  • Benchmark claims show the 9B model outperforming larger competitors on Terminal-Bench 2.1 and SWE-Bench tasks

Deepreinforce-ai has released Ornith-1.0, a family of open-source models built for agentic coding tasks. The project claims state-of-the-art performance among comparable open-source models on benchmarks including Terminal-Bench 2.1, SWE-Bench, NL2Repo, and OpenClaw. Four sizes ship under MIT license: 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE, all post-trained on Google's Gemma 4 and Alibaba's Qwen 3.5 architectures.

Advertisement

What makes Ornith-1.0 different from other coding models?

The headline feature is what deepreinforce-ai calls a "self-improving training framework." Most code generation models learn to produce outputs. Ornith learns to produce both solutions and the scaffolds that guide those solutions. The team uses reinforcement learning to jointly optimize both components, which they claim helps the model discover better search trajectories when solving complex coding problems.

Think of it this way: a standard model learns what answer to give. Ornith learns how to approach the search for an answer. That distinction matters for agentic coding, where the model must navigate real codebases, understand project structure, and execute multi-step tasks autonomously.

How do the benchmarks stack up?

The numbers look strong, though with caveats. The 9B model scores 69.4 on SWE-Bench Verified, compared to 53.2 for Qwen3.5-9B and 52 for Gemma4-31B. On Terminal-Bench 2.1 using the Terminus-2 framework, Ornith-1.0-9B hits 43.1 versus 21.3 for Qwen3.5-9B. That's a smaller model beating a larger one from the same architecture family.

The 35B MoE variant posts 75.6 on SWE-Bench Verified, edging out Qwen3.5-35B at 70 and Gemma4-31B at 52. The flagship 397B model reaches 82.4 on the same benchmark, trailing only Claude Opus 4.8 at 87.6 among the published comparisons.

All benchmarks used temperature 1.0 with specific harness configurations. Terminal-Bench runs used 4-hour timeouts with 32 CPU cores and 48GB RAM, averaged over five runs. SWE-Bench tests ran through the OpenHands harness with a 256K context window.

Hardware requirements and deployment options

The 9B dense model fits on a single 80GB GPU for inference and fine-tuning. The MoE checkpoints require multi-GPU setups with tensor parallelism. All variants support a 256K token context window.

Multiple precision formats ship for each size. The 9B comes in bf16 and GGUF quantized formats. The 35B MoE offers bf16, FP8, and GGUF. GGUF builds target local inference via llama.cpp or Ollama. FP8 cuts VRAM requirements roughly in half on compatible GPUs.

Runtime requirements are recent: Transformers 5.8.1 or newer, vLLM 0.19.1+, or SGLang 0.5.9+. Recommended sampling uses temperature 0.6, top_p 0.95, and top_k 20, though reproducing benchmarks requires temperature 1.0.

Advertisement

The MIT license question

Ornith ships under MIT license with no regional restrictions. That's worth noting because some AI models, particularly from Chinese labs, carry geographic limitations or complex licensing terms. DeepSeek's models, for instance, have faced scrutiny over their license interpretation. MIT is about as permissive as it gets: use it commercially, modify it, distribute it, with minimal obligations.

The post-training builds on Gemma 4 (Google, permissive license) and Qwen 3.5 (Alibaba, Apache 2.0 for most variants). The combination keeps the downstream license clean for enterprise adoption.

Where Ornith fits in the agentic coding race

Agentic coding models don't just complete code. They autonomously navigate repositories, write tests, fix bugs across multiple files, and execute shell commands. The market has gotten crowded. DeepSeek Coder, CodeLlama, StarCoder, and proprietary options from Anthropic and OpenAI all compete for developer attention.

Ornith's pitch is that self-improving scaffolds produce better agents than pure output optimization. If the benchmarks hold up under independent testing, the 9B model in particular offers an interesting value proposition: near-flagship performance at a fraction of the compute cost.

ℹ️

Logicity's Take

Deepreinforce-ai is a new name, and benchmark claims from unknown labs deserve scrutiny until third parties reproduce them. That said, the architectural approach of jointly optimizing search strategy alongside outputs is technically interesting and aligns with recent research on reasoning-augmented models. For engineering teams evaluating self-hosted coding assistants, the 9B GGUF variant is the low-risk starting point. Run it locally, test it on your actual codebase, and compare against Qwen3.5-14B or CodeLlama-34B before committing infrastructure. If you're managing AI development workflows, tools like [n8n](https://logicity.in/r/n8n) or [Make](https://logicity.in/r/make) can help orchestrate model interactions across your pipeline.

ℹ️

Disclosure

Some links in this post are affiliate links — Logicity earns a commission if you sign up, at no extra cost to you. We only link products we have used or actively recommend.

Frequently Asked Questions

What hardware do I need to run Ornith-1.0 locally?

The 9B dense model requires a single 80GB GPU for bf16 inference. GGUF quantized versions run on consumer hardware via llama.cpp or Ollama. The 35B and 397B MoE models need multi-GPU setups with tensor parallelism.

Can I use Ornith-1.0 commercially?

Yes. The MIT license permits commercial use, modification, and distribution with minimal restrictions. The underlying Gemma 4 and Qwen 3.5 base models also carry permissive licenses.

How does Ornith-1.0 compare to Claude for coding tasks?

On SWE-Bench Verified, Ornith-1.0-397B scores 82.4 versus Claude Opus 4.8 at 87.6. Claude leads on most benchmarks, but Ornith is open-source and self-hostable, which matters for teams with data residency requirements.

What does 'self-improving' mean in this context?

Ornith uses reinforcement learning to optimize both the final code output and the scaffold, meaning the strategic approach to problem-solving. The model learns better search trajectories rather than just better answers.

Is deepreinforce-ai a known organization?

Deepreinforce-ai appears to be a new entrant in the open-source AI space. The benchmarks have not yet been independently verified by third parties.

Also Read
Claude Science launches: Anthropic's bet on AI research tools

Compares how major labs are approaching specialized AI models for technical work

ℹ️

Need Help Implementing This?

Evaluating open-source coding models for your team? Logicity helps engineering leaders assess AI tooling, infrastructure requirements, and integration strategies. Get in touch for a consultation.

Source: Hacker News: Best

Advertisement
M

Manaal Khan

Tech & Innovation Writer

Produced with AI assistance and reviewed by the Logicity editorial team. Learn more in our Editorial Policy.

Related Articles