All posts
Trending Tech

Probably raises $9M to cut AI hallucinations with smaller models

Huma Shazia16 June 2026 at 6:57 pm4 min read
Probably raises $9M to cut AI hallucinations with smaller models

Key Takeaways

Probably raises $9M to cut AI hallucinations with smaller models
Source: TechCrunch
  • Probably raised $9M seed funding from Andreessen Horowitz to build AI tools that prevent hallucinations from reaching users
  • The company's validator system lets it run AI on models four generations weaker than frontier systems, cutting token costs significantly
  • Founder Peter Elias argues better harness engineering reduces the need for powerful models by eliminating ambiguity before inference

Probably, a new AI startup, has raised $9 million in seed funding from Andreessen Horowitz to tackle one of the most persistent problems in AI deployment: hallucinations. The company's approach bypasses brute-force model scaling in favor of deterministic validation systems that catch errors before they ever reach end users.

Founder Peter Elias says Probably aims to hit 99.99% accuracy, the kind of reliability standard that deterministic software achieves routinely but AI systems struggle to match. The bet is that the path to reliable AI isn't necessarily bigger models. It's better scaffolding around smaller ones.

How does Probably catch AI hallucinations?

The company's first product is a data science tool that answers questions from complex datasets. Each response includes a citation and an audit trail showing how the answer was generated. That much is table stakes for enterprise AI tools in 2026.

What's different is what Elias calls the "data science mech suit." The LLM's initial responses pass through a deterministic validator system that checks results against the actual dataset. Any answer that doesn't match gets bounced back. The model has been trained against this validator, so the entire pipeline optimizes for fast, accurate responses rather than just plausible-sounding ones.

"What we learned building this was that the better your harness engineering is, the weaker the model can be," Elias told TechCrunch. "If you can refine the context enough, the model does not have to work very hard to do the right thing. Basically, it's an exercise in reducing ambiguity."

Why smaller models matter for cost control

Probably's validator approach produces a practical side effect: the system runs on AI models that Elias describes as "four classes weaker than the frontier models." That's a significant gap. Frontier models today require data center infrastructure and rack up substantial token costs. Probably's tool can run on local hardware, a desktop computer rather than a server farm.

The timing matters. Token costs have been rising, and many enterprise customers are rethinking their AI budgets. A system that delivers comparable accuracy on cheaper infrastructure addresses a real pain point.

4 generations weaker
Probably's validator system lets it run on models four classes below frontier systems while maintaining accuracy

Elias argues the approach extends beyond data science. Accounting, medical services, and other precision-sensitive domains could benefit from the same validator architecture. The common thread is use cases where a 95% correct answer isn't good enough and errors carry real consequences.

Are big AI labs ignoring this problem?

Elias takes a pointed stance on why the major AI labs haven't pursued this direction. "I think it's really interesting that the big AI labs have not even attempted to do this," he said. "They're incentivized not to, because they make money the more times you have to correct the model."

That's a strong claim. OpenAI, Anthropic, and Google have all invested in reducing hallucinations through techniques like retrieval-augmented generation and chain-of-thought prompting. But Elias is pointing at something structural: if your business model depends on token volume, you don't necessarily benefit from systems that get answers right on the first try.

Whether that fully explains the gap is debatable. But there's an undeniable market opening for startups building accuracy-first tooling, particularly as enterprise adoption moves from experimentation to production.

What comes next for Probably?

The $9 million seed round gives Probably runway to expand beyond its initial data science tool. Elias has signaled interest in accounting and medical applications, both fields where regulatory requirements demand audit trails and error rates have real consequences.

The broader question is whether Probably's approach can scale to more open-ended tasks. Data science queries against structured datasets are a relatively constrained domain. Extending the same validator logic to freeform text generation or multi-step reasoning would require new architectures. The company hasn't detailed plans for those use cases.

Also Read
Nvidia raises $20B in bonds as AI debt boom accelerates

Context on the capital flowing into AI infrastructure

Frequently Asked Questions

What is Probably AI and what does it do?

Probably is a startup that builds AI tools designed to prevent hallucinations and factual errors. Its first product is a data science tool that validates LLM responses against deterministic systems before showing results to users.

How much funding did Probably raise?

Probably raised $9 million in seed funding from Andreessen Horowitz in June 2026.

How does Probably reduce AI hallucinations?

The company uses a validator harness that checks LLM outputs against the actual dataset. Results that don't match get rejected. The model is trained against this validator, optimizing the whole system for accuracy.

Can Probably's approach work for other industries?

Founder Peter Elias says the same validator architecture could extend to accounting, medical services, and other precision-sensitive use cases where errors carry significant consequences.

ℹ️

Logicity's Take

Probably's bet inverts the conventional AI scaling logic. Instead of throwing more parameters at reliability, it treats accuracy as an engineering problem around the model rather than inside it. If the approach holds up in production, it offers a template for cost-conscious enterprises: pair modest models with tight validation and skip the frontier model premium entirely. The open question is whether validator harnesses can generalize beyond structured data queries to messier, real-world use cases.

ℹ️

Need Help Implementing This?

Logicity helps technology teams evaluate AI reliability tools and build validation pipelines for production systems. Contact our consulting team to discuss your accuracy requirements.

Source: TechCrunch / Russell Brandom

H

Huma Shazia

Senior AI & Tech Writer

Related Articles

Tesla's Remote Parking Feature: The Investigation That Didn't Quite Park Itself
Trending Tech·8 min

Tesla's Remote Parking Feature: The Investigation That Didn't Quite Park Itself

The US auto safety regulators have closed their investigation into Tesla's remote parking feature, but what does this mean for the future of autonomous driving? We dive into the details of the investigation and what it reveals about the technology. The National Highway Traffic Safety Administration found that crashes were rare and minor, but the investigation's closure doesn't necessarily mean the feature is completely safe.