Probably raises $9M to cut AI hallucinations with smaller models

Key Takeaways

- Probably raised $9M seed funding from Andreessen Horowitz to build AI tools that prevent hallucinations from reaching users
- The company's validator system lets it run AI on models four generations weaker than frontier systems, cutting token costs significantly
- Founder Peter Elias argues better harness engineering reduces the need for powerful models by eliminating ambiguity before inference
Probably, a new AI startup, has raised $9 million in seed funding from Andreessen Horowitz to tackle one of the most persistent problems in AI deployment: hallucinations. The company's approach bypasses brute-force model scaling in favor of deterministic validation systems that catch errors before they ever reach end users.
Founder Peter Elias says Probably aims to hit 99.99% accuracy, the kind of reliability standard that deterministic software achieves routinely but AI systems struggle to match. The bet is that the path to reliable AI isn't necessarily bigger models. It's better scaffolding around smaller ones.
How does Probably catch AI hallucinations?
The company's first product is a data science tool that answers questions from complex datasets. Each response includes a citation and an audit trail showing how the answer was generated. That much is table stakes for enterprise AI tools in 2026.
What's different is what Elias calls the "data science mech suit." The LLM's initial responses pass through a deterministic validator system that checks results against the actual dataset. Any answer that doesn't match gets bounced back. The model has been trained against this validator, so the entire pipeline optimizes for fast, accurate responses rather than just plausible-sounding ones.
"What we learned building this was that the better your harness engineering is, the weaker the model can be," Elias told TechCrunch. "If you can refine the context enough, the model does not have to work very hard to do the right thing. Basically, it's an exercise in reducing ambiguity."
Why smaller models matter for cost control
Probably's validator approach produces a practical side effect: the system runs on AI models that Elias describes as "four classes weaker than the frontier models." That's a significant gap. Frontier models today require data center infrastructure and rack up substantial token costs. Probably's tool can run on local hardware, a desktop computer rather than a server farm.
The timing matters. Token costs have been rising, and many enterprise customers are rethinking their AI budgets. A system that delivers comparable accuracy on cheaper infrastructure addresses a real pain point.
Elias argues the approach extends beyond data science. Accounting, medical services, and other precision-sensitive domains could benefit from the same validator architecture. The common thread is use cases where a 95% correct answer isn't good enough and errors carry real consequences.
Are big AI labs ignoring this problem?
Elias takes a pointed stance on why the major AI labs haven't pursued this direction. "I think it's really interesting that the big AI labs have not even attempted to do this," he said. "They're incentivized not to, because they make money the more times you have to correct the model."
That's a strong claim. OpenAI, Anthropic, and Google have all invested in reducing hallucinations through techniques like retrieval-augmented generation and chain-of-thought prompting. But Elias is pointing at something structural: if your business model depends on token volume, you don't necessarily benefit from systems that get answers right on the first try.
Whether that fully explains the gap is debatable. But there's an undeniable market opening for startups building accuracy-first tooling, particularly as enterprise adoption moves from experimentation to production.
What comes next for Probably?
The $9 million seed round gives Probably runway to expand beyond its initial data science tool. Elias has signaled interest in accounting and medical applications, both fields where regulatory requirements demand audit trails and error rates have real consequences.
The broader question is whether Probably's approach can scale to more open-ended tasks. Data science queries against structured datasets are a relatively constrained domain. Extending the same validator logic to freeform text generation or multi-step reasoning would require new architectures. The company hasn't detailed plans for those use cases.
Context on the capital flowing into AI infrastructure
Frequently Asked Questions
What is Probably AI and what does it do?
Probably is a startup that builds AI tools designed to prevent hallucinations and factual errors. Its first product is a data science tool that validates LLM responses against deterministic systems before showing results to users.
How much funding did Probably raise?
Probably raised $9 million in seed funding from Andreessen Horowitz in June 2026.
How does Probably reduce AI hallucinations?
The company uses a validator harness that checks LLM outputs against the actual dataset. Results that don't match get rejected. The model is trained against this validator, optimizing the whole system for accuracy.
Can Probably's approach work for other industries?
Founder Peter Elias says the same validator architecture could extend to accounting, medical services, and other precision-sensitive use cases where errors carry significant consequences.
Logicity's Take
Probably's bet inverts the conventional AI scaling logic. Instead of throwing more parameters at reliability, it treats accuracy as an engineering problem around the model rather than inside it. If the approach holds up in production, it offers a template for cost-conscious enterprises: pair modest models with tight validation and skip the frontier model premium entirely. The open question is whether validator harnesses can generalize beyond structured data queries to messier, real-world use cases.
Need Help Implementing This?
Logicity helps technology teams evaluate AI reliability tools and build validation pipelines for production systems. Contact our consulting team to discuss your accuracy requirements.
Source: TechCrunch / Russell Brandom
Huma Shazia
Senior AI & Tech Writer
اقرأ أيضاً

رأي مغاير: كيف يؤثر اختراق الأمن الداخلي الأميركي على شركاتنا الخاصة؟
في ظل اختراق عقود الأمن الداخلي الأميركي مع شركات خاصة، نناقش تأثير هذا الاختراق على مستقبل الأمن السيبراني. نستعرض الإحصاءات الموثوقة ونناقش كيف يمكن للشركات الخاصة أن تتعامل مع هذا التهديد. استمتع بقراءة هذا التحليل العميق

الإنسان في زمن ما بعد الوجود البشري: نحو نظام للتعايش بين الإنسان والروبوت - Centre for Arab Unity Studies
في هذا المقال، سنناقش كيف يمكن للبشر والروبوتات التعايش في نظام متكامل. سنستعرض التحديات والحلول المحتملة التي تضعها شركات مثل جوجل وأمازون. كما سنلقي نظرة على التوقعات المستقبلية وفقًا لتقرير ماكنزي

إطلاق ناسا لمهمة مأهولة إلى القمر: خطوة تاريخية نحو استكشاف الفضاء
تعتبر المهمة الجديدة خطوة هامة نحو استكشاف الفضاء وتطوير التكنولوجيا. سوف تشمل المهمة إرسال رواد فضاء إلى سطح القمر لconducting تجارب علمية. ستسهم هذه المهمة في تطوير فهمنا للفضاء وتحسين التكنولوجيا المستخدمة في استكشاف الفضاء.