Can Claude Match Human Experts in Bioinformatics?

Key Takeaways

- Claude Mythos Preview achieves 82.6% accuracy on human-solvable bioinformatics problems
- On tasks that stumped all human experts, Claude only manages 30% success rate
- BioMysteryBench uses 99 questions with objectively verifiable answers, not subjective interpretations
The Problem with Existing AI Biology Benchmarks
Measuring AI performance in biological research is tricky. Anthropic argues that current benchmarks each miss something important.
Knowledge tests like MMLU-Pro and GPQA check whether a model knows facts. They don't test whether it can actually do research. Benchmarks like BixBench use real datasets but evaluate models against individual scientists' conclusions. Those conclusions are subjective, shaped by each researcher's methods. Simulated lab environments like SciGym have clear right answers but lack the messiness of real biological data.
Anthropic built BioMysteryBench to address these gaps. The benchmark contains 99 questions across multiple bioinformatics domains. Specialists wrote each question using real, noisy datasets.
How BioMysteryBench Works
The benchmark's design centers on objectivity. Answers come from controllable, verifiable properties of the data or independently validated metadata. They're not derived from scientific interpretations that could vary between researchers.
Every question author had to submit a validation notebook proving the signal actually exists in the data. This approach also allows questions that might be unsolvable for humans.
Typical tasks include identifying which organ a single-cell RNA dataset came from or figuring out which gene was knocked out in experimental samples. Claude gets a container with bioinformatics tools, access to databases like NCBI and Ensembl, and full freedom to choose its own analysis methods.
Only the final answer counts. The path Claude takes to get there doesn't affect its score.
Results: Strong on Solvable Tasks, Weak on the Hardest
Anthropic split the 99 tasks into two groups. Seventy-six were "human-solvable" because at least one of up to five experts found the correct answer. Another 23 tasks stumped every expert who tried them. Four originally planned questions had to be removed due to flawed formulations.
On the human-solvable problems, Claude Mythos Preview reaches 82.6% accuracy. That matches human expert performance, according to Anthropic. The older Haiku 4.5 model stays at 36.8% on the same tasks.
The hard problems tell a different story. On the 23 tasks that no human expert could solve, Claude Mythos Preview achieves just 30% success rate. Haiku 4.5 manages only 5.2%.
Anthropic acknowledges uncertainty about these hard tasks. It's unclear whether they're fundamentally unsolvable or just extremely difficult. A larger or differently composed expert panel might have cracked some of them.
What This Means for Bioinformatics Work
The benchmark suggests Claude can handle routine bioinformatics analysis at expert level. Identifying tissue types, spotting gene knockouts, interpreting single-cell RNA data. These are bread-and-butter tasks in computational biology labs.
But the hard-problem results show clear limits. When tasks require genuinely novel reasoning or handling edge cases that stump human specialists, Claude's success rate drops to roughly one in three attempts.
For research teams, this points to a practical split. AI tools can likely accelerate standard analysis workflows. But frontier problems still need human scientists who can recognize when an approach isn't working and pivot.
Caveats Worth Noting
BioMysteryBench is Anthropic's own benchmark testing Anthropic's own model. The company has incentives to design tests that showcase Claude's strengths. Independent replication using different datasets would strengthen these claims.
The expert panel size matters too. Five experts per task is a small sample. Some "unsolvable" problems might simply need a specialist with the right background. Without knowing who the experts were or their specific domains, it's hard to gauge how representative their performance is.
Still, the benchmark design is more rigorous than many AI evaluations. Requiring validation notebooks and objectively verifiable answers removes some subjectivity that plagues other benchmarks.
Logicity's Take
Another look at how AI companies are responding to scrutiny
Frequently Asked Questions
What is BioMysteryBench?
BioMysteryBench is Anthropic's benchmark for testing AI performance on bioinformatics problems. It contains 99 questions with objectively verifiable answers based on real, noisy biological datasets.
How accurate is Claude on bioinformatics tasks?
Claude Mythos Preview achieves 82.6% accuracy on human-solvable bioinformatics problems. On the hardest tasks that stumped all human experts, it manages only 30% success rate.
Can AI replace human bioinformatics experts?
Not yet. Claude matches human performance on routine tasks but struggles with the hardest problems. The benchmark suggests AI can accelerate standard analysis but frontier research still needs human scientists.
What tools does Claude use for bioinformatics analysis?
Claude gets access to a container with bioinformatics tools and databases like NCBI and Ensembl. It has full freedom to choose its own analysis methods to solve each problem.
Need Help Implementing This?
Source: The Decoder / Maximilian Schreiner
Huma Shazia
Senior AI & Tech Writer
Related Articles
Browse allZuckerberg's Superintelligence Lab Faces Setback
The first AI model from Zuckerberg's superintelligence lab has failed to impress compared to its rivals, sparking concerns about the lab's direction. We take a closer look at what happened and why it matters.

Muse Spark Launch Propels Meta AI App to Top 5
The recent launch of Muse Spark has significantly boosted the popularity of Meta AI app, pushing it into the top 5. We explore what this means for the AI landscape.

Meta's Muse Spark AI Model Lags Behind ChatGPT and Claude
Meta's Muse Spark AI model still can't outperform ChatGPT and Claude in key areas, despite its advancements. We explore what this means for the AI landscape.

Meta Launches Muse Spark AI To Challenge ChatGPT
Meta launches Muse Spark AI to challenge ChatGPT and Claude, we explore what this means for the AI landscape. Muse Spark AI is a significant development in the AI chatbot space.
Also Read

5 Tasks Where Old 1GB USB Drives Beat Modern Storage
That dusty 1GB thumb drive in your drawer isn't obsolete. For BIOS updates, portable tools, and emergency recovery, smaller drives often work better than their modern counterparts. Here's why you shouldn't throw them away.

AI Memory Shortage Could Last Until 2027, Samsung and SK Hynix Warn
Samsung and SK Hynix, controlling over 90% of global DRAM production with Micron, are warning of memory shortages extending through 2027 and possibly to 2030. The crunch is driven by explosive demand for HBM chips used in AI accelerators, with customers already reserving supply years in advance.

Italy Closes AI Probes After Hallucination Disclaimers Agreed
Italy's antitrust regulator has ended investigations into DeepSeek, Mistral AI, and Scaleup Yazilim Hizmetleri. The three AI companies committed to adding permanent disclaimers about hallucination risks and better informing users about potential inaccuracies in AI-generated content.