AI & Machine Learning

Can Claude Match Human Experts in Bioinformatics?

Huma Shazia30 April 2026 at 6:38 pm5 min read

Key Takeaways

Claude Mythos Preview achieves 82.6% accuracy on human-solvable bioinformatics problems
On tasks that stumped all human experts, Claude only manages 30% success rate
BioMysteryBench uses 99 questions with objectively verifiable answers, not subjective interpretations

The Problem with Existing AI Biology Benchmarks

Measuring AI performance in biological research is tricky. Anthropic argues that current benchmarks each miss something important.

Knowledge tests like MMLU-Pro and GPQA check whether a model knows facts. They don't test whether it can actually do research. Benchmarks like BixBench use real datasets but evaluate models against individual scientists' conclusions. Those conclusions are subjective, shaped by each researcher's methods. Simulated lab environments like SciGym have clear right answers but lack the messiness of real biological data.

Anthropic built BioMysteryBench to address these gaps. The benchmark contains 99 questions across multiple bioinformatics domains. Specialists wrote each question using real, noisy datasets.

How BioMysteryBench Works

The benchmark's design centers on objectivity. Answers come from controllable, verifiable properties of the data or independently validated metadata. They're not derived from scientific interpretations that could vary between researchers.

Every question author had to submit a validation notebook proving the signal actually exists in the data. This approach also allows questions that might be unsolvable for humans.

Typical tasks include identifying which organ a single-cell RNA dataset came from or figuring out which gene was knocked out in experimental samples. Claude gets a container with bioinformatics tools, access to databases like NCBI and Ensembl, and full freedom to choose its own analysis methods.

Only the final answer counts. The path Claude takes to get there doesn't affect its score.

Results: Strong on Solvable Tasks, Weak on the Hardest

Anthropic split the 99 tasks into two groups. Seventy-six were "human-solvable" because at least one of up to five experts found the correct answer. Another 23 tasks stumped every expert who tried them. Four originally planned questions had to be removed due to flawed formulations.

82.6%

Accuracy of Claude Mythos Preview on human-solvable bioinformatics problems

On the human-solvable problems, Claude Mythos Preview reaches 82.6% accuracy. That matches human expert performance, according to Anthropic. The older Haiku 4.5 model stays at 36.8% on the same tasks.

The hard problems tell a different story. On the 23 tasks that no human expert could solve, Claude Mythos Preview achieves just 30% success rate. Haiku 4.5 manages only 5.2%.

Anthropic acknowledges uncertainty about these hard tasks. It's unclear whether they're fundamentally unsolvable or just extremely difficult. A larger or differently composed expert panel might have cracked some of them.

What This Means for Bioinformatics Work

The benchmark suggests Claude can handle routine bioinformatics analysis at expert level. Identifying tissue types, spotting gene knockouts, interpreting single-cell RNA data. These are bread-and-butter tasks in computational biology labs.

But the hard-problem results show clear limits. When tasks require genuinely novel reasoning or handling edge cases that stump human specialists, Claude's success rate drops to roughly one in three attempts.

For research teams, this points to a practical split. AI tools can likely accelerate standard analysis workflows. But frontier problems still need human scientists who can recognize when an approach isn't working and pivot.

Caveats Worth Noting

BioMysteryBench is Anthropic's own benchmark testing Anthropic's own model. The company has incentives to design tests that showcase Claude's strengths. Independent replication using different datasets would strengthen these claims.

The expert panel size matters too. Five experts per task is a small sample. Some "unsolvable" problems might simply need a specialist with the right background. Without knowing who the experts were or their specific domains, it's hard to gauge how representative their performance is.

Still, the benchmark design is more rigorous than many AI evaluations. Requiring validation notebooks and objectively verifiable answers removes some subjectivity that plagues other benchmarks.

ℹ️

Logicity's Take

Frequently Asked Questions

What is BioMysteryBench?

BioMysteryBench is Anthropic's benchmark for testing AI performance on bioinformatics problems. It contains 99 questions with objectively verifiable answers based on real, noisy biological datasets.

How accurate is Claude on bioinformatics tasks?

Claude Mythos Preview achieves 82.6% accuracy on human-solvable bioinformatics problems. On the hardest tasks that stumped all human experts, it manages only 30% success rate.

Can AI replace human bioinformatics experts?

Not yet. Claude matches human performance on routine tasks but struggles with the hardest problems. The benchmark suggests AI can accelerate standard analysis but frontier research still needs human scientists.

What tools does Claude use for bioinformatics analysis?

Claude gets access to a container with bioinformatics tools and databases like NCBI and Ensembl. It has full freedom to choose its own analysis methods to solve each problem.

ℹ️

Need Help Implementing This?

Source: The Decoder / Maximilian Schreiner

Also Read

Hacks & Workarounds·4 min

5 Tasks Where Old 1GB USB Drives Beat Modern Storage

That dusty 1GB thumb drive in your drawer isn't obsolete. For BIOS updates, portable tools, and emergency recovery, smaller drives often work better than their modern counterparts. Here's why you shouldn't throw them away.

Manaal Khan·30 Apr 2026

Gadgets & Hardware·4 min

AI Memory Shortage Could Last Until 2027, Samsung and SK Hynix Warn

Samsung and SK Hynix, controlling over 90% of global DRAM production with Micron, are warning of memory shortages extending through 2027 and possibly to 2030. The crunch is driven by explosive demand for HBM chips used in AI accelerators, with customers already reserving supply years in advance.

Huma Shazia·30 Apr 2026

Trending Tech·4 min

Italy Closes AI Probes After Hallucination Disclaimers Agreed

Italy's antitrust regulator has ended investigations into DeepSeek, Mistral AI, and Scaleup Yazilim Hizmetleri. The three AI companies committed to adding permanent disclaimers about hallucination risks and better informing users about potential inaccuracies in AI-generated content.