All posts
AI & Machine Learning

Can Claude Match Human Experts in Bioinformatics?

Huma Shazia30 April 2026 at 6:38 pm5 min read
Can Claude Match Human Experts in Bioinformatics?

Key Takeaways

Can Claude Match Human Experts in Bioinformatics?
Source: The Decoder
  • Claude Mythos Preview achieves 82.6% accuracy on human-solvable bioinformatics problems
  • On tasks that stumped all human experts, Claude only manages 30% success rate
  • BioMysteryBench uses 99 questions with objectively verifiable answers, not subjective interpretations

The Problem with Existing AI Biology Benchmarks

Measuring AI performance in biological research is tricky. Anthropic argues that current benchmarks each miss something important.

Knowledge tests like MMLU-Pro and GPQA check whether a model knows facts. They don't test whether it can actually do research. Benchmarks like BixBench use real datasets but evaluate models against individual scientists' conclusions. Those conclusions are subjective, shaped by each researcher's methods. Simulated lab environments like SciGym have clear right answers but lack the messiness of real biological data.

Anthropic built BioMysteryBench to address these gaps. The benchmark contains 99 questions across multiple bioinformatics domains. Specialists wrote each question using real, noisy datasets.

How BioMysteryBench Works

The benchmark's design centers on objectivity. Answers come from controllable, verifiable properties of the data or independently validated metadata. They're not derived from scientific interpretations that could vary between researchers.

Every question author had to submit a validation notebook proving the signal actually exists in the data. This approach also allows questions that might be unsolvable for humans.

Typical tasks include identifying which organ a single-cell RNA dataset came from or figuring out which gene was knocked out in experimental samples. Claude gets a container with bioinformatics tools, access to databases like NCBI and Ensembl, and full freedom to choose its own analysis methods.

Only the final answer counts. The path Claude takes to get there doesn't affect its score.

Results: Strong on Solvable Tasks, Weak on the Hardest

Anthropic split the 99 tasks into two groups. Seventy-six were "human-solvable" because at least one of up to five experts found the correct answer. Another 23 tasks stumped every expert who tried them. Four originally planned questions had to be removed due to flawed formulations.

82.6%
Accuracy of Claude Mythos Preview on human-solvable bioinformatics problems

On the human-solvable problems, Claude Mythos Preview reaches 82.6% accuracy. That matches human expert performance, according to Anthropic. The older Haiku 4.5 model stays at 36.8% on the same tasks.

The hard problems tell a different story. On the 23 tasks that no human expert could solve, Claude Mythos Preview achieves just 30% success rate. Haiku 4.5 manages only 5.2%.

Anthropic acknowledges uncertainty about these hard tasks. It's unclear whether they're fundamentally unsolvable or just extremely difficult. A larger or differently composed expert panel might have cracked some of them.

What This Means for Bioinformatics Work

The benchmark suggests Claude can handle routine bioinformatics analysis at expert level. Identifying tissue types, spotting gene knockouts, interpreting single-cell RNA data. These are bread-and-butter tasks in computational biology labs.

But the hard-problem results show clear limits. When tasks require genuinely novel reasoning or handling edge cases that stump human specialists, Claude's success rate drops to roughly one in three attempts.

For research teams, this points to a practical split. AI tools can likely accelerate standard analysis workflows. But frontier problems still need human scientists who can recognize when an approach isn't working and pivot.

Caveats Worth Noting

BioMysteryBench is Anthropic's own benchmark testing Anthropic's own model. The company has incentives to design tests that showcase Claude's strengths. Independent replication using different datasets would strengthen these claims.

The expert panel size matters too. Five experts per task is a small sample. Some "unsolvable" problems might simply need a specialist with the right background. Without knowing who the experts were or their specific domains, it's hard to gauge how representative their performance is.

Still, the benchmark design is more rigorous than many AI evaluations. Requiring validation notebooks and objectively verifiable answers removes some subjectivity that plagues other benchmarks.

ℹ️

Logicity's Take

Also Read
Italy Closes AI Probes After DeepSeek, Mistral Add Disclaimers

Another look at how AI companies are responding to scrutiny

Frequently Asked Questions

What is BioMysteryBench?

BioMysteryBench is Anthropic's benchmark for testing AI performance on bioinformatics problems. It contains 99 questions with objectively verifiable answers based on real, noisy biological datasets.

How accurate is Claude on bioinformatics tasks?

Claude Mythos Preview achieves 82.6% accuracy on human-solvable bioinformatics problems. On the hardest tasks that stumped all human experts, it manages only 30% success rate.

Can AI replace human bioinformatics experts?

Not yet. Claude matches human performance on routine tasks but struggles with the hardest problems. The benchmark suggests AI can accelerate standard analysis but frontier research still needs human scientists.

What tools does Claude use for bioinformatics analysis?

Claude gets access to a container with bioinformatics tools and databases like NCBI and Ensembl. It has full freedom to choose its own analysis methods to solve each problem.

ℹ️

Need Help Implementing This?

Source: The Decoder / Maximilian Schreiner

H

Huma Shazia

Senior AI & Tech Writer