AI & Machine Learning

AI matches doctors in diagnosis, but faces a shelf-life problem

Huma Shazia20 June 2026 at 10:12 am5 min read

Key Takeaways

Germany's MIRA system hit 88.9% diagnostic accuracy across 500 emergency cases, beating experienced specialists who scored 78.1%
Google's AMIE produced appropriate treatment plans 95% of the time versus 72% for primary care physicians
Both AI systems were built on base models that are already outdated, suggesting rapid obsolescence in medical AI

Two AI diagnostic systems matched or beat physicians in clinical decision-making, according to studies published simultaneously in Nature. The German system MIRA outperformed doctors in emergency diagnoses. Google's AMIE produced more accurate treatment plans than primary care physicians. But both systems run on base models that are already outdated, which raises an uncomfortable question: how long will any medical AI stay clinically relevant?

How did MIRA perform against emergency room doctors?

MIRA (Medical Intelligence for Reasoning and Action) was developed by TUD Dresden and Heidelberg University. Unlike chatbots that answer questions, MIRA operates as an autonomous agent inside a sealed, virtual electronic health record. It can choose from more than 85,000 options across eleven clinical tools. It takes patient histories, orders lab work and imaging, interprets results, generates differential diagnoses, and writes treatment plans including prescriptions and surgical planning.

The research team tested MIRA on over 500 real emergency department cases from the public MIMIC-IV dataset. A second AI agent played the patient, sharing only information from the actual medical record. Across eight disease categories, MIRA hit the right diagnosis 88.9 percent of the time.

In a direct head-to-head comparison using 311 cases under identical conditions, MIRA scored 87.8 percent. Four experienced specialists reached 78.1 percent. A mixed team of residents and specialists managed 71.1 percent.

MIRA excelled at appendicitis (98.6 percent accuracy) and pancreatitis (92.3 percent). Both AI and doctors struggled more with pneumonia (72.4 percent) and urinary tract infections (77.6 percent).

Safety checks found no dangerous drug interactions, no incorrect dosing for patients with impaired kidney function, and no risky painkiller prescriptions. MIRA captured patients' current medications almost perfectly and didn't miss a single case that required hospitalization. Performance held steady even when test patients spoke only German or French, or acted particularly anxious. The source code is available on GitHub.

What makes Google's AMIE different?

Google's AMIE takes a different approach: managing patients across multiple visits rather than a single emergency encounter. The system pairs two agents. A conversational agent handles fast, friendly dialogue with the patient. A second agent works in the background, thinking more carefully and cross-referencing the case against medical guidelines.

Google compared AMIE with 21 primary care physicians across 100 cases spanning multiple visits. The benchmark was the UK's NICE Guidance and BMJ Best Practice guidelines. Actors portrayed patients via text chat.

At the first visit, AMIE's overall plan was rated appropriate in 95 percent of cases. For the physicians, that number was 72 percent. AMIE matched the physicians on treatment decisions and beat them on plan accuracy and guideline adherence. Both specialist reviewers and the patient actors preferred AMIE more often than the human doctors.

To test drug knowledge, the team built a dedicated benchmark called RxQA, based on two national drug formularies and verified by licensed pharmacists. AMIE outscored the primary care physicians on the harder questions. The test was tough for both sides. Even on the easier questions, the best score stayed below 75 percent.

Why these results come with asterisks

Both research teams warn against jumping to conclusions. MIRA recommended care that deviated from best practices for a "small but non-zero" share of patients. The simulated patient's answers may have been more structured than real speech from people in emergency departments.

There's also the data contamination question. The MIMIC-IV dataset is freely available. It cannot be ruled out entirely that this data was already part of the training data for the models used. If so, the measured performance would be more of a ceiling than a realistic estimate of how these systems would perform on truly novel cases.

The comparison physicians worked in the German emergency department system, which differs from other countries. The AMIE developers call their study a "milestone" but stress that neither the case selection nor the text-only communication reflects real clinical conditions.

The obsolescence problem hiding in plain sight

Here's the detail that should make hospital administrators pause: both AI systems run on base models that are already outdated. Medical AI development cycles measure in years. Regulatory approval adds more time. By the time a system clears the hurdles for clinical deployment, the underlying AI may be two or three generations behind.

This creates a strange dynamic. A hospital might adopt an AI diagnostic tool that performs brilliantly in trials, only to find that its foundation is technically obsolete before the first patient sees it in practice. The question isn't just whether AI can match doctors. It's whether any given AI system can maintain clinical relevance long enough to justify the implementation cost.

ℹ️

Logicity's Take

These studies demonstrate what's possible in controlled conditions. They don't demonstrate what's deployable. The bigger story here is structural: medical AI faces a shelf-life problem that other AI applications don't. A chatbot can update overnight. A diagnostic system embedded in hospital workflows, trained on specific datasets, and cleared by regulators cannot. Health systems considering medical AI investments should ask vendors hard questions about upgrade paths, not just accuracy benchmarks.

Frequently Asked Questions

Can AI diagnose diseases as accurately as doctors?

In controlled studies using simulated patients, yes. MIRA achieved 88.9% diagnostic accuracy compared to 78.1% for experienced specialists. However, real clinical conditions introduce variables these studies did not test.

Is Google's AMIE available for clinical use?

No. AMIE is a research system. Google's developers describe the study as a milestone, not a product launch. Regulatory approval and real-world validation would be required before any clinical deployment.

What diseases can medical AI diagnose best?

In the MIRA study, AI performed best on appendicitis (98.6% accuracy) and pancreatitis (92.3%). Both AI and human doctors struggled more with pneumonia and urinary tract infections.

Are AI diagnostic tools safe for patients?

The MIRA study found no dangerous drug interactions or incorrect dosing in its recommendations. However, researchers noted the system still recommended care that deviated from best practices for a small number of patients.

When will AI replace doctors in hospitals?

These studies suggest AI can support diagnostic decisions, not replace physicians. Both research teams explicitly caution against using their results to justify autonomous AI deployment in clinical settings.

Need Help Implementing This?

If you're evaluating AI diagnostic tools for your healthcare organization, Logicity can connect you with implementation specialists who understand both the technical requirements and regulatory landscape. Contact us for a consultation on AI readiness assessment.

Source: The Decoder / Maximilian Schreiner