OpenAI's ChatGPT for Clinicians Beats Doctors on Medical Benchmark

Key Takeaways

- GPT-5.4 scored 59.0 on HealthBench Professional versus 43.7 for human doctors with unlimited time and internet access
- The free tool is available to verified physicians, advanced-practice nurses, physician assistants, and pharmacists in the US
- OpenAI built both the benchmark and the product being tested, which raises methodological concerns
OpenAI released ChatGPT for Clinicians this week, a free AI assistant built for everyday medical work. The company claims its GPT-5.4 model outperforms human doctors on clinical tasks by a wide margin, even when those doctors have unlimited time and full internet access.
The tool is now available to verified healthcare professionals in the United States. Physicians, nurses with advanced clinical qualifications, physician assistants, and pharmacists can access it at no cost.
What the Benchmark Shows
OpenAI published HealthBench Professional alongside the launch. The benchmark measures AI performance across three clinical areas: consultations, writing and documentation, and medical research. It uses doctor-written conversations, multi-level physician scoring, and targeted data filtering.
GPT-5.4 running in the ChatGPT for Clinicians workspace scored 59.0 overall. Doctor-written responses came in at 43.7. Every other AI model tested scored below the Clinicians version: the base GPT-5.4 hit 48.1, Anthropic's Claude Opus 4.7 reached 47.0, Google's Gemini 3.1 Pro scored 43.8, and xAI's Grok 4.2 landed at 36.1.

The clinical workspace version scored about 11 points higher than base GPT-5.4. OpenAI did not clarify how much of that gap comes from the clinical setup versus how the benchmark was built.
A Tough Test by Design
OpenAI says the benchmark was designed to be difficult. About a third of the examples come from targeted "red teaming," where doctors actively tried to find weaknesses in the models. The hardest conversations were overrepresented by a factor of 3.5.
The benchmark builds on the earlier HealthBench and includes multi-level physician scoring. OpenAI reports that 99.6 percent of answers were rated reliable by evaluators.
The Methodology Problem
There's an obvious issue with these results. OpenAI built the benchmark and tested its own product. That's not unusual in AI research, but it means the numbers deserve scrutiny.
Benchmark scores also don't translate directly to real clinical practice. A model that excels at structured evaluation tasks might perform differently in the chaos of an emergency room or the nuance of a long-term patient relationship.
What the Tool Actually Does
ChatGPT for Clinicians includes features aimed at daily medical work. The system offers real-time clinical searches across specialist literature, templates for recurring workflows, and automatic recognition of continuing medical education credits.
The tool is currently limited to US healthcare professionals who can verify their credentials. OpenAI hasn't announced plans for international expansion.
| Model | HealthBench Professional Score |
|---|---|
| GPT-5.4 (Clinicians workspace) | 59.0 |
| GPT-5.4 (base) | 48.1 |
| Claude Opus 4.7 | 47.0 |
| Human doctors (unlimited time/internet) | 43.7 |
| Gemini 3.1 Pro | 43.8 |
| Grok 4.2 | 36.1 |
What This Means in Practice
The 15-point gap between AI and human doctors looks striking. But context matters. Doctors don't typically have unlimited time to answer questions. They juggle patients, paperwork, and interruptions. An AI that scores higher under test conditions might still serve best as a second opinion rather than a replacement.
The more interesting number might be the 11-point gap between the Clinicians workspace and base GPT-5.4. That suggests specialized tuning and medical-specific features add real value, which could shape how healthcare organizations think about deploying AI tools.
Logicity's Take
Frequently Asked Questions
Is ChatGPT for Clinicians free?
Yes. OpenAI offers it at no cost to verified physicians, advanced-practice nurses, physician assistants, and pharmacists in the United States.
How did GPT-5.4 compare to human doctors?
GPT-5.4 in the Clinicians workspace scored 59.0 on HealthBench Professional. Human doctors scored 43.7, despite having unlimited time and internet access during the test.
Which AI models were tested on HealthBench Professional?
OpenAI tested GPT-5.4 (base and Clinicians versions), Anthropic's Claude Opus 4.7, Google's Gemini 3.1 Pro, and xAI's Grok 4.2. The Clinicians version of GPT-5.4 scored highest.
Is ChatGPT for Clinicians available outside the US?
Not currently. OpenAI has only announced availability for verified US healthcare professionals and has not shared international expansion plans.
Need Help Implementing This?
Source: The Decoder / Matthias Bastian
Manaal Khan
Tech & Innovation Writer
Related Articles
Browse allZuckerberg's Superintelligence Lab Faces Setback
The first AI model from Zuckerberg's superintelligence lab has failed to impress compared to its rivals, sparking concerns about the lab's direction. We take a closer look at what happened and why it matters.

Muse Spark Launch Propels Meta AI App to Top 5
The recent launch of Muse Spark has significantly boosted the popularity of Meta AI app, pushing it into the top 5. We explore what this means for the AI landscape.

Meta's Muse Spark AI Model Lags Behind ChatGPT and Claude
Meta's Muse Spark AI model still can't outperform ChatGPT and Claude in key areas, despite its advancements. We explore what this means for the AI landscape.

Meta Launches Muse Spark AI To Challenge ChatGPT
Meta launches Muse Spark AI to challenge ChatGPT and Claude, we explore what this means for the AI landscape. Muse Spark AI is a significant development in the AI chatbot space.
Also Read

Why I Switched from Stirling-PDF to an Open Source Rival
Stirling-PDF was the perfect local PDF tool until silent failures started corrupting files. When merging overwrote originals and splitting dropped pages without warning, the trust broke. Here's what happened and what comes next.

How OnePlus One Changed Android Phones Forever
Twelve years ago, a startup nobody knew launched a $300 phone that made the entire Android industry rethink pricing. The OnePlus One arrived on April 23, 2014, and its ripple effects still shape how we buy smartphones today.

Honor 600 Pro Copies iPhone 17 Pro Design, Costs €300 Less
Honor's latest flagship phones borrow heavily from Apple's design playbook, right down to the orange colorway and triple camera layout. The 600 Pro undercuts the iPhone 17 Pro by several hundred euros while packing last year's top Qualcomm chip.