A Harvard-affiliated study found an AI model outperformed human ER doctors on diagnostic accuracy using electronic health records and clinical notes.
Harvard Study: AI Outperformed Emergency Room Doctors on Diagnoses
By Hector Herrera | May 3, 2026 | Health
A Harvard-affiliated study published this week found that at least one large language model outperformed human emergency room physicians on diagnostic accuracy — using only electronic health records and brief clinical notes. This is not a simulation. It is a controlled benchmark against real ER cases, and the AI won.
The result matters because emergency medicine is one of the highest-stakes, highest-noise environments in healthcare. Doctors work fast, with incomplete information, under pressure. If AI can consistently outperform physicians in that setting, it changes the calculus for how hospitals should be deploying these tools right now.
What the Study Found
Researchers at Harvard-affiliated institutions benchmarked at least one frontier large language model against two human emergency room physicians on a set of real-world ER cases. The inputs were the same for both: electronic health records (EHRs) and brief clinical notes — the kind of structured documentation that flows through every hospital in the United States.
The AI model showed measurably higher diagnostic accuracy than the human physicians across the case set.
Key details:
- The study used real ER cases, not synthetic scenarios
- Inputs were limited to EHR data and clinical notes — no imaging, no physical exam findings
- The benchmark compared AI performance against two physicians, not a statistical average
- At least one LLM cleared the performance bar
The study adds to a growing body of research showing frontier AI models approaching or exceeding clinical-level diagnostic performance in narrowly defined tasks. Earlier this year, a separate evaluation of OpenAI's models on emergency department diagnostic benchmarks showed similar patterns.
Get this in your inbox.
Daily AI intelligence. Free. No spam.
Why Two Physicians Is Not a Small Number
A common objection to AI-vs-doctor studies is sample size on the human side. Two physicians is a thin comparison group, and individual variability is high. That is a fair critique.
But here is the counterpoint: the study was not trying to prove AI should replace ER doctors. It was testing whether an AI model, given the same data a physician would see, could arrive at the correct diagnosis more often. The answer, in this controlled setting, was yes.
What the study does not tell us:
- Whether AI accuracy holds across diverse patient populations
- How AI performs on atypical presentations or rare conditions
- Whether AI guidance improves outcomes when used alongside physicians
These are open questions. They matter enormously before any clinical deployment.
What This Means for Hospitals and Health Systems
For hospital administrators and clinical informatics teams, this study adds pressure to a decision that many have been deferring: whether to integrate AI diagnostic support into the ER workflow.
Several health systems are already piloting AI triage and diagnostic tools in controlled settings. The Harvard findings will accelerate those conversations — and, more importantly, accelerate the ask from physicians and nurses who are already using AI tools outside the official channels.
The risk of moving too fast is real: AI diagnostic errors in a high-acuity environment can be fatal. The risk of moving too slow is also real: if AI-assisted diagnosis reduces missed diagnoses by even a few percentage points, delays in deployment have a body count.
For patients, the practical near-term implication is not "the AI will diagnose you." It is that AI-assisted review of your EHR, flagging potential missed diagnoses, is closer to standard of care than most people realize.
What to Watch
The next signal to track is whether this study prompts FDA movement on AI diagnostic decision-support clearances, and whether any major health system announces an expanded ER deployment in response. Peer review of the methodology — specifically the case selection criteria and physician sample — will be the first test of whether these findings hold.
Hector Herrera is the founder of Hex AI Systems and editor of NexChron.com.
Did this help you understand AI better?
Your feedback helps us write more useful content.
Get tomorrow's AI briefing
Join readers who start their day with NexChron. Free, daily, no spam.