OpenAI's o1 Model Outperforms ER Doctors in Diagnostic Accuracy

OpenAI's o1 preview model achieved 67% diagnostic accuracy in real emergency room cases, surpassing two attending physicians who scored 55% and 50% respectively, according to a Harvard-led study published in Science journal. The research, conducted in collaboration with Stanford, analyzed 76 patients from Beth Israel's emergency room and represents one of the first peer-reviewed studies demonstrating superior AI performance in actual clinical diagnostic tasks using real patient data.

O1 Model Excels at Emergency Triage Decisions

The study compared diagnoses from two internal medicine attending physicians against those generated by OpenAI's o1 and 4o models. Two additional attending physicians assessed all diagnoses without knowing their source. The o1 model offered "the exact or very close diagnosis" in 67% of triage cases, performing nominally better than or on par with human physicians and the 4o model at each diagnostic touchpoint.

The performance gap was most pronounced at initial ER triage—the first diagnostic touchpoint where physicians have the least information and face the most urgency to make correct decisions. This represents a critical bottleneck in healthcare where diagnostic errors can have serious consequences.

Step-by-Step Reasoning Proves Effective for Medical Diagnosis

The o1 preview model's distinctive step-by-step reasoning capabilities appear particularly well-suited for medical diagnosis, where systematic analysis of symptoms, patient history, and clinical presentation is essential. This approach differs fundamentally from earlier language models that generated responses without explicit reasoning traces.

Key study details:

76 real emergency room patients analyzed
Two attending physicians provided human diagnoses
Two additional physicians blind-reviewed all diagnoses
O1 and 4o models compared against human performance
Results published in peer-reviewed Science journal

Clinical Significance for Emergency Medicine

The study demonstrates AI's potential to assist in the most challenging diagnostic scenarios—when physicians have limited information and must make rapid decisions. Emergency room triage represents a high-stakes environment where accurate initial diagnosis directly impacts patient outcomes and resource allocation.

This research marks a significant milestone in healthcare AI, moving beyond performance on medical licensing exams to demonstrating practical utility with actual patient data in real clinical settings.

Key Takeaways

OpenAI's o1 model achieved 67% diagnostic accuracy on 76 real ER cases, compared to 50-55% for attending physicians
The AI's advantage was most pronounced at initial triage when information is limited and decisions are most urgent
The study represents one of the first peer-reviewed demonstrations of superior AI performance using real patient data rather than simulated scenarios
O1's step-by-step reasoning approach proved particularly effective for systematic medical diagnosis
Research was conducted by Harvard and Stanford teams and published in Science journal

O1 Model Excels at Emergency Triage Decisions

Step-by-Step Reasoning Proves Effective for Medical Diagnosis

Key study details:

76 real emergency room patients analyzed

Two attending physicians provided human diagnoses

Two additional physicians blind-reviewed all diagnoses

O1 and 4o models compared against human performance

Results published in peer-reviewed Science journal

Clinical Significance for Emergency Medicine

Key Takeaways

OpenAI's o1 model achieved 67% diagnostic accuracy on 76 real ER cases, compared to 50-55% for attending physicians

The AI's advantage was most pronounced at initial triage when information is limited and decisions are most urgent

The study represents one of the first peer-reviewed demonstrations of superior AI performance using real patient data rather than simulated scenarios

O1's step-by-step reasoning approach proved particularly effective for systematic medical diagnosis

Research was conducted by Harvard and Stanford teams and published in Science journal