OpenAI's o1 preview model achieved 67% diagnostic accuracy in real emergency room cases, surpassing two attending physicians who scored 55% and 50% respectively, according to a Harvard-led study published in Science journal. The research, conducted in collaboration with Stanford, analyzed 76 patients from Beth Israel's emergency room and represents one of the first peer-reviewed studies demonstrating superior AI performance in actual clinical diagnostic tasks using real patient data.
O1 Model Excels at Emergency Triage Decisions
The study compared diagnoses from two internal medicine attending physicians against those generated by OpenAI's o1 and 4o models. Two additional attending physicians assessed all diagnoses without knowing their source. The o1 model offered "the exact or very close diagnosis" in 67% of triage cases, performing nominally better than or on par with human physicians and the 4o model at each diagnostic touchpoint.
The performance gap was most pronounced at initial ER triage—the first diagnostic touchpoint where physicians have the least information and face the most urgency to make correct decisions. This represents a critical bottleneck in healthcare where diagnostic errors can have serious consequences.
Step-by-Step Reasoning Proves Effective for Medical Diagnosis
The o1 preview model's distinctive step-by-step reasoning capabilities appear particularly well-suited for medical diagnosis, where systematic analysis of symptoms, patient history, and clinical presentation is essential. This approach differs fundamentally from earlier language models that generated responses without explicit reasoning traces.
Key study details:
- 76 real emergency room patients analyzed
- Two attending physicians provided human diagnoses
- Two additional physicians blind-reviewed all diagnoses
- O1 and 4o models compared against human performance
- Results published in peer-reviewed Science journal
Clinical Significance for Emergency Medicine
The study demonstrates AI's potential to assist in the most challenging diagnostic scenarios—when physicians have limited information and must make rapid decisions. Emergency room triage represents a high-stakes environment where accurate initial diagnosis directly impacts patient outcomes and resource allocation.
This research marks a significant milestone in healthcare AI, moving beyond performance on medical licensing exams to demonstrating practical utility with actual patient data in real clinical settings.
Key Takeaways
- OpenAI's o1 model achieved 67% diagnostic accuracy on 76 real ER cases, compared to 50-55% for attending physicians
- The AI's advantage was most pronounced at initial triage when information is limited and decisions are most urgent
- The study represents one of the first peer-reviewed demonstrations of superior AI performance using real patient data rather than simulated scenarios
- O1's step-by-step reasoning approach proved particularly effective for systematic medical diagnosis
- Research was conducted by Harvard and Stanford teams and published in Science journal