A study of 13,917 participants randomized to interact with five conversational AI agents demonstrates that agentic symptom interviews produce differential diagnoses significantly more accurate than independent clinicians. The research, published May 5, 2026 on arXiv by Joseph Breda and over 35 co-authors, deployed SymptomAI through the Fitbit app to conduct end-to-end patient interviewing and diagnosis.
SymptomAI Diagnoses 2.47 Times More Accurate Than Clinicians
In a blinded randomized comparison, SymptomAI differential diagnoses were significantly more accurate than those from independent clinicians given identical dialogue transcripts (OR = 2.47, p < 0.001). The comparison evaluated 517 participants from a subset of 1,228 who reported clinician-provided diagnoses, with over 250 hours of clinical annotation. An auxiliary analysis on 1,509 conversations from a general US population panel validated that results generalize beyond wearable device users.
Dedicated Symptom Interviews Substantially Outperform User-Guided Conversations
The research identified that agentic strategies conducting dedicated symptom interviews before providing diagnoses perform substantially better than baseline user-guided conversations (p < 0.001). This finding challenges the default behavior of most consumer large language models, which follow user-led symptom discussions rather than actively guiding comprehensive information gathering. The study demonstrates that AI-directed interviews eliciting additional symptom information produce superior diagnostic accuracy.
Wearable Data Analysis Identifies Strong Physiological Associations
Researchers analyzed over 500,000 days of wearable metrics across nearly 400 unique conditions, using SymptomAI diagnoses as labels for all 13,917 participants. The analysis identified strong associations between acute infections and physiological shifts, with odds ratios exceeding 7 for influenza. This integration of conversational AI with continuous physiological monitoring demonstrates potential for enhanced symptom assessment.
Real-World Deployment Captures Diverse Communication Patterns
The 13,917-participant corpus captures diverse communication patterns and a realistic distribution of illnesses from a real-world population. The deployment through the Fitbit app provided access to both conversational data and longitudinal wearable metrics, enabling comprehensive evaluation of symptom assessment accuracy and physiological correlates of diagnosed conditions.
Key Takeaways
- SymptomAI differential diagnoses were 2.47 times more accurate than independent clinicians reviewing the same dialogue (p < 0.001)
- Agentic interview strategies where AI guides symptom gathering substantially outperform user-led conversations (p < 0.001)
- Study analyzed 13,917 participants with 517 clinician-annotated cases over 250 hours of evaluation
- Wearable integration enabled analysis of 500,000+ days of physiological data across nearly 400 conditions
- Results validated across general US population panel of 1,509 conversations beyond wearable users