Researchers from ETH Zurich and University of Zurich have developed RadAgent, a tool-using AI agent that generates chest CT radiology reports through a stepwise, interpretable process. The system achieves 36.4% relative improvement in clinical accuracy over baseline vision-language models while providing fully inspectable reasoning traces—addressing a critical gap in medical AI where clinicians are typically relegated to passive observers of black-box outputs.
RadAgent Structures CT Interpretation as Explicit Tool-Augmented Reasoning
Unlike existing vision-language models that generate radiology reports as end-to-end outputs, RadAgent breaks down chest CT interpretation into explicit, iterative steps using specialized tools. Each generated report comes with a complete trace of intermediate decisions and tool interactions, allowing clinicians to inspect, validate, and refine how reported findings are derived. This approach transforms medical AI from an opaque prediction system into a transparent reasoning process that physicians can audit and trust.
Clinical Accuracy Improves by 6.0 Points in Macro-F1 Over 3D VLM Baseline
When evaluated against CT-Chat, a state-of-the-art 3D vision-language model baseline, RadAgent demonstrated substantial performance gains across multiple metrics. The system achieved a 6.0-point improvement in macro-F1 (36.4% relative improvement) and a 5.4-point improvement in micro-F1 (19.6% relative improvement). These gains indicate that the agent-based approach with explicit tool use outperforms end-to-end deep learning models in identifying and classifying radiological findings.
Robustness Under Adversarial Conditions Improves by 41.9%
RadAgent showed particularly strong performance when facing challenging or unusual cases, improving robustness under adversarial conditions by 24.7 points—a 41.9% relative improvement over the baseline. This resilience suggests the system maintains accuracy better than traditional VLMs when encountering edge cases or atypical presentations, a critical requirement for clinical deployment where rare conditions and unusual manifestations are common.
System Achieves 37.0% Faithfulness Score—A Capability Absent in VLM Counterpart
Perhaps most significantly, RadAgent achieved 37.0% in faithfulness metrics, measuring whether the reasoning trace accurately reflects how the model arrived at its conclusions. This capability was entirely absent in the 3D VLM baseline, indicating that the agent's step-by-step reasoning traces are meaningful explanations rather than post-hoc rationalizations—a crucial distinction for clinical trust and regulatory approval.
Key Takeaways
- RadAgent improves clinical accuracy by 36.4% relative to baseline vision-language models in chest CT report generation, achieving 6.0-point macro-F1 gains
- The system provides fully inspectable reasoning traces showing how each finding was derived, enabling clinicians to validate AI conclusions
- Robustness under adversarial conditions improves by 41.9% relative to baseline, indicating better performance on challenging or unusual cases
- RadAgent achieves 37.0% faithfulness, meaning its reasoning traces accurately reflect decision-making processes—a capability entirely absent in standard VLM approaches
- The tool-augmented agent approach demonstrates that interpretability and accuracy can be achieved simultaneously in medical AI systems