Diagnostic performance of large language models on the NEJM image challenge: a comparative study with human evaluators and the impact of prompt engineering

Source: Frontiers Medicine

Original: https://www.frontiersin.org/articles/10.3389/fmed.2025.1709413...

Published: 2026-01-08T00:00:00Z

The study evaluated the diagnostic performance of four large language models on 200 tasks from the NEJM Image Challenge, which combine clinical text and images. OpenAI o4-mini-high achieved an overall accuracy of 94%, which was the highest of all models. Human participants' accuracy was lower: three medical students scored 38.5% and an attending physician scored 70.5%. The performance of OpenAI o4-mini-high remained high in light, medium and complex cases. Analysis of 12 errors of this model showed that 83.3% of the errors were related to incorrect diagnostic logic rather than input data processing. Simple adjustments to the assignment (such as "chain of thought" and multiple examples in the prompt) were able to correct more than half of these initial errors. The authors report that the results demonstrate strong pattern recognition ability and support the use of LLM in educational and standardized testing environments.