Performance of large language models on sleep medicine certification examination: a comprehensive multi-model analysis

Source: Frontiers Medicine

Original: https://www.frontiersin.org/articles/10.3389/fmed.2026.1761025...

Published: 2026-03-02T00:00:00Z

The study evaluated the performance of nine large language models (LLM) on 197 multiple-choice questions according to American Academy of Sleep Medicine standards. Each question was asked three times and accuracy was determined by a strict criterion of 3/3 matching of correct answers. The performance of the models was significantly heterogeneous (χ² = 101.95, df = 8, p < 0.001), with accuracy ranging from 68.5% to 95.9%. Premium versions performed best: Gemini 2.5 Pro (95.9%, 95% CI: 93.2–98.7%), Claude Opus 4 (93.9%) and ChatGPT GPT-4o (93.4%). Paid models outperformed free models by 5.1 to 8.6 percentage points (all p < 0.05). The highest consistency was for secondary sleep disorders (92.0%), the lowest for diagnostic methods (85.9%). Eight of the nine models exceeded the 80% benchmark for all three scoring criteria.