Evaluation of large language models in rheumatology and clinical immunology: a systematic assessment based on Chinese national health professional qualification examination

Source: Frontiers Medicine

Original: https://www.frontiersin.org/articles/10.3389/fmed.2025.1716122...

Published: 2026-01-15T00:00:00Z

The study evaluated 11 large language models (LLMs), including DeepSeek, GPT, Llama, Gemma and Qwen, for their capabilities in rheumatology and clinical immunology. The evaluation was based on the Chinese National Health Professional Qualification Examination and included four dimensions: basic medical knowledge, related medical knowledge, immunological knowledge, and professional experience. The results showed significant differences between the tested models. DeepSeek-R1 and Qwen3 performed best with accuracy exceeding 90 percent. Despite this, performance in professional practice remained relatively low, highlighting the limitations of these models in complex clinical applications. The study thus demonstrates that although LLMs achieve high accuracy in theoretical medical knowledge, their practical clinical use requires further development.