Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

Source: Nature Medicine

Original: https://www.nature.com/articles/s41591-025-04074-y...

Published: 2026-02-09

They tested the reliability of large language models (LLMs) as medical assistants for the general public in a randomized pre-registered public trial in Nature Medicine on February 9, 2026.[1] The study included 1298 participants from the general population of Great Britain who evaluated ten medical scenarios to identify diseases and recommend treatment.[1] Participants interacted with ChatGPT, presenting symptoms as imaginary patients, and achieved the correct diagnosis only about 37% of the time.[1] Conversely, LLM alone reached the correct diagnosis about 95% of the time when doctors directly provided him with a list of symptoms.[1] Other models had similar results: Llama 3 from Meta 99% and Command R+ from Cohere 91%.[1] The main problem was deficiencies in human communication with AI, such as incomplete information, incorrect selection from proposals or closed questions.[1] A study showed that people with the help of the LLM performed significantly lower than the LLM alone.[1]