Holistic evaluation of large language models for medical tasks with MedHELM

Source: Nature Medicine

Original: https://www.nature.com/articles/s41591-025-04151-2...

Published: 2026-01-20

MedHELM is an extensible evaluation framework for the holistic evaluation of large language models for medical tasks. It includes a new taxonomy for classifying medical tasks and benchmarks many datasets across these categories. It enables the evaluation of large language models on real-world clinical tasks. The article was published in Nature Medicine online on January 20, 2026 with DOI: 10.1038/s41591-025-04151-2. MedHELM creates a basis for testing and evaluating the real applicability of language models in healthcare. The framework is designed as a flexible tool for systematic comparison of models in different medical scenarios.