Evaluating the quality of large language model-generated preoperative patient education material: a comparative study across models and surgery types

Source: Frontiers Medicine

Original: https://www.frontiersin.org/articles/10.3389/fmed.2025.1701344...

Published: 2025-12-11T00:00:00Z

The study evaluated the quality of preoperative patient education material (PEM) generated by six large language models (LLM) across different types of surgery. The materials were assessed by three groups of experts for accuracy and completeness on a 5-point scale, while comprehensibility and usability were assessed by two researchers using the PEMAT-P and SAM instruments. Statistical analysis showed that all models achieved high accuracy, comprehensibility and actionability without significant differences. The Grok-4 and Claude-Opus-4 models stood out in their entirety, surpassing the GPT-4o. Claude-Opus-4 achieved the best suitability rating, while Grok-4 was the worst in this category. Readability was best with Grok-4 and Gemini-2.5-Pro, while Claude-Opus-4 had the lowest readability. In terms of sentiment, the only model Gemini-2.5-Pro was able to consistently generate content with a positive emotional charge. Research emphasizes that no model is perfect, and therefore medical personnel should review and supplement materials before use[1].