Evaluating the clinical utility of large language models for hepatocellular carcinoma treatment recommendations: A nationwide retrospective registry study

The study evaluated how well large language models (ChatGPT 4o, Gemini 2.0, and Claude 3.5) recommend treatment for hepatocellular carcinoma compared to actual physician decisions. The analysis included 13,614 HCC patients diagnosed in South Korea between 2008 and 2020. The agreement between model recommendations and physician decisions was low—31.1% for ChatGPT 4o, 32.7% for Gemini 2.0, and 26.8% for Claude 3.5. In patients with early stage (BCLC-A) compliance with model recommendations was associated with better survival, but in patients with advanced stage (BCLC-C) it was conversely associated with worse outcomes. Research has shown that doctors prioritize a patient's liver function when making decisions, while models focus more on tumor characteristics. The authors conclude that large language models can serve as a helpful tool in simple cases, but their recommendations should be interpreted with caution along with the clinician's clinical judgment, especially in complex situations.