Although large language models are increasingly used in clinical and research settings, the validity of the information they provide remains uncertain. This study aimed to evaluate the accuracy, consistency, and reliability of three large language models—ChatGPT 4.0, DeepSeek R1, and Gemini 2.0—in answering cervical cancer‐related questions based on the ESGO/ESTRO/ESP guidelines.
Prospective, comparative in silico benchmarking study.
Fondazione Policlinico Universitario A. Gemelli, Rome, Italy. Population or Sample: Fifty questions derived from the ESGO/ESTRO/ESP (European Society of Gynaecologic Oncology/European Society for Radiotherapy and Oncology/European Society of Pathology) Guidelines for Cervical Cancer.
Each question was submitted simultaneously to ChatGPT 4.0, DeepSeek R1, and Gemini 2.0, and re‐entered twice to assess response repeatability. Answers were evaluated for accuracy using a Global Quality Score (GQS) from 1 (poor) to 5 (completely accurate). Consistency (intra‐model response stability) and reliability (alignment with guidelines) were assessed using binary classification. Main Outcome Measures: Median GQS, percentage of GQS 5 responses, consistency between repeated answers, and reliability.
ChatGPT 4.0 achieved the highest performance, with 42% of responses rated GQS 5, followed by Gemini 2.0 (30%) and DeepSeek R1 (28%). DeepSeek R1 and Gemini 2.0 scored lower in median GQS (3.50) compared to ChatGPT 4.0 (4.00). Response consistency varied significantly, with ChatGPT 4.0 and DeepSeek R1 showing differences from Gemini 2.0 ( p = 0.034 and p = 0.044, respectively). No significant difference was observed in reliability ( p = 0.602).
All models demonstrated suboptimal accuracy in aligning with clinical guidelines. ChatGPT 4.0 was the most accurate and consistent whereas DeepSeek R1 underperformed. Despite similar reliability across models, expert oversight remains essential to ensure safe clinical application and prevent misinformation.