Assessing the Accuracy of Large Language Models on European Guidelines for Cervical Cancer: An In Silico Benchmarking Study

Q: How does OCRA curate its research paper database?

OCRA indexes peer-reviewed papers from PubMed and CrossRef, enriched with MeSH terms, author profiles, and citation metrics. Papers are categorized by cancer type, research method, and clinical relevance.

Q: Can I search papers by MeSH term or author?

Yes. Use the search bar to filter papers by MeSH term, author name, journal, keyword, or cancer type. Each paper includes linked MeSH terms and author profiles for further exploration.

Q: How current is the paper database?

The database is updated regularly through automated ingestion from PubMed and CrossRef. Most papers appear within days of their PubMed indexing date.

Matteo Pavone; Nicolò Bizzarri

doi:10.1111/1471-0528.70095

ABSTRACT

Objective

Although large language models are increasingly used in clinical and research settings, the validity of the information they provide remains uncertain. This study aimed to evaluate the accuracy, consistency, and reliability of three large language models—ChatGPT 4.0, DeepSeek R1, and Gemini 2.0—in answering cervical cancer‐related questions based on the ESGO/ESTRO/ESP guidelines.

Design

Prospective, comparative in silico benchmarking study.

Setting

Fondazione Policlinico Universitario A. Gemelli, Rome, Italy. Population or Sample: Fifty questions derived from the ESGO/ESTRO/ESP (European Society of Gynaecologic Oncology/European Society for Radiotherapy and Oncology/European Society of Pathology) Guidelines for Cervical Cancer.

Methods

Each question was submitted simultaneously to ChatGPT 4.0, DeepSeek R1, and Gemini 2.0, and re‐entered twice to assess response repeatability. Answers were evaluated for accuracy using a Global Quality Score (GQS) from 1 (poor) to 5 (completely accurate). Consistency (intra‐model response stability) and reliability (alignment with guidelines) were assessed using binary classification. Main Outcome Measures: Median GQS, percentage of GQS 5 responses, consistency between repeated answers, and reliability.

Results

ChatGPT 4.0 achieved the highest performance, with 42% of responses rated GQS 5, followed by Gemini 2.0 (30%) and DeepSeek R1 (28%). DeepSeek R1 and Gemini 2.0 scored lower in median GQS (3.50) compared to ChatGPT 4.0 (4.00). Response consistency varied significantly, with ChatGPT 4.0 and DeepSeek R1 showing differences from Gemini 2.0 ( p = 0.034 and p = 0.044, respectively). No significant difference was observed in reliability ( p = 0.602).

Conclusion

All models demonstrated suboptimal accuracy in aligning with clinical guidelines. ChatGPT 4.0 was the most accurate and consistent whereas DeepSeek R1 underperformed. Despite similar reliability across models, expert oversight remains essential to ensure safe clinical application and prevent misinformation.

Assessing the Accuracy of Large Language Models on European Guidelines for Cervical Cancer: An In Silico Benchmarking Study

ABSTRACT

Objective

Design

Setting

Methods

Results

Conclusion

Links

Journal

Institutions

Authors