Assessment of accuracy of an early artificial intelligence large language model at summarizing medical literature: ChatGPT 3.5 vs. ChatGPT 4.0

Joseph Weinberg; James Goldhardt; Stacie Patterson; John Kepros

doi:10.21037/jmai-24-48

Background: Within the field of medicine, large language models (LLMs) offer the potential to provide clinical decision support by integrating real-time clinical information with leading practice guidelines and evidence-based medicine. A prerequisite to this future is confirmation that these models accurately integrate the current best practices in each field. This study aimed to determine whether ChatGPT can reliably summarize the correct clinical conclusions (CCCs) of the existing medical literature. Methods: The objective of this study is to evaluate ChatGPT’s ability to correctly summarize existing medical literature from the information it collected during its training. The top 25 most cited publications using the keyword “trauma” from the New England Journal of Medicine were summarized by both ChatGPT versions 3.5 and 4.0. These summaries were assessed by two specialist physician evaluators for accuracy in establishing the reason for the paper, methodology, conclusions, and primary clinical conclusion. A quantitative assessment of errors and failure to display comprehension was performed. Results: ChatGPT 4.0 claimed no specific knowledge of 4 of the 25 papers, which were therefore excluded from analysis. Of the remaining 21 papers, 2 were deemed not clinical in nature and 1 prompt was performed in error, leaving an N=18. Both reviewers found that ChatGPT 3.5 performed inferiorly to ChatGPT 4.0 in both production of errors and in displaying comprehension of the queried summaries. Interobserver agreement was greatest for measurement of whether the summary accurately communicated the CCC of the queried article, on average failing to do so 44% and 8% of the time for ChatGPT 3.5 and 4.0, respectively. Conclusions: ChatGPT shows an impressive knowledge base of medical literature and has markedly improved its ability to synthesize accurate information from its training. Specifically, ChatGPT 4.0 more frequently displayed comprehension of the primary clinical conclusion of the queried articles. © Journal of Medical Artificial Intelligence. All rights reserved.

Assessment of accuracy of an early artificial intelligence large language model at summarizing medical literature: ChatGPT 3.5 vs. ChatGPT 4.0

Abstract

Files and links (1)

Metrics

Details