Can ChatGPT pass the MRCP (UK) written examinations? Analysis of performance and errors using a clinical decision-reasoning framework

Objective
Large language models (LLMs) such as ChatGPT are being developed for use in research, medical education and clinical decision systems. However, as their usage increases, LLMs face ongoing regulatory concerns. This study aims to analyse ChatGPT’s performance on a postgraduate examination to identify areas of strength and weakness, which may provide further insight into their role in healthcare.

Design
We evaluated the performance of ChatGPT 4 (24 May 2023 version) on official MRCP (Membership of the Royal College of Physicians) parts 1 and 2 written examination practice questions. Statistical analysis was performed using Python. Spearman rank correlation assessed the relationship between the probability of correctly answering a question and two variables: question difficulty and question length. Incorrectly answered questions were analysed further using a clinical reasoning framework to assess the errors made.

Setting
Online using ChatGPT web interface.

Primary and secondary outcome measures
Primary outcome was the score (percentage questions correct) in the MRCP postgraduate written examinations. Secondary outcomes were qualitative categorisation of errors using a clinical decision-making framework.

Results
ChatGPT achieved accuracy rates of 86.3% (part 1) and 70.3% (part 2). Weak but significant correlations were found between ChatGPT’s accuracy and both just-passing rates in part 2 (r=0.34, p=0.0001) and question length in part 1 (r=–0.19, p=0.008). Eight types of error were identified, with the most frequent being factual errors, context errors and omission errors.

Conclusion
ChatGPT performance greatly exceeded the passing mark for both exams. Multiple choice examinations provide a benchmark for LLM performance which is comparable to human demonstrations of knowledge, while also highlighting the errors LLMs make. Understanding the reasons behind ChatGPT’s errors allows us to develop strategies to prevent them in medical devices that incorporate LLM technology.

Leggi
Marzo 2024

REVOLUTIONIZING IBD MANAGEMENT: HOW DO CHATGPT & GOOGLE BARD STAND UP IN OFFERING COMPREHENSIVE MANAGEMENT SOLUTIONS?

Artificial Intelligence (AI) has notably transformed the realm of healthcare, especially in diagnosing and treating Inflammatory Bowel Disease (IBD) and other digestive disorders. Essential AI tools, such as ChatGPT and Google Bard, can interpret endoscopic imagery, analyze diverse samples, simplify administrative duties, and assist in assessing medical images and the automation of devices. By individualizing treatments and forecasting adverse reactions, these AI applications have notably enhanced the management of digestive diseases.

Leggi
Gennaio 2024

Abstract 16401: Optimizing ChatGPT to Detect VT Recurrence From Complex Medical Notes

Circulation, Volume 148, Issue Suppl_1, Page A16401-A16401, November 6, 2023. Introduction:Large language models (LLMs), such as ChatGPT, have remarkable ability to interpret natural language using text questions (prompts) applied to gigabytes of data in the world wide web. However, the performance of ChatGPT is less impressive when addressing nuanced questions from finite repositories of lengthy, unstructured clinical notes (Fig A).Hypothesis:The performance of ChatGPT to identify sustained ventricular tachycardia (VT) or fibrillation after ablation from free-text medical notes is improved by optimizing the question and adding in-context sample notes with correct responses (‘prompt engineering’).Methods:We curated a dataset of N = 125 patients with implantable defibrillators (32.0% female, LVEF 48.9±13.9%, 61.7±14.0 years), split into development (N = 75) and testing (N = 50) sets of 307 and 337 notes, with 256.8±95.1 and 289.8±103 words, respectively. Notes were deidentified. Gold standard labels for recurrent VT (Yes, No, Unknown) were provided by experts. We applied GPT-3.5 to the test set (N=337 notes), using 1 of 3 prompts (“Does the patient have sustained VT or VF after ablation” or 2 others), systematically adding 1-5 “training” examples, and repeating experiments 10 times (51,561 inquiries).Results:At baseline, GPT achieved an F1 score of 38.6%±19.4% (mean across 3 prompts; Fig B). Increasing the number of examples progressively improved mean accuracy and reduced variance. The optimal result was the illustrated prompt plus 5 in-context examples, with an F1 score of 84.6%±6.4% (p

Leggi
Novembre 2023