OpenAI’s GPT-4 correctly diagnosed 52.7% of complex challenge cases, compared to 36% of medical journal readers, and outperformed 99.98% of simulated human readers, according to a study published by the New England Journal of Medicine.
The evaluation, conducted by researchers in Denmark, utilized GPT-4 to find diagnoses pertaining to 38 complex clinical case challenges with text information published online between January 2017 and January 2023. GPT-4’s responses were compared to 248,614 answers from online medical journal readers.
Each complex clinical case included a medical history alongside a poll with six options for the most likely diagnosis. The prompt used for GPT-4 asked the program to solve for diagnosis by answering a multiple choice question and analyzing full unedited text from the clinical case report. Each case was presented to GPT-4 five times to evaluate reproducibility.
Alternatively, researchers collected votes for each case from medical-journal readers, which simulated 10,000 sets of answers, resulting in a pseudopopulation of 10,000 human participants.
The most common diagnoses included 15 cases in the field of infectious disease (39.5%), five cases in endocrinology (13.1%) and four cases in rheumatology (10.5%).
Patients in the clinical cases ranged from newborn to 89 years of age, and 37% were female.
The recent March 2023 edition of GPT-4 correctly diagnosed 21.8 cases or 57% with good reproducibility, while medical journal readers correctly diagnosed 13.7 cases, or 36% on average.
The most recent release of GPT-4 in March includes online material up to September 2021; therefore, researchers also evaluated the cases before and after the available training data.
In that case, GPT-4 correctly diagnosed 52.7% of cases published up to September 2021 and 75% of cases published after September 2021.
“GPT-4 had a high reproducibility, and our temporal analysis suggests that the accuracy we observed is not due to these cases’ appearing in the model’s training data. However, performance did appear to change between different versions of GPT-4, with the newest version performing slightly worse. Although it demonstrated promising results in our study, GPT-4 missed almost every second diagnosis,” the researchers wrote.
“… our results, together with recent findings by other researchers, indicate that the current GPT-4 model may hold clinical promise today. However, proper clinical trials are needed to ensure that this technology is safe and effective for clinical use.”
WHY IT MATTERS
Researchers noted the study’s limitations, including unknowns around the medical journal readers’ medical skills, and that the researcher’s results may represent a best-case scenario favoring GPT-4.
Still, researchers concluded GPT-4 would still perform better than 72% of human readers even with “maximally correlated correct answers” among medical journal readers.
The researchers highlighted the importance of future models to include training data from developing countries to ensure the global benefit of the technology as well as the need for ethical considerations.
“As we move toward this future, the ethical implications surrounding the lack of transparency by commercial models such as GPT-4 also need to be addressed as well as regulatory issues on data protection and privacy,” the study’s authors wrote.
“Finally, clinical studies evaluating accuracy, safety and validity should precede future implementation. Once these issues have been addressed and AI improves, society is expected to increasingly rely on AI as a tool to support the decision-making process with human oversight, rather than as a replacement for physicians.”