Chatbot Outperformed Physicians in Clinical Reasoning in Head-To-Head Study

Artificial intelligence was also "just wrong" much more often

BOSTON – ChatGPT-4, an artificial intelligence program designed to understand and generate human-like text, outperformed internal medicine residents and attending physicians at two academic medical centers in processing medical data and demonstrating clinical reasoning. In a research letter published in JAMA Internal Medicine, physician-scientists at Beth Israel Deaconess Medical Center (BIDMC) compared the reasoning abilities of a large language model (LLM) directly to human performance using standards developed to evaluate physicians.

"It was clear from the beginning that LLMs can make diagnoses, but anyone who practices medicine knows that medicine is much more than that," said Adam Rodman MD, an internal medicine physician and researcher in the department of medicine at BIDMC. “There are several steps behind a diagnosis, so we wanted to evaluate whether LLMs are as good as doctors at doing that type of clinical reasoning. “It is a surprising finding that these things are able to show equivalent or better reasoning than people throughout the evolution of the clinical case.”

Rodman and colleagues used a previously validated tool developed to assess clinicians' clinical reasoning called the revised IDEA score (r-IDEA). The researchers recruited 21 treating physicians and 18 residents, each of whom worked on one of 20 selected clinical cases composed of four sequential stages of diagnostic reasoning. The authors instructed physicians to write down and justify their differential diagnoses at each stage. The GPT-4 chatbot received a message with identical instructions and executed the 20 clinical cases. Their responses were then scored on clinical reasoning (r-IDEA score) and several other reasoning measures.

"The first stage is triage data, when the patient tells you what's bothering them and you get vital signs," said lead author Stephanie Cabral, MD, a third-year internal medicine resident at BIDMC. “The second stage is the review of the system, when additional information is obtained from the patient. The third stage is the physical examination and the fourth is diagnostic tests and imaging.”

Rodman, Cabral, and their colleagues found that the chatbot had the highest r-IDEA scores, with an average score of 10 out of 10 for the LLM, 9 for treating physicians, and 8 for residents. It was more of a tie between the humans and the robot when it came to diagnostic accuracy (how high the correct diagnosis was on the list of diagnoses they provided) and correct clinical reasoning. But the robots were also “just plain wrong”: They had more instances of incorrect reasoning in their answers, significantly more often than the residents, the researchers found. The finding underscores the notion that AI will likely be most useful as a tool to augment, not replace, the human reasoning process.

"More studies are needed to determine how LLMs can best be integrated into clinical practice, but even now, they could be useful as a checkpoint, helping us make sure we're not missing anything," Cabral said. “My ultimate hope is that AI will improve the doctor-patient interaction by reducing some of the inefficiencies we currently have and allowing us to focus more on the conversation we have with our patients.

"Early studies suggested that AI could make diagnoses, if it was given all the information," Rodman said. “What our study shows is that AI demonstrates real reasoning, perhaps better than that of people through multiple steps of the process. “We have a unique opportunity to improve the quality and experience of patient care.”

Co-authors included Zahir Kanjee, MD, Philip Wilson, MD, and Byron Crowe, MD, of BIDMC; Daniel Restrepo, MD, of Massachusetts General Hospital; and Raja-Elie Abdulnour, MD, of Brigham and Women's Hospital.

This work was supported by Harvard Catalyst | The Harvard Clinical and Translational Sciences Center (National Center for Advancing Translational Sciences, National Institutes of Health) (award UM1TR004408) and financial contributions from Harvard University and its affiliated academic healthcare centers.

Potential conflicts of interest: Rodman reports receiving grants from the Gordon and Betty Moore Foundation. Crowe reports on employment and equity at Solera Health. Kanjee reports receiving royalties for edited books and serving on a paid advisory board for non-AI medical education products from Wolters Kluwer, as well as fees for continuing medical education provided by Oakstone Publishing. Abdulnour reports having worked at the Massachusetts Medical Society (MMS), a nonprofit organization that owns NEJM Healer. Abdulnour does not receive royalties from sales of NEJM Healer and has no shares in NEJM Healer. The MMS did not provide funding for this study. Abdulnour reports that he received grants from the Gordan and Betty Moore Foundation through the National Academy of Medicine Scholars in Diagnostic Excellence.

Chatbot Outperformed Physicians in Clinical Reasoning in Head-To-Head Study

Artificial intelligence was also "just wrong" much more often

Comments

Leave a Reply Cancel reply