Despite increasing use of artificial intelligence (AI) in health care, a new study led by Mass General Brigham researchers from the MESH Incubator shows that generative AI models continue to fall short at their clinical reasoning capabilities.
By asking 21 different large language models (LLMs) to play doctor in a series of clinical scenarios, the researchers showed that LLMs often fail often fail at navigating diagnostic workups and coming up with a testable list of potential or โdifferentialโ diagnoses. Though all tested LLMs arrived at a correct final diagnosis more than 90% of the time when provided with all pertinent information in a patient case, they consistently performed poorly at the earlier, reasoning-driven steps of the diagnostic process, according to the results published in JAMA Network Open.
โDespite continued improvements, off-the-shelf large language models are not ready for unsupervised clinical-grade deployment,โ said corresponding author Marc Succi, MD, executive director of the MESH Incubator at Mass General Brigham. โDifferential diagnoses are central to clinical reasoning and underlie the โart of medicineโ that AI cannot currently replicate. The promise of AI in clinical medicine continues to lie in its potential to augment, not replace, physician reasoning, provided all the relevant data is available โ not always the caseโ

This new research is a follow-up to previous work led by Succiโs MESH group in which researchers evaluated ChatGPT 3.5 ability to accurately in diagnose a series of a clinical vignettes.
In the new study, the researchers developed a novel and more holistic measure of LLMs that looked beyond accuracy, called PrIME-LLM, which evaluates a modelโs competency across different stages of clinical reasoningโcoming up with potential diagnoses, conducting appropriate tests, arriving at a final diagnosis, and managing treatment. When models perform well in one area but poorly in another, this imbalance is reflected in the PrIME-LLM score, as opposed to averaging competency across tasks, which may mask areas of weakness, according to the researchers.
The study compared 21 general-purpose LLMs, including the latest models of ChatGPT, DeepSeek, Claude, Gemini, and Grok at the time of submission. The researchers tested the modelsโ ability to work through 29 published clinical cases. To simulate the way that clinical cases unfold, the researchers gradually fed the models information, beginning with basics like a patientโs age, gender and symptoms before adding physical examination findings and laboratory results. The LLMsโ performance at each stage was assessed by medical student evaluators, and these evaluations were used to calculate the modelsโ overall PrIME-LLM scores.
In line with their previous study, the researchers found that the LLMs were good at producing accurate final diagnoses. However, all of the models failed to produce an appropriate differential diagnosis more than 80% of the time. In the real world, a differential diagnosis is critical, but in this study, the models were given more information so that they could proceed to the next stage of the clinical workup even if they failed at the differential diagnosis step.
โBy evaluating LLMs in a stepwise fashion, we move past treating them like test-takers and put them in the position of a doctor,โ said Arya Rao, lead author, MESH researcher, and MD-PhD student at Harvard Medical School. โThese models are great at naming a final diagnosis once the data is complete, but they struggle at the open-ended start of a case, when there isn’t much information.โ
Most of the LLMs showed improved accuracy when provided with laboratory results and imaging in addition to text. More recently released models generally outperformed older models, showing that LLMs are improving incrementally. The modelsโ PrIME-LLM scores ranged from 64% for Gemini 1.5 Flash to 78% for Grok 4 and GPT-5.
According to Succi, PrIME-LLM represents a standardized way to evaluate AIโs clinical competency that could be used by AI developers and hospital leaders to benchmark new technologies as they are released.
โWe want to help separate the hype from the reality of these tools as they apply to health care,โ he said. โOur results reinforce that large language models in healthcare continue to require a โhuman in the loopโ and very close oversight.โ




Leave a Reply