AI News Hub Logo

AI News Hub

Can AI Chatbots Reason Like Doctors?

IEEE Spectrum
Greg Uyeno

One of the earliest stated goals for computing in medicine was to aid in clinical reasoning: the decision-making steps required to reach a diagnosis and form a treatment plan. And over the years, researchers have built many clinical decision support systems, which have typically been purpose-built, with painstakingly written rules about symptoms, test thresholds, and medication interactions. As artificial intelligence capabilities develop, clinical reasoning is a natural application. Now, a large language model (LLM) from OpenAI has outperformed physicians on several clinical reasoning tasks using real emergency room records, according to a study published 30 April in Science. The new findings arrive amid a wave of concerning evidence about medical information from chatbots, with some studies showing impressive diagnostic performance while others document fabricated citations, flawed advice, and results that shift depending on how researchers score the systems. Despite that uncertainty, products aimed towards medical professionals are already entering the market. For example, this year OpenAI introduced ChatGPT for Clinicians and ChatGPT for Healthcare. The performance of OpenAI’s o1-preview, a general-purpose model that has since been supplanted by newer models, was promising enough for the authors to recommend further testing of LLMs in real life cases, with physicians seeking second opinions on diagnosis at specific checkpoints. Mickael Tordjman, who studies AI in medical imaging at the Icahn School of Medicine in New York, agrees that the time is right for research focused on real world applications. “We need more proof in prospective clinical trials,” he says, noting that newer LLM models, or those trained specifically for medical use, might perform even better. While the authors of the Science paper expressed optimism about AI’s medical potential during a press briefing, they also stressed important limitations of LLMs, and raised concerns about the ways their research could be misinterpreted. “I don’t think our findings mean that AI replaces doctors,” says coauthor Arjun Manrai, who studies AI at Harvard Medical School. “I think this is really cool, don’t get me wrong,” says coauthor Adam Rodman, a medical educator at Beth Israel Deaconess Medical Center in Boston. “I get a little queasy about how some of these results might be used.” How Reliable Are Chatbots on Medical Matters? Other researchers investigating chatbots’ medical advice have recently found reason to doubt their trustworthiness. For example, in one study, nearly half of the responses that five popular chatbots gave to open-ended health questions were flawed. Chatbots fabricated information and citations, and presented their answers confidently regardless of their accuracy. “These models are being used every day. There’s a certain risk there that’s not being quantified or mitigated,” says Arya Rao, who studies AI in medical practice in a different Harvard group than the Science authors. Much of the research focuses on chatbots answering health questions from everyday users—the kinds of questions that a person might ask before deciding to seek medical attention. Using an LLM as a clinical decision-support tool for doctors is a different task entirely. Physicians should have a much better sense of what information would help an LLM reach an accurate diagnosis or formulate a treatment plan, as well as the background knowledge to identify obvious mistakes. However, detecting hallucinations could still be challenging for doctors. “The models are equally convincing whether they are right or wrong,” says Rodman. “We need to find workflows with a low rate of errors.” Researchers compared two physicians and two large language models on diagnostic tasks at multiple stages of emergency-room care. Peter G. Brodeur, Thomas A. Buckley, et al. Even studies focused on physician-facing clinical reasoning tasks can reach very different conclusions depending on how researchers define success. In a paper published 13 April in JAMA Network, Rao and colleagues tested 21 LLMs in clinical reasoning tasks similar to those in the Science paper. As with the Science paper, many performed well with their final diagnoses, including chatbots in the o1 series. However, Rao scored the LLMs poorly on differential diagnosis questions because she used a different evaluation system. When doctors make differential diagnoses, they note all of the potential causes of a patient’s symptoms. An LLM might correctly list six out of seven possible final diagnoses. This could reasonably be scored as 86 percent or, as in Rao’s system, an unacceptable failure. There is no agreed upon standard scoring system in place. “It is still something in progress,” says Tordjman, “There’s no perfect way to evaluate LLMs in clinical reasoning.” Testing Medical AI in the Real World For the Science study, the researchers tested the OpenAI model with several batteries of medical case studies, comparable to difficult open-ended medical exam questions. Instructions to the chatbot were sometimes lengthy and filled with details that could be either extraneous or critical clues to the correct diagnosis. “We went the extra step and showed that this performance also works in the real world,” says Rodman. One part of the study used data from 76 actual emergency room visits. The researchers asked the LLM and physicians for diagnoses at several stages of care: upon arrival to the emergency room, after evaluation by a doctor, and after transfer to another part of the hospital. Though both computers and humans were more accurate as more information became available, the LLM consistently edged out the humans. For example, it provided an “exact or very close diagnosis” 82 percent of the time at the final checkpoint, compared to 79 percent and 70 percent for the two physicians. LLMs as we know them are not even a decade old, and the landscape is rapidly evolving. Updated versions of flagship LLMs are arriving faster than the typical pace of medical studies and academic literature, and many questions about regulation and liability remain unanswered. With many patients and doctors already consulting these machines, researchers told IEEE Spectrum that there’s an urgent need to understand their benefits, risks, and the best way to use them. While comparing AI performance against human physicians was important to the study, Manrai says the more important question is how doctors will actually use the technology. “We have to very rapidly move away from ‘AI vs. humans’ toward how humans interact with this technology,” says Manrai. Despite the many unresolved questions, Harvard’s Rao says the technology is advancing too quickly for medicine to ignore. “I would say it’s important to be careful, it’s important to evaluate, but it’s perhaps even more important to innovate,” she says. “We don’t want to rain on the parade—we think responsible innovation is the way to go.”