Can AI Outperform Human Doctors?

By Crystal Lindell

It may not be long before a trip to the emergency room means telling your symptoms to an AI robot, potentially before you even talk to a human doctor. 

New research published in Science seems to highlight the potential for artificial intelligence to create such a future in healthcare.

The study -- which was conducted by both Harvard and Stanford researchers – tested OpenAI’s experimental “o1 preview” models against human physicians. OpenAI makes ChatGPT.

They asked the o1 models to do a patient diagnosis and create a diagnostic testing plan, then compared its skill in clinical reasoning to experts and generalist physicians.

They also assessed AI on 76 real-life emergency room patients at a Boston hospital in three stages: the initial triage at first arrival; first contact with a physician; and upon admission to the hospital. 

The results showed that the new AI model outperformed human physicians and showed improvement from earlier generations of AI. 

“Our findings suggest the urgent need for prospective trials to evaluate these technologies in real-world patient care settings and for health care systems to prepare for investments for computing infrastructure and design for clinician-AI interaction that can facilitate the safe integration of AI tools into patient-care workflows,” wrote lead authors Arjun Manrai, PhD, Assistant Professor of Biomedical Informatics at Harvard University and Adam Rodman, MD, Director of AI Programs at Beth Israel Deaconess Medical Center. 

In the emergency department cases, the o1 model was diagnostically correct 67.1% of the time during the initial triage, outperforming two expert attending physicians (55.3% and 50.0%).

Physicians who reviewed the diagnostic results – without knowing if they were made by AI or human doctors – were unable to distinguish between the two. 

“AI models are evolving from static question-and-answer tools into agents that can, for example, analyze patient records, monitor clinical encounters through ambient listening, and interact in real time with predictive models built on patient data," Ashley Hopkins, PhD, and Erik Cornelisse, PhD candidate, at the College of Medicine and Public Health at Flinders University in Australia, wrote in an op/ed on the study.

“This advance sets a new evaluation benchmark — testing AI against physician performance, and ideally alongside physicians, on authentic clinical tasks.”

Interestingly, Hopkins and Cornelisse pushed back on the idea that the ideal method for evaluating patients is physicians collaborating with AI. They think AI may perform better on its own. 

“That collaborative configuration itself must be tested,” they write. “It has been argued that for certain well-defined tasks across health care, AI may operate more effectively independently.”

They also wrote that since many doctors are already using AI in their practices, sometimes without institutional oversight, further studies are urgently needed to determine when AI improves patient care and when it does not.

In an article about the AI study published in Harvard Magazine, Arjun Manrai, the senior co-author of the study, said the results do not show that “AI replaces doctors, despite what some (AI) companies are likely to say.”

“I think it does mean that we’re witnessing a really profound change in technology that will reshape medicine,” Manrai said. “We need to evaluate this technology now and rigorously conduct prospective clinical trials.”

Manrai also makes an important point. The AI study was based entirely on text-based inputs, while practicing physicians evaluate many other forms of information and communication, such as listening to a patient, observing how a patient behaves, examining images and x-rays, and evaluating other test results. 

AI can’t do all those things – at least not yet.

Manrai’s co-author, Adam Rodman, also thinks it’s premature for AI to replace doctors in clinical settings. AI might prove useful in providing second opinions and finding diagnostic mistakes, but Rodman doesn’t want to see “AI doctor companies” replacing human physicians.

“I do not think that these results support that,” Rodman said. “What these results support is a robust and ambitious research agenda to try to figure out how we can use these technologies to make patients’ lives better.”

ChatGPT Is Replacing Dr. Google

By Andrew Leonard, KFF Health News

As a fourth-year ophthalmology resident at Emory University School of Medicine, Riley Lyons’ biggest responsibilities include triage: When a patient comes in with an eye-related complaint, Lyons must make an immediate assessment of its urgency.

He often finds patients have already turned to “Dr. Google.” Online, Lyons said, they are likely to find that “any number of terrible things could be going on based on the symptoms that they’re experiencing.”

So, when two of Lyons’ fellow ophthalmologists at Emory came to him and suggested evaluating the accuracy of the AI chatbot ChatGPT in diagnosing eye-related complaints, he jumped at the chance.

In June, Lyons and his colleagues reported in medRxiv, an online publisher of health science preprints, that ChatGPT compared quite well to human doctors who reviewed the same symptoms — and performed vastly better than the symptom checker on the popular health website WebMD.

And despite the much-publicized “hallucination” problem known to afflict ChatGPT — its habit of occasionally making outright false statements — the Emory study reported that the most recent version of ChatGPT made zero “grossly inaccurate” statements when presented with a standard set of eye complaints.

The relative proficiency of ChatGPT, which debuted in November 2022, was a surprise to Lyons and his co-authors. The artificial intelligence engine “is definitely an improvement over just putting something into a Google search bar and seeing what you find,” said co-author Nieraj Jain, an assistant professor at the Emory Eye Center who specializes in vitreoretinal surgery and disease.

But the findings underscore a challenge facing the health care industry as it assesses the promise and pitfalls of generative AI, the type of artificial intelligence used by ChatGPT: The accuracy of chatbot-delivered medical information may represent an improvement over Dr. Google, but there are still many questions about how to integrate this new technology into health care systems with the same safeguards historically applied to the introduction of new drugs or medical devices.

The smooth syntax, authoritative tone, and dexterity of generative AI have drawn extraordinary attention from all sectors of society, with some comparing its future impact to that of the internet itself. In health care, companies are working feverishly to implement generative AI in areas such as radiology and medical records.

When it comes to consumer chatbots, though, there is still caution, even though the technology is already widely available — and better than many alternatives. Many doctors believe AI-based medical tools should undergo an approval process similar to the FDA’s regime for drugs, but that would be years away. It’s unclear how such a regime might apply to general-purpose AIs like ChatGPT.

“There’s no question we have issues with access to care, and whether or not it is a good idea to deploy ChatGPT to cover the holes or fill the gaps in access, it’s going to happen and it’s happening already,” said Jain. “People have already discovered its utility. So, we need to understand the potential advantages and the pitfalls.”

The Emory study is not alone in ratifying the relative accuracy of the new generation of AI chatbots. A report published in Nature in early July by a group led by Google computer scientists said answers generated by Med-PaLM, an AI chatbot the company built specifically for medical use, “compare favorably with answers given by clinicians.”

AI may also have better bedside manner. Another study, published in April by researchers from the University of California-San Diego and other institutions, even noted that health care professionals rated ChatGPT answers as more empathetic than responses from human doctors.

Indeed, a number of companies are exploring how chatbots could be used for mental health therapy, and some investors in the companies are betting that healthy people might also enjoy chatting and even bonding with an AI “friend.” The company behind Replika, one of the most advanced of that genre, markets its chatbot as, “The AI companion who cares. Always here to listen and talk. Always on your side.”

“We need physicians to start realizing that these new tools are here to stay and they’re offering new capabilities both to physicians and patients,” said James Benoit, an AI consultant. While a postdoctoral fellow in nursing at the University of Alberta in Canada, he published a study in February reporting that ChatGPT significantly outperformed online symptom checkers in evaluating a set of medical scenarios. “They are accurate enough at this point to start meriting some consideration,” he said.

A ’Band-Aid’ Solution

Still, even the researchers who have demonstrated ChatGPT’s relative reliability are cautious about recommending that patients put their full trust in the current state of AI. For many medical professionals, AI chatbots are an invitation to trouble: They cite a host of issues relating to privacy, safety, bias, liability, transparency, and the current absence of regulatory oversight.

The proposition that AI should be embraced because it represents a marginal improvement over Dr. Google is unconvincing, these critics say.

“That’s a little bit of a disappointing bar to set, isn’t it?” said Mason Marks, a professor and MD who specializes in health law at Florida State University. He recently wrote an opinion piece on AI chatbots and privacy in the Journal of the American Medical Association.

“I don’t know how helpful it is to say, ‘Well, let’s just throw this conversational AI on as a band-aid to make up for these deeper systemic issues,’” he said to KFF Health News.

The biggest danger, in his view, is the likelihood that market incentives will result in AI interfaces designed to steer patients to particular drugs or medical services. “Companies might want to push a particular product over another,” said Marks. “The potential for exploitation of people and the commercialization of data is unprecedented.”

OpenAI, the company that developed ChatGPT, also urged caution.

“OpenAI’s models are not fine-tuned to provide medical information,” a company spokesperson said. “You should never use our models to provide diagnostic or treatment services for serious medical conditions.”

John Ayers, a computational epidemiologist who was the lead author of the UCSD study, said that as with other medical interventions, the focus should be on patient outcomes.

“If regulators came out and said that if you want to provide patient services using a chatbot, you have to demonstrate that chatbots improve patient outcomes, then randomized controlled trials would be registered tomorrow for a host of outcomes,” Ayers said.

He would like to see a more urgent stance from regulators.

“One hundred million people have ChatGPT on their phone,” said Ayers, “and are asking questions right now. People are going to use chatbots with or without us.”

At present, though, there are few signs that rigorous testing of AIs for safety and effectiveness is imminent. In May, Robert Califf, the commissioner of the FDA, described “the regulation of large language models as critical to our future,” but aside from recommending that regulators be “nimble” in their approach, he offered few details.

In the meantime, the race is on. In July, The Wall Street Journal reported that the Mayo Clinic was partnering with Google to integrate the Med-PaLM 2 chatbot into its system. In June, WebMD announced it was partnering with a Pasadena, California-based startup, HIA Technologies Inc., to provide interactive “digital health assistants.” And the ongoing integration of AI into both Microsoft’s Bing and Google Search suggests that Dr. Google is already well on its way to being replaced by Dr. Chatbot.

KFF Health News is a national newsroom that produces in-depth journalism about health issues.