Article image
brinsa.com

When Dr. Maybe Meets Real Medicine

markus brinsa 25 march 16, 2026 6 6 min read create pdf website all articles

Sources

A pair of new studies shows what happens when medical confidence outruns medical judgment

The chatbot in the waiting room

There is now a particular kind of modern person who will not call a doctor, will not call urgent care, will absolutely not pay for an out-of-network consultation, but will very happily open a chatbot and type, “I have a terrible headache and my neck feels weird.” Then, because the machine replies in complete sentences with the calm tone of a sleep app and the confidence of a bad intern, that person feels strangely reassured. Maybe not fully reassured. Just reassured enough to stay home.

That tiny gap between “I should probably get checked” and “maybe I’ll wait until tomorrow” is where this story lives. And according to two new studies in Nature Medicine, it is also where chatbots start becoming dangerous.

The problem is not that these systems know nothing. The problem is almost worse. They know just enough to sound useful, just enough to sound informed, and just enough to make people think they are getting something close to clinical judgment. They are not.

In one of the studies, researchers tested whether ordinary people could use large language models to make better medical decisions. On paper, the bots looked impressive. When tested alone, the models identified the condition correctly almost 95 percent of the time. That sounds like the beginning of a marketing campaign, a TED Talk, and three venture rounds. Then humans entered the picture.

Once real people had to interact with the models, describe symptoms, interpret the answers, and decide what to do next, the whole performance dropped through the floor. Participants correctly identified relevant conditions in fewer than 34.5 percent of cases and chose the correct next step in fewer than 44.2 percent. In other words, the machine may have had the answer somewhere in its synthetic little heart, but the human-machine combination still performed badly in the real world. The researchers concluded that using the LLMs was no better than relying on more traditional tools.

That is a brutal finding because it punctures one of the most persistent fantasies in AI. The fantasy says that if the model scores well on exams, benchmarks, and polished demos, it must be ready to help actual people with actual problems. Medicine, once again, has stepped forward to explain that reality is not a benchmark. Reality is a frightened human at 11:47 p.m. using vague language, leaving out key symptoms, misunderstanding follow-up advice, and desperately hoping the answer is “don’t worry about it.”

And then came the other study, which is the one that should make everyone sit up straighter.

Researchers at Mount Sinai evaluated ChatGPT Health, OpenAI’s consumer-facing health tool, using 60 clinician-authored vignettes across 21 clinical domains and 16 contextual conditions. This was not a few cherry-picked prompts tossed into a chatbot for sport. It was a structured stress test designed to see whether the system could handle the question that matters most in triage: how urgent is this, really?

The answer was not encouraging. According to the study, ChatGPT Health under-triaged 52 percent of gold-standard emergencies. Not awkwardly. Not a little. More than half.

That phrase under-triaged sounds dry and technical, which is unfortunate, because what it really means is that a system looked at emergencies and, far too often, acted like they were not emergencies. Cases involving diabetic ketoacidosis and impending respiratory failure were sometimes sent toward slower evaluation rather than the emergency department. The model reportedly performed better with textbook examples like stroke and anaphylaxis, which is exactly the sort of result you would expect from a system that has ingested massive amounts of medical language but does not possess anything remotely like bedside judgment.

A clinician knows that medicine is full of ugly edge cases. Real danger often arrives dressed as something a little off, a little unclear, a little easy to dismiss.

A chatbot, by contrast, tends to like the obvious stuff. It loves the dramatic symptom constellation. It loves a crisp pattern. It loves a case that sounds like a board exam question. What it does not reliably handle is the murky middle where patients actually live.

That is where one of the most revealing details from the Oxford study becomes almost darkly funny. Two users described essentially the same life-threatening scenario. One said it was the “worst headache ever” and got told to go to the hospital. Another used less dramatic wording and got advice that amounted to lie down in a dark room. Same underlying risk, different phrasing, wildly different outcome. That is not intelligence. That is a linguistic tripwire.

It also exposes a basic truth that the AI industry has been trying very hard not to dwell on. Medical use is not just a model problem. It is an interaction problem. A chatbot can only work with what the user gives it, and ordinary users are not trained historians of their own symptoms. Doctors know which questions to ask because medicine is not just recall. It is elicitation. It is pattern recognition. It is noticing what the patient did not think to mention. It is interrupting the story at exactly the right moment to ask, “When you say dizzy, do you mean lightheaded, spinning, or weak?” That question alone can change the entire picture.

A chatbot does not really know what it is missing. It just keeps going.

And that confidence matters, because in medicine, reassuring language is not neutral. A wrong answer wrapped in uncertainty might still scare someone into getting help. A wrong answer wrapped in calm, plausible fluency can delay care. The studies suggest exactly that risk: not just error, but error delivered in a form people can mistake for guidance.

This is what makes the launch timing of products like ChatGPT Health so fascinating and so absurd. OpenAI’s own materials say Health is designed to support, not replace, medical care. Fine. Sensible. Necessary. Also entirely insufficient. Because once you build a dedicated health product, connect medical records and wellness apps, and market it as a more informed way to navigate health decisions, you are no longer operating in the theoretical universe of “people should know better.” You are operating in the real universe where millions of people will absolutely use it at moments of stress, fear, confusion, and urgency.

And millions already are. OpenAI said health is one of the most common ways people use ChatGPT, with over 230 million people globally asking health and wellness questions each week. That is not edge behavior anymore. That is consumer behavior.

So the stakes are not academic. They are infrastructural. If this category is becoming part of how people decide whether to seek care, then its failures are not quirky model glitches. They are public-health failures waiting for a body count.

The studies do not prove that chatbots are useless in medicine. In fact, even critics and cautious clinicians point to more realistic uses that may genuinely help. A chatbot can help someone prepare questions before an appointment, summarize jargon after a visit, explain the difference between two tests, or help a patient navigate administrative nonsense that American healthcare manufactures with industrial pride. Used that way, the machine is less fake doctor and more translator, organizer, and paperwork hostage negotiator.

That distinction matters. The safe use case is not “replace the clinician.” It is “help the patient participate more effectively when a clinician is involved.”

The danger begins when people start treating conversational competence as medical competence. A chatbot that sounds composed is not the same thing as a system that can triage safely. A model that aces a benchmark is not the same thing as a tool that can handle a panicked user who does not know which symptom matters. And a product disclaimer saying this does not replace doctors is not a force field. It does not undo the psychological effect of a machine that sounds like it knows what it is talking about.

That is the larger Chatbots Behaving Badly lesson here. Chatbots do not fail only when they say bizarre things. They also fail when they say normal things in situations where normal is exactly the wrong tone. In medical contexts, the scariest output may not be lunacy. It may be reassurance.

And that is the real problem with putting a chatbot in the waiting room. It does not need to be wildly wrong to cause harm. It only needs to be wrong in a way that sounds reasonable enough for someone to stay home.

About the Author

Markus Brinsa is the Founder & CEO of SEIKOURI Inc., an international strategy firm that gives enterprises and investors human-led access to pre-market AI—then converts first looks into rights and rollouts that scale. As an AI Risk & Governance Strategist, he created "Chatbots Behaving Badly," a platform and podcast that investigates AI’s failures, risks, and governance. With over 30 years of experience bridging technology, strategy, and cross-border growth in the U.S. and Europe, Markus partners with executives, investors, and founders to turn early signals into a durable advantage.

©2026 copyright by markus brinsa | brinsa.com™