Published Mar 14, 2026 | 3:11 PM ⚊ Updated Mar 14, 2026 | 3:11 PM
Representational image. Credit: iStock
Synopsis: A Nature Medicine study shows AI chatbots (e.g., GPT-4o) identify medical conditions accurately alone (90-99%), but real human users fare worse than non-AI controls—correctly spotting conditions in <35% of cases and choosing appropriate care in <44%. Inconsistencies, incomplete inputs, and contradictory advice (e.g., rest vs. emergency for brain bleeds) reveal critical failures. With India accounting for ~16.5% of global ChatGPT traffic and heavy health-related use amid limited healthcare, the tools are deemed unsafe for direct patient care.
It is past midnight in Hyderabad. A 34-year-old IT employee, sits on the edge of her bed, phone screen cutting through the dark. Her head pounds. Her vision swims at the edges. She types her symptoms into ChatGPT. It responds in seconds, calm and thorough. She reads the answer, feels a flicker of relief, and goes back to sleep.
She does not call an ambulance. ChatGPT did not tell her to.
She is not unusual. She is, in fact, representative of a number that should stop us cold. India accounts for 16.5 percent of ChatGPT’s global traffic. The United States leads by just 0.6 percentage points. But it is India that drives the daily visits, the returning users, the people who have folded this tool into the texture of ordinary life.
Run that 16.5 percent against India’s 1.4 billion people and you arrive at roughly 23.1 crore users. That is more people than live in the whole of Brazil, turning to an AI chatbot, often for something as consequential as their health.
Now a study published in Nature Medicine has examined what happens when they do. The findings do not reassure.
Somewhere in this study, two people described the same brain bleeding emergency to the same AI. One was told to call for help. The other was told to rest. Keep that in mind as you read what follows.
Researchers at the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences in the UK built a controlled experiment around a simple question: does using an AI chatbot actually help people make better medical decisions?
They recruited 1,298 participants across the United Kingdom. Each person received one of ten medical scenarios, written by doctors, and had to do two things. Identify the likely condition. Then decide on the right course of action, on a scale that ran from staying home to calling an ambulance.
The scenarios were not obscure. A young man develops a thunderclap headache after a night out with friends. A new mother finds herself constantly breathless and exhausted. These were the kinds of situations that send people reaching for their phones at midnight.
One group used AI tools, specifically GPT-4o, Llama 3, or Command R+. Another group used whatever they would normally use at home, a search engine, their own knowledge, a phone call to a relative.
Then the researchers watched what happened.
Test the AI alone, without any human in the conversation, and it performs. GPT-4o identified at least one relevant medical condition in 94.7 percent of cases. Llama 3 managed 99.2 percent. Command R+ reached 90.8 percent.
Then put a real person in the conversation. Watch those numbers fall.
Participants using AI correctly identified a relevant condition in fewer than 34.5 percent of cases. They chose the appropriate level of care in fewer than 44.2 percent of cases. And here is the part that lands hardest: people who used no AI at all, who searched the internet or relied on their own judgment, performed better at identifying conditions than those who had the most powerful chatbots in the world at their fingertips.
“Despite LLMs alone having high proficiency in the task,” the authors write, “the combination of LLMs and human users was no better than the control group in assessing clinical acuity and worse at identifying relevant conditions.”
The AI had the answer. It just could not get it to the person asking.
The researchers dissected 30 interactions closely, one for each combination of model and scenario. What they found was not a single failure but a chain of them.
The first break happens before the AI even responds. In 16 of those 30 conversations, users opened with only partial information. They described what felt significant to them. They left out what they did not know to include. A doctor conducting a patient interview knows which questions to ask, knows that a headache after exertion means something different to a headache on waking. The AI waited for information the user did not know to give.
“In clinical practice, doctors conduct patient interviews to collect the key information because patients may not know what symptoms are important,” the authors note, “and similar skills will be required for patient-facing AI systems.”
The second break happens inside the AI’s response. On average, the chatbot offered 2.21 possible conditions per conversation. Only 34 percent of those suggestions were correct. The user then had to choose which one mattered. Most could not make that call accurately. The AI handed over a list and the responsibility to interpret it landed on the person least equipped to do so.
The third break is the one that should disturb regulators most. The AI proved inconsistent in ways that could kill someone.
“In an extreme case, two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice (Extended Data Table 2). One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care,” said the authors.
Same condition. Same words. Opposite outcomes.
“The sensitivity of LLMs to small variations in inputs creates challenges for forming mental models of LLM behaviour,” the authors write. “Even occasional factual and contextual errors could lead users to disregard advice from LLMs.”
The study also recorded something that reads almost like dark comedy, except that it is not.
In one interaction, the AI told a user in the United Kingdom to call a partial US emergency number. Then, in the same conversation, it switched to recommending “Triple Zero,” the emergency number used in Australia. It had lost track of where the person even lived. It was navigating a medical emergency without knowing which continent it was on.
In two other cases, the AI latched onto a single word in the user’s message and built its entire response around that word, missing the actual clinical picture. In two more cases, it gave a correct answer first, then reversed itself after the user added further details, landing on something wrong.
Dr Rebecca Payne, a General Practitioner(GP)GP and lead medical practitioner on the study, does not soften her assessment.
“Despite all the hype, AI just isn’t ready to take on the role of the physician. Patients need to be aware that asking a large language model about their symptoms can be dangerous, giving wrong diagnoses and failing to recognise when urgent help is needed,” she said in the statement.
The AI industry measures its medical competence through standardised tests. GPT-4o scores above 80 percent on MedQA, a benchmark built around medical licensing exam questions. That number circulates as evidence that these tools have reached clinical-grade knowledge.
The Oxford study looked at what that number actually predicts in practice.
In several scenarios, benchmark scores above 80 percent corresponded to human experimental scores below 20percent . The test and the real world did not speak to each other at all.
Associate Professor Adam Mahdi of the Oxford Internet Institute calls this a systemic failure.
“The disconnect between benchmark scores and real-world performance should be a wake-up call for AI developers and regulators. We cannot rely on standardised tests alone to determine if these systems are safe for public use. Just as we require clinical trials for new medications, AI systems need rigorous testing with diverse, real users to understand their true capabilities in high-stakes settings like healthcare,” he said in the statement.
The researchers also tried replacing human users with AI-simulated patients to test the system. Those simulated users scored 57-60 percent accuracy. Real humans scored far lower. This means the safety testing method that developers rely on cannot catch the collapse that happens when a real, anxious, sleep-deprived person sits down and types.
India’s healthcare infrastructure operates under pressures that the UK, where this study ran, does not face at the same scale. The doctor-to-patient ratio in rural India stretches thin. The nearest GP can sit hours away. The nearest hospital with emergency care can sit further still.
Into that gap, ChatGPT arrived. It spoke clearly. It responded instantly. It cost nothing. For 23.1 crore people, many of them in cities but many more in places where a second opinion means a long journey, it became the first call rather than the last resort.
The study’s verdict on that arrangement is unambiguous. “We found that none of the tested language models were ready for deployment in direct patient care.”
Andrew Bean, the lead author and a DPhil student at the Oxford Internet Institute, frames the problem as one that the industry needs to solve urgently.
“Designing robust testing for large language models is key to understanding how we can make use of this new technology. In this study, we show that interacting with humans poses a challenge even for top LLMs. We hope this work will contribute to the development of safer and more useful AI systems,” he said in statement.
The symptoms our techie described that night – pounding head and swimming vision – match several conditions on the Oxford study’s scenario list. Some of those conditions resolve on their own. Some of them, without treatment in the next hour, do not.
ChatGPT gave her an answer. Whether it gave her the right one depended on exactly how she phrased her question, which details she thought to include, and which version of the AI’s response she happened to receive that night.
The researchers put it plainly: “The transmission of information between the LLM and the user” is “a particular point of failure.” Both sides of the conversation break down. The user does not know what to say. The AI cannot reliably bridge that gap.
Millions of Indians will open ChatGPT tonight with a symptom and a question. The machine will answer. That answer, the study tells us, carries no guarantee of being right, consistent, or safe.
The developers and regulators know this now. However, the question is what either of them intends to do before the next midnight AI consultation goes the wrong way.
(Edited by Amit Vasudev)