TAIPEI (Taiwan News) — Popular AI tools, including ChatGPT, show limited accuracy on stroke-related clinical questions, a Taiwanese study finds, raising concerns about the reliability of large language models in medical contexts.
The study, presented Thursday by Lee Ta-yu (李達宇), an associate professor at National Taiwan University’s Department of Public Health, assessed AI performance across four stages of stroke care. Lee highlighted that as patients increasingly turn from doctors to the internet and AI for medical information, the accuracy of these tools requires careful evaluation, per CNA.
Researchers tested ChatGPT-4o, Claude 3 Sonnet, and Gemini Ultra 1.0 using simulated clinical scenarios. They applied three prompt methods, Zero-Shot Learning, Chain of Thought, and Talking Out Your Thoughts, and evaluated responses for accuracy, hallucinations, specificity, empathy, and actionability, according to the study published in Npj Digital Medicine.
Most AI scores fell below the clinical competency threshold of 60 out of 100. The models were inconsistent in providing actionable guidance, especially during high-risk stages of stroke treatment, where errors or incomplete responses were common.
Lee said that while AI may help with general health information, it is unreliable in urgent or high-stakes situations requiring professional judgment. Including personal information, such as age, gender, family medical history, environmental exposures, and medication use can improve the relevance of AI-generated responses.
Chen Pau-chung (陳保中), an attending physician at National Taiwan University Hospital, echoed these concerns, emphasizing that AI tools should only serve as supplementary references, not replacements for healthcare professionals.
The study also found that each prompt method has specific strengths. Among the three models, GPT-4o performed best in accuracy, specificity, and actionability.
However, large language models demonstrated significant limitations in delivering clinically relevant and actionable guidance. Researchers concluded that health information generated by AI may not consistently meet clinical standards, underscoring the need for patients to approach these resources with caution.





