select search filters
briefings
roundups & rapid reactions
Fiona fox's blog

expert reaction to study suggesting potential patient harms associated with use of AI medical outcome-prediction models

A study published in Patterns looks at potential patient harms linked to the use of AI medical outcome prediction models. 

 

Professor Peter Bannister, Fellow and Healthcare expert at the Institution of Engineering and Technology said:

“AI is trained on real-world data which include biases as well as the desired potential to enable better decisions. In the case of healthcare, there is a risk that if AI is widely used for clinical decision making, it may further marginalise groups who already have poor access to treatments. An example would be for patients where AI predicts they have a low survival rate, which means they are then not offered potentially lifesaving treatments.

“This paper proves that in many clinical decision-making processes, relying only on AI’s ability to accurately predict symptoms can sometimes lead to worse outcomes for those patients. While the authors make it clear there are further, more complex scenarios that need to be studied, this work reinforces the need for AI technologies that are used in real-world settings to be assessed in a “whole system” approach, where the overall health outcome of the patient is used to decide whether the AI is contributing to improved care.”

 

Professor Ibrahim Habli, Research Director, Centre for Assuring Autonomy, University of York, said:

“The study warns us about the risks of relying too much on one technology and judging it only by its accuracy, without considering who it’s for and in what situations. For AI to be used safely in healthcare, it needs to fit into the real-world practices of doctors and the specific needs of patients. The study is encouraging in that it focuses on AI safety, especially as it follows a recently published White Paper ‘Avoiding the AI off switch’ highlighting the need for AI to be a benefit, not a liability to both clinicians and patients. Treating patients is a process that changes over time, depending on their needs and available treatments. Focusing only on accuracy and outcomes can be misleading and even dangerous. AI might also show bias, such as against people with disabilities or rare diseases, making it safer for some people but not for everyone.”

 

Prof Ian Simpson, Professor of Biomedical Informatics, University of Edinburgh, said:

When asked how widely are these outcome prediction AI models used in the NHS/NHS Scotland right now?

“It’s reasonable to say that AI OPMs are not that widely used at the moment in the NHS/NHS Scotland. Decision support tends to be used more in association with medical hardware systems that were very early adopters of ML techniques, i.e. things like MRI machines. Here they tend to be used in parallel with existing clinical management policies and often either for assisting diagnostics and/or speeding up processes like image segmentation.

“Whilst diagnostics can fall foul of the issues raised in the paper, it’s not quite the same as the scenarios they explore in that it’s deterministic and following clinical decisions would likely be made using existing processes. Issues here tend to be more performance oriented i.e. false positives (over diagnosis) and false negatives (incorrect or missing diagnosis). These are the metrics that are currently scrutinised in approval processes. So, in short, the issues raised in this paper are in my opinion not quite so acute for diagnostics as currently deployed.”

 

Professor Ewen Harrison, Professor of Surgery and Data Science and Co-Director of Centre for Medical Informatics at the University of Edinburgh, said:

“Artificial intelligence and computer algorithms are increasingly used in medicine to help make difficult decisions. While these tools promise more accurate and personalised care, this study highlights one of a number of concerning downsides: predictions themselves can unintentionally harm patients by influencing treatment decisions.

“Say a hospital introduces a new AI tool to estimate who is likely to have a poor recovery after knee replacement surgery. The tool uses characteristics such as age, body weight, existing health problems, and physical fitness.

“Initially, doctors intend to use this tool to decide which patients would benefit from intensive rehabilitation therapy. However, due to limited availability and cost, it is decided instead to reserve intensive rehab primarily for patients predicted to have the best outcomes. Patients labelled by the algorithm as having a “poor predicted recovery” receive less attention, fewer physiotherapy sessions, and less encouragement overall.

“As a result, these patients indeed experience slower recovery, higher pain, and reduced mobility, seemingly confirming the accuracy of the prediction tool. In reality, however, it was the reduced support and resources – triggered by the algorithm’s predictions – that contributed to their poor outcomes. The model has thus created a harmful self-fulfilling prophecy, with accuracy metrics wrongly interpreted as evidence of its success.

“These are real issues affecting AI development in the UK. The researchers emphasise that hospitals and policymakers need to carefully monitor how predictive algorithms are actually used in practice. Doing so can help ensure that AI-driven decisions genuinely benefit patients, rather than inadvertently harming those who most need help.”

 

Prof Ian Simpson, Professor of Biomedical Informatics, University of Edinburgh, said:

“This is an important and timely study adding to emerging evidence that the long established dependence on predictive performance when evaluating AI models is not sufficient to support their deployment in healthcare settings. This study undertakes a formal theoretical approach to explore the relationship between model performance (how well a model predicts) and model calibration (how reliable the probabilities of those predictions are) in both pre- and post- model deployment scenarios. The study finds that, even in simple settings, models that have good performance and calibration properties could lead to worse patient outcomes if deployed.

“Intuitively, it would seem that implementing models with the best performance would be desirable, if not essential, however these models are typically trained on historical data. This bakes in relationships so that any future change in treatment from the historical process which changes a patient outcome favourably would paradoxically result in a drop in model performance during deployment. This could result in positive changes in treatment decisions leading to the withdrawal of the model due to a drop in performance below an acceptable level despite it leading to an improvement in patient outcomes. One of the interesting findings in this study is that drops in model performance on deployment could actually be evidence of a model performing well and that where models do not change performance upon deployment it may mean that the model is in fact not effective at all; it simply reinforces existing practice.

“The authors find that over a wide range of settings there is risk of “self-fulfilling prophecy” where the historical training used to develop models hard-wires decisions or worse actively disadvantages groups of patients for whom treatment changes from the established process would be beneficial. They posit a scenario where patients with a fast-growing tumour receive a decision not to undergo palliative radiotherapy based on the poor survival time predicted by the model. Patients with slower growing tumours are recommended for treatment as the model predicts a longer survival time, justifying the side-effects of the treatment. However in this scenario radiotherapy is ineffective for slow growing tumours, but highly effective for aggressive ones; the model supports exactly the wrong outcome.

“This work, building on findings by others in recent years, provides further evidence for a need to shift focus from predictive performance to an explicit consideration of the effects on patient outcomes of changes in treatment choice. The gold-standard for such are long-established in healthcare; randomised control trials designed to directly measure the effectiveness of new interventions in deployment. Regulation for AI tools is evolving rapidly around the world, but these are primarily focussed on performance both pre- and post- deployment which, as this study shows, fails to capture their effectiveness in practice and risks reinforcing bias from historical data.

“Whilst at first glance this work might seem alarming it is in fact a very encouraging development highlighting essential considerations for how to evaluate and use AI models in healthcare. These deepen our understanding of how to improve their safety and clinical effectiveness and, crucially, emphasises the importance of randomised control trials and deep integration of clinical knowledge into model development.”

 

Dr Catherine Menon, Principal Lecturer at the University of Hertfordshire’s Department of Computer Science, said:

“This study presents results that show the risks of doctors using AI prediction models to make treatment decisions. This happens when AI models have been trained on historical data, where the data does not necessarily account for such factors as historical under-treatment of some medical conditions or demographics. These models will accurately predict poor outcomes for patients in these demographics. This creates a “self-fulfilling prophecy” if doctors decide not to treat these patients due to the associated treatment risks and the fact that the AI predicts a poor outcome for them. Even worse, this perpetuates the same historic error: under-treating these patients means that they will continue to have poorer outcomes. Useof these AI models therefore risks worsening outcomes for patients who have typically been historically discriminated against in medical settings due to factors such as race, gender or educational background.

“This demonstrates the inherent importance of evaluating AI decisions in context, and applying human reasoning and assessment to AI judgements. AIs might be accurate, but they can only understand a limited subset of the entire landscape around treatment decisions. This has important real-world implications because it shows that human oversight and sound ethical assessment of AI models is necessary if treatment decisions are going to be made based on the predictions of these AI models. Use of AI without human oversight in this context risks embedding further discrimination and disenfranchisement into medical systems.

“This also has important real-world implications beyond the medical domain. Uses of AI such as the “homicide prediction project” highlighted in https://www.theguardian.com/uk-news/2025/apr/08/uk-creating-prediction-tool-to-identify-people-most-likely-to-kill may also lead to the same result. Certain demographics which have historically been over-policed and are over-represented within the justice system may suffer from the same AI-predicted poorer outcomes as those discussed within this medical study. This demonstrates the wider power of such predictive AI models, and the necessity to fully understand their training and scope before using them.”

 

Dr James N. Weinstein, Innovation and Health Equity, Microsoft Research, Health Futures, said:

“While prediction models are often praised for their accuracy, this research highlights a critical flaw: even well-performing models can lead to harmful self-fulfilling prophecies when used for treatment decisions. It’s essential to evaluate these models based on their real-world impact on patient outcomes rather than just predictive accuracy. Emphasizing “informed choice,” where medical decisions are guided by a patient’s values and preferences, is crucial to ensure that treatment and outcome decisions evolve with the patient’s condition over time.”

References:

Patient-Reported Data Can Help People Make Better Health Care Choices, William B. Weeks, MD and Dr. James N. Weinstein. September 21, 2015: Harvard Business Review

Effects of Viewing an Evidence-Based Video Decision Aid on Patients’ Treatment Preferences for Spine Surgery, Jon D. Lurie, MD, MS, Kevin F. Spratt, PhD, Emily A. Blood, MS, Tor D. Tosteson, ScD, Anna N. A. Tosteson, ScD, and James N. Weinstein, DO, MS, Dartmouth Medical School, Hanover, NH, USA   Spine (Phila Pa 1976). August 15, 2011; 36(18): 1501–1504. doi: 10.1097/BRS.0b013e3182055c1e.

GenAI and Patient Choice: A New Era of Informed Healthcare, Dr. Peter Bonis and Dr. Jim Weinstein. February 28, 2025: Patient Safety & Quality Healthcare

 

 

When accurate prediction models yield harmful self-fulfilling prophecies’ by Wouter A.C. van Amsterdam et al. was published in Patterns at 16:00 UK time Friday 11 April 2025. 

 

DOI: 10.1016/j.patter.2025.101229

 

 

Declared interests

Prof Ewen Harrison: EMH receives grant funding from the NIHR, Wellcome Leap, UKRI and the Bill and Melinda Gates Foundation

Prof Ian Simpson: I have consulted for, and received funding from, pharmaceutical companies including UCB and AstraZeneca. I also lead the UKRI AI Centre for Doctoral Training in Biomedical Innovation that has many industry partners.

Dr Jim Weinstein: employee of Microsoft Research which is a research subsidiary of Microsoft.

For all other experts, no reply to our request for DOIs was received.

 

 

in this section

filter RoundUps by year

search by tag