A study published in Nature looks at the use of large language models (LLMs) for medical question-answering.
Prof Maria Liakata, Professor in Natural Language Processing, Queen Mary University of London (QMUL), said:
“The paper makes important steps in advancing the state of the art in using LLMs for medical question-answering (Q-A) and in providing a multi-faceted evaluation framework for evaluating the quality of generated content. It introduces a new medical Q-A dataset, which combines a number of previous multiple-choice based Q-A datasets with a new dataset on consumer medical questions, and a new LLM trained on medical Q-A, that surpasses previous models, making use of a new LLM adaptation strategy, instruction prompting. The most significant contribution is a human evaluation framework for assessing answers according to different aspects and criteria, including compatibility with the scientific consensus, reading comprehension, likelihood and extent of harm, recall of clinical knowledge, completeness of responses, potential for bias, helpfulness. This evaluation framework, although still a pilot and not exhaustive, is much more thorough compared to similar previous work and reveals significant limitations of even the best performing LLMs for this task, such as generation of inappropriate and potentially harmful responses. This will need to be overcome for LLMs to be used in real-world medical Q-A settings and more research is needed in the area of evaluation frameworks for LLMs.”
Prof James Davenport, Hebron and Medlock Professor of Information Technology, University of Bath, said:
“The press release is accurate as far as it goes, describing how this paper advances our knowledge of using Large Language Models (LLMs) to answer medical questions. But there is an elephant in the room, which is the difference between “medical questions” and actual medicine. Practicing medicine does not consist of answering medical questions – if it were purely about medical questions, we wouldn’t need teaching hospitals and doctors wouldn’t need years of training after their academic courses.
“This gap is best illustrated by the emergency room doctor’s experience in https://inflecthealth.medium.com/im-an-er-doctor-here-s-what-i-found-when-i-asked-chatgpt-to-diagnose-my-patients-7829c375a9da .
“The Nature paper, and press release, state that “Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks”. The authors have built a much larger benchmark, but one with precisely the same limitations. In particular, the authors do not consider the question of how much of their benchmark is already known, or might be known, to the LLMs.
“The authors do an excellent job of evaluating the answers, using “physician and lay user evaluation to assess multiple axes of LLM performance beyond accuracy on multiple-choice datasets”, but this is evaluating the answers to medical questions, not the diagnosing (and possibly solving) of genuine clinical problems.”
Prof Anthony G Cohn FREng, Professor of Automated Reasoning, University of Leeds, and Turing Fellow at The Alan Turing Institute, said:
“Whilst impressively improved performance on multiple choice questions (MCQ) answering has been achieved on the MultiMedQA benchmark, the accuracy levels are still below human expert level performance. The authors report many limitations which will be necessary to remove before LLMs could be possibly considered for actual use in the medical domain, which is indeed “complex” as the authors note. Whether LLMs could ever reach the required levels of robustness, safety, ethical adherence and alignment to human values remains an open question.
“LLMs have been widely reported to reflect bias in the training set, and this needs further investigation as also do the questions of safety and equity which they raise. Although MCQs are routinely used as part of the examination of medical students, it is not truly reflective of the situation a health professional finds themselves in when confronted with an actual patient which then requires generation of a diagnosis and treatment plan without the benefit of pre-given short list of possibilities. Indeed, it is interesting, and not entirely surprising, to note that when the LLM was evaluated using human evaluators rather than mechanical MCQ scoring performance dropped back considerably even after further tuning.
“Contamination of the training data from the test data is always of concern when evaluating machine learning systems and the authors have conducted some analysis and suggest that any contamination does not explain the level of results achieved, but without further detailed information on the training set this cannot be verified.
“The authors report that they use different prompts for each of the benchmarks to elicit the best performance. Whilst this is a common practice this does not reflect the real life situation when a doctor is in front of a patient – a real patient does not have to think about the best “prompting strategy” to get a good answer from their doctor, and will not want to or be expert enough to do so if consulting a virtual “LLM doctor”.
“The problem of “hallucinations” that LLMs are prone to (i.e. giving false information) has been widely reported in the literature, and is briefly mentioned when discussing safety – because of the statistical nature of LLMs, it is likely that this will always be a problem and thus LLMs should always be regarded as assistants rather than the final decision makers, especially in critical fields such as medicine; indeed ethical considerations make this especially true in medicine where also the question of legal liability is ever present.
“A further issue is that best medical practice is constantly changing and the question of how LLMs can be adapted to take such new knowledge into account remains a challenging problem, especially when they require such huge amounts of time and money to train.
“Finally, I would comment on the title of the paper: “Large language models encode clinical knowledge”: they key word here is “encode” – the knowledge is not explicit in the model. This has some advantages in terms of flexibility, but also disadvantages: it cannot be directly inspected or verified, unlike in a traditional symbolic knowledge base. Moreover, finding the limits of the model’s knowledge is an impossible task, except by sampling it using questions such as found in the benchmark. Whilst a human doctor is likely to know when they are unsure of an answer, and when to seek a second opinion, this ability is not (yet) present in LLMs.
“Note that I have not been able to view the supplementary material yet when composing this commentary.”
‘Large language models encode clinical knowledge’ by Karan Singhal et al. was published in Nature at 16:00 UK time Wednesday 12 July 2023.
DOI: 10.1038/s41586-023-06291-2
Declared interests
Prof Tony Cohn: “No conflicts of interest to declare. I do not have any medical training so these comments are offered as a computer scientist working in the field of the evaluation of Foundation Models.”
Prof James Davenport: James Davenport, professor at the University of Bath, was not involved in the research and has no commercial interests. He does sit on various AI standardisation bodies.
For all other experts, no reply to our request for DOIs was received.