October 10, 2022

expert reaction to studies presenting a standardized approach for quantifying the strength of evidence that supports associations between risk factors and health outcomes

Six studies published in Nature Medicine present a standardized approach for quantifying the strength of evidence that supports associations between risk factors and health outcomes.

Prof Sir David Spiegelhalter, Chair, Winton Centre for Risk and Evidence Communication, University of Cambridge, said:

“This is an interesting proposal to give a star-rating to all those newsworthy claims about how routine things we are exposed to, for example diet and lifestyle, increase our risk of bad outcomes such as cancer. It is based, essentially, on the size of the minimum plausible increased risk linked to someone with an average exposure. It follows a complex, almost algorithmic, process for combining evidence from multiple studies, in which outlying results are discarded and the remainder pooled to produce an overall risk profile.

“I see two main problems. First, as they acknowledge, the procedure avoids expertise about the topic being studied, and I am suspicious of attempts to automate scientific understanding. Second, it only deals with relative risks, and so cannot take into account practical importance: we may be confident that a five-star exposure doubles your risk, but twice tiny is still tiny.”

Dr Duane Mellor, Registered Dietitian and Senior Teaching Fellow, Aston Medical School, Aston University, said:

“These are an interesting set of papers, which aim to produce a simple star rating to highlight the burden of proof between a risk factor, with the authors spotlighting smoking, blood pressure, low vegetable consumption and unprocessed red meat intake. The approach gave more stars (maximum of 5) if there was a clearer relationship between risk and that risk behaviour. These papers appeared to suggest there was only weak evidence to associate with unprocessed red meat intake and diseases such as heart disease colon cancer, heart disease and type 2 diabetes. This was estimated using a new statistical approach which combines the results from previously published research papers which measured intake of unprocessed red meat and the chances of developing these diseases. This finding is perhaps not that surprising as the associated risk between unprocessed red meat intake and disease, in most studies is generally very low. Typically it is intake of processed red meat, such as bacon and sausages which have been associated with a higher risk of disease, which these papers did not report on.

“Although only weak evidence was seen with unprocessed red meat, a clearer link was seen between low vegetable intake and stroke (both haemorrhagic and ischaemic types), ischaemic heart disease, oesophageal cancer and type 2 diabetes. With ischaemic stroke and ischaemic heart disease gaining 3 and 2 stars. This is far below the 5 star good quality evidence reported in these papers for the link between blood pressure and these conditions. But it needs to be remembered, blood pressure is the result of many different risk factors and behaviour, whereas vegetable intake is just one part of an overall diet.

“Although these analyses are interesting, they perhaps only confirm that it is hard to accurately measure diet, with authors acknowledging that their methods were not able to account for biases in individual studies and studies measured diet intake in a range of ways from food diaries to questionnaires which may lead to inaccurate reporting of food intake. Also, it is plausible that people who eat more of one type of food, tend to eat less of another. So, in this case people who eat less red meat may eat more vegetables, so it might have been more appropriate to look at an overall dietary pattern rather than one component of a healthy or unhealthy diet. It is also interesting that only unprocessed red meat was considered, given the data for processed red meat is much clearer.

“Overall, although these papers are interesting in highlighting the strength and quality of evidence of risk factors for disease like heart disease, they may lack the sensitivity and ability to look at how a diet changes as someone eats more or fewer vegetables. Also as they tend to rely on studies where food intake is only measured once, this assumes a snapshot of someones diet is a reliable measurement of their diet over decades. So, although this approach might work for a behaviour like smoking which you either do or don’t do, when it comes to diet, which is more subtle and one type of food can be swapped for another, it may not be appropriate to measure just one part of the diet on risk of developing disease.

“So, ultimately the findings of this study do not really suggest that current recommendations for a healthy diet, basing it on plenty of vegetables, fruit, pulses, nuts and seeds with modest amounts of meat, fish, dairy or alternatives need to change. As when we change one food in our diet, it is replaced with another, it is almost as if in studies like this researchers are suggesting we compare apples with pears – both statistically and almost literally!”

Prof Kevin McConway, Emeritus Professor of Applied Statistics, The Open University, said:

“I suspect that what’s interesting about this set of studies to me, a statistician, may well not match what is interesting to most journalists and members of the public. So really what I’ve done is to try to clarify what’s been done and what’s new about it.

“My overall feeling is that it’s a little hard to see exactly what the point of the new approach is, in that it boils down what might be a great deal of complexity from a number of studies into a few numbers, or indeed to a single star rating (one to five stars), and a great deal is inevitably lost in that process, however cleverly it is done. It’s true that the researchers say that their measures should be considered alongside other existing methods of summarising the results of many studies. Their method is statistically plausible, though the facts that it’s pretty complex and has not yet been widely applied means that I’m not rushing to judgment on precisely how well it works. But I’m far from sure, so far, what its place will be amongst the other approaches, or to whom it will really be useful. I’m also a little concerned that it appears to produce only relative measures of risk, while in many circumstances absolute measures might be more relevant to what people need to know.

“Another important issue is that the star ratings apparently can’t, on their own, distinguish between a situation where there is good evidence that an association between an exposure to a potential risk and a health outcome is very weak or not-existent, and a situation where there is not (yet) good evidence of how strong the association might be. This is problematic, because what should be done about a low star rating would be different in the two cases – either don’t pay much attention to this risk (if there’s good evidence that the association is very weak), or get more evidence on how big the risk is (if the evidence so far is inadequate).

“In a way, the four papers looking at specific exposures (unprocessed red meat, vegetables, smoking, high systolic blood pressure) and various health outcomes don’t tell us much that was not already known about the risks and about where there’s good evidence – but they do demonstrate that the new methods do work in circumstances that are already pretty well understood, giving an indication that they are likely to work in other areas too. Also, those papers do arguably make some of the uncertainties about the risks even clearer, which is a good think in my view. And they show yet again (though it bears repetition) the importance of looking at all the data, not just the latest finding, and of taking the uncertainties into account properly. (I won’t comment further on those papers, interesting though they are.)

“In many biomedical research studies, the aim is to measure the association, if any, between the level of exposure to some potential risk, such as smoking cigarettes or eating certain forms of meat, or not eating many vegetables, and some potential health outcome, such as being diagnosed with lung cancer or having a heart attack. Ideally this would be done in such a way that it gives an idea of how far the exposure causes the health outcome, though that can be difficult. Sometimes the exposure might reduce the risk of an adverse health outcome; for example eating lots of vegetables may reduce the risk of some diseases.

“Investigating these risks or benefits is, however, problematic in several ways. Humans are complicated biologically, and we’re exposed to all kinds of potential risks or potentially beneficial influences, at the same time, which usually makes it difficult to get to the bottom of what might be causing what. So individual pieces of research are often subject to many potential biases. As one example, many studies of the association between eating certain foods and being diagnosed with various illnesses have to be observational – that is, people are asked to record what they eat, and are followed up to see if and when they become ill with the diseases being studied. The problem is that people who eat different amounts of the foods involved will also differ in terms of many other factors too, and those factors might be the real cause of any associations with disease that the researchers find. There are statistical methods of taking these other factors, which are called potential confounders, into account, but one can never be sure that everything relevant has been dealt with in this way. So it’s pretty well impossible that the results of a single observational study of this kind can actually establish that an exposure is related to a health consequence in a cause-and-effect way. So there’s a risk of bias.

“Potential risks of bias occur in other types of study too, though perhaps for different reasons. In a randomised clinical trial (RCT), the researchers can get round some of the issues of cause and effect, because they allocate the treatments at random to the participants. So on average, groups of people who receive different exposures are the same in terms of other factors. But in many RCTs the participants are not very typical of the population who will eventually be treated with the treatments being investigated, for various reasons, and often the trials go on for a relatively short time compared with the length of time that the treatments will be used for if they go into general use. So again there can be a risk of bias in taking the results of RCTs to indicate what might happen in a general population outside the context of an RCT.

“If a possible association between an exposure and the risk of disease looks as if it is important or medically interesting, it’s very unlikely that only one study of that exposure will have been done. There are often lots of studies, and for many reasons their quantitative results won’t all be the same. Different populations and types of participants might have been involved, the ways of measuring the exposure may differ, the statistical adjustments for potential confounders are often different, and so on. But looking at patterns of association and risk across many studies can be very informative, and pieces of research that put together findings from several studies of the same association have been around for a long time. They would often go under the name of systematic reviews, and the statistical techniques used to summarise the results of the different studies generally go under the name of meta-analysis.

“The thread running through all the new research papers could be thought of as a new way of doing meta-analysis, that takes into account more than has usually been the case. The final aim is to boil down the findings from all the studies of a particular association between a potential risk and a particular health outcome into a single number, called a Risk-Outcome Score or ROS – and indeed to go one step further and to boil down the ROS for a particular pair of a potential risk and a health outcome to a star rating. Five stars means that there’s consistent, strong association between the risk and the outcome, across many studies, so very strong evidence that an association exists. One star is the lowest star rating, and means that there’s no evidence of an association, Two stars means weak evidence of an association, three stars means moderate evidence, four means strong evidence (but not as strong as for five stars).

“I’ve mainly been talking in the context of an exposure, such as cigarette smoking, that’s generally likely to be harmful to health, but the ROS scale and the star ratings can work in the other direction too, and look at possible associations between an exposure that may well improve a health outcome, and the outcome. (For example, associations between consuming more vegetables and various diseases such as strokes and some cancers.)

“There have been many approaches to combining the results from different studies before, of course. What’s really new about this approach, I’d say, is that it includes several different aspects and measures of the quality of each of the included studies into the star rating (and indeed into the ROS, and, further back, into the quantity from which the ROS is calculated, the BPRF or Burden of Proof Risk Function). Systematic reviews and meta-analyses in the past have typically reported various aspects of the quality of the studies they reviewed, but not generally into their overall numerical results. The new approach, to a considerable extent, puts together existing statistical approaches into a new overall package, though there are some statistical novelties in there too, and some issues in previous meta-analytic approaches have been dealt with.

“My overall feeling is that, if one does want to reduce a set of studies of association between a potentially risky exposure and a health outcome to one number, or a few numbers, this new method seems to be a plausible and generally appealing way to do that – though really I reserve judgement, because the method is complicated, so far we’ve only seen the results from a small set of demonstration applications of it, and I need to think about it much more and see it in wider operation to be clearer on how well it works. But before I even get that far, I feel the need to ask what it is useful for (apart from keeping statisticians in gainful employment).

“This approach certainly makes the final result simple, by putting everything into a single score – but the process boils down what will typically be the results of a considerable number of different studies, each with its own peculiarities, to a single point on a five-level star scale. (They don’t award fractions of stars – it’s got to be exactly 1, 2, 3, 4 or 5 of them.) Obviously that will allow a large number of exposure-outcome pairs to be reported in a simple table or diagram – but one has to wonder what is being lost in the process of boiling down the complexity of all those original studies.

“Who might actually use the findings? The methodology paper by the researchers (the one by Peng Zheng as lead author, with many others) discusses two types of potential user – individuals making health choices, and policy-makers making decisions on guidelines on health behaviour (such as recommended diets) and related aspects of health policy. It’s interesting that potential use by clinicians is hardly mentioned explicitly. But for individuals and for policy-makers, the authors make it clear that an obvious simplistic approach, of just worrying about potential risks that have a lot of stars, isn’t entirely appropriate.

“Yes, risk-outcome pairs with four or five stars indicate that the exposure probably makes a considerable difference to the risk. The way that the scores are calculated is conservative, in the sense that, if there is considerable uncertainty about the size of the risks, what is reported is essentially the lowest level of risk that remains statistically consistent with the data – so, if one could magically know the true risk, it could sometimes be higher than the star rating indicates. So a four- or five star exposure-outcome pair, you might think, should be something that an individual should definitely seek to avoid.

“But this misses out the fact that the risk measures used in these scores are relative – they basically measure the increase in risk of the disease or other outcome, if one is exposed to cigarettes smoke or whatever it is, as a percentage of the risk of the disease at a pre-determined low level of exposure (that could be zero exposure). Now, even if there’s strong evidence of an increased risk in exposed people, and even if that risk might be 50% or more greater than for unexposed people, then if the risk for unexposed people happens to be very low, an individual may have very little to worry about from having a risk 50% higher than that very low level. Adding on half of a very low number to the original very low number might well lead to a slightly less low number that an individual is happy to bear, if the exposure is something they like doing. An individual would have to look further than just the relative measure given by the ROS and star rating to make a decision. The researchers who developed the new measures do point this out clearly, but the lack of an interpretation of the proposed measures in terms of absolute risks is important in my view.

“But isn’t it the case that a risk-outcome pair with a low rating, one or two stars, is less of a concern? Well, maybe, but the researchers rightly point out that a one-star or two-star rating might arise, not because the risk is actually low or zero, but because the quality of evidence for that risk is low. Particularly for policy-makers, the researchers write that “The precautionary principle implies that public policy should pay attention to all potential risks”, and they point out that policies that reduce exposure even to risks with one or two stars would, on average, improve health. But resource constraints mean that not everything that might improve health can, in practice, be done. It would be good if the star ratings, ROSs and BPRF could guide the necessary resource decisions, but the facts that those measures are relative, and that a lot more has to be taken into account than just the evidence and strength of association, mean that the authorities need to consider far more than just a simple score, in my view.

“It’s perhaps worth pointing out that the standard summary of findings from a Cochrane Review of studies, that comes from a very long-established global collaboration, provides three pieces of data for each exposure-outcome pair that it considers, not just one. There are two absolute risks, typically for unexposed people and for people with a stated level of exposure, so that an absolute risk comparison can be made (even if the original studies provided only relative measures). There is also a score related to the quality of the evidence, the so-called GRADE score, on a four-point star scale. Using these tables does require the reader to put together the quality measure and the quantitative risk measures themselves, though it could be argued that the freedom to do that would allow for different trade-offs in different circumstances, whereas the approach in the new research is arguably one-size-fits-all. But the Cochrane approach would usually make the comparison only for one level of exposure, whereas the new score averages over different levels of exposure, which might well be preferable (if harder to understand) in certain circumstances. Also the researchers on the new approach do suggest that their ratings and scores are considered alongside the results from Cochrane Reviews and other relevant sources, so it’s not as if they are claiming that their approach should supersede the Cochrane approach.

“The researchers who developed the new method do mention several limitations of their approach. Several of these are technical aspects of the statistics, but not all of them. They mention that some issues of study quality may be “hard to capture” as part of their method of adjusting the scores to allow for the quality of studies – so something important could be missed by the aim of providing a standard method. Also, the method for allowing for bias can’t do its job properly if all, or the great majority of, the studies that are included are biased. One possibility here might be if all the available studies are observational, and none of them adjusted for an important potential confounder, which can happen if it’s very difficult or impossible to get data on that potential confounder. The fact that the new method averages the increase in risk over a number of different levels of outcome, while it will often be helpful, may produce strange results if (say) only very high levels of exposure affect the risk. In that case, looking at more detail of the calculation results would make this clear, but the researchers don’t appear to want to go further, writing: “Giving different star ratings to different ranges of exposure would, however, add a further degree of complexity that we sought to avoid.” Not every type of study can be included – for instance there is no direct way of including animal studies. One has to remember that a lot of the evidence for the health harms caused by cigarette smoke came from animal studies (though the evidence from observational studies in humans, on smoking and many diseases, is anyway pretty overwhelming).”

‘The Burden of Proof studies: assessing the evidence of risk’ by Peng Zheng et al. was published in Nature Medicine at 16:00 hours UK time Monday 10 October 2022. DOI: 10.1038/s41591-022-01973-2

‘Health effects associated with consumption of unprocessed red meat: a Burden of Proof study’ by Haley Lescinsky et al. was published in Nature Medicine at 16:00 hours UK time Monday 10 October 2022. DOI: 10.1038/s41591-022-01968-z

‘Health effects associated with vegetable consumption: a Burden of Proof study’ by Jeffrey D. Stanaway et al. was published in Nature Medicine at 16:00 hours UK time Monday 10 October 2022. DOI: 10.1038/s41591-022-01970-5

‘Effects of elevated systolic blood pressure on ischemic heart disease: a Burden of Proof study’ by Christian Razo et al. was published in Nature Medicine at 16:00 hours UK time Monday 10 October 2022. DOI: 10.1038/s41591-022-01974-1

‘Health effects associated with smoking: a Burden of Proof study’ by Xiaochen Dai et al. was published in Nature Medicine at 16:00 hours UK time Monday 10 October 2022. DOI: 10.1038/s41591-022-01978-x

‘The Global Burden of Disease Study at 30 years’ by Christopher J. L. Murray was published in Nature Medicine at 16:00 hours UK time Monday 10 October 2022. DOI: 10.1038/s41591-022-01990-1

Declared interests

Dr Duane Mellor: “No conflicts of interest to declare.”

Prof Kevin McConway: “I am a Trustee of the SMC and a member of its Advisory Committee. My quote above is in my capacity as an independent professional statistician.”

For all other experts, no reply to our request for DOIs was received.

October 10, 2022

expert reaction to studies presenting a standardized approach for quantifying the strength of evidence that supports associations between risk factors and health outcomes

in this section

filter RoundUps by year

search by tag