Sometimes You Just Have to Choose

The bread and butter of my work life is thinking about the question of “Does XX treatment affect YY outcome, and how much?” Does caffeine impact miscarriage rates? Do epidurals cause C-sections? Does breastfeeding make your kid thinner? Do vaccines reduce risk of COVID-19 transmission to others? Do COVID-19 breastfeeding antibodies protect my baby? And, if yes, how big is that effect?

I spend a lot of time in this newsletter talking about why these questions are difficult to answer and, often, highlighting issues with distinguishing correlation from causality. It’s difficult to know establish a causal link between breastfeeding and child weight, because there are other underlying factors (parental weight, resource) which influence both the choice to breastfeed and later child weight.

Today, though, I want to talk about a different problem with learning from data, one which arises even if we are able to put aside these correlation-versus-causality concerns: statistical power. Specifically, I want to talk about why it can be very difficult to learn about treatment effects when the outcome is rare.

(Bear with me through some of the theory and at the end I’ll come back to why this is important to keep in mind when we think about evidence.)

Example: Vaccine Trials

It is simplest to think about these issues in a context we understand well, so let’s focus on vaccine trials. When a drug company runs a vaccine trial — for COVID-19 or anything else — they ensure their effects are causal by randomizing. They take a sample of people and pick some of them to get the vaccine and others to get a placebo. They do not tell them which they got, and then they track the participants over time and see how many get sick in each group. Evaluating the success of the vaccine depends on comparing disease rates in the vaccinated to the unvaccinated group.

When results are reported from trials like this, they typically report the difference in infection rates, along with a “p-value”. So, the results will be something like: 2% of unvaccinated participants and 1% of vaccinated participants were infected (p<0.05). Colloquially, the p-value is taken to measure how confident we are in the results, how sure we should be that there is a difference in infection rates.

More formally, the p-value relates to sampling.

Imagine that there is some true effect of the vaccine, which we could learn about if we had a randomized trial run over the entire population of the world. We do not have that. What we have, is a trial run over a (much smaller) sample of people at a particular point in time. Because we do not see the entire world, we worry that any results we see in the sample might be just due to chance.

For example: imagine your trial was 10 people, and you vaccinated 5 of them, and 1 of the unvaccinated people got the disease versus none of the vaccinated people. In terms of infection shares, your vaccine looks great — 20% of the unvaccinated group was infected versus 0% of the vaccinated group! — but it seems very plausible this just occurred by chance. Imagine if, instead, you had 10,000 people with 5,000 vaccinated and 20% of the unvaccinated got the disease versus none of the vaccinated. Intuitively, this should make you a lot more confident about the effectiveness of the vaccine, because it feels less likely this was some kind of chance coincidence.

The p-value is a way to measure this. A p-value of 0.05 means, formally, that if the true impact of the vaccine were zero, and you ran this same study (with different samples) 100 times, in only 5 cases would you get a result more extreme than what you observed. Put differently: a p-value of 0.05 implies it’s unlikely you’d see an effect of this magnitude by chance. The smaller the p-value, the less likely the result is a statistical accident.

It’s easy(ish) to see why having a larger sample can make you more confident in results, but it may be harder to see why the overall risk of the outcome matters. But it does.

Consider the big trial, with 5,000 people in each group. But now, rather than 20% of the unvaccinated group getting infected, it’s 0.02% (1 person rather than 1000 people). You’re still at zero infections in the vaccinated group. At this point, it feels much more difficult to draw strong conclusions. Your intuition is much more likely to make you think, well, that one infection could have been a random thing. And the statistics will bear that out. With 20% infection rate in the unvaccinated group, the p-value for the comparison of means is tiny — less than 0.000001. With a 0.2% infection rate, it’s 0.31, not considered significant.

If the sample was, instead, a million people, then you’d be able to statistically detect an infection rate difference of 0.2%.

In the end, our ability to draw strong statistical conclusions depends on both the number of people in the sample and the risk of the outcome. This is what researchers mean by statistical power: the larger your sample, the smaller an effect you have the “power” to detect.

So interesting! (Or, maybe not). Why am I telling you this?

The limits of studies

Understanding the idea of statistical power is crucial for thinking about what we can possibly learn from data.

A lot of the questions parents have — either in the COVID-19 context or elsewhere — are about very rare outcomes. Monday’s newsletter had a brief discussion of the SNOO and SIDS risk. Several people wrote to ask: does the SNOO actually reduce SIDS? To answer this convincingly in a randomized trial, you’d need an enormous sample size. Even a 50% reduction in SIDS risk — which would be astronomical — would require a sample size of between 100,000 and 150,000 infants, which may be impractical even putting aside the cost of the SNOO. It just isn’t feasible to answer this in this way.

Or think about this question: “How much breastmilk do I need to give my infant to give them protective antibodies against COVID-19?” To get at this, you’d need enough statistical power to differentiate across breastfeeding levels. This is basically impossible. Even to answer a simpler question of Does breastfeeding protect infants from COVID-19? is difficult. We know breastmilk contains antibodies, but do they actually lower infant disease risk given the delivery mechanism etc? Infants are unlikely to be infected with COVID-19, certainly very unlikely to be detectably ill. So you’d need a huge, huge sample of babies to answer even this pretty basic version of this query.

To circle back to vaccines: in Phase 3 vaccine trials this year we saw very compelling results for adults, with highly significant impacts on COVID-19 rates. The results for adolescents were also statistically compelling, despite smaller sample sizes. We are expecting vaccine data for children 2-11 in the fall and, frankly, I’d be surprised if we see the same statistical power. It is going to be very difficult to draw strong statistical conclusions about efficacy in this population in this time period. It’s a low risk group and the study data will be from a lower risk time; the number of children in either group who get COVID-19 is likely to be very low.

In a way, this is good! It will be a good sign about this population and the overall course of the pandemic. And we will still be able to learn about side effects and how kids tolerate vaccines. But it also means our statistical confidence about efficacy is going to be more limited.

And, finally, these limitations mean that some of the small-risk-concerns people raise about the vaccines — What if there is a negative impact on some organ function for 1 out of every 10,000,000 people? — are simply not something we’ll be able to ever convincingly dismiss. Detecting an effect that small is not feasible in trials, or in realized post-trial data. We can point out that you take a lot of serious health risks which are well in excess of 1 in 10 million, but you cannot literally rule this out.

Confronting the limits of statistical power is frustrating, but maybe also liberating. If you’re someone who likes to use evidence to make decisions, there can be a kind of decision paralysis in realizing that the evidence you want doesn’t exist. It’s easy to start to think, okay, I’ll wait to make this decision until the evidence is good enough to be helpful. What this discussion forces us to recognize, though, is that there are some cases where that will never happen.

In these cases, we have to make the decision with whatever imperfect data we do have, combined with other factors (like, say, preferences). It may not be the way we most want to make decisions, but at least it may allow us to move on.