Selection in Practice: The Value of Randomness

Lately, we all seem to be wearing a lot of hats; I certainly feel like have five or six too many on at the moment. But even away from the world of COVID, I tend to divide my professional life into two arenas. The first is my book-author-pregnancy-and-baby-lady-hat, from whence this newsletter primarily comes. And, second, there’s my research.

With that second hat on, in the past few years I’ve been thinking a lot about statistical methods, about selection, and about sampling. One of my most recent publications is about the problems of learning about a general population of people from a selected sample of volunteers. That paper is pretty technical (it’s written with an actual econometrician, which I am not), and feels a far cry from much of what I do here.

And yet: sometimes the two arenas merge, and lately I have been thinking a lot about how these problems of sampling arise in the context of the virus. And, in particular, in the question of how we can understand the level of current virus exposure.

Exposure Rates: The Problem

There are a lot of unanswered questions in COVID-19 — how far it travels in the air, how best to treat it, why some groups and people are so much more affected than others.

I believe among the very biggest questions is simply how widespread the virus is — how many people have already been infected? This is an extremely important question, but it’s also very hard. Why?

A lot of our predictions about the path of the virus (and the world, the economy, etc) over the next few months rely on epidemic modeling. Many of these models are forms of an “S-I-R” model — “Susceptible-Infected-Recovered” — which chart out dynamics as a population moves from entirely virus-susceptible to infected and finally into recovered.

The results from these models vary a lot; the basic structure of the models is mostly similar, but depending on what numbers you insert in them, they give wildly different answers! We’ve seen that with movements in predictions about hospitalizations and deaths over time. To make the models better — both to figure out which ones are right, and to improve the best ones — we need to fit them to data.

But that means actually knowing what share of people are susceptible, infected or recovered at any given time. Without that information, we are basically just guessing.

You may think: surely we know that! Don’t we see information on infections and hospitalizations and deaths over time? I feel like I’ve seen a lot of graphs about this.

Well, yes. But in the context of COVID-19, that’s not close to enough. Many infections with COVID-19 are very mild and non-specific. A large share of people — perhaps half or even 75% — who are infected have no symptoms. Even people who are symptomatic are still often not tested.

This means for every case we see, there are at least some we do not see. How many is really unclear. Some people think there are 10 missing cases for every one we see; others think it’s just one or two.

The implications of these two views are hugely different. If 1% of the population has already been infected, then 99% of people are still susceptible. On the other hand, if 20% have already been infected, well, that’s a different story.

Among our #1 priorities should be to learn about this number. And here is where I’ve been contemplating the problems of selection.

The Problem of Sampling

The best way to learn about the share of the population who have been exposed to the virus is to either test everyone (best case, probably infeasible in the US) or to test a random sample of people. This testing could be for active current infection or testing for past infection using antibodies. (This antibody testing has started to come online in the last couple of weeks and promises to be even more useful than active infection testing.)

Regardless of which type of testing we are doing, it is crucial to have a random sample of people. There are a few examples of this — a very few. Iceland did some random population testing recently, which showed about 1% of the general population had active infection (half of them asymptomatic). There is one town in Italy which tested everyone early in the epidemic (3% active infection, about half asymptomatic). Antibody testing (which includes past infections) in among a random sample in Germany showed 15% had either been actively or in the past infected.

Second best to a random sample may be universal testing among a known population. We had a recent examples of this among, actually, pregnant women in NY. A publication earlier this week in the New England Journal of Medicine showed active COVID-19 infection among almost 15% of women admitted for delivery. (A very large share of these infections were asymptomatic. I’m still unpacking what this might mean for those of you who are pregnant; more on that next week).

This isn’t as a good as a random sample since pregnant women are different in many ways (gender, age, exposure to medical care) than the general population. Still, it has value in part because we can understand the sources of bias.

Most people agree that random or universal testing is the best approach. But it’s also very hard to execute. Identifying a random sample of people and testing them is much, much more challenging than testing what we’d call a “convenience sample” — people who it is easy to find and access. And given the difference in difficulty, you might be tempted to think, well, some data is better than no data. I’ll do the easier thing and at least learn something.

This thinking is really problematic, though. Put simply: if we do not understand the biases in our sampling, the resulting data is garbage. One recent frustrating example of this is a large NIH study which aims to do antibody testing among 10,000 volunteers. Volunteers are being solicited in various ways, like over Twitter and with other public postings. People are asked to email the NIH to enroll, at which point they may be sent a home test kit.

Dr. Fauci has suggested that this will give us a “clearer picture of the true magnitude of the COVID-19 pandemic in the United States.” But it will not! It will give a clear picture of the magnitude among people who, say, scroll Twitter for opportunities to be in studies like this. Are these people more or less likely to have had COVID-19? I have no idea. Maybe you pull more people who know they’ve been exposed (higher prevalence), or maybe you pull people who are more careful about exposure (lower prevalence). Maybe it’s a weird mix of both. We simply do not know. We’ll get some number out of this and it will be completely uninterpretable.

This is worse than nothing, since people will think that they’ve learned something.

I have similar problems with testing blood donors as a measure of prevalence. Yes, it’s convenient. But it’s not going to tell us anything broadly useful.

What to do? I’m afraid that despite how hard it is, we simply have no choice but to do better sampling when we test. As someone who is trying to get some random testing off the ground in various populations, I can attest to the many, many challenges of doing so. But it is worthwhile. We need to do this.

So if someone shows up at your door and tells you you’ve been randomly selected for testing, please, please consent.