There is a familiar refrain in discussing science and decision-making around COVID. Namely: there isn’t enough data.

We need more data.
We’re hoping to get more data on that.
That’s an area with incomplete data.

The fact is that 18 months into the pandemic, we are still not in the position we should be, in terms of our ability to move quickly to collect and analyze data that is needed for public health decisions.

Here’s an example. Back in spring 2020, many of us (here’s my particular take) argued that the only way to really track case rates (and serious illness rates and so on) was to engage in a program of random testing. Some countries, like the U.K., did that. But the U.S. did not. And we still do not have such a program. And while our tracking of the pandemic has improved tremendously with better testing and better data reporting, this would still be helpful.

Consider this. Right now South Dakota has low case rates relative to most places with its vaccination rates. Is this because of natural immunity from very, very high infection rates in the winter? Or is it, more plausibly, a lack of testing? We have no idea, since without random sample testing, we cannot know.

This is a big-picture thing. There are a large number of smaller data pieces that we need answered, where it seems possible to answer them, and yet we have failed to do so. I want to talk about three examples — the problem, and how we might have done better (or might still do better!). As someone who thinks all the time about data, and what data we need, it seems clear that there are some feasible opportunities we have passed by.

And in case anyone is listening, I’ll end with my pitch for a “Data Force.”

Example 1: Breakthrough Infections

At some point around this past May, the CDC stopped systematically tracking breakthrough COVID infections in vaccinated people. Obviously, ex post, this was a mistake. It left the country with limited insight into the frequency of such cases and their potential for spread. Ex ante it’s less clear that this was a mistake, given that tracking such cases is significant  work, and resources are limited.

However: what seems clearly an ex ante mistake was not developing a comprehensive plan for some ongoing tracking with systematic data analysis. The best way to do this would have been to enroll a random sample cohort of vaccinated individuals and track them over time. But even if that was infeasible, the CDC could have done a much more transparent job of reporting out the cohorts that are being tracked.

Many universities (including my own) have continued universal testing even after vaccines. Sports leagues have done the same; remember when Chris Paul had to sit out some of the playoffs? He was fully vaccinated. That was a breakthrough.

The CDC has said that it is “monitoring” these data, but I don’t know what that means! It could have created a cohort (and still could) of these data sources — it’s not random, but it’s a lot of people — and reported out at least summary data on breakthrough infections, which vaccines the individuals had, and symptoms. This could be really, really helpful in answering questions like: Are some of the vaccines more effective against Delta than others? Does it matter when you were vaccinated?

This isn’t even a data-collection exercise! Someone else is doing the data collection. It’s an exercise in asking these organizations to participate (some of these data are already public), and putting some time into visualizing and reporting the data. I feel like a motivated hackathon could do this in two days.

Example 2: Severity of Illness in Kids

Last week, I wrote about risks for kids and surfaced one of the open questions: Is the Delta variant causing more severe disease in children? Our tracking makes this difficult to figure out. Case tracking in children is fairly poor, since testing is somewhat haphazard and many infections are asymptomatic. Hospitalization tracking is better but still not great. Deaths are well tracked but, in this case, thankfully so rare that they are difficult to analyze.

This has left us with a gap. We have little or no way to look at the ratio of hospitalizations to cases among children specifically. This is true even if we look at the whole range of ages, but even more true if we wanted to narrow into younger ages. This is a problem. Public health officials from Celine Gounder to Rochelle Walensky have noted that we need more data. But… what would it be?

The tracking-cohort approach here might be useful, but serious illness risks are sufficiently low in kids that it’s probably not feasible to have a large enough cohort to do this. Instead, I think at this moment the problem calls for someone to dig into data kept directly by hospitals, perhaps combined with case surveillance data in particular local areas. It wouldn’t be perfect, but at the moment we’re relying on anecdotes and press conferences. Line lists of hospitals would tell us something more about the characteristics of admissions with positive COVID-19 tests and the severity over time.

Example 3: Schools

Of course we’d end here. Back in summer 2020, I looked ahead to school openings in July and realized that no one was planning to collect data on COVID cases and mitigation. Others, like Burbio, looked and saw no systematic plan to track which schools were open for in-person learning. So many researchers, NGOs, think tanks, and just people in their basements spent time collecting these and other data. But such collections are necessarily less good than what government actors could do, since there was no ability to compel reporting.

In summer 2021, as schools reopen amid the Delta variant, we find ourselves — incredibly — in the same place. On the basic questions of which schools are open or closed at a given time, or how many families opt for remote learning, there isn’t a centralized plan to collect data nationally, and only a subset of states will likely be able to do it.

We are also still unclear on the precise value of various mitigation factors. For example: The best-quality data in the U.S. showing limited spread of COVID in schools (e.g. in New York or North Carolina) is from schools with mask mandates. This, to my mind, argues strongly for starting school years with mask mandates, since that’s part of what makes us confident about limited spread. 

However, our understanding of the effects of these mandates will be better if we have better data. And at the moment, we have very little data directly comparing schools with mask mandates to those without (as this op-ed points out), and we know that Europe does not universally mask children in schools. The very limited direct data we have from the U.S. (i.e. in Georgia and Florida) is mixed. This isn’t to discount masks, it is just to say that better data would allow us to understand them better — and to make a stronger case for their value.

Better data would necessitate a real study, or at least some school-level data collection. But it’s doable. I know, because we did a version of it last year.

A careful observational study of spread in schools over September would likely provide information about not only masking but also asymptomatic testing, types of ventilation, quarantine policies and other factors. It would also literally allow us to understand how much COVID is in schools, which we currently have no plan to do. We owe it to kids to optimize these prevention measures, to provide the best COVID protection and also the best school experience.

The Data Force

Why isn’t our data better? There are probably several answers, but one is that it wasn’t actually anyone’s job. During the Trump administration, Betsy DeVos said she didn’t think it was her responsibility to collect data on COVID in schools; that was the job of the CDC. But the CDC couldn’t do this alone, since it required coordination. Beyond that, in the middle of the pandemic everyone was working all the time on their core jobs. It is easy to see why there wasn’t really capacity to ramp up an entirely new data infrastructure, one that could nimbly respond to what we needed.

One approach would be to have a small, flexible data team that could be ramped up in response to this type of situation. I think Scott Gottlieb has a similar suggestion in his new book (which looks great but isn’t out yet), although I suspect his idea is broader than data alone. I would call it the “Data Force,” though that’s just me.

Maybe such a team already exists somewhere in the Biden administration. I hope so. I have some ideas for them.