COVID-19 School Data Hub

TL;DR: This is a post about the COVID-19 School Data Hub, which you can visit here.

Last year, I did a lot of work on schools. Much of it was scaffolded and supported by the existence of this newsletter, which produced many of the connections that allowed that work to happen. If you have been a longer-time reader of the newsletter, you’ll recall that during the 2020-2021 school year I worked with a team to track COVID cases in schools.

I haven’t talked as much about this over the past few months, as other issues (COVID and non-) have been more pressing.

But today I wanted to break that streak and write a bit about the COVID-19 School Data Hub, which is the data culmination of much of what we did last year. In short, this hub (which you can see here) is a public data resource documenting schooling modes over the 2020-21 school year in most of the U.S. I’m going to talk about why we built it, how we built it, and what we hope we (and others) will learn from it.

There are a few things I’m aiming to convey here. One goal is to provide data for others — parents, guardians, teachers, policy makers, institutions, etc — to use when making decisions. A second is to give more context for what went on last year across the U.S. The reaction to this recent piece in the Washington Post suggests that there is still confusion about the extent to which there was variation in schooling access over the past year. Finally, I just think this is an interesting example of where data comes from, of the messy chaos and strokes of luck that can make data production on this scale possible.

The context, and the goal

Throughout the 2020-21 school year, real-time tracking of school openings fell to a few different groups. Burbio, a company that pre-pandemic published information on neighborhood events, started a data collection project that tracked 1,200 of the largest school districts monthly, and then weekly, over the year. For context, the U.S. has about 13,000 school districts, but their sizes are very skewed, so the top 1,200 have about 56% of the nation’s students.

This Burbio data, which was scraped from district websites by a team of employees, was the source of most of the national-level data that was reported by media sources and the federal government. Also useful was the American Enterprise Institute’s Return to Learn Tracker, which combined a number of sources, including data from several states with more consistent reporting.

However: at the end of the school year, there was no consistent source with national data on how schools had operated during the school year. Many district were missing information completely about whether they were virtual, in person or hybrid; some had limited information for only part of the year. The data that was there was missing or inconsistent.

So we set out, in this Data Hub project, to create a data set that — in our dream — would let us track the opening mode of every school in the U.S. over this pandemic school year. We knew our ideal at the start: information on every school, every week, for the entire year. We also knew this was unrealistic, so instead we tried to get as much information as we could toward it.

You may ask: Why did you want to know this? Two reasons (at least). One was just a desire to document what happened — a historical question. But the second was related to research goals. The consequences of school closures will likely be felt for years, probably decades. If we want to understand the consequences of these disruptions, we need systematic information on where they occurred.

You might ask, too: Why was this information not collected, either in real time or later, by the U.S. Department of Education or another federal agency? That’s a complicated question and I’m not entirely sure of the answer, although it almost certainly relates to a) resources: not having the funding; b) the fact that schooling in the U.S. is extremely decentralized and, as a result, this information is not readily accessible.

Data sources

With our goal in mind, last spring we began to collect data. We focused on getting “official” data from state education agencies (SEAs) where possible. We got enormous help from CCSSO, which connected us with individuals at SEAs, who could then provide us with the data we needed. Or, well, sometimes they could.

We quickly learned that there was an absolutely astonishing variation in the information that states collected on schools during the pandemic.

In some cases — North Dakota and Rhode Island come to mind — the state had official data for every school, with counts of children in each schooling mode, by day, for the entire school year.

In other cases — Arizona, for example — information was collected from school districts a few times a year, providing a series of snapshots, but included much less detail.

And some states had nothing at all — seemingly no systematic data collected by the state on school or district learning mode. Or, anyway, nothing we could really use.

Over the course of last summer, the project team, headed by the irreplaceable Clare Halloran, took the data we got from states and supervised a team of people to organize it. For research purposes, data is most useful if it is clean.

What does clean mean?

If you’re going to work with data like this from many states, it can be used much more quickly if every state’s data has the same variable names, for example. And it’s helpful if every data set uses the national-level school codes, so it can be easily merged with other sources. This kind of organization, if you’re doing it on a large scale and you want to get it right, takes a lot of time and careful hand-checking. This takes more time than you might think. We were all-hands-on-deck and down to the wire on launch timing; I spent some of my vacation time last August reviewing National Center for Education Statistics codes and double-checking Excel files on Clare’s instruction. I do not think I was the best employee she had.

When we launched the Data Hub in September, the data was clean, but incomplete. We had data from 30 states (including D.C.), covering about 56,000 students. And there were some notable states missing — California, for example.

It seemed like we might be stuck at this stage, with perhaps a few more states trickling in. But then there was a breakthrough: the realization that some of this data was collected through an unrelated program. This is one of those lucky data breaks you do not get every day.

Here’s the story. During the pandemic, when schools were closed or had reduced hours, the federal government funded a program called Pandemic Electronic Benefit Transfer, or P-EBT. The program provided funds to students who were eligible for free and reduced-price lunch, to make up for the fact that when they didn’t have access to full in-person school, they weren’t able to receive their lunch. It started in the spring and summer of 2020 and continued through the 2020-21 school year (it has continued through 2021-22 as well).

A key element of the program in the 2020-21 school year was that funds were owed to students to cover time when their schools were closed. If a school was open, and therefore providing lunch, payments weren’t necessary. States could apply to get federal funds through this program. But in order to do so, they needed to document if schools were closed or operating with reduced hours or attendance for at least five consecutive days, leading students to miss meals they would have otherwise received. Most often, states decided to do this by retroactively tracking schooling mode by school and month.

Of course, this was precisely the information we were looking for. And it turned out, when we dug into it, that there were a number of states where this information was available, even when the SEA had initially told us it didn’t have information (because the P-EBT data collection process wasn’t a part of states’ regular data collection systems) or that it was incomplete. In several cases, the P-EBT data came from state departments of health or social services (in collaboration with departments of education).

For example: in California, we struggled for months to get data from the state’s department of education. In the end, what we got was limited — a snapshot of opening information from one moment in the spring for a small subset of districts. It wasn’t enough to populate the Data Hub. But when we flagged the P-EBT program, we were able to get a full data set, by school and month, almost immediately. The information was there; it was just living in a different agency.

Using this program, and a few other data breaks, the Data Hub now has complete data for 42 states for most months.

We’re still missing a few. Some are underway — Montana, South Dakota, Tennessee. Others have proved extremely difficult. Despite our best efforts, we haven’t been able to unlock anything in Pennsylvania, North Carolina, Oklahoma, or Delaware. Sometimes we know the data is there, but the state agencies are unwilling to share; in other cases, we can’t tell if they collected it at all.

Our goal was to fill in the whole map. I’m still optimistic (although if you’ve got any Pennsylvania or North Carolina contacts, let me know).

Why is this useful?

What’s the purpose of having all this data? One purpose is simply to map out the experience of the past year. The main Data Hub page lets people scroll through the opening experience through the year. Below, I’ve extracted graphs for October, January, and April. You can see here both the increased opening over time and the considerable geographic variation. More districts offered in-person instruction by the spring, but we see huge variation across the country throughout the year.

We can also use this data to look at where schools opened, and for whom. A very strong pattern is that schools and districts with more students of color were less likely to open for in-person learning. This is especially true for Black students. Districts with more students eligible for free and reduced-price lunch were also less likely to open. On the flip side, districts in more Republican-leaning areas (as proxied by Trump vote share in 2020) were more likely to open. This is true across and within states. COVID case rates are weakly positively correlated with opening.

These correlations aren’t causal and they aren’t independent; the racial makeup of districts is highly correlated with their income, so it’s hard to separate those out. What the data shows very starkly, though, is that there are large differences across areas in who had access to in-person schooling, and these differences are predictable. This winter, we were able to use the data to show that the areas that closed in January 2022 had the same characteristics — indeed, were generally the same districts — that had more limited school access in the 2020-21 school year.

Our team has also used the data to show that test scores in the spring of 2021 were reduced everywhere, but were reduced more in areas with less access to in-person schooling. The detailed nature of the data allowed us to show that this is true even within small areas (specifically, commuting zones). What we observe is that if you take two close-by districts but one had more in-person schooling than the other, the one with more in-person schooling showed smaller test-score declines during the pandemic.

There is more that we think can be done with this data. Our team is working on additional research, but we hope that by making the data accessible to all, we can encourage others to do their own projects. So please explore and use the data. You can find all of it together, plus some additional data sets cleaned for merging, at the page here.

Funding

Quite simply, a project like this takes resources. Processing this type of data takes human effort, beyond what I could do alone. It requires a team, and a team that needs to be fairly compensated for their time. We were fortunate to have some wonderful volunteer talent, especially last summer, but core ongoing team members are paid.

This work — both the COVID-19 School Response Dashboard, which tracked COVID cases in schools through the school year, and the Data Hub — was funded through many sources. Initial funding came through my research funds at Brown University. Subsequently we have been lucky to get funding from private foundations including the Chan Zuckerberg Initiative, Arnold Ventures, the Silver Giving Foundation, the John Templeton Foundation, the Walton Family Foundation, and the Emergent Ventures Fast Grants program.

We have received criticism for some of this funding, with allegations that the research is influenced, especially by the political right. The Emergent Ventures program, for example, is run through the Mercatus Center at George Mason University, which itself has funding from the Koch family. It has been alleged — directly and indirectly — that these groups dictate how our research is done, what I write, or what we publish.

This is emphatically not the case. Our sources of funding have no influence. Full stop. The funding for this project has run through Brown, which has strict rules that would not allow funders to influence research findings. Moreover, even if that were not true, I want to be clear that we have never been asked to change what we are doing, write any specific content, or hold any data back. I wouldn’t do that, and it has never come up.

I think it is fair to say that it would have been better if work like this was funded directly through the federal government. But the federal government didn’t choose to do this, and there weren’t avenues for us to apply for federal funding on an appropriate time frame. I am incredibly grateful for the funding we received from foundations, and hopeful that we’ll be able to apply to these avenues in the future to keep the Data Hub running.

Final thoughts

Please use our data! And if you have ideas for either funding opportunities or for how we can wrest data from the great state of Pennsylvania, reach out…