In the last couple of years I’ve spent a huge share of my professional time writing. Books, opinion pieces, this newsletter, and so on. But when I introduce myself, my job, I still list “Economist who studies health economics and statistical methods” first. It’s my “real job”. And, today, this newsletter comes to you with that hat on. We’re going to talk data. Specifically, COVID case data, positivity data and testing.

Graphs, tables and, yes, equations coming.

The TL;DR here is that I think there’s a lot of confusion in how we track the pandemic and how we compare across locations, and it may lead us to poor conclusions.

(If you want to nerd out with me, the data and code for this is here. And, no, I do not want to hear about how you could have done it better with R, thanks. I’m old, I’m committed to Stata, and I can’t hear you la la la).

**The Basic Issue: Case Rates, Positivity Rates & An Identity**

Broadly, there is a lot of interest in tracking the COVID-19 pandemic over time. I don’t think I need to convince you of that, and I’ll talk a bit more below about where this tracking impacts policy. And it’s easy to find trackers online. Johns Hopkins has a good one, as does the New York Times. For the most part, they rely on data from USA Facts, which is what I’ll use here.

Our goal, really, is to track the *infection rate. *That is: in the population overall, what share of people have COVID-19 right now. We do not know this. If we really wanted to be able to accurately track the evolution of the pandemic we’d want to randomly test a segment of the population over time in a consistent manner. I made this point, as did many, many others, early in the pandemic. We never did this, which is unfortunate but has left us with imperfect options.

In practice, there are two primary variables we use in tracking: case rates (positive cases per population, usually quoted per 100,000 people) and positivity rates (share of tests which are positive). Both are flawed. Case rate is an undercount, because not all cases are detected; positivity is an over-count, because sick people are more likely to be tested.

This post is about how to understand their relationship and what they mean.

We can start with the data that goes into these calculations. There are three relevant pieces of information: (1) total population (I’ll work with state-level data, so this is state population), (2) number of tests performed and (3) number of positive tests. Given these data, we can calculate case rates and positivity rates and think about how they relate. The simple calculations are below.

The positivity rate is positive tests divided by total tests. The case rate is positive test divided by population.

If you put these together, you can see the relationship between the *case rate* and the *positivity rate.*

This result is what we’d refer to as an “identity”; it’s just a feature of how these variables are calculated. But what it tells us is that case rate and positivity rate are *related*, but the number of tests also matters. Put differently, you can get the same case rate with wildly different positivity rates, depending on the number of tests being done.

Why does this matter? Who cares?!?!

I’d argue it matters because although we sometimes talk about these two things as both measuring the pandemic, they actually measure somewhat different things. And while they do tend to move together, there is definitely *not *a perfect correlation.

You can see this in the graph below, where I illustrate the relationship — at the state-month level — between case rates and positivity rates. The graph slopes up but there is a lot of variation.

I’ve marked some example points. For example, the case rate in Arizona in January was very similar to South Dakota in November; both had rates around 130 in 100,000. However, their positivity was wildly different: 50% in South Dakota versus 16% in Arizona. Alaska and Idaho both had rates around 72 per 100,000 at some point in the fall, but this reflected a positivity of 3.8% in Alaska and 54% In Idaho.

These differences are largest when the case rates are higher, but they persist at low rates. If we look at the state-months with case rates between 10 and 20 per 100,000 people (basically fairly low), we see positivity rates which range from 1.2% (Massachusetts in October) to 18% (Idaho, September).

Given the intimate relationship with testing, these differences must be driven by testing volume. Another way to see this is in the table below which shows the case rate, positivity rates and testing volume for January (so far) for states with the highest and lowest case rates. There is huge overlap in positivity rates *even at these tail end of case rates*. Iowa has a case rate about a third as large as Rhode Island, but the positivity rate is about 6 times *higher. *This reflects the fact that Rhode Island did 17 times as many tests per capita than Iowa during this period.

## Who* is doing it right?*

It has become common to use case rates as our primary pandemic measure, and this analysis above suggests they might be misleading in the sense that the same case rate may reflect different underlying *infection rates* depending on how much testing is going on. Looking at the data above, one natural conclusion is that if two places have the same case rate but one has a higher positivity rate, the one with the higher positivity rate probably actually has a higher infection rate.

But a skeptical person might argue: well, maybe the case rates are right, and places like Rhode Island and Alaska are just testing a lot of low risk negative people for no reason. Maybe Idaho is doing a good job, only using their tests on positive people. Maybe Alaska in November (case rate 72, positivity rate 3.8%) actually has the same underlying infection rate as Idaho in December (case rate 72, positivity rate 54%). Maybe it’s just that Idaho is a very efficient tester.

This is very unlikely to be true. For one thing, it simply stretches credulity to think that Idaho would have some magical solution to finding positive cases. The rates imply they are testing only 0.8 negative people for each positive person, versus 35 in Alaska.

A second thing that makes this unlikely is the relationship with hospitalizations. If similar case rates reflected similar underlying infection rates, we’d expect them to translate into similar hospitalizations. In fact, what we see is in places which are doing more testing — say, the state-months in the highest 20% in terms of per capita testing rates — each detected case has about a 2.8% chance of showing up as a hospitalization 4 weeks later. In those in the bottom 20%, this figure is 4%. Assuming that the actual risk of hospitalization is similar, this suggests we are missing cases in the lower testing locations.

None of this is to say that case rates are not reflecting infection rates at all. If infection rates are very low — say, in a place like New Zealand — even with a lot of testing you wont turn up many cases. But in our US context, it’s likely the case that case rates do not tell the whole story.

**Why Does This Matter?**

This is a bit of a cautionary tale about case rates. When we look at, say, the New York Times rankings of states, it’s probably not quite right to assume those with higher case rates necessarily have higher infection rates. But does this actually matter for anything?

I would argue yes. For example: we are often basing reopening decisions on case rate numbers. California has said, for example, that schools can open if case rates drop below 28 per 100,000. But case rates in this range (say, 24 to 28) in the data reflect ranges of positivity between 2.3% and 17%. Moreover, the link with testing volume makes this manipulable. If you want fewer cases, testing less can deliver that.

This is another reason to worry. A focus on case rates de-incentivizes testing. I realize I’m a Rhode Island booster in general but one of the things that the state has done really, really well is testing. Last weekend I took my kids by the zoo after violin lesson for a COVID test; drive up, out in 5 minutes, text result in an hour. But ramping up testing has contributed to higher case rates. From a policy maker standpoint, there is a temptation to lower testing rates to seem like things are improving.

From a purely practical standpoint, if you’re hoping to predict hospitalizations, this matters because depending on your testing rate you’d predict variation in the hospitalization risk.

All this adds together to say: how we measure the pandemic does matter, and this particular case-focused approach has some limitations.

**Could We Do Better?**

Can we improve on the current reporting? It’s worth noting first that there is no way to back out infection rates reliably. Case rates are what we call a *lower bound* — the infection rate cannot be below that. But we have no good way to get an upper bound, or a good point estimate. *That’s why we all said back in April we needed random testing. *But anyway.

There are clearly more sophisticated ways to engage in modeling here that could improve what we can learn. One option I toyed with for the purposes of this post was to look at the relationship between marginal increases in testing and positivity rate holding constant underlying prevalence (say, with state-week fixed effects), which could be applied to the distribution broadly and used to correct the rates. (Don’t worry if this doesn’t make sense!) In the end, I think virtually anything in this space is likely to be subject to a lot of concern about what assumptions go into it. So I’m not sure there is a “right” solution (although smarter people than I may have ideas, which I would love to hear).

I think we can do better, though, by recognizing these issues and focusing on a variety of variables rather than targeting just one. I would argue *no* reopening decisions should be based on a single metric — not case rates, not positivity rates, not any one number. Ideally we’d have an approach which combines a whole set of information to make an informed decision.

Either that, or random testing. I’m never giving up on that.