My own data collection efforts continue. If you do want to help contribute to the child care data, as imperfect as it is, check out the constantly updated COVID-Explained post here. Lots of notes there on what data we have, and how you can help us get more. We’ve included a draft letter to send to your provider. We’d love to start including more schools and camps.

Data Sources: What are they and How to Understand Them

I was planning this post last week, and then of course this weekend our President said that we should do less testing since we’d then have less infections and my head exploded a little bit. So let’s start this by saying that doing less testing WILL NOT PRODUCE FEWER INFECTIONS. This isn’t how it works.

If it were not so important, it would be a great introductory stats example for confusing the direction of causality. You see more testing and more cases move together, and you assume the testing is driving the cases. Students could explain why this makes no sense.

So, this idea is ridiculous. But in practice it is the case that with our current data availability, tracking the epidemic is hard. What’s the best way to do it? To think about this, we’ll start with the first question: Why do you want to know? And then check out where we can find data, and how to read them.

Why do you want to know?

There are a few reasons why one might want data on the pandemic. One is to study different policy responses and which worked; this is probably a task for a much later, future, time. Another reason is to track whether hospitals are getting overwhelmed and try to prepare.

From the standpoint of an individual, though, I think the main reason we might want data is to figure out how much spread there is in our community so we can decide on our level of external engagement. When we think about how much to get out, it clearly matters if the COVID-19 rate in our area is 1%, or 10% or 0.1% since that impacts the chance that we encounter someone with the virus.

So for this post I really want to focus in on this question: how can I best understand the current viral level in my location so I can make good choices? Questions like: What’s the best way for states to track their progress and protect their hospitals? I will leave for another day.

What would be ideal?

Best case scenario would be random or universal testing of everyone, symptoms or not. I wrote about why in an earlier newsletter here. The bottom line is that if we sampled people randomly and tested them, we’d be able to get a sense of the level of infection in the population, including anyone without symptoms or with mild symptoms.

As companies start to return to work and universities contemplate their return to campus, I think we will start seeing more testing like this in certain populations (for example, see some discussions here). But we do not have it yet.

Where’s the Data we do Have?

Perhaps surprisingly, this is the easier question than how to interpret it. The best tracking data on the pandemic comes from Johns Hopkins. You can check out their site here. There is a tremendous amount on the site, and they have excellent data on testing and tracking, and their case count data is down to the county level in many places.

The other excellent source is the New York Times although some pieces are behind a paywall. They also do not have an obvious place where they report number of tests. These data are easier to read but less comprehensive.

Individuals states also have some of their own data. For example, if you live in Rhode Island, you can check out all our data here!

What data should I pay attention to?

We have basically four pieces of data that you could look at.

  • Deaths
  • Hospitalizations
  • Number of cases (positive tests)
  • Total number of tests run

Of these, the first two are the least complicated to understand and the most reliable. It’s not perfect, but we generally do a good job of recording deaths. Similarly, hospitalizations are fairly well reported in most areas and we know COVID-19 status for most hospitalized patients.

When we go back to analyze the pandemic from the standpoint of the future, I suspect these are the data we will use most extensively. However: from the standpoint of current decision-making, the timing of these data are not great. For your choices, you want to know how many people have the virus now. Generally we think hospitalizations will lag cases by a couple of weeks, and deaths by perhaps a month. Basically, by the time you see a lot of people in the hospital, it has been several weeks of high case rates.

For current tracking, the best we can do is data on case counts, which comes from tests. You can access data for your state from Johns Hopkins.

The first set of pictures is number of positive cases, the second is counts of tests per person and the third is the share of tests that are positive.

What you want to know is: of all the people around, what share of them carry the coronavirus? This is really the decision relevant information. The question is: how much can you learn about this thing you’re interested in from the information in the above graphs?

Based on the data we have, there are really two things that you could rely on: (1) the number of positive tests (sometimes called “the case count”, the thing most commonly reported) or (2) the share of tests which are positive. Neither is quite right, although the biases are in opposite directions.

To start, think about the case counts. Let’s imagine 10% of people in some population have the virus, and you test 200 of them. You’ll get 20 cases. Now test 2000 of the. You’ll get 200 cases. Now test 20,000 of them. You’ll get 2000 cases. The decision-relevant thing — the share of people with the virus — hasn’t changed. But your case counts have gone up a lot.

What this means is that as you test more, you’ll get more cases. Conversely, if you test less, you’ll detect fewer cases. This does not mean the virus is less prevalent. (In fact, testing is really, really important since it allows us to detect and isolate asymptomatic positive people.) What this does mean is that as places test more, we expect them to detect more cases and it doesn’t necessarily mean the virus is getting more prevalent.

Now let’s think about our other quantity, the share of positive tests. You could say: “We tested people, and only 1% of them were positive”, and you might well perceive this as being better than seeing 10% positive or 20% positive. And if we were testing people randomly, this would be exactly the right number to report and would tell us what we want to know.

BUT: in practice, testing is not random and typically the highest risk people are tested first. If you had a shortage of tests, as many places had and still do, you would likely not use them on healthy people. What this means is that the share of positive tests is typically higher when there are fewer of them. Note that this is not mechanical the way that the tests and case counts are. It isn’t necessarily true. But in practice, this is the the pattern just given the way tests are used.

What we tend to see, then, is that as places increase their testing the share of positive tests go down. This is good in the sense that it’s getting closer to capturing the actual population level rate, but it’s worth being cautious about the trends in this over time. It can be tempting to conclude that the virus is diminishing — which it might be! — but it could reflect changes in the tested population.

Bottom line, though — what should you be looking at, if nothing is perfect? If you live in a state with a fairly robust testing program — that is, they are doing a lot of tests, and a pretty consistent number per day — then my sense is the best thing to look at is the share of tests which are positive. This will come closest to the figure you care about.

For example: In the Johns Hopkins data for Florida, tests are very flat at around 1.3 per 1000 people, and the share positive is going up sharply. This tells you the epidemic is getting worse. Case counts are going up, sure, but it looks like rates are too.

In Illinois, things look a lot better. Testing rates are pretty flat and the share positive is falling fast.

Much harder to learn from is a place like South Carolina. Tests have increased a lot. Cases are going up. Share positive is all over the map. Even putting aside the fact that the big spike must be some kind of odd reporting thing, the fact that the share positive has declined in the last week may reflect the fact that they are testing more.

Hope this helps a bit. Or that you are now so confused you storm down to your local State legislature and demand that the state do better universal or random testing. Either way.