Where Does Data Come From?

I teach a lot of classes on data, to a lot of different audiences: college students, graduate students, businesspeople, people who find me through the newsletter or Instagram. Once, I taught a data class to a middle-school assembly. My favorite way to start these classes — an opening that I’ve used with all of these audiences — is this:

Here is a fact from the CDC: In 2017-2018, 42.7% of Americans were obese. And here is my question: How do they know?

This is a simple question, and one that many people have a knee-jerk reaction to: “They weighed people.” But of course, it’s not that simple. They do weigh people. But which people? And who does the weighing? And how do you go from weighing some people to a statement about all Americans?

Today I want to unpack the answers to these questions, a bit like the way I would in class. Understanding the answers here is a lens into thinking more generally about where data comes from, and whether we can be confident in what it tells us.

Important note: I’m using obesity rate throughout today’s post because it is a number that is often cited in policy discussions, and it’s a good illustration of general principles. There are many good arguments for why BMI-based measures like this do not reflect a person’s health and why we should move away from a focus on them. (For more, I highly recommend Virginia Sole-Smith — we’ll be discussing her upcoming book, Fat Talk: Parenting in the Age of Diet Culture, in the newsletter early next year.) I’m using this example to talk about analytic concepts, not because there’s great value in the data itself.

The data source

Let’s start by asking what the ideal way to measure this would be, from a data standpoint. If you wanted to know the exact right number at all times, you’d want to basically force everyone to weigh themselves every day and have that data be uploaded to some government server. From this, we’d be able to precisely calculate the share of people at any particular weight. Data-wise, great. In all other ways, terrifying and horrible. It’s not The Handmaid’s Tale over here (yet), so that’s not the way this is done.

Slightly more realistically, there are a number of existing data sources for this information. A large-scale example is anonymized medical records, which would provide weight from yearly check-ups. Apps are another possibility. People who use a fitness or diet tracker may enter their weight as part of that; certain diet trackers will link electronically to a scale, so this information is entered automatically. Some states will collect your weight when you apply for or renew a driver’s license.

In principle, these may seem like a good way to measure America’s weight. They’re easily accessible, in the sense that the information doesn’t have to be collected anew, and possibly cover many, many people. However: there are significant issues with these sources. One is that, in some cases, they rely on self-reports, which may not be accurate. A more pernicious issue is that all of these samples are selected in various ways. That is to say: none are representative of the full U.S. population.

This lack of representativeness is obvious if we think about dieters or the wearers of fitness trackers. But even a population of medical records — while better — may not be representative. People who are engaged with their health such that they go to the doctor for well visits are, at least in some ways, different from those who do not. Using only that population, we’ll get a biased estimate of what we want to know.

In order to make a statement like the one that I attributed to the CDC above, we need to have data from a representative sample of Americans — the simplest way to think about this is a random sample. It is important to say that we do not need to sample all Americans. One of the wondrous, magic things about statistics and sampling is that we can sample a subset of people — even quite a small share — and make statements about the whole population. These statements will come with some possible error, but we can quantify that error. But this is only possible if our sample is representative of the whole population.

The way the CDC actually gets these data on obesity (and on much other health data) is with a survey called the National Health and Nutrition Examination Survey (NHANES). The NHANES survey began in the 1960s and has run in more or less its current form since 1999. It includes roughly 5,000 individuals each year, and it’s run continuously with data released every two years.

The NHANES has two components. There is a survey component, which asks questions about demographics (race, income, education), health conditions, and diet. This is the source for a lot of data on American dietary patterns; participants do a one- or two-day “dietary recall” in which they list everything they ate during those days. We also get detailed information about any existing health conditions, and health behaviors.

The second component is the examination. This consists of a series of measurements, including medical and dental tests and laboratory tests. It is in this survey segment that information is collected on weight, along with blood pressure, laboratory measurements like triglycerides, and so on. This examination portion of the survey is done by NHANES survey people in a series of carefully designed mobile examination units.

The NHANES is designed as a representative sample. In an ideal world, the way we’d do that is to randomly choose 5,000 people from the American population of 300 million and survey them. This is infeasible with a study like this for many reasons, most notably that you’d need to get your mobile examination units all over the country. Instead, what the survey does is choose 15 random counties each year and then choose random households from within these counties, and then random people within those households. This approach allows the researchers to have the mobile units in a smaller number of locations. It also allows them to advertise the existence of the survey and to let people know what is going on.

Again, it might seem like magic, but actually this approach to sampling — when done randomly — will give you a representative sample that you can use to reflect the U.S. population. Statistics is cool! That magic, though, happens when you are able to actually survey and examine everyone you sample. That is: the survey picks a set of people within each county, and the ability to draw conclusions based on that subset of the population is reliant on them actually surveying and examining those people they picked.

The main issue with this is non-response. Not everyone you contact wants to be surveyed, and even fewer people want to be weighed and have their blood drawn. It takes time, and also can be invasive. In the data, about half of the people contacted are willing to be surveyed, and slightly fewer are willing to undergo the examination. If the refusal was random, this would be okay — you’d need to start with twice as many people, but you’d still do all right on being representative. The problem is that refusal is not random.

For example: Likely due to long-standing issues of mistreatment by the medical system, Black individuals are less likely to opt into the survey than those of other races. More-educated individuals are more likely to agree to be surveyed, on average, as are richer people. This means that the sample that you get is not random, and the data cannot simply be used as it is. The NHANES approaches (say) 10,000 people to get 5,000 responses; but even though the 10,000 people were randomly selected, the 5,000 are not.

At the end of the NHANES process, there is a data set of 5,000 individuals. On average, there are more white people in the sample than in the overall population, and more people with more education (among other imbalances). Reporting the obesity rate in the observed data would not be representative of the overall population.

So… what do you do?

Reweighting data

The short answer is that you “reweight” the data. Imagine that your data is on 10 people: 9 white people and 1 Black person. But your overall population has 7 white people and 3 Black people. If you want your data to represent your population in terms of race, you need to count your one Black person three times and each of your white people only 7/9ths of a time. In doing this, you are giving more weight to the person representing the group you do not have enough of and less weight to the people who you have too many of.

This reweighting can get very complicated when the sample is imbalanced in a lot of ways, as it is in the NHANES. Typically, the way it is done is by grouping people based on a set of characteristics (e.g. 20-to-39-year-old non-Hispanic white women living in urban areas with an intermediate median income) and then asking how the share of the people in the survey with that set of characteristics compares with the share of the overall U.S. population. Participants in this group are then assigned a survey weight, which tells researchers whether to up- or down-weight them in any overall statistics.

There are two important subtleties to this. The first is that in order to do this, you need a good number of people in each group. If there is literally one Black person in your entire sample, you cannot count them for the entire Black population of the U.S. One implication of this is that a survey like the NHANES starts out by “oversampling” smaller population groups, to make sure they have enough people to do their weighting.

A second issue, more pernicious, is that you can only weight based on things you see. This was brought home to me by a reader email, which was about a survey in the U.K. but has a similar feel:

I’ve just signed up for the U.K. future health study:

I live in an ethnically diverse city with significant areas of deprivation.

However, at the initial screening I was struck by how similar the cohort was, mostly white middle-aged professionals who worked in The City (or at least had a job where they could get time off to attend). I’d say 75%+ had a Garmin [fitness watch] or equivalent, and we were all quite excited to get a free cholesterol check.

Some of the issues this person identifies (imbalances in race or age) are things we can deal with using the weighting procedures described. But some of these features — like Garmin ownership — are not things we measure, and therefore not things we can base our weights on.

The bottom line is that if the non-response to a survey like the NHANES is in part a function of unobservable differences across people (which it surely is), then we retain concerns about lack of representativeness. It can be difficult to know how important this is in magnitude, or even in what direction it biases our conclusions. We can sometimes speculate, but we cannot be sure.

I wish I could tell you that there is a good way to address these problems! It’s a topic I’ve studied in my academic work (see “A simple approximation for evaluating external validity bias,” written with the incomparable Isaiah Andrews), but we do not come up with any airtight solutions. The fundamental problem is that if your sample is selected based on features you cannot see — unobservables — then you’re kind of out of luck for making precise conclusions.

Summary and other contexts

When we want to make statements about characteristics of whole populations, be they the entire U.S. or something smaller, or larger, we must use representative samples. If my goal was to get the weight of 5,000 people, there are much simpler ways to collect that data than mobile clinics across the U.S. I could weigh people at the New York City Marathon, or people who go to a Packers game. But those approaches will yield a biased sample.

Yet even when we do our very, very best to sample in a representative way, we still run into problems if not everyone responds (and they do not!). We can up-weight and down-weight, and even when we do that, we are still not usually all the way there. It’s better than the Packers game! Not perfect, though.

The issues here come up all the time if you’re looking for them. Political polling, for example. Pollsters randomly sample people to call, but they definitely do not get a random sample of people answering them. There are many reweighting approaches to addressing these imbalances, but they do all run into the problem of unobservable selection (also, lying, but that’s for another day).

It is worth looking for these issues. We spend a lot of time, in this newsletter and in media in general, talking about issues like correlation versus causation. Those are important! But the more mundane question of where data comes from, and what it really measures — this is crucially important too.