One of the classes I sometimes teach at Brown is a first-year seminar called The Power of Data. On the first day of class, I start with this fact, from the CDC: In 2017-2018, 42.7% of Americans were obese. I ask students a simple question: How do we know this?
Usually someone will respond with their first instinct: “Well, we know because we weighed them.” And then I ask the obvious follow-up: “Do you remember being weighed?” They do not, of course, and we are off and running. In fact, the figure is based on data from the National Health and Nutrition Examination Survey, which includes about 11,000 people each year. This leads into a discussion about sampling — when is this sample size sufficient, and for what conclusions?
It also leads us to a conversation about how we communicate about data. Why do we say 42.7% of Americans are obese, when we really mean 42.7% of this particular sample? When do we need to give more details?
The rest of the course is much like this, trying to help my students see where data comes from, what we can do with it, and what its limitations are.
The point of this story is that I like data. A lot. It’s what gets me up in the morning. I like collecting data, but even more than that, I like thinking about how to learn from it. How can we use what we see to tease out relationships that might be hard to see? What statistical methods are useful, and which are less useful? How can we make them better, or at least evaluate their quality?
Formalizing statistical methods is a big part of my academic profile. Explaining statistics is a huge part of my teaching and writing. In a sense, a roundup of what I write about statistics seems like it would be just a list of everything I do. But! In the past year, especially through the lens of COVID, I’ve been writing more directly about particular statistical topics. So I thought it would be useful to pull everything together in one place. Think of it as your own mini-course.
I’ve tried to organize these roughly in order of how I’d teach them in a class.
- Selection in Practice: The Value of Randomness. Main lesson: If you want to know how common some characteristic is in the whole population, you have to randomly sample people. You cannot learn the obesity rate in the U.S. by weighing only people in Providence, Rhode Island. Just as you cannot learn the COVID rate in the population by only testing the people who show up for your study.
- Positivity and Case Rates. Main lesson: A corollary to the above (or, maybe better stated, an example). Because of our garbage approach to data collection during COVID-19, it is extremely difficult to learn anything from case rates or positivity. Some places test more than others, and this mucks up our ability to compare.
- Power (the Statistical Kind). Main lesson: There may be limits to the kinds of questions data can answer, because the sample size you need scales with the effect you’re looking for. If you want to identify very small effects, you need to have a really, really large amount of data. Which might be infeasible!
- Epidurals and Autism. Main lesson: Correlation is not the same as causality. (Note: I write a version of this lesson frequently; this is just the newsletter that came to mind. Also, some people didn’t like the title. It was sarcastic! Maybe I am not as funny as I think I am.)
- Welcome to Econ 1430. Main lesson: Let’s say we do not have a randomized trial. How can we use data to draw more convincing causal conclusions? This post is based on another class I teach at Brown, about social policy. It goes through how economists think about evaluating the impacts of interventions without the benefit of randomization.
- Coffee, Vitamins and Longevity. Main lesson: The problems with observational data might be even worse than we think! When we produce public health advice, the people who respond may be different from those who do not, and the result can reinforce (incorrect) research findings.
- Bayes Rule Is My Faves Rule. Main lesson: Data is great, but there is also a role for our prior beliefs in making decisions. If something is highly unlikely to be true, then we should be skeptical even in the face of data that makes it slightly more likely to be true.
Read the newsletters and still dying for more equations? I’ve got a bit of academic work on this, too.
- Unobservable Selection and Coefficient Stability
- Health Recommendations and Selection in Health Behaviors
- A Simple Approximation for Evaluating External Validity Bias
If you really want to dive into econometrics and statistics, you’ll have to go (well) beyond me. The best place I can send you is Mostly Harmless Econometrics. Not completely harmless, but mostly.