Emily Oster

3 min Read Emily Oster

Emily Oster

Statistical Significance—and Why It Matters for Parenting

Fake studies, p-hacking, publication bias, and more

Emily Oster

3 min Read

One of my big goals with ParentData is to enhance data literacy. Yes, I want to be here to talk you down from the panic headlines you read. But I also want to give you the tools to talk yourself down, at least some of the time.

Today, we’re doing data literacy. I’m going to talk about statistical significance

The phrase “statistically significant” is probably one most people are familiar with, or have read before. Media coverage of academic results will often use it — as in, “In this study, the relationship was statistically significant.” When it is used in common parlance, I think people often read it as meaning “true.” This phrase somehow implies that the result is right, or real. 

One reason that this conclusion might be wrong, of course, is that a lot of studies estimate correlations, not causal relationships. That’s an issue I discuss all the time. There is a second issue, though, which is what I’ll be going into here. Even in an experiment — where we have a randomized treatment, so we are more confident that the results we see are causal — understanding what the results mean requires a real understanding of statistical significance.  

That’s our job for today. No bottom line on this one — you’ve got to read it! But if you get to the end, there is some funny stuff about a dead fish, so hang in there.

What does “statistically significant” mean?

When we say an effect is “statistically significant at the 5% level,” what this means is that there is less than a 5% chance that we’d see an effect of this size if the true effect were zero. (The “5% level” is a common cutoff, but things can be significant at the 1% or 10% level also.) 

The natural follow-up question is: Why would any effect we see occur by chance? The answer lies in the fact that data is “noisy”: it comes with error. To see this a bit more, we can think about what would happen if we studied a setting where we know our true effect is zero. 

My fake study 

Imagine the following (fake) study. Participants are randomly assigned to eat a package of either blue or green M&Ms, and then they flip a (fair) coin and you see if it is heads. Your analysis will compare the number of heads that people flip after eating blue versus green M&Ms and report whether this is “statistically significant at the 5% level.”

To be clear: because this is a fair coin, there is no way that the color of the M&Ms eaten would influence how many heads you flip. So the “right” answer is that there is no relationship.

I used a computer program to simulate what would happen if we ran this study. I assumed we had 50 people in each group. First, I tried running the study one time. I calculated the share of heads in the blue and green groups and subtracted the share. This is my “treatment effect” — the impact of blue M&M eating relative to green on the share of heads. You can see the difference is very small. In this experiment the blue group threw 58% heads, and the green group 54%, for a slightly higher share of blue. 

In my fake study, the blue M&M group had a slightly higher chance of heads, but this difference is small. When I ran my statistical test, the difference was not statistically significant. Basically, I get nothing — I can’t rule out that the two groups are tossing heads at the same rate. 

However: we all know that when we flip a coin, or do anything else random, sometimes you get a string of heads — just by chance. Sometimes if you do 50 coin flips, 40 of them come up heads. It’s not likely! But it’s not impossible. And that fact means that, sometimes, even if there is no real relationship between M&M color and flipping heads, you can find one by accident.

After I did my study one time, with the results above, I did it 99 more times. In the end, I have 100 (computer-generated) versions of the same test. For each of them, I calculated the difference in the share of heads produced by the blue group versus the green group, and whether that difference was significant. The graph below shows the differences; the bars in yellow are differences I found that were significant. The dark-blue bar is the original result from my first experimental run.

That first bar — all the way to the left on your screen in yellow. In that version of my experiment, the blue M&M group were much less likely to flip heads than the green group were, and that result showed up as statistically significant even though it was definitely, definitely just by chance. 

In fact, out of the 100 times I ran this experiment, the data showed me 5 times where there appears to be a statistically significant relationship at the standard 5% level. This isn’t a mistake; it’s not something I messed up in doing it. This is, in fact, the definition of statistical significance at the 5% level. If you have a setting where the true impact of some treatment is zero, and you run it 100 times, you’ll expect to see a significant effect 5 of the 100 times. 

The real world 

How does this help us understand studies in the real world?

Now imagine someone says they have run a study to evaluate the impact of eating blue versus green M&Ms on performance on a test of multiplication speed. We know from the above example that even if there were no impact, if I ran this study 100 times we would expect to find a significant effect 5 of the 100 times.  

The researchers report an effect: green M&Ms make you multiply faster. There are two possible explanations for that finding. One is that there is actually an impact of M&M color on multiplication speed. The other is that this is a result that arises by chance — it’s one of the 5 in 100.  

If we were very sure that the researchers ran this study only one time, we could be pretty confident in the result — there is always the possibility that this is one of the “chance” significant findings, but the high level of significance gives us confidence. 

But: in the real world, we cannot be sure that we are seeing all the research that actually happened before publication. This leads to concerns about “publication bias” and “p-hacking,” which, in turn, makes us skeptical. More on these below, but for a simple explanation, try this cartoon

Publication bias and p-hacking

Publication bias and p-hacking are two shorthand, jargony ways to describe journal and researcher behaviors that make it more likely that the results we observe in published papers are occurring just by chance.

First: Academic journals are more likely to publish papers that find significant results. It’s not hard to see why this might be true. It’s not very interesting to publish a result saying that M&M color doesn’t impact multiplication speed — that’s kind of what we expected. But a result that says it does matter — that’s more surprising, and more likely to spark the interest of a journal editor.

This is what we call publication bias, and it turns out that this pattern means that the results we see in print are actually a lot more likely to be statistical accidents. Often, many researchers are looking into the same question. It’s not just my research team who is interested in the M&M-multiplication relationship — imagine there are 99 other teams doing the same thing. Even if there is no relationship, on average 5 of those teams will find something significant. 

These 5 “successful” teams are more likely to get their results published. That’s what we all see in journals, but what we do not see is the 95 times it didn’t work. When we read these studies, we’re assuming, implicitly, that we are seeing all the studies that were run. But we’re not, and we’re more likely to see the significant-by-chance results.

The issue of publication bias would be problematic just on its own. But it’s even more problematic when it interacts with researchers’ incentives. Researchers need to publish, and (see above) it is easier to do so when results are significant. This can lead to what people sometimes call p-hacking (the “p” stands for probability).

When researchers run a study, there are often a lot of ways to analyze the data. You can analyze the impact on different subgroups of the population. You can analyze the impact of different circumstances. You can test many different treatments. The idea of the xkcd cartoon is that you could test the impact of all the different M&M colors on some outcome. 

The more of these tests you do, the more likely you are to get a significant effect by chance. If you do 100 tests, you expect 5 of them to be significant at the 5% level. And then, because of publication bias, you write up the results focusing only on the significant groups or significant M&M colors. Of course, those are just accidental. But as a consumer of research, we do not see all the other things that happened in the background.

For these two reasons: some of what we see published, even if it is from a randomized experiment, is likely to be a result of statistical chance. There is a somewhat notorious paper that suggests that “most” research findings are false; I think this is overkill, but it’s a perspective. 

Is everything wrong?

Is all lost? Is all research garbage?

No, definitely not. Much research is great, and randomized trials are often compelling and convincing. How do we know which new studies are worth paying attention to? There is no simple answer here, but there are a couple of features that matter.

One very, very important issue is the number of people in the study, especially as it relates to the expected size of the impact. Before researchers run a study, they generally have an idea of the size of the effect they might get, based on what they know before from other information. For example, if I am studying the impact of M&M color on math performance, I would expect that effect to be very small (if it is there at all). On the other hand, if I’m studying the impact on math performance of keeping someone awake for 48 hours, I might expect that effect to be large. 

The larger the effect you expect, the smaller the sample of people you need to detect it statistically. This is called “statistical power,” and in well-designed experiments, researchers calculate this before they even start their study. They make an educated guess at the size of the impact, and they figure out how many people they need to detect that effect. 

When this doesn’t happen — when researchers do a small experiment on something where we expect the impact to be very small — we should not expect to learn anything from the data. The study is just not powered to detect the likely effects. If we do see an impact, it’s extremely likely that it is a statistical accident.

Shorthand here: large studies = better, especially when the expected effects are small. 

A second good feature to look for is confirmatory results. It should make us more confident if we see multiple experiments with the same results (commonly done in clinical drug trials). Or if we see one experiment that has a suggestive fact, and then a confirmatory experiment after that. More independent confirmation increases confidence. 

In practice, what this means is that the set of papers we should be paying attention to is a lot smaller than the set that gets media attention. 

A story about fish

I would like to end by talking about one of my very favorite academic papers ever.

To understand this story, you need to understand something about neuroscience. One thing people who study the brain like to do is put people into brain scan machines, called fMRI scanning. To somewhat oversimplify, these experiments show people different stimuli (pictures, words, etc.), and the researchers look at what parts of the brain light up. The idea is to figure out what parts of the brain do what. 

An issue with this method is that the brain has many places that can light up. These areas of brain tissue are called voxels, and there are about 130,000 of them. This fact has enormous potential for the problems I talked about above — in principle, you could be doing 130,000 different tests, looking at each voxel. With this approach you will get a lot of things that are significant at the 5% level.

In practice, researchers take approaches to try to mitigate this problem, but the people who wrote my favorite paper felt those approaches were definitely not sufficient to fix the problem. To show this, they did a study on a fish. A dead fish.  

Specifically, they studied a dead salmon. They put this dead salmon into an fMRI machine, and they gave it a task. As they write: “The salmon was shown a series of photographs depicting human individuals in social situations with a specified emotional valence, either socially inclusive or socially exclusive.” The study is careful. They provide a mirror to ensure the salmon can see the faces.

The authors then looked at what areas of the salmon’s brain lit up when the fish saw the photos, compared with a period of no photos. They found two areas of the brain that appeared significantly more electronically active during the photos. That is: if they had been doing a standard type of study, they would have reported that there were two areas of the brain that were responsive to emotionally charged photos. 

I want to mention again that, at the time, the salmon was dead. (Brains can have electrical activity postmortem.) In addition, salmon are not well known for being responsive to the facial emotions of people, even while alive. 

As the authors say, either they have discovered something amazing about the cognition of dead fish, or the methods that people are using in these fMRI studies are statistically problematic. They favor the latter explanation, and argue that these issues need to be addressed before we can really learn about the brain.

Thanks to Jesse Shapiro, Isaiah Andrews, and Lukasz Kowalik for helpful comments. All mistakes remain my own! 

0 Comments
Inline Feedbacks
View all comments
An illustration of different types of colorful graphs and charts.

Mar 13 2023

11 min read

Why I Look at Data Differently

A question I get frequently: Why does my analysis often disagree with groups like the American Academy of Pediatrics or Read more

Emily Oster
Personal view from a pregnant woman of her stomach from above. She is holding a stuffed rabbit and ultrasound.

Jul 10 2023

6 min read

Three Big Changes in Pregnancy Data

I wrote Expecting Better during pregnancy and in the first year of my daughter’s life, in a period between 2011 Read more

Emily Oster
A child holds up an abacus with green and red beads arranged to look like a data chart.

Aug 10 2023

9 min read

Data Literacy for Parenting

The ParentData mission statement (it’s on my door!) is “To create the most data-literate generation of parents.” The other day, Read more

Emily Oster
A group of babies listening to ParentData on headphones

Jul 12 2023

30 min read

Random Acts of Data

Today I spoke with Dr. Anupam Jena, known as Bapu, who hosts the podcast Freakonomics, MD, about his new book, Random Acts Read more

Emily Oster

Instagram

left right
Happy Father’s Day to the Fathers and Father figures in our ParentData community! 

Tag a Dad who this holiday may be tricky for. We’re sending you love. 💛

Happy Father’s Day to the Fathers and Father figures in our ParentData community!

Tag a Dad who this holiday may be tricky for. We’re sending you love. 💛
...

“Whilst googling things like ‘new dad sad’ and ‘why am I crying new dad,’ I came across an article written by a doctor who had trouble connecting with his second child. I read the symptoms and felt an odd sense of relief.” Today we’re bringing back an essay by Kevin Maguire of @newfatherhood about his experience with paternal postpartum depression. We need to demystify these issues in order to change things for the better. Comment “Link” for a DM to read his full essay.

#parentdata #postpartum #postpartumdepression #paternalmentalhealth #newparents #emilyoster

“Whilst googling things like ‘new dad sad’ and ‘why am I crying new dad,’ I came across an article written by a doctor who had trouble connecting with his second child. I read the symptoms and felt an odd sense of relief.” Today we’re bringing back an essay by Kevin Maguire of @newfatherhood about his experience with paternal postpartum depression. We need to demystify these issues in order to change things for the better. Comment “Link” for a DM to read his full essay.

#parentdata #postpartum #postpartumdepression #paternalmentalhealth #newparents #emilyoster
...

What does the data say about children who look more like one parent? Do they also inherit more character traits and mannerisms from that parent? Let’s talk about it 🔎

#emilyoster #parentdata #parentingcommunity #lookslikedaddy #lookslikemommy

What does the data say about children who look more like one parent? Do they also inherit more character traits and mannerisms from that parent? Let’s talk about it 🔎

#emilyoster #parentdata #parentingcommunity #lookslikedaddy #lookslikemommy
...

SLEEP DATA 💤 We asked you all about your kids’ sleep—and got nearly 15,000 survey responses to better understand kids’ sleep patterns. Comment “Link” for an article that breaks down our findings!

This graph shows sleeping location by age. You’ll notice that for the first three months, most kids are in their own sleeping location in a parent’s room. Then, over the first year, this switches toward their own room. As kids age, sharing a room with a sibling becomes more common. 

Head to the newsletter for more and stay tuned for part two next week on naps! 🌙

#parentdata #emilyoster #childsleep #babysleep #parentingcommunity

SLEEP DATA 💤 We asked you all about your kids’ sleep—and got nearly 15,000 survey responses to better understand kids’ sleep patterns. Comment “Link” for an article that breaks down our findings!

This graph shows sleeping location by age. You’ll notice that for the first three months, most kids are in their own sleeping location in a parent’s room. Then, over the first year, this switches toward their own room. As kids age, sharing a room with a sibling becomes more common.

Head to the newsletter for more and stay tuned for part two next week on naps! 🌙

#parentdata #emilyoster #childsleep #babysleep #parentingcommunity
...

Weekends are good for extra cups of ☕️ and listening to podcasts. I asked our team how they pod—most people said on walks or during chores. What about you?

Comment “Link” to subscribe to ParentData with Emily Oster, joined by some excellent guests.

#parentdata #parentdatapodcast #parentingpodcast #parentingtips #emilyoster

Weekends are good for extra cups of ☕️ and listening to podcasts. I asked our team how they pod—most people said on walks or during chores. What about you?

Comment “Link” to subscribe to ParentData with Emily Oster, joined by some excellent guests.

#parentdata #parentdatapodcast #parentingpodcast #parentingtips #emilyoster
...

Humility. That’s why. That’s the whole reason.

#emilyoster #secondbaby #parentingjokes #parentinghumor

Humility. That’s why. That’s the whole reason.

#emilyoster #secondbaby #parentingjokes #parentinghumor
...

Bug season is upon us. Besides annoyance, this can bring up safety concerns, particularly with ticks. They are carriers of diseases, most notably Lyme disease. So what’s the best course of action?

Prevention is key! I suggest:
⭐ Regular tick checks
⭐ Using bug sprays with DEET 
⭐ Wearing long sleeves and pants in the woods

Some parents worry about DEET, but repellants with up to 30% DEET are recommended by both the CDC and AAP. The data says you’re in the clear, so go for it. Enjoy your summer!

#parentdata #emilyoster #tickseason #bugbites #bugspray

Bug season is upon us. Besides annoyance, this can bring up safety concerns, particularly with ticks. They are carriers of diseases, most notably Lyme disease. So what’s the best course of action?

Prevention is key! I suggest:
⭐ Regular tick checks
⭐ Using bug sprays with DEET
⭐ Wearing long sleeves and pants in the woods

Some parents worry about DEET, but repellants with up to 30% DEET are recommended by both the CDC and AAP. The data says you’re in the clear, so go for it. Enjoy your summer!

#parentdata #emilyoster #tickseason #bugbites #bugspray
...

The list of what not to do while pregnant feels longer than a CVS receipt. At ParentData, we want to empower you to make the right decisions for you. 

What an amazing group of women, and an honor to speak at the #MomsFirstSummit debunking parenting myths. 

What are some pregnancy rules you chose to bend after being empowered by data?

#emilyoster #parentdata #pregnancyproblems #pregnancymyths

The list of what not to do while pregnant feels longer than a CVS receipt. At ParentData, we want to empower you to make the right decisions for you.

What an amazing group of women, and an honor to speak at the #MomsFirstSummit debunking parenting myths.

What are some pregnancy rules you chose to bend after being empowered by data?

#emilyoster #parentdata #pregnancyproblems #pregnancymyths
...

Looking for Memorial Day Weekend plans? Might be the perfect time to give potty training a shot. Potty training is notoriously difficult, and we unfortunately don’t have a lot of evidence-based guidance on what works best. So I asked the ParentData community to fill out a survey and share their knowledge — about 6,000 people responded.

👉Comment “Link” for a DM to an article that summarizes all of the best potty training advice we collected. 

Remember, you are not alone in the potty training struggle! It can be incredibly challenging, so please give yourself some grace.

#emilyoster #parentdata #pottytraining #pottytrainingtips #toddlertips

Looking for Memorial Day Weekend plans? Might be the perfect time to give potty training a shot. Potty training is notoriously difficult, and we unfortunately don’t have a lot of evidence-based guidance on what works best. So I asked the ParentData community to fill out a survey and share their knowledge — about 6,000 people responded.

👉Comment “Link” for a DM to an article that summarizes all of the best potty training advice we collected.

Remember, you are not alone in the potty training struggle! It can be incredibly challenging, so please give yourself some grace.

#emilyoster #parentdata #pottytraining #pottytrainingtips #toddlertips
...

We’re hiring an Associate Editor at ParentData! More details at my link in bio. Please share with the great writers and data-loving people in your network. 📊💻

We’re hiring an Associate Editor at ParentData! More details at my link in bio. Please share with the great writers and data-loving people in your network. 📊💻 ...

Do you brand things a certain way to get your kid to accept it? Like calling carrots “rabbit popsicles”? Or telling them to put on their “super speed socks” in the morning? Share your rebrands in the comments below! You never know who you might be helping out 👇

#emilyoster #funnytweets #relatabletweets #parentingjokes #kidssaythedarndestthings

Do you brand things a certain way to get your kid to accept it? Like calling carrots “rabbit popsicles”? Or telling them to put on their “super speed socks” in the morning? Share your rebrands in the comments below! You never know who you might be helping out 👇

#emilyoster #funnytweets #relatabletweets #parentingjokes #kidssaythedarndestthings
...

Have you ever panic-googled a parenting question when everyone else is asleep? If so, you’re not alone. 

Today is the first episode of a new biweekly series on my podcast: Late-Night Panic Google. On these mini-episodes, you’ll hear from some familiar names about the questions keeping them up at night, and how data can help. First up: @claireholt!

Listen and subscribe to ParentData with Emily Oster in your favorite podcast app 🎧

#parentdata #emilyoster #claireholt #parentingstruggles #parentingtips #latenightpanicgoogle

Have you ever panic-googled a parenting question when everyone else is asleep? If so, you’re not alone.

Today is the first episode of a new biweekly series on my podcast: Late-Night Panic Google. On these mini-episodes, you’ll hear from some familiar names about the questions keeping them up at night, and how data can help. First up: @claireholt!

Listen and subscribe to ParentData with Emily Oster in your favorite podcast app 🎧

#parentdata #emilyoster #claireholt #parentingstruggles #parentingtips #latenightpanicgoogle
...

Sun safety is a must for all ages, especially babies! Here are my tips for keeping your littlest ones protected in the sunshine:
☀️ Most importantly, limit their time out in hot weather. (They get hotter than you do!)
☀️ Keep them in the shade as much as possible when you’re out.
☀️ Long-sleeve but lightweight clothing is your friend, especially on the beach, where even in the shade you can get sunlight reflecting off different surfaces.
☀️ If you want to add a little sunscreen on their hands and feet? Go for it! But be mindful as baby skin tends to more prone to irritation.

Comment “Link” for a DM to an article on the data around sun and heat exposure for babies.

#sunsafety #babysunscreen #babyhealth #parentdata #emilyoster

Sun safety is a must for all ages, especially babies! Here are my tips for keeping your littlest ones protected in the sunshine:
☀️ Most importantly, limit their time out in hot weather. (They get hotter than you do!)
☀️ Keep them in the shade as much as possible when you’re out.
☀️ Long-sleeve but lightweight clothing is your friend, especially on the beach, where even in the shade you can get sunlight reflecting off different surfaces.
☀️ If you want to add a little sunscreen on their hands and feet? Go for it! But be mindful as baby skin tends to more prone to irritation.

Comment “Link” for a DM to an article on the data around sun and heat exposure for babies.

#sunsafety #babysunscreen #babyhealth #parentdata #emilyoster
...