Emily Oster

3 min Read Emily Oster

Emily Oster

Statistical Significance—and Why It Matters for Parenting

Fake studies, p-hacking, publication bias, and more

Emily Oster

3 min Read

One of my big goals with ParentData is to enhance data literacy. Yes, I want to be here to talk you down from the panic headlines you read. But I also want to give you the tools to talk yourself down, at least some of the time.

Today, we’re doing data literacy. I’m going to talk about statistical significance

The phrase “statistically significant” is probably one most people are familiar with, or have read before. Media coverage of academic results will often use it — as in, “In this study, the relationship was statistically significant.” When it is used in common parlance, I think people often read it as meaning “true.” This phrase somehow implies that the result is right, or real. 

One reason that this conclusion might be wrong, of course, is that a lot of studies estimate correlations, not causal relationships. That’s an issue I discuss all the time. There is a second issue, though, which is what I’ll be going into here. Even in an experiment — where we have a randomized treatment, so we are more confident that the results we see are causal — understanding what the results mean requires a real understanding of statistical significance.  

That’s our job for today. No bottom line on this one — you’ve got to read it! But if you get to the end, there is some funny stuff about a dead fish, so hang in there.

What does “statistically significant” mean?

When we say an effect is “statistically significant at the 5% level,” what this means is that there is less than a 5% chance that we’d see an effect of this size if the true effect were zero. (The “5% level” is a common cutoff, but things can be significant at the 1% or 10% level also.) 

The natural follow-up question is: Why would any effect we see occur by chance? The answer lies in the fact that data is “noisy”: it comes with error. To see this a bit more, we can think about what would happen if we studied a setting where we know our true effect is zero. 

My fake study 

Imagine the following (fake) study. Participants are randomly assigned to eat a package of either blue or green M&Ms, and then they flip a (fair) coin and you see if it is heads. Your analysis will compare the number of heads that people flip after eating blue versus green M&Ms and report whether this is “statistically significant at the 5% level.”

To be clear: because this is a fair coin, there is no way that the color of the M&Ms eaten would influence how many heads you flip. So the “right” answer is that there is no relationship.

I used a computer program to simulate what would happen if we ran this study. I assumed we had 50 people in each group. First, I tried running the study one time. I calculated the share of heads in the blue and green groups and subtracted the share. This is my “treatment effect” — the impact of blue M&M eating relative to green on the share of heads. You can see the difference is very small. In this experiment the blue group threw 58% heads, and the green group 54%, for a slightly higher share of blue. 

In my fake study, the blue M&M group had a slightly higher chance of heads, but this difference is small. When I ran my statistical test, the difference was not statistically significant. Basically, I get nothing — I can’t rule out that the two groups are tossing heads at the same rate. 

However: we all know that when we flip a coin, or do anything else random, sometimes you get a string of heads — just by chance. Sometimes if you do 50 coin flips, 40 of them come up heads. It’s not likely! But it’s not impossible. And that fact means that, sometimes, even if there is no real relationship between M&M color and flipping heads, you can find one by accident.

After I did my study one time, with the results above, I did it 99 more times. In the end, I have 100 (computer-generated) versions of the same test. For each of them, I calculated the difference in the share of heads produced by the blue group versus the green group, and whether that difference was significant. The graph below shows the differences; the bars in yellow are differences I found that were significant. The dark-blue bar is the original result from my first experimental run.

That first bar — all the way to the left on your screen in yellow. In that version of my experiment, the blue M&M group were much less likely to flip heads than the green group were, and that result showed up as statistically significant even though it was definitely, definitely just by chance. 

In fact, out of the 100 times I ran this experiment, the data showed me 5 times where there appears to be a statistically significant relationship at the standard 5% level. This isn’t a mistake; it’s not something I messed up in doing it. This is, in fact, the definition of statistical significance at the 5% level. If you have a setting where the true impact of some treatment is zero, and you run it 100 times, you’ll expect to see a significant effect 5 of the 100 times. 

The real world 

How does this help us understand studies in the real world?

Now imagine someone says they have run a study to evaluate the impact of eating blue versus green M&Ms on performance on a test of multiplication speed. We know from the above example that even if there were no impact, if I ran this study 100 times we would expect to find a significant effect 5 of the 100 times.  

The researchers report an effect: green M&Ms make you multiply faster. There are two possible explanations for that finding. One is that there is actually an impact of M&M color on multiplication speed. The other is that this is a result that arises by chance — it’s one of the 5 in 100.  

If we were very sure that the researchers ran this study only one time, we could be pretty confident in the result — there is always the possibility that this is one of the “chance” significant findings, but the high level of significance gives us confidence. 

But: in the real world, we cannot be sure that we are seeing all the research that actually happened before publication. This leads to concerns about “publication bias” and “p-hacking,” which, in turn, makes us skeptical. More on these below, but for a simple explanation, try this cartoon

Publication bias and p-hacking

Publication bias and p-hacking are two shorthand, jargony ways to describe journal and researcher behaviors that make it more likely that the results we observe in published papers are occurring just by chance.

First: Academic journals are more likely to publish papers that find significant results. It’s not hard to see why this might be true. It’s not very interesting to publish a result saying that M&M color doesn’t impact multiplication speed — that’s kind of what we expected. But a result that says it does matter — that’s more surprising, and more likely to spark the interest of a journal editor.

This is what we call publication bias, and it turns out that this pattern means that the results we see in print are actually a lot more likely to be statistical accidents. Often, many researchers are looking into the same question. It’s not just my research team who is interested in the M&M-multiplication relationship — imagine there are 99 other teams doing the same thing. Even if there is no relationship, on average 5 of those teams will find something significant. 

These 5 “successful” teams are more likely to get their results published. That’s what we all see in journals, but what we do not see is the 95 times it didn’t work. When we read these studies, we’re assuming, implicitly, that we are seeing all the studies that were run. But we’re not, and we’re more likely to see the significant-by-chance results.

The issue of publication bias would be problematic just on its own. But it’s even more problematic when it interacts with researchers’ incentives. Researchers need to publish, and (see above) it is easier to do so when results are significant. This can lead to what people sometimes call p-hacking (the “p” stands for probability).

When researchers run a study, there are often a lot of ways to analyze the data. You can analyze the impact on different subgroups of the population. You can analyze the impact of different circumstances. You can test many different treatments. The idea of the xkcd cartoon is that you could test the impact of all the different M&M colors on some outcome. 

The more of these tests you do, the more likely you are to get a significant effect by chance. If you do 100 tests, you expect 5 of them to be significant at the 5% level. And then, because of publication bias, you write up the results focusing only on the significant groups or significant M&M colors. Of course, those are just accidental. But as a consumer of research, we do not see all the other things that happened in the background.

For these two reasons: some of what we see published, even if it is from a randomized experiment, is likely to be a result of statistical chance. There is a somewhat notorious paper that suggests that “most” research findings are false; I think this is overkill, but it’s a perspective. 

Is everything wrong?

Is all lost? Is all research garbage?

No, definitely not. Much research is great, and randomized trials are often compelling and convincing. How do we know which new studies are worth paying attention to? There is no simple answer here, but there are a couple of features that matter.

One very, very important issue is the number of people in the study, especially as it relates to the expected size of the impact. Before researchers run a study, they generally have an idea of the size of the effect they might get, based on what they know before from other information. For example, if I am studying the impact of M&M color on math performance, I would expect that effect to be very small (if it is there at all). On the other hand, if I’m studying the impact on math performance of keeping someone awake for 48 hours, I might expect that effect to be large. 

The larger the effect you expect, the smaller the sample of people you need to detect it statistically. This is called “statistical power,” and in well-designed experiments, researchers calculate this before they even start their study. They make an educated guess at the size of the impact, and they figure out how many people they need to detect that effect. 

When this doesn’t happen — when researchers do a small experiment on something where we expect the impact to be very small — we should not expect to learn anything from the data. The study is just not powered to detect the likely effects. If we do see an impact, it’s extremely likely that it is a statistical accident.

Shorthand here: large studies = better, especially when the expected effects are small. 

A second good feature to look for is confirmatory results. It should make us more confident if we see multiple experiments with the same results (commonly done in clinical drug trials). Or if we see one experiment that has a suggestive fact, and then a confirmatory experiment after that. More independent confirmation increases confidence. 

In practice, what this means is that the set of papers we should be paying attention to is a lot smaller than the set that gets media attention. 

A story about fish

I would like to end by talking about one of my very favorite academic papers ever.

To understand this story, you need to understand something about neuroscience. One thing people who study the brain like to do is put people into brain scan machines, called fMRI scanning. To somewhat oversimplify, these experiments show people different stimuli (pictures, words, etc.), and the researchers look at what parts of the brain light up. The idea is to figure out what parts of the brain do what. 

An issue with this method is that the brain has many places that can light up. These areas of brain tissue are called voxels, and there are about 130,000 of them. This fact has enormous potential for the problems I talked about above — in principle, you could be doing 130,000 different tests, looking at each voxel. With this approach you will get a lot of things that are significant at the 5% level.

In practice, researchers take approaches to try to mitigate this problem, but the people who wrote my favorite paper felt those approaches were definitely not sufficient to fix the problem. To show this, they did a study on a fish. A dead fish.  

Specifically, they studied a dead salmon. They put this dead salmon into an fMRI machine, and they gave it a task. As they write: “The salmon was shown a series of photographs depicting human individuals in social situations with a specified emotional valence, either socially inclusive or socially exclusive.” The study is careful. They provide a mirror to ensure the salmon can see the faces.

The authors then looked at what areas of the salmon’s brain lit up when the fish saw the photos, compared with a period of no photos. They found two areas of the brain that appeared significantly more electronically active during the photos. That is: if they had been doing a standard type of study, they would have reported that there were two areas of the brain that were responsive to emotionally charged photos. 

I want to mention again that, at the time, the salmon was dead. (Brains can have electrical activity postmortem.) In addition, salmon are not well known for being responsive to the facial emotions of people, even while alive. 

As the authors say, either they have discovered something amazing about the cognition of dead fish, or the methods that people are using in these fMRI studies are statistically problematic. They favor the latter explanation, and argue that these issues need to be addressed before we can really learn about the brain.

Thanks to Jesse Shapiro, Isaiah Andrews, and Lukasz Kowalik for helpful comments. All mistakes remain my own! 

Community Guidelines
0 Comments
Inline Feedbacks
View all comments
people discussing some data on the screen

Oct 14 2024

11 min Read

Why I Look at Data Differently

A question I get frequently: Why does my analysis often disagree with groups like the American Academy of Pediatrics or Read more

Personal view from a pregnant woman of her stomach from above. She is holding a stuffed rabbit and ultrasound.

Jul 10 2023

6 min Read

Three Big Changes in Pregnancy Data

I wrote Expecting Better during pregnancy and in the first year of my daughter’s life, in a period between 2011 Read more

A child holds up an abacus with green and red beads arranged to look like a data chart.

Aug 10 2023

9 min Read

Data Literacy for Parenting

The ParentData mission statement (it’s on my door!) is “To create the most data-literate generation of parents.” The other day, Read more

A group of babies listening to ParentData on headphones

Jul 12 2023

30 min Read

Random Acts of Data

Today I spoke with Dr. Anupam Jena, known as Bapu, who hosts the podcast Freakonomics, MD, about his new book, Random Acts Read more

Instagram

left right
Happy fall, everyone! What parenting moments do you look forward to (or dread) when the season changes? Let me know in the comments 🍂

#daylightsavings #parentingtips #parentinghumor #fallactivities #emilyoster

Happy fall, everyone! What parenting moments do you look forward to (or dread) when the season changes? Let me know in the comments 🍂

#daylightsavings #parentingtips #parentinghumor #fallactivities #emilyoster
...

Curious about the role of midwives in childbirth? Today on the ParentData Podcast, we have Ann Ledbetter help us dive into all things midwives—exploring their unique approaches, the impact on birth outcomes, and what makes midwife care distinct.

Listen to this episode now and subscribe to ParentData with Emily Oster in your favorite podcast app 🎧

#parentdata #parentdatapodcast #midwives #midwifery #epidural #emilyoster

Curious about the role of midwives in childbirth? Today on the ParentData Podcast, we have Ann Ledbetter help us dive into all things midwives—exploring their unique approaches, the impact on birth outcomes, and what makes midwife care distinct.

Listen to this episode now and subscribe to ParentData with Emily Oster in your favorite podcast app 🎧

#parentdata #parentdatapodcast #midwives #midwifery #epidural #emilyoster
...

Parenthood isn’t something anyone should navigate alone. It’s about showing up for the moms, dads, and families in your circle—especially when things don’t go as planned. Whether it’s a pregnancy complication or the challenges of postpartum life, we can all play a part in supporting each other.

#TheUnexpected gives tools to better understand and help the parents you love. Link in bio to learn more about the book! If you’ve read it, what stuck out most to you? Share in the comments below 💛

#emilyoster #pregnancycomplications #parentingcommunity #parentingsupport

Parenthood isn’t something anyone should navigate alone. It’s about showing up for the moms, dads, and families in your circle—especially when things don’t go as planned. Whether it’s a pregnancy complication or the challenges of postpartum life, we can all play a part in supporting each other.

#TheUnexpected gives tools to better understand and help the parents you love. Link in bio to learn more about the book! If you’ve read it, what stuck out most to you? Share in the comments below 💛

#emilyoster #pregnancycomplications #parentingcommunity #parentingsupport
...

Do men really get worse colds? Not really, but they definitely think so! 

#mancold #perceptionvsreality #sickday #staystrong #emilyoster

Do men really get worse colds? Not really, but they definitely think so!

#mancold #perceptionvsreality #sickday #staystrong #emilyoster
...

💧 Is your kid drinking bathwater? Here’s the 411: It’s gross but not dangerous! Comment “Link” for my full guide to bathtime.

 A little soap or mold from toys isn’t harmful, and diluted pee is no big deal. Eating a lot of poop can make you sick, but in small amounts, it isn’t toxic.

Bottom line: Bathwater isn’t your kid’s best drink option, but don’t stress too much! Just keep an eye on them during bathtime for safety. 

#kidsbathtime #bathtoys #bathtimefuntime #emilyoster #parentdata

💧 Is your kid drinking bathwater? Here’s the 411: It’s gross but not dangerous! Comment “Link” for my full guide to bathtime.

A little soap or mold from toys isn’t harmful, and diluted pee is no big deal. Eating a lot of poop can make you sick, but in small amounts, it isn’t toxic.

Bottom line: Bathwater isn’t your kid’s best drink option, but don’t stress too much! Just keep an eye on them during bathtime for safety.

#kidsbathtime #bathtoys #bathtimefuntime #emilyoster #parentdata
...

Why is it that schools always call mom, even when dad’s the one with more availability? Comment “Link” to dive into the data on inequality in parental workloads.

A study shows just how deep the gender divide runs when it comes to household labor—even in something as simple as a phone call. Does this ring true for you? Share your experience in the comments 👇

#gendergap #momstruggles #parentsupport #emilyoster #parentdata

Why is it that schools always call mom, even when dad’s the one with more availability? Comment “Link” to dive into the data on inequality in parental workloads.

A study shows just how deep the gender divide runs when it comes to household labor—even in something as simple as a phone call. Does this ring true for you? Share your experience in the comments 👇

#gendergap #momstruggles #parentsupport #emilyoster #parentdata
...

Is your child getting enough sleep? 💤 Kids need 9-11 hours of rest for better focus, behavior, and health. Comment “Link” for an article on the importance of sleep and how to help your kids get more of it. It’s time to make sleep a priority!

#childsleep #childhooddevelopment #parentingtips #emilyoster #parentdata

Is your child getting enough sleep? 💤 Kids need 9-11 hours of rest for better focus, behavior, and health. Comment “Link” for an article on the importance of sleep and how to help your kids get more of it. It’s time to make sleep a priority!

#childsleep #childhooddevelopment #parentingtips #emilyoster #parentdata
...

Here’s my rant on motherhood inspired by the Barbie movie. Motherhood is hard enough without the unsolicited expectations we’re constantly assigned. As I always say, there is no secret option c. You’re doing great, and I hope you know that! 

Share this with a mom you think is doing great!

#parentdata #emilyoster #motherhood #barbiemovie

Here’s my rant on motherhood inspired by the Barbie movie. Motherhood is hard enough without the unsolicited expectations we’re constantly assigned. As I always say, there is no secret option c. You’re doing great, and I hope you know that!

Share this with a mom you think is doing great!

#parentdata #emilyoster #motherhood #barbiemovie
...

Ever seen a headline that makes your heart race, but when you dig deeper, the study behind it doesn’t hold up? That’s a panic headline! It’s designed to grab attention and spark fear, but the research it’s based on is often weak or irrelevant. Next time you see one, take a breath, look closer, and don’t let sensationalism get you stressed out.

What’s the most recent panic headline you’ve seen? Drop it in the comments and let’s break it down together! ⬇️

#parentdata #emilyoster #panicheadline #datadriven #riskmanagement #parentingstruggles

Ever seen a headline that makes your heart race, but when you dig deeper, the study behind it doesn’t hold up? That’s a panic headline! It’s designed to grab attention and spark fear, but the research it’s based on is often weak or irrelevant. Next time you see one, take a breath, look closer, and don’t let sensationalism get you stressed out.

What’s the most recent panic headline you’ve seen? Drop it in the comments and let’s break it down together! ⬇️

#parentdata #emilyoster #panicheadline #datadriven #riskmanagement #parentingstruggles
...

Is constant phone access impacting your child’s development? 📵 Today on the ParentData podcast, listen to @profemilyoster and @jonathanhaidt discuss the impact of phones and social media on learning and child mental health.

Listen to this episode now and subscribe to ParentData with Emily Oster in your favorite podcast app 🎧

#parentdata #parentdatapodcast #theanxiousgeneration #kidsmentalhealth #screentime #jonathanhaidt #emilyoster

Is constant phone access impacting your child’s development? 📵 Today on the ParentData podcast, listen to @profemilyoster and @jonathanhaidt discuss the impact of phones and social media on learning and child mental health.

Listen to this episode now and subscribe to ParentData with Emily Oster in your favorite podcast app 🎧

#parentdata #parentdatapodcast #theanxiousgeneration #kidsmentalhealth #screentime #jonathanhaidt #emilyoster
...

Here’s your Monday reminder for the start of the school year: You’re doing a great job, rice bunnies or not!

This is a sneak peek from the Saturday newsletter on ParentData. Want more parenting tips and insights? Subscribe now at the link in bio.

#parentdata #emilyoster #parentingadvice #parentingtips #parentingquotes #parentingishard

Here’s your Monday reminder for the start of the school year: You’re doing a great job, rice bunnies or not!

This is a sneak peek from the Saturday newsletter on ParentData. Want more parenting tips and insights? Subscribe now at the link in bio.

#parentdata #emilyoster #parentingadvice #parentingtips #parentingquotes #parentingishard
...

Reflux: It’s more common than you think! Comment “Link” for an article by @thepediatricianmom breaking down the information we have about reflux — what it is, what you can do, and red flags to look out for.

This graph shows how reflux changes with age. Nearly half of all babies experience reflux by 3 months, often peaking around 4 months before improving by their first birthday. And remember, if you’re struggling, you’re not alone. The most effective treatment for infant reflux is time. It will get better!

#parentdata #refluxbaby #babyreflux #spitup #parentingadvice #emilyoster

Reflux: It’s more common than you think! Comment “Link” for an article by @thepediatricianmom breaking down the information we have about reflux — what it is, what you can do, and red flags to look out for.

This graph shows how reflux changes with age. Nearly half of all babies experience reflux by 3 months, often peaking around 4 months before improving by their first birthday. And remember, if you’re struggling, you’re not alone. The most effective treatment for infant reflux is time. It will get better!

#parentdata #refluxbaby #babyreflux #spitup #parentingadvice #emilyoster
...

We’re heading into a three-day weekend, which means a lot of you might take the opportunity to do some potty training. 

Here are some things to keep in mind:
🚽 It takes longer than three days (based on the data!)
🚽 Your child will have trouble staying dry at night.
🚽 Poop sometimes comes later than pee – this is common, you just have to work through it.

Comment “Link” for an article that breaks down potty training data from ParentData readers,  along with helpful tips and tricks.

#pottytraining #pottytrainingtips #pottytrainingproblems #parentdata #emilyoster

We’re heading into a three-day weekend, which means a lot of you might take the opportunity to do some potty training.

Here are some things to keep in mind:
🚽 It takes longer than three days (based on the data!)
🚽 Your child will have trouble staying dry at night.
🚽 Poop sometimes comes later than pee – this is common, you just have to work through it.

Comment “Link” for an article that breaks down potty training data from ParentData readers, along with helpful tips and tricks.

#pottytraining #pottytrainingtips #pottytrainingproblems #parentdata #emilyoster
...

Trampoline parks: great way to get the sillies out or injury haven? Or both? Comment “Link” for an article breaking down a 2023 study on injury trends in trampoline parks.

Here’s a visualisation based on the paper, showing the injury rate by area. Beware the foam pit and the high-performance areas! Slam-dunking, though, seems fine.

#parentdata #emilyoster #trampolinepark #childsafety #trampolinefun

Trampoline parks: great way to get the sillies out or injury haven? Or both? Comment “Link” for an article breaking down a 2023 study on injury trends in trampoline parks.

Here’s a visualisation based on the paper, showing the injury rate by area. Beware the foam pit and the high-performance areas! Slam-dunking, though, seems fine.

#parentdata #emilyoster #trampolinepark #childsafety #trampolinefun
...

I’m teaming up with @Wholefoods to remind you that even though school lunches can be tricky, they have everything you need, from conventional to organic, to give you peace of mind about the foods your kids eat. Through their rigorous Quality Standards, they ban 300+ ingredients from food. 

Does your kid have any special or weird lunch requests? Share in the comments! Tap the link in my bio for more tips and inspiration #WholeFoodsMarket

I’m teaming up with @Wholefoods to remind you that even though school lunches can be tricky, they have everything you need, from conventional to organic, to give you peace of mind about the foods your kids eat. Through their rigorous Quality Standards, they ban 300+ ingredients from food.

Does your kid have any special or weird lunch requests? Share in the comments! Tap the link in my bio for more tips and inspiration #WholeFoodsMarket
...

Travel is already stressful. Add kids to the equation, and it becomes even more complicated. Here are 3 tips and considerations for handling jet lag in kids.

#travelwithkids #jetlag #melatonin #parentingtips #parentdata #emilyoster

Travel is already stressful. Add kids to the equation, and it becomes even more complicated. Here are 3 tips and considerations for handling jet lag in kids.

#travelwithkids #jetlag #melatonin #parentingtips #parentdata #emilyoster
...