Emily Oster

3 min Read Emily Oster

Emily Oster

Statistical Significance—and Why It Matters for Parenting

Fake studies, p-hacking, publication bias, and more

Emily Oster

3 min Read

One of my big goals with ParentData is to enhance data literacy. Yes, I want to be here to talk you down from the panic headlines you read. But I also want to give you the tools to talk yourself down, at least some of the time.

Today, we’re doing data literacy. I’m going to talk about statistical significance

The phrase “statistically significant” is probably one most people are familiar with, or have read before. Media coverage of academic results will often use it — as in, “In this study, the relationship was statistically significant.” When it is used in common parlance, I think people often read it as meaning “true.” This phrase somehow implies that the result is right, or real. 

One reason that this conclusion might be wrong, of course, is that a lot of studies estimate correlations, not causal relationships. That’s an issue I discuss all the time. There is a second issue, though, which is what I’ll be going into here. Even in an experiment — where we have a randomized treatment, so we are more confident that the results we see are causal — understanding what the results mean requires a real understanding of statistical significance.  

That’s our job for today. No bottom line on this one — you’ve got to read it! But if you get to the end, there is some funny stuff about a dead fish, so hang in there.

What does “statistically significant” mean?

When we say an effect is “statistically significant at the 5% level,” what this means is that there is less than a 5% chance that we’d see an effect of this size if the true effect were zero. (The “5% level” is a common cutoff, but things can be significant at the 1% or 10% level also.) 

The natural follow-up question is: Why would any effect we see occur by chance? The answer lies in the fact that data is “noisy”: it comes with error. To see this a bit more, we can think about what would happen if we studied a setting where we know our true effect is zero. 

My fake study 

Imagine the following (fake) study. Participants are randomly assigned to eat a package of either blue or green M&Ms, and then they flip a (fair) coin and you see if it is heads. Your analysis will compare the number of heads that people flip after eating blue versus green M&Ms and report whether this is “statistically significant at the 5% level.”

To be clear: because this is a fair coin, there is no way that the color of the M&Ms eaten would influence how many heads you flip. So the “right” answer is that there is no relationship.

I used a computer program to simulate what would happen if we ran this study. I assumed we had 50 people in each group. First, I tried running the study one time. I calculated the share of heads in the blue and green groups and subtracted the share. This is my “treatment effect” — the impact of blue M&M eating relative to green on the share of heads. You can see the difference is very small. In this experiment the blue group threw 58% heads, and the green group 54%, for a slightly higher share of blue. 

In my fake study, the blue M&M group had a slightly higher chance of heads, but this difference is small. When I ran my statistical test, the difference was not statistically significant. Basically, I get nothing — I can’t rule out that the two groups are tossing heads at the same rate. 

However: we all know that when we flip a coin, or do anything else random, sometimes you get a string of heads — just by chance. Sometimes if you do 50 coin flips, 40 of them come up heads. It’s not likely! But it’s not impossible. And that fact means that, sometimes, even if there is no real relationship between M&M color and flipping heads, you can find one by accident.

After I did my study one time, with the results above, I did it 99 more times. In the end, I have 100 (computer-generated) versions of the same test. For each of them, I calculated the difference in the share of heads produced by the blue group versus the green group, and whether that difference was significant. The graph below shows the differences; the bars in yellow are differences I found that were significant. The dark-blue bar is the original result from my first experimental run.

That first bar — all the way to the left on your screen in yellow. In that version of my experiment, the blue M&M group were much less likely to flip heads than the green group were, and that result showed up as statistically significant even though it was definitely, definitely just by chance. 

In fact, out of the 100 times I ran this experiment, the data showed me 5 times where there appears to be a statistically significant relationship at the standard 5% level. This isn’t a mistake; it’s not something I messed up in doing it. This is, in fact, the definition of statistical significance at the 5% level. If you have a setting where the true impact of some treatment is zero, and you run it 100 times, you’ll expect to see a significant effect 5 of the 100 times. 

The real world 

How does this help us understand studies in the real world?

Now imagine someone says they have run a study to evaluate the impact of eating blue versus green M&Ms on performance on a test of multiplication speed. We know from the above example that even if there were no impact, if I ran this study 100 times we would expect to find a significant effect 5 of the 100 times.  

The researchers report an effect: green M&Ms make you multiply faster. There are two possible explanations for that finding. One is that there is actually an impact of M&M color on multiplication speed. The other is that this is a result that arises by chance — it’s one of the 5 in 100.  

If we were very sure that the researchers ran this study only one time, we could be pretty confident in the result — there is always the possibility that this is one of the “chance” significant findings, but the high level of significance gives us confidence. 

But: in the real world, we cannot be sure that we are seeing all the research that actually happened before publication. This leads to concerns about “publication bias” and “p-hacking,” which, in turn, makes us skeptical. More on these below, but for a simple explanation, try this cartoon

Publication bias and p-hacking

Publication bias and p-hacking are two shorthand, jargony ways to describe journal and researcher behaviors that make it more likely that the results we observe in published papers are occurring just by chance.

First: Academic journals are more likely to publish papers that find significant results. It’s not hard to see why this might be true. It’s not very interesting to publish a result saying that M&M color doesn’t impact multiplication speed — that’s kind of what we expected. But a result that says it does matter — that’s more surprising, and more likely to spark the interest of a journal editor.

This is what we call publication bias, and it turns out that this pattern means that the results we see in print are actually a lot more likely to be statistical accidents. Often, many researchers are looking into the same question. It’s not just my research team who is interested in the M&M-multiplication relationship — imagine there are 99 other teams doing the same thing. Even if there is no relationship, on average 5 of those teams will find something significant. 

These 5 “successful” teams are more likely to get their results published. That’s what we all see in journals, but what we do not see is the 95 times it didn’t work. When we read these studies, we’re assuming, implicitly, that we are seeing all the studies that were run. But we’re not, and we’re more likely to see the significant-by-chance results.

The issue of publication bias would be problematic just on its own. But it’s even more problematic when it interacts with researchers’ incentives. Researchers need to publish, and (see above) it is easier to do so when results are significant. This can lead to what people sometimes call p-hacking (the “p” stands for probability).

When researchers run a study, there are often a lot of ways to analyze the data. You can analyze the impact on different subgroups of the population. You can analyze the impact of different circumstances. You can test many different treatments. The idea of the xkcd cartoon is that you could test the impact of all the different M&M colors on some outcome. 

The more of these tests you do, the more likely you are to get a significant effect by chance. If you do 100 tests, you expect 5 of them to be significant at the 5% level. And then, because of publication bias, you write up the results focusing only on the significant groups or significant M&M colors. Of course, those are just accidental. But as a consumer of research, we do not see all the other things that happened in the background.

For these two reasons: some of what we see published, even if it is from a randomized experiment, is likely to be a result of statistical chance. There is a somewhat notorious paper that suggests that “most” research findings are false; I think this is overkill, but it’s a perspective. 

Is everything wrong?

Is all lost? Is all research garbage?

No, definitely not. Much research is great, and randomized trials are often compelling and convincing. How do we know which new studies are worth paying attention to? There is no simple answer here, but there are a couple of features that matter.

One very, very important issue is the number of people in the study, especially as it relates to the expected size of the impact. Before researchers run a study, they generally have an idea of the size of the effect they might get, based on what they know before from other information. For example, if I am studying the impact of M&M color on math performance, I would expect that effect to be very small (if it is there at all). On the other hand, if I’m studying the impact on math performance of keeping someone awake for 48 hours, I might expect that effect to be large. 

The larger the effect you expect, the smaller the sample of people you need to detect it statistically. This is called “statistical power,” and in well-designed experiments, researchers calculate this before they even start their study. They make an educated guess at the size of the impact, and they figure out how many people they need to detect that effect. 

When this doesn’t happen — when researchers do a small experiment on something where we expect the impact to be very small — we should not expect to learn anything from the data. The study is just not powered to detect the likely effects. If we do see an impact, it’s extremely likely that it is a statistical accident.

Shorthand here: large studies = better, especially when the expected effects are small. 

A second good feature to look for is confirmatory results. It should make us more confident if we see multiple experiments with the same results (commonly done in clinical drug trials). Or if we see one experiment that has a suggestive fact, and then a confirmatory experiment after that. More independent confirmation increases confidence. 

In practice, what this means is that the set of papers we should be paying attention to is a lot smaller than the set that gets media attention. 

A story about fish

I would like to end by talking about one of my very favorite academic papers ever.

To understand this story, you need to understand something about neuroscience. One thing people who study the brain like to do is put people into brain scan machines, called fMRI scanning. To somewhat oversimplify, these experiments show people different stimuli (pictures, words, etc.), and the researchers look at what parts of the brain light up. The idea is to figure out what parts of the brain do what. 

An issue with this method is that the brain has many places that can light up. These areas of brain tissue are called voxels, and there are about 130,000 of them. This fact has enormous potential for the problems I talked about above — in principle, you could be doing 130,000 different tests, looking at each voxel. With this approach you will get a lot of things that are significant at the 5% level.

In practice, researchers take approaches to try to mitigate this problem, but the people who wrote my favorite paper felt those approaches were definitely not sufficient to fix the problem. To show this, they did a study on a fish. A dead fish.  

Specifically, they studied a dead salmon. They put this dead salmon into an fMRI machine, and they gave it a task. As they write: “The salmon was shown a series of photographs depicting human individuals in social situations with a specified emotional valence, either socially inclusive or socially exclusive.” The study is careful. They provide a mirror to ensure the salmon can see the faces.

The authors then looked at what areas of the salmon’s brain lit up when the fish saw the photos, compared with a period of no photos. They found two areas of the brain that appeared significantly more electronically active during the photos. That is: if they had been doing a standard type of study, they would have reported that there were two areas of the brain that were responsive to emotionally charged photos. 

I want to mention again that, at the time, the salmon was dead. (Brains can have electrical activity postmortem.) In addition, salmon are not well known for being responsive to the facial emotions of people, even while alive. 

As the authors say, either they have discovered something amazing about the cognition of dead fish, or the methods that people are using in these fMRI studies are statistically problematic. They favor the latter explanation, and argue that these issues need to be addressed before we can really learn about the brain.

Thanks to Jesse Shapiro, Isaiah Andrews, and Lukasz Kowalik for helpful comments. All mistakes remain my own! 

0 Comments
Inline Feedbacks
View all comments
An illustration of different types of colorful graphs and charts.

Mar 13 2023

11 min read

Why I Look at Data Differently

A question I get frequently: Why does my analysis often disagree with groups like the American Academy of Pediatrics or Read more

Emily Oster
Personal view from a pregnant woman of her stomach from above. She is holding a stuffed rabbit and ultrasound.

Jul 10 2023

6 min read

Three Big Changes in Pregnancy Data

I wrote Expecting Better during pregnancy and in the first year of my daughter’s life, in a period between 2011 Read more

Emily Oster
A child holds up an abacus with green and red beads arranged to look like a data chart.

Aug 10 2023

9 min read

Data Literacy for Parenting

The ParentData mission statement (it’s on my door!) is “To create the most data-literate generation of parents.” The other day, Read more

Emily Oster
A group of babies listening to ParentData on headphones

Jul 12 2023

30 min read

Random Acts of Data

Today I spoke with Dr. Anupam Jena, known as Bapu, who hosts the podcast Freakonomics, MD, about his new book, Random Acts Read more

Emily Oster

Instagram

left right
I hear from many of you that the information on ParentData makes you feel seen. Wherever you are on your journey, it’s always helpful to know you’re not alone. 

Drop an emoji in the comments that best describes your pregnancy or parenting searches lately… 💤🚽🍻🎒💩

I hear from many of you that the information on ParentData makes you feel seen. Wherever you are on your journey, it’s always helpful to know you’re not alone.

Drop an emoji in the comments that best describes your pregnancy or parenting searches lately… 💤🚽🍻🎒💩
...

Milestones. We celebrate them in pregnancy, in parenting, and they’re a fun thing to celebrate at work too. Just a couple years ago I couldn’t have foreseen what this community would grow into. Today, there are over 400,000 of you here—asking questions, making others feel seen wherever they may be in their journey, and sharing information that supports data > panic. 

It has been a busy summer for the team at ParentData. I’d love to take a moment here to celebrate the 400k milestone. As I’ve said before, it’s more important than ever to put good data in the hands of parents. 

Share this post with a friend who could use a little more data, and a little less parenting overwhelm. 

📷 Me and my oldest, collaborating on “Expecting Better”

Milestones. We celebrate them in pregnancy, in parenting, and they’re a fun thing to celebrate at work too. Just a couple years ago I couldn’t have foreseen what this community would grow into. Today, there are over 400,000 of you here—asking questions, making others feel seen wherever they may be in their journey, and sharing information that supports data > panic.

It has been a busy summer for the team at ParentData. I’d love to take a moment here to celebrate the 400k milestone. As I’ve said before, it’s more important than ever to put good data in the hands of parents.

Share this post with a friend who could use a little more data, and a little less parenting overwhelm.

📷 Me and my oldest, collaborating on “Expecting Better”
...

I spend a lot of time talking people down after they read the latest panic headline. In most cases, these articles create an unnecessary amount of stress around pregnancy and parenting. This is my pro tip for understanding whether the risk presented is something you should really be worrying about.

Comment “link” for an article with other tools to help you navigate risk and uncertainty.

#emilyoster #parentdata #riskmanagement #parentstruggles #parentingstruggles

I spend a lot of time talking people down after they read the latest panic headline. In most cases, these articles create an unnecessary amount of stress around pregnancy and parenting. This is my pro tip for understanding whether the risk presented is something you should really be worrying about.

Comment “link” for an article with other tools to help you navigate risk and uncertainty.

#emilyoster #parentdata #riskmanagement #parentstruggles #parentingstruggles
...

Here’s why I think you don’t have to throw away your baby bottles.

Here’s why I think you don’t have to throw away your baby bottles. ...

Drop your toddlers favorite thing right now in the comments—then grab some popcorn.

Original thread source: Reddit @croc_docs

Drop your toddlers favorite thing right now in the comments—then grab some popcorn.

Original thread source: Reddit @croc_docs
...

Just keep wiping.

Just keep wiping. ...

Dr. Gillian Goddard sums up what she learned from the Hot Flash  S e x  Survey! Here are some key data takeaways:

🌶️ Among respondents, the most common s e x u a l frequency was 1 to 2 times per month, followed closely by 1 to 2 times per week
🌶️ 37% have found their sweet spot and are happy with the frequency of s e x they are having
🌶️ About 64% of respondents were very or somewhat satisfied with the quality of the s e x they are having

Do any of these findings surprise you? Let us know in the comments!

#hotflash #intimacy #midlifepleasure #parentdata #relationships

Dr. Gillian Goddard sums up what she learned from the Hot Flash S e x Survey! Here are some key data takeaways:

🌶️ Among respondents, the most common s e x u a l frequency was 1 to 2 times per month, followed closely by 1 to 2 times per week
🌶️ 37% have found their sweet spot and are happy with the frequency of s e x they are having
🌶️ About 64% of respondents were very or somewhat satisfied with the quality of the s e x they are having

Do any of these findings surprise you? Let us know in the comments!

#hotflash #intimacy #midlifepleasure #parentdata #relationships
...

Should your kid be in a car seat on the plane? The AAP recommends that you put kids under 40 pounds into a car seat on airplanes. However, airlines don’t require car seats.

Here’s what we know from a data standpoint:
✈️ The risk of injury to a child on a plane without a carseat is very small (about 1 in 250,000)
✈️ A JAMA Pediatrics paper estimates about 0.4 child air crash deaths per year might be prevented in the U.S. with car seats 
✈️ Cars are far more dangerous than airplanes! The same JAMA paper suggests that if 5% to 10% of families switched to driving, then we would expect more total deaths as a result of this policy. 

If you want to buy a seat for your lap infant, or bring a car seat for an older child, by all means do so! But the additional protection based on the numbers is extremely small.

#parentdata #emilyoster #flyingwithkids #flyingwithbaby #carseats #carseatsafety

Should your kid be in a car seat on the plane? The AAP recommends that you put kids under 40 pounds into a car seat on airplanes. However, airlines don’t require car seats.

Here’s what we know from a data standpoint:
✈️ The risk of injury to a child on a plane without a carseat is very small (about 1 in 250,000)
✈️ A JAMA Pediatrics paper estimates about 0.4 child air crash deaths per year might be prevented in the U.S. with car seats
✈️ Cars are far more dangerous than airplanes! The same JAMA paper suggests that if 5% to 10% of families switched to driving, then we would expect more total deaths as a result of this policy.

If you want to buy a seat for your lap infant, or bring a car seat for an older child, by all means do so! But the additional protection based on the numbers is extremely small.

#parentdata #emilyoster #flyingwithkids #flyingwithbaby #carseats #carseatsafety
...

SLEEP DATA 💤 PART 2: Let’s talk about naps. Comment “Link” for an article on what we learned about daytime sleep!

The first three months of life are a chaotic combination of irregular napping, many naps, and a few brave or lucky souls who appear to have already arrived at a two-to-three nap schedule. Over the next few months, the naps consolidate to three and then to two. By the 10-to-12-month period, a very large share of kids are napping a consistent two naps per day. Over the period between 12 and 18 months, this shifts toward one nap. And then sometime in the range of 3 to 5 years, naps are dropped. What I think is perhaps most useful about this graph is it gives a lot of color to the average napping ages that we often hear. 

Note: Survey data came from the ParentData audience and users of the Nanit sleep monitor system. Both audiences skew higher-education and higher-income than the average, and mostly have younger children. The final sample is 14,919 children. For more insights on our respondents, read the full article.

SLEEP DATA 💤 PART 2: Let’s talk about naps. Comment “Link” for an article on what we learned about daytime sleep!

The first three months of life are a chaotic combination of irregular napping, many naps, and a few brave or lucky souls who appear to have already arrived at a two-to-three nap schedule. Over the next few months, the naps consolidate to three and then to two. By the 10-to-12-month period, a very large share of kids are napping a consistent two naps per day. Over the period between 12 and 18 months, this shifts toward one nap. And then sometime in the range of 3 to 5 years, naps are dropped. What I think is perhaps most useful about this graph is it gives a lot of color to the average napping ages that we often hear.

Note: Survey data came from the ParentData audience and users of the Nanit sleep monitor system. Both audiences skew higher-education and higher-income than the average, and mostly have younger children. The final sample is 14,919 children. For more insights on our respondents, read the full article.
...

Happy Father’s Day to the Fathers and Father figures in our ParentData community! 

Tag a Dad who this holiday may be tricky for. We’re sending you love. 💛

Happy Father’s Day to the Fathers and Father figures in our ParentData community!

Tag a Dad who this holiday may be tricky for. We’re sending you love. 💛
...

“Whilst googling things like ‘new dad sad’ and ‘why am I crying new dad,’ I came across an article written by a doctor who had trouble connecting with his second child. I read the symptoms and felt an odd sense of relief.” Today we’re bringing back an essay by Kevin Maguire of @newfatherhood about his experience with paternal postpartum depression. We need to demystify these issues in order to change things for the better. Comment “Link” for a DM to read his full essay.

#parentdata #postpartum #postpartumdepression #paternalmentalhealth #newparents #emilyoster

“Whilst googling things like ‘new dad sad’ and ‘why am I crying new dad,’ I came across an article written by a doctor who had trouble connecting with his second child. I read the symptoms and felt an odd sense of relief.” Today we’re bringing back an essay by Kevin Maguire of @newfatherhood about his experience with paternal postpartum depression. We need to demystify these issues in order to change things for the better. Comment “Link” for a DM to read his full essay.

#parentdata #postpartum #postpartumdepression #paternalmentalhealth #newparents #emilyoster
...

What does the data say about children who look more like one parent? Do they also inherit more character traits and mannerisms from that parent? Let’s talk about it 🔎

#emilyoster #parentdata #parentingcommunity #lookslikedaddy #lookslikemommy

What does the data say about children who look more like one parent? Do they also inherit more character traits and mannerisms from that parent? Let’s talk about it 🔎

#emilyoster #parentdata #parentingcommunity #lookslikedaddy #lookslikemommy
...

SLEEP DATA 💤 We asked you all about your kids’ sleep—and got nearly 15,000 survey responses to better understand kids’ sleep patterns. Comment “Link” for an article that breaks down our findings!

This graph shows sleeping location by age. You’ll notice that for the first three months, most kids are in their own sleeping location in a parent’s room. Then, over the first year, this switches toward their own room. As kids age, sharing a room with a sibling becomes more common. 

Head to the newsletter for more and stay tuned for part two next week on naps! 🌙

#parentdata #emilyoster #childsleep #babysleep #parentingcommunity

SLEEP DATA 💤 We asked you all about your kids’ sleep—and got nearly 15,000 survey responses to better understand kids’ sleep patterns. Comment “Link” for an article that breaks down our findings!

This graph shows sleeping location by age. You’ll notice that for the first three months, most kids are in their own sleeping location in a parent’s room. Then, over the first year, this switches toward their own room. As kids age, sharing a room with a sibling becomes more common.

Head to the newsletter for more and stay tuned for part two next week on naps! 🌙

#parentdata #emilyoster #childsleep #babysleep #parentingcommunity
...

Weekends are good for extra cups of ☕️ and listening to podcasts. I asked our team how they pod—most people said on walks or during chores. What about you?

Comment “Link” to subscribe to ParentData with Emily Oster, joined by some excellent guests.

#parentdata #parentdatapodcast #parentingpodcast #parentingtips #emilyoster

Weekends are good for extra cups of ☕️ and listening to podcasts. I asked our team how they pod—most people said on walks or during chores. What about you?

Comment “Link” to subscribe to ParentData with Emily Oster, joined by some excellent guests.

#parentdata #parentdatapodcast #parentingpodcast #parentingtips #emilyoster
...