Emily Oster

Statistical Significance—and Why It Matters for Parenting

Fake studies, p-hacking, publication bias, and more

Emily Oster

Jan 29 2024 3 min Read

One of my big goals with ParentData is to enhance data literacy. Yes, I want to be here to talk you down from the panic headlines you read. But I also want to give you the tools to talk yourself down, at least some of the time.

Today, we’re doing data literacy. I’m going to talk about statistical significance.

The phrase “statistically significant” is probably one most people are familiar with, or have read before. Media coverage of academic results will often use it — as in, “In this study, the relationship was statistically significant.” When it is used in common parlance, I think people often read it as meaning “true.” This phrase somehow implies that the result is right, or real.

One reason that this conclusion might be wrong, of course, is that a lot of studies estimate correlations, not causal relationships. That’s an issue I discuss all the time. There is a second issue, though, which is what I’ll be going into here. Even in an experiment — where we have a randomized treatment, so we are more confident that the results we see are causal — understanding what the results mean requires a real understanding of statistical significance.

That’s our job for today. No bottom line on this one — you’ve got to read it! But if you get to the end, there is some funny stuff about a dead fish, so hang in there.

What does “statistically significant” mean?

When we say an effect is “statistically significant at the 5% level,” what this means is that there is less than a 5% chance that we’d see an effect of this size if the true effect were zero. (The “5% level” is a common cutoff, but things can be significant at the 1% or 10% level also.)

The natural follow-up question is: Why would any effect we see occur by chance? The answer lies in the fact that data is “noisy”: it comes with error. To see this a bit more, we can think about what would happen if we studied a setting where we know our true effect is zero.

My fake study

Imagine the following (fake) study. Participants are randomly assigned to eat a package of either blue or green M&Ms, and then they flip a (fair) coin and you see if it is heads. Your analysis will compare the number of heads that people flip after eating blue versus green M&Ms and report whether this is “statistically significant at the 5% level.”

To be clear: because this is a fair coin, there is no way that the color of the M&Ms eaten would influence how many heads you flip. So the “right” answer is that there is no relationship.

I used a computer program to simulate what would happen if we ran this study. I assumed we had 50 people in each group. First, I tried running the study one time. I calculated the share of heads in the blue and green groups and subtracted the share. This is my “treatment effect” — the impact of blue M&M eating relative to green on the share of heads. You can see the difference is very small. In this experiment the blue group threw 58% heads, and the green group 54%, for a slightly higher share of blue.

In my fake study, the blue M&M group had a slightly higher chance of heads, but this difference is small. When I ran my statistical test, the difference was not statistically significant. Basically, I get nothing — I can’t rule out that the two groups are tossing heads at the same rate.

However: we all know that when we flip a coin, or do anything else random, sometimes you get a string of heads — just by chance. Sometimes if you do 50 coin flips, 40 of them come up heads. It’s not likely! But it’s not impossible. And that fact means that, sometimes, even if there is no real relationship between M&M color and flipping heads, you can find one by accident.

After I did my study one time, with the results above, I did it 99 more times. In the end, I have 100 (computer-generated) versions of the same test. For each of them, I calculated the difference in the share of heads produced by the blue group versus the green group, and whether that difference was significant. The graph below shows the differences; the bars in yellow are differences I found that were significant. The dark-blue bar is the original result from my first experimental run.

That first bar — all the way to the left on your screen in yellow. In that version of my experiment, the blue M&M group were much less likely to flip heads than the green group were, and that result showed up as statistically significant even though it was definitely, definitely just by chance.

In fact, out of the 100 times I ran this experiment, the data showed me 5 times where there appears to be a statistically significant relationship at the standard 5% level. This isn’t a mistake; it’s not something I messed up in doing it. This is, in fact, the definition of statistical significance at the 5% level. If you have a setting where the true impact of some treatment is zero, and you run it 100 times, you’ll expect to see a significant effect 5 of the 100 times.

The real world

How does this help us understand studies in the real world?

Now imagine someone says they have run a study to evaluate the impact of eating blue versus green M&Ms on performance on a test of multiplication speed. We know from the above example that even if there were no impact, if I ran this study 100 times we would expect to find a significant effect 5 of the 100 times.

The researchers report an effect: green M&Ms make you multiply faster. There are two possible explanations for that finding. One is that there is actually an impact of M&M color on multiplication speed. The other is that this is a result that arises by chance — it’s one of the 5 in 100.

If we were very sure that the researchers ran this study only one time, we could be pretty confident in the result — there is always the possibility that this is one of the “chance” significant findings, but the high level of significance gives us confidence.

But: in the real world, we cannot be sure that we are seeing all the research that actually happened before publication. This leads to concerns about “publication bias” and “p-hacking,” which, in turn, makes us skeptical. More on these below, but for a simple explanation, try this cartoon.

Publication bias and p-hacking

Publication bias and p-hacking are two shorthand, jargony ways to describe journal and researcher behaviors that make it more likely that the results we observe in published papers are occurring just by chance.

First: Academic journals are more likely to publish papers that find significant results. It’s not hard to see why this might be true. It’s not very interesting to publish a result saying that M&M color doesn’t impact multiplication speed — that’s kind of what we expected. But a result that says it does matter — that’s more surprising, and more likely to spark the interest of a journal editor.

This is what we call publication bias, and it turns out that this pattern means that the results we see in print are actually a lot more likely to be statistical accidents. Often, many researchers are looking into the same question. It’s not just my research team who is interested in the M&M-multiplication relationship — imagine there are 99 other teams doing the same thing. Even if there is no relationship, on average 5 of those teams will find something significant.

These 5 “successful” teams are more likely to get their results published. That’s what we all see in journals, but what we do not see is the 95 times it didn’t work. When we read these studies, we’re assuming, implicitly, that we are seeing all the studies that were run. But we’re not, and we’re more likely to see the significant-by-chance results.

The issue of publication bias would be problematic just on its own. But it’s even more problematic when it interacts with researchers’ incentives. Researchers need to publish, and (see above) it is easier to do so when results are significant. This can lead to what people sometimes call p-hacking (the “p” stands for probability).

When researchers run a study, there are often a lot of ways to analyze the data. You can analyze the impact on different subgroups of the population. You can analyze the impact of different circumstances. You can test many different treatments. The idea of the xkcd cartoon is that you could test the impact of all the different M&M colors on some outcome.

The more of these tests you do, the more likely you are to get a significant effect by chance. If you do 100 tests, you expect 5 of them to be significant at the 5% level. And then, because of publication bias, you write up the results focusing only on the significant groups or significant M&M colors. Of course, those are just accidental. But as a consumer of research, we do not see all the other things that happened in the background.

For these two reasons: some of what we see published, even if it is from a randomized experiment, is likely to be a result of statistical chance. There is a somewhat notorious paper that suggests that “most” research findings are false; I think this is overkill, but it’s a perspective.

Is everything wrong?

Is all lost? Is all research garbage?

No, definitely not. Much research is great, and randomized trials are often compelling and convincing. How do we know which new studies are worth paying attention to? There is no simple answer here, but there are a couple of features that matter.

One very, very important issue is the number of people in the study, especially as it relates to the expected size of the impact. Before researchers run a study, they generally have an idea of the size of the effect they might get, based on what they know before from other information. For example, if I am studying the impact of M&M color on math performance, I would expect that effect to be very small (if it is there at all). On the other hand, if I’m studying the impact on math performance of keeping someone awake for 48 hours, I might expect that effect to be large.

The larger the effect you expect, the smaller the sample of people you need to detect it statistically. This is called “statistical power,” and in well-designed experiments, researchers calculate this before they even start their study. They make an educated guess at the size of the impact, and they figure out how many people they need to detect that effect.

When this doesn’t happen — when researchers do a small experiment on something where we expect the impact to be very small — we should not expect to learn anything from the data. The study is just not powered to detect the likely effects. If we do see an impact, it’s extremely likely that it is a statistical accident.

Shorthand here: large studies = better, especially when the expected effects are small.

A second good feature to look for is confirmatory results. It should make us more confident if we see multiple experiments with the same results (commonly done in clinical drug trials). Or if we see one experiment that has a suggestive fact, and then a confirmatory experiment after that. More independent confirmation increases confidence.

In practice, what this means is that the set of papers we should be paying attention to is a lot smaller than the set that gets media attention.

A story about fish

I would like to end by talking about one of my very favorite academic papers ever.

To understand this story, you need to understand something about neuroscience. One thing people who study the brain like to do is put people into brain scan machines, called fMRI scanning. To somewhat oversimplify, these experiments show people different stimuli (pictures, words, etc.), and the researchers look at what parts of the brain light up. The idea is to figure out what parts of the brain do what.

An issue with this method is that the brain has many places that can light up. These areas of brain tissue are called voxels, and there are about 130,000 of them. This fact has enormous potential for the problems I talked about above — in principle, you could be doing 130,000 different tests, looking at each voxel. With this approach you will get a lot of things that are significant at the 5% level.

In practice, researchers take approaches to try to mitigate this problem, but the people who wrote my favorite paper felt those approaches were definitely not sufficient to fix the problem. To show this, they did a study on a fish. A dead fish.

Specifically, they studied a dead salmon. They put this dead salmon into an fMRI machine, and they gave it a task. As they write: “The salmon was shown a series of photographs depicting human individuals in social situations with a specified emotional valence, either socially inclusive or socially exclusive.” The study is careful. They provide a mirror to ensure the salmon can see the faces.

The authors then looked at what areas of the salmon’s brain lit up when the fish saw the photos, compared with a period of no photos. They found two areas of the brain that appeared significantly more electronically active during the photos. That is: if they had been doing a standard type of study, they would have reported that there were two areas of the brain that were responsive to emotionally charged photos.

I want to mention again that, at the time, the salmon was dead. (Brains can have electrical activity postmortem.) In addition, salmon are not well known for being responsive to the facial emotions of people, even while alive.

As the authors say, either they have discovered something amazing about the cognition of dead fish, or the methods that people are using in these fMRI studies are statistically problematic. They favor the latter explanation, and argue that these issues need to be addressed before we can really learn about the brain.

Thanks to Jesse Shapiro, Isaiah Andrews, and Lukasz Kowalik for helpful comments. All mistakes remain my own!

+ Add a Comment

You must be logged in to reply to this topic.

Log in

A link to log in to ParentData has been sent to emily@parentdata.com.
If you don’t see it, please be sure to check your spam folder.

Didn’t get the link?

Not yet part of the community? Sign up

0 Comments

Inline Feedbacks

View all comments

Share on:

Latest discussions

How Do You Parent Through Toddler Tantrums?

For many parents, the toddler years are a favorite. This time is filled with curiosity, adventure, wonder, and magic. However, at some point or another, we’ve heard these years referred to as the “terrible twos” or threes. Children are learning about boundaries, pushing buttons, and finding their voice. Today’s question asks what to do when those stages feel eternal. Is there a parenting style that will fix it? Will this…

ParentData 1 week ago

What Are Your Parenting Wins and Woes? (April 2024)

A few weeks ago, we brought back our Wins and Woes series. The responses we received were amazing, so I wanted to highlight one of your wins and a woe and provide a space for you to share new ones. Feel free to leave some encouraging words, share your own story, or help celebrate someone else’s win. As I usually say, parenting is wonderful, but it can also be hard.…

ParentData 2 weeks ago

Should I Have a Baby?

This week we’re bringing back one of our most popular questions, about whether to have a child. The decision to become a parent is almost never black-and-white, and each family should consider the factors that are unique to them. However, it’s always helpful to hear from those who felt the same way. Now for today’s question. —Denisse, ParentData Community Manager I’m 35 and so is my husband. We love our…

ParentData 3 weeks ago

Instagram

COMING SOON: My new book “The Unexpected: Navigating Pregnancy During and After Complications” is available on April 30th. All of my other books came out of my own experiences. I wrote them to answer questions I had, as a pregnant woman and then as a new parent. “The Unexpected” is a book not to answer my own questions but to answer yours. Specifically, to answer the thousands of questions I’ve gotten over the past decade from people whose pregnancies were more complicated than they had expected. This is for you. 💛 Order now at my link in bio!

Open

“I sat on the couch, staring aimlessly at the television as my husband and family milled about the house. A baby was crying somewhere in the distance. My mom walked into my line of vision with the crying baby asking if I wanted to feed him. I shuddered, trying to clear the brain fog. Right, the crying baby was mine. I had a baby. We were finally home with our baby. I was a mom.” Today, @thebirthtrauma_mama lends her voice to ParentData by sharing her own traumatic birth story and what came after it. Kayleigh is a therapist who is tireless in her efforts to help people feel seen, heard, and understood in these impossible situations. I’m so grateful to her for sharing this with us. Comment “Link” for a DM to read her full essay.

#parentdata #birthstory #pregnancyjourney #birthing #pregnancycomplications #birthtrauma

Open

Is side sleeping important during pregnancy? Comment “Link” for a DM to an article on whether sleep position affects pregnancy outcomes.

Being pregnant makes you tired, and as time goes by, it gets increasingly hard to get comfortable. You were probably instructed to sleep on your side and not your back, but it turns out that advice is not based on very good data.

We now have much better data on this, and the bulk of the evidence seems to reject the link between sleep position and stillbirth or other negative outcomes. So go ahead and get some sleep however you are most comfortable. 💤

Sources:
📖 #ExpectingBetter pp. 160-163
📈 Robert M. Silver et al., “Prospective Evaluation of Maternal Sleep Position Through 30 Weeks of Gestation and Adverse Pregnancy Outcomes,” Obstetrics and Gynecology 134, no. 4 (2019): 667–76.

#emilyoster #pregnancy #pregnancytips #sleepingposition #pregnantlife

Open

My new book, “The Unexpected: Navigating Pregnancy During and After Complications” is available for preorder at the link in my bio!

I co-wrote #TheUnexpected with my friend and maternal fetal medicine specialist, Dr. Nathan Fox. The unfortunate reality is that about half of pregnancies include complications such as preeclampsia, miscarriage, preterm birth, and postpartum depression. Because these are things not talked about enough, it can not only be an isolating experience, but it can also make treatment harder to access.

The book lays out the data on recurrence and delves into treatment options shown to lower risk for these conditions in subsequent pregnancies. It also guides you through how to have productive conversations and make shared decisions with your doctor. I hope none of you need this book, but if you do, it’ll be here for you 💛

#pregnancy #pregnancycomplications #pregnancyjourney #preeclampsiaawareness #postpartumjourney #emilyoster

Open

We are better writers than influencers, I promise. Thanks to our kids for filming our unboxing videos. People make this look way too easy.

Only two weeks until our book “The Unexpected” is here! Preorder at the link in my bio. 💙

Open

There’s been a lot of debate recently about whether the CDC’s findings on maternal mortality are overblown or not. Comment “Link” for a DM to an article digging into the data on both sides.

It’s also worth asking a basic question: Why are we fighting about this? Any way you slice it, the U.S. maternal mortality rate doesn’t compare favorably to comparably developed countries. There are also huge inequities in mortality, with maternal mortality rates for Black women far higher than for any other group. For a rich country that spends a lot of money on health care, we should be doing better.

We need to focus on finding solutions! And a full understanding of the data — including trends over time and levels — is crucial for developing them.

#BMHW24 #maternalmortality #maternalhealth #cdc #parentdata #emilyoster

Open

Exciting news! We have new, high-quality data that says it’s safe to take Tylenol during pregnancy and there is no link between Tylenol exposure and neurodevelopmental issues in kids. Comment “Link” for a DM to an article exploring this groundbreaking study.

While doctors have long said Tylenol was safe, confusing studies, panic headlines, and even a lawsuit have continually stoked fears in parents. As a result, many pregnant women have chosen not to take it, even if it would help them.

This is why good data is so important! When we can trust the data, we can trust our choices. And this study shows there is no blame to be placed on pregnant women here. So if you have a migraine or fever, please take your Tylenol.

#tylenol #pregnancy #pregnancyhealth #pregnancytips #parentdata #emilyoster

Open

Are screens bad? Panic headlines will convince you they are, but these arguments are not typically based on good data. So what should you base your decisions on? 📺 Comment “Link” for a DM to my article on a different way to look at screen time.

One of the most important things to consider is opportunity cost, or what your child could be doing instead (i.e. reading, playing outside, sleeping). For example, when my kids are on an airplane, screen time is pretty much unlimited. Because what else are they going to be doing? But on a school night, when there’s homework and violin lessons and sports practice, there may be no screen time at all. This isn’t seeing screen time as a reward or punishment; it’s simply a reflection of where screens fall in the hierarchy of activities.

A thought experiment that may be helpful is to think of screens as the same as your kid sitting in their room staring at the wall. It’s completely neutral. Not beneficial, not harmful. The question is: when does wall-staring make sense, and when does it not?

So tell me: how does screen time fit into your family?

Open

How many words should kids say — and when? Comment “Link” for a DM to an article about language development!

For this graph, researchers used a standardized measure of vocabulary size. Parents were given a survey and checked off all the words and sentences they have heard their child say.

They found that the average child—the 50th percentile line—at 24 months has about 300 words. A child at the 10th percentile—near the bottom of the distribution—has only about 50 words. On the other end, a child at the 90th percentile has close to 600 words. One main takeaway from these graphs is the explosion of language after fourteen or sixteen months.

What’s valuable about this data is it can give us something beyond a general guideline about when to consider early intervention, and also provide reassurance that there is a significant range in this distribution at all young ages.

#cribsheet #emilyoster #parentdata #languagedevelopment #firstwords

Open

I saw this and literally laughed out loud 😂 Thank you @adamgrant for sharing this gem! Someone let me know who originally created this masterpiece so I can give them the proper credit.

Open

I saw this and literally laughed out loud 😂 Thank you @adamgrant for sharing this gem! Someone let me know who originally created this masterpiece so I can give them the proper credit. ...

Perimenopause comes with a whole host of symptoms, like brain fog, low sex drive, poor energy, and loss of muscle mass. These symptoms can be extremely bothersome and hard to treat. Could testosterone help? Comment “Link” for a DM to an article about the data on testosterone treatment for women in perimenopause.

#perimenopause #perimenopausehealth #womenshealth #hormoneimbalance #emilyoster #parentdata

Open

What age is best to start swim lessons? Comment “Link” for a DM to an article about water safety for children 💦

Summer is quickly approaching! You might be wondering if it’s the right time to have your kid start swim lessons. The AAP recommends starting between 1 and 4 years old. This is largely based on a randomized trial where young children were put into 8 or 12 weeks of swim lessons. They found that swimming ability and water safety reactions improve in both groups, and more so in the 12 weeks group.

Below this age range though, they are too young to actually learn how to swim. It’s fine to bring your baby into the pool (if you’re holding them) and they might like the water. But starting formal safety-oriented swim lessons before this age isn’t likely to be very helpful.

Most importantly, no matter how old your kid is or how good of a swimmer they are, adult supervision is always necessary!

#swimlessons #watersafety #kidsswimminglessons #poolsafety #emilyoster #parentdata

Open

Can babies have salt? 🧂 While babies don’t need extra salt beyond what’s in breast milk or formula, the risks of salt toxicity from normal foods are minimal. There are concerns about higher blood pressure in the long term due to a higher salt diet in the first year, but the data on these is not super compelling and the differences are small.

Like with most things, moderation is key! Avoid very salty chips or olives or saltines with your infant. But if you’re doing baby-led weaning, it’s okay for them to share your lightly salted meals. Your baby does not need their own, unsalted, chicken if you’re making yourself a roast. Just skip the super salty stuff.

#emilyoster #parentdata #childnutrition #babynutrition #foodforkids

Open

Is sleep training bad? Comment “Link” for a DM to an article breaking down the data on sleep training 😴

Among parenting topics, sleep training is one of the most divisive. Ultimately, it’s important to know that studies looking at the short- and long-term effects of sleep training show no evidence of harm. The data actually shows it can improve infant sleep and lower parental depression.

Even so, while sleep training can be a great option, it will not be for everyone. Just as people can feel judged for sleep training, they can feel judged for not doing it. Engaging in any parenting behavior because it’s what’s expected of you is not a good idea. You have to do what works best for your family! If that’s sleep training, make a plan and implement it. If not, that’s okay too.

What’s your experience with sleep training? Did you feel judged for your decision to do (or not do) it?

#sleeptraining #newparents #babysleep #emilyoster #parentdata

Open

Does your kid love to stall right before bedtime? 💤 Tell me more about their tactics in the comments below!

#funnytweets #bedtime #nightimeroutine #parentinghumor #parentingmemes

Open

Does your kid love to stall right before bedtime? 💤 Tell me more about their tactics in the comments below!

#funnytweets #bedtime #nightimeroutine #parentinghumor #parentingmemes ...

Got a big decision to make? 🤔 Comment “Link” for a DM to read about my easy mantra for making hard choices.

When we face a complicated problem in pregnancy or parenting, and don’t like either option A or B, we often wait around for a secret third option to reveal itself. This magical thinking, as appealing as it is, gets in the way. We need a way to remind ourselves that we need to make an active choice, even if it is hard. The mantra I use for this: “There is no secret option C.”

Having this realization, accepting it, reminding ourselves of it, can help us make the hard decisions and accurately weigh the risks and benefits of our choices.

#parentingquotes #decisionmaking #nosecretoptionc #parentingadvice #emilyoster #parentdata

Open

Excuse the language, but I have such strong feelings about this subject! Sometimes, it feels like there’s no winning as a mother. People pressure you to breastfeed and, in the same breath, shame you for doing it in public. Which is it?!

So yes, they’re being completely unreasonable. You should be able to feed your baby in peace. What are some responses you can give to someone who tells you to cover up? Share in the comments below ⬇️

#breastfeeding #breastfeedinginpublic #breastfeedingmom #motherhood #emilyoster

Open

Potty training can feel like a Mount Everest-size challenge, and sadly, our evidence-based guidance is poor. So, I created a survey to collate advice and feedback on success from about 6,000 participants.

How long does potty training take? We found that there is a strong basic pattern here: the later you wait to start, the shorter time it takes to potty train. On average, people who start at under 18 months report it takes them about 12 weeks for their child to be fully trained (using the toilet consistently for both peeing and pooping). For those who start between 3 and 3.5, it’s more like nine days. Keep in mind that for all of these age groups, there is a range of length of time from a few days to over a year. Sometimes parents are told that if you do it right, it only takes a few days. While that is true for some people, it is definitely not the norm.

If you’re in the throes of potty training, hang in there!

#emilyoster #parentdata #pottytraining #pottytrainingtips #toddlerlife

Open

Search

Search

Statistical Significance—and Why It Matters for Parenting

Emily Oster

Statistical Significance—and Why It Matters for Parenting

Emily Oster

What does “statistically significant” mean?

My fake study

The real world

Publication bias and p-hacking

Is everything wrong?

A story about fish

Get more from ParentData when you sign up

Latest discussions

How Do You Parent Through Toddler Tantrums?

What Are Your Parenting Wins and Woes? (April 2024)

Should I Have a Baby?

Instagram

Search

Search

Statistical Significance—and Why It Matters for Parenting

Emily Oster

Statistical Significance—and Why It Matters for Parenting

Emily Oster

What does “statistically significant” mean?

My fake study

The real world

Publication bias and p-hacking

Is everything wrong?

A story about fish

Sign up for ParentData

Consider signing up for ParentData

Get more from ParentData when you sign up

Username

Log in

Related Articles

Emily Oster

Emily Oster

Emily Oster

Emily Oster

Join the ParentData Village

Latest discussions

How Do You Parent Through Toddler Tantrums?

What Are Your Parenting Wins and Woes? (April 2024)

Should I Have a Baby?

Instagram