Statistical Significance—and Why It Matters for Parenting

One of my big goals with ParentData is to enhance data literacy. Yes, I want to be here to talk you down from the panic headlines you read. But I also want to give you the tools to talk yourself down, at least some of the time.

Today, we’re doing data literacy. I’m going to talk about statistical significance.

The phrase “statistically significant” is probably one most people are familiar with, or have read before. Media coverage of academic results will often use it — as in, “In this study, the relationship was statistically significant.” When it is used in common parlance, I think people often read it as meaning “true.” This phrase somehow implies that the result is right, or real.

One reason that this conclusion might be wrong, of course, is that a lot of studies estimate correlations, not causal relationships. That’s an issue I discuss all the time. There is a second issue, though, which is what I’ll be going into here. Even in an experiment — where we have a randomized treatment, so we are more confident that the results we see are causal — understanding what the results mean requires a real understanding of statistical significance.

That’s our job for today. No bottom line on this one — you’ve got to read it! But if you get to the end, there is some funny stuff about a dead fish, so hang in there.

What does “statistically significant” mean?

When we say an effect is “statistically significant at the 5% level,” what this means is that there is less than a 5% chance that we’d see an effect of this size if the true effect were zero. (The “5% level” is a common cutoff, but things can be significant at the 1% or 10% level also.)

The natural follow-up question is: Why would any effect we see occur by chance? The answer lies in the fact that data is “noisy”: it comes with error. To see this a bit more, we can think about what would happen if we studied a setting where we know our true effect is zero.

My fake study

Imagine the following (fake) study. Participants are randomly assigned to eat a package of either blue or green M&Ms, and then they flip a (fair) coin and you see if it is heads. Your analysis will compare the number of heads that people flip after eating blue versus green M&Ms and report whether this is “statistically significant at the 5% level.”

To be clear: because this is a fair coin, there is no way that the color of the M&Ms eaten would influence how many heads you flip. So the “right” answer is that there is no relationship.

I used a computer program to simulate what would happen if we ran this study. I assumed we had 50 people in each group. First, I tried running the study one time. I calculated the share of heads in the blue and green groups and subtracted the share. This is my “treatment effect” — the impact of blue M&M eating relative to green on the share of heads. You can see the difference is very small. In this experiment the blue group threw 58% heads, and the green group 54%, for a slightly higher share of blue.

In my fake study, the blue M&M group had a slightly higher chance of heads, but this difference is small. When I ran my statistical test, the difference was not statistically significant. Basically, I get nothing — I can’t rule out that the two groups are tossing heads at the same rate.

However: we all know that when we flip a coin, or do anything else random, sometimes you get a string of heads — just by chance. Sometimes if you do 50 coin flips, 40 of them come up heads. It’s not likely! But it’s not impossible. And that fact means that, sometimes, even if there is no real relationship between M&M color and flipping heads, you can find one by accident.

After I did my study one time, with the results above, I did it 99 more times. In the end, I have 100 (computer-generated) versions of the same test. For each of them, I calculated the difference in the share of heads produced by the blue group versus the green group, and whether that difference was significant. The graph below shows the differences; the bars in yellow are differences I found that were significant. The dark-blue bar is the original result from my first experimental run.

That first bar — all the way to the left on your screen in yellow. In that version of my experiment, the blue M&M group were much less likely to flip heads than the green group were, and that result showed up as statistically significant even though it was definitely, definitely just by chance.

In fact, out of the 100 times I ran this experiment, the data showed me 5 times where there appears to be a statistically significant relationship at the standard 5% level. This isn’t a mistake; it’s not something I messed up in doing it. This is, in fact, the definition of statistical significance at the 5% level. If you have a setting where the true impact of some treatment is zero, and you run it 100 times, you’ll expect to see a significant effect 5 of the 100 times.

The real world

How does this help us understand studies in the real world?

Now imagine someone says they have run a study to evaluate the impact of eating blue versus green M&Ms on performance on a test of multiplication speed. We know from the above example that even if there were no impact, if I ran this study 100 times we would expect to find a significant effect 5 of the 100 times.

The researchers report an effect: green M&Ms make you multiply faster. There are two possible explanations for that finding. One is that there is actually an impact of M&M color on multiplication speed. The other is that this is a result that arises by chance — it’s one of the 5 in 100.

If we were very sure that the researchers ran this study only one time, we could be pretty confident in the result — there is always the possibility that this is one of the “chance” significant findings, but the high level of significance gives us confidence.

But: in the real world, we cannot be sure that we are seeing all the research that actually happened before publication. This leads to concerns about “publication bias” and “p-hacking,” which, in turn, makes us skeptical. More on these below, but for a simple explanation, try this cartoon.

Publication bias and p-hacking

Publication bias and p-hacking are two shorthand, jargony ways to describe journal and researcher behaviors that make it more likely that the results we observe in published papers are occurring just by chance.

First: Academic journals are more likely to publish papers that find significant results. It’s not hard to see why this might be true. It’s not very interesting to publish a result saying that M&M color doesn’t impact multiplication speed — that’s kind of what we expected. But a result that says it does matter — that’s more surprising, and more likely to spark the interest of a journal editor.

This is what we call publication bias, and it turns out that this pattern means that the results we see in print are actually a lot more likely to be statistical accidents. Often, many researchers are looking into the same question. It’s not just my research team who is interested in the M&M-multiplication relationship — imagine there are 99 other teams doing the same thing. Even if there is no relationship, on average 5 of those teams will find something significant.

These 5 “successful” teams are more likely to get their results published. That’s what we all see in journals, but what we do not see is the 95 times it didn’t work. When we read these studies, we’re assuming, implicitly, that we are seeing all the studies that were run. But we’re not, and we’re more likely to see the significant-by-chance results.

The issue of publication bias would be problematic just on its own. But it’s even more problematic when it interacts with researchers’ incentives. Researchers need to publish, and (see above) it is easier to do so when results are significant. This can lead to what people sometimes call p-hacking (the “p” stands for probability).

When researchers run a study, there are often a lot of ways to analyze the data. You can analyze the impact on different subgroups of the population. You can analyze the impact of different circumstances. You can test many different treatments. The idea of the xkcd cartoon is that you could test the impact of all the different M&M colors on some outcome.

The more of these tests you do, the more likely you are to get a significant effect by chance. If you do 100 tests, you expect 5 of them to be significant at the 5% level. And then, because of publication bias, you write up the results focusing only on the significant groups or significant M&M colors. Of course, those are just accidental. But as a consumer of research, we do not see all the other things that happened in the background.

For these two reasons: some of what we see published, even if it is from a randomized experiment, is likely to be a result of statistical chance. There is a somewhat notorious paper that suggests that “most” research findings are false; I think this is overkill, but it’s a perspective.

Is everything wrong?

Is all lost? Is all research garbage?

No, definitely not. Much research is great, and randomized trials are often compelling and convincing. How do we know which new studies are worth paying attention to? There is no simple answer here, but there are a couple of features that matter.

One very, very important issue is the number of people in the study, especially as it relates to the expected size of the impact. Before researchers run a study, they generally have an idea of the size of the effect they might get, based on what they know before from other information. For example, if I am studying the impact of M&M color on math performance, I would expect that effect to be very small (if it is there at all). On the other hand, if I’m studying the impact on math performance of keeping someone awake for 48 hours, I might expect that effect to be large.

The larger the effect you expect, the smaller the sample of people you need to detect it statistically. This is called “statistical power,” and in well-designed experiments, researchers calculate this before they even start their study. They make an educated guess at the size of the impact, and they figure out how many people they need to detect that effect.

When this doesn’t happen — when researchers do a small experiment on something where we expect the impact to be very small — we should not expect to learn anything from the data. The study is just not powered to detect the likely effects. If we do see an impact, it’s extremely likely that it is a statistical accident.

Shorthand here: large studies = better, especially when the expected effects are small.

A second good feature to look for is confirmatory results. It should make us more confident if we see multiple experiments with the same results (commonly done in clinical drug trials). Or if we see one experiment that has a suggestive fact, and then a confirmatory experiment after that. More independent confirmation increases confidence.

In practice, what this means is that the set of papers we should be paying attention to is a lot smaller than the set that gets media attention.

A story about fish

I would like to end by talking about one of my very favorite academic papers ever.

To understand this story, you need to understand something about neuroscience. One thing people who study the brain like to do is put people into brain scan machines, called fMRI scanning. To somewhat oversimplify, these experiments show people different stimuli (pictures, words, etc.), and the researchers look at what parts of the brain light up. The idea is to figure out what parts of the brain do what.

An issue with this method is that the brain has many places that can light up. These areas of brain tissue are called voxels, and there are about 130,000 of them. This fact has enormous potential for the problems I talked about above — in principle, you could be doing 130,000 different tests, looking at each voxel. With this approach you will get a lot of things that are significant at the 5% level.

In practice, researchers take approaches to try to mitigate this problem, but the people who wrote my favorite paper felt those approaches were definitely not sufficient to fix the problem. To show this, they did a study on a fish. A dead fish.

Specifically, they studied a dead salmon. They put this dead salmon into an fMRI machine, and they gave it a task. As they write: “The salmon was shown a series of photographs depicting human individuals in social situations with a specified emotional valence, either socially inclusive or socially exclusive.” The study is careful. They provide a mirror to ensure the salmon can see the faces.

The authors then looked at what areas of the salmon’s brain lit up when the fish saw the photos, compared with a period of no photos. They found two areas of the brain that appeared significantly more electronically active during the photos. That is: if they had been doing a standard type of study, they would have reported that there were two areas of the brain that were responsive to emotionally charged photos.

I want to mention again that, at the time, the salmon was dead. (Brains can have electrical activity postmortem.) In addition, salmon are not well known for being responsive to the facial emotions of people, even while alive.

As the authors say, either they have discovered something amazing about the cognition of dead fish, or the methods that people are using in these fMRI studies are statistically problematic. They favor the latter explanation, and argue that these issues need to be addressed before we can really learn about the brain.

Thanks to Jesse Shapiro, Isaiah Andrews, and Lukasz Kowalik for helpful comments. All mistakes remain my own!