If you are a regular reader, you’ll know that among my biggest frustrations are studies that confuse correlation and causality, especially when it comes to food. This is a topic that I research in my academic work and have written about frequently here (see, for example, this post on coffee, closely related to today’s rant).

Last week, we got a new study, reported in the New York Times with the characteristic headline “Coffee Drinking Linked to Lower Mortality Risk, New Study Finds.” I briefly pooh-poohed this headline (and the whole study) on Twitter, but I will take the opportunity here to expand on that criticism. Although I’ve talked through many of these points before in general, I want to pull out the details of how I unpack a particular study of this type.

To begin: here is the study. It’s published in the Annals of Internal Medicine, which is a prestigious journal (impact factor 25.39!).

The contours of the study are pretty typical. The researchers use a large sample (about 171,000 people) in the U.K., created as part of a broader study called the UK Biobank. They have information at baseline about the consumption of (among other things) coffee, including whether people drink their coffee unsweetened, sweetened, or artificially sweetened. They follow these individuals for seven years and look at the relationship between coffee consumption and death. They find that consumption of unsweetened or sugar-sweetened coffee is associated with a reduced risk of death. There doesn’t seem to be an association with artificially sweetened coffee.

The effects are sizable, up to a 30% reduction in the hazard rate of death over this period. The sweet spot seems to be between 1.5 and 3.5 cups a day.

The obvious concern with this paper is that it’s not the coffee but the other differences across people that drive the results. The approach this paper, and many others, take is to try to adjust for observable differences between individuals. In this case, the data set is detailed, and they are able to control for differences in underlying health, other dietary choices, and socioeconomic status.

The key question in these papers is: are these controls sufficient? Or, alternatively, are there other important unobserved controls that might be driving the results?

I have two ways into thinking about this question. The first is conceptual. Generating a causal estimate here is going to require isolating some variation in coffee consumption that is effectively random. That is: if we think these observational data generate a causal link, it must be that we think that once the controls are included, we’ve isolated variation in coffee consumption that isn’t related to other important characteristics. This could happen if, say, people choose how much coffee they drink at random, conditional on their observed characteristics, or if their choice was driven by some external factor unrelated to their health overall.

I find this idea implausible. I don’t think that coffee consumption is chosen at random, even conditional on controls; it’s part of a larger diet and lifestyle. And all of the external factors I can think of that might influence coffee consumption — stress, how busy you are, sleep — are also variables that influence health. However, this is an inherently untestable view. I find it implausible, but others might not.

The second way I try to evaluate this question is with data — specifically, looking at the differences across groups in the observed controls. Why? Generally, because I tend to think that any differences we see in the observed controls might be reflective of differences in characteristics that remain unobserved.

In the case of this particular paper, we can look at Table 1, which gives characteristics for the four groups considered (non-consumers, unsweetened consumers, sugar-sweetened consumers, and artificially sweetened consumers).

There are a few sizable differences I notice.

  • Gender: 60% of the sugar-sweetened coffee group identify as men, versus 40% to 44% in the other three groups.
  • Race: all coffee drinker groups are more likely to be white (with unsweetened much more likely).
  • Education: unsweetened coffee drinkers are much more likely to have a degree than any other group.
  • Smoking: large differences, in various directions (sugar-sweetened consumers are more likely to smoke; unsweetened are less likely).
  • Diabetes: sugar-sweetened consumers are much less likely to be diabetic; artificially sweetened are much more.
  • Blood-pressure-drug use: artificially sweetened are much more likely to use.
  • Sugar consumption: all three coffee groups, but especially the unsweetened group, report less sugar consumption.

When I look at this table, it paints a picture of quite different people in each group. Relative to the non-consumers, those who drink unsweetened coffee are better-educated, less likely to smoke, do more exercise, and consume less sugar. It’s a picture of a better-off, more health-conscious group.

In contrast, the group that consumes artificial sweetener seems to be less healthy (more diabetes, higher weight, more likely to be a former smoker). They look like a group that is consuming artificial sweetener in part due to trying to improve their health.

The group that consumes sugar-sweetened coffee actually looks most similar to the non-consumers, except they are 20 percentage points more likely to be men. Which … is a large difference.

The bottom line is that on observable dimensions, these groups have some very important differences. This makes me concerned about differences in characteristics we do not observe. Most of the controls we see are pretty coarse — two categories of education, some general measure of diet based on a small number of 24-hour recalls — and my main worry is that these do not capture anything like the full picture of individuals. To the extent that there are differences in unobserved features that are reflected in the differences in observed features, this could drive some of the results.

This concern is also fundamentally untestable. For me: I believe that observing these large differences across groups on variables we observe points to likely differences in variables we do not observe, which could generate bias. The authors may say they disagree; that while there are differences in observed variables, they don’t think there are important unobserved differences. Their argument becomes more persuasive as we observe more and more controls; mine becomes more as the differences on observable dimensions get larger. But we cannot be convinced for sure either way.

In the end, I find the idea that the findings in this paper are causal to be implausible for these two key reasons. What I cannot do — what no one can do — is prove that. Which may be why studies with this structure continue to be published, and people like me continue to complain.