I spend a lot of time thinking about how difficult it is to understand the relationship between diet and health, and the two examples I come back to frequently are coffee and alcohol. Both of these choices are sometimes linked to worse health and sometimes to better health. One day, coffee helps you live forever; three weeks later, another study says it causes early death. Alcohol consumption is subject to similar fluctuations — a glass of red wine a day is key to heart health; no, actually, any drinking is dangerous.

In both cases (as with much in diet), the underlying problem is that dietary choices are not random, and it’s hard to separate these choices from the other choices or characteristics they typically go along with. As a result, researchers work to use more sophisticated data techniques to answer these questions.

One of these techniques surfaced a couple of weeks ago in the context of alcohol. A new study on alcohol and cardiovascular effects came out in late March that used a technique called Mendelian randomization to try to better isolate causal effects. It got attention, not least because of this exotic empirical strategy, based on genetics.

However, this technique is somewhat confusing and, in my view, poorly understood even by some of the people who use it. So today, I want to do a deep dive. I’ll first try to explain, in a stylized example, how this works (and what the pitfalls might be). I’ll then talk about the particular details of this study.

Yes, this post is a little more technical than usual. The newsletter is called ParentData, after all! Stick with me. It’s interesting, I promise. 

Teaching example overview

Let’s put aside drinking and heart disease and turn to perhaps the most canonical relationship in economics: the relationship between education and wages. If people get more education, do they make more money and, if so, how much? If you think about it for a moment, you can see why it might be difficult to learn the answer to this question just by looking at wages across education groups. There are many other factors (family background, circumstances, ability, patience) that likely contribute to education but also to wages directly.

When researchers study this question, then, they look for strategies to get around these confounding factors. The ideal (from a research standpoint) would be to randomize how much education people get. Since we typically cannot do that for practical or ethical reasons, a common approach is to look for some other external factor that impacts individual education. In a famous example, researchers noted that because of compulsory schooling laws, the quarter of the year that you were born in impacted your educational attainment. They could then use the time of birth as what we call an “instrumental variable” to estimate a (more plausibly) causal impact of education on wages.

Very broadly, the idea in Mendelian randomization (which I will now call “MR” for word count reasons) is to recognize that your genetic code could be used as this instrumental variable.

How might that work in practice?

Quick biology reminder: You have two copies of each of your 23 chromosomes, one inherited from each parent. Each chromosome contains a number of genes, which all together are your genetic code.

Imagine for a moment that we’ve identified a genetic variant (a “SNP”) that strongly predicts college attendance. Let’s imagine it’s on chromosome 3, and we’re going to call this variant “COLLEGE.”1 Let’s say your mother has one copy of the COLLEGE variant, on one of her two chromosome 3s.

When you are conceived, you get one copy of each chromosome from your mother and your father. This means you get only one of the chromosome 3 copies from your mother. And — here’s the key — which copy you get is random. As a result, there is a 50% chance you get her COLLEGE variant and a 50% chance you get the other copy, with no variant. (You’ll get your other copy of chromosome 3 from your father; here, I’m going to assume he doesn’t have the COLLEGE variant at all, so you definitely do not get it from him.)

In this scenario, you have a few siblings, and each of them also gets a copy of chromosome 3 from your mother. Some of you get the COLLEGE variant copy, and some get the other one. In expectation, half get each. But we have now generated random variation in the propensity to go to college within your family, based on this genetic lottery. We can potentially use that to estimate the effect of college on wages. Potentially being the key word, as doing so is going to require additional assumptions.

One thing I want to be clear about: The “randomness” in genetic makeup here is necessarily conditional on your parental genes. Genetic variants are not, in general, randomly allocated around the population of the world. Since your parents’ (and other ancestors’!) genes impact their behavior and outcomes, and those behaviors and outcomes can impact you directly, it’s really only among siblings who share both parents that there is a condition of randomness.

Simple approach: The simplest approach to the data here would be to compare wages across children in the same family who got different versions of the COLLEGE variant. What we can say with confidence, comparing siblings within a family, is whether the child who gets the COLLEGE variant of the gene has higher wages.

This impact may be causal, but it is also uninteresting. The question we are interested in is to what extent going to college increases wages. There is a simple way to imagine translating between the two. Specifically:

  • Calculate how much having the COLLEGE variant increases the chance of going to college within a group of siblings.
  • Calculate how much having the COLLEGE variant increases wages within a group of siblings.
  • Divide the second number by the first number. This effectively translates the impact of the COLLEGE gene on wages into an impact of college-going on wages. It re-scales the impact to get what you want.

This is called an IV (instrumental variables) estimator or a Wald estimator. The calculation is straightforward, but interpreting what you get as the causal impact of college-going on wages requires additional assumptions. (Want a more technical explanation of all of the following? Start with this seminal 1996 JASA paper, among the origins of last year’s Nobel Prize in Economics. Or the less technical explainer here.)

What are the assumptions, and how do they work in the genetic case?

Causal interpretation: The key assumption here is what is called the exclusion restriction. Intuitively, the exclusion restriction says that in order to interpret our simple estimate as a causal impact, it must be the case that the random variable (in this example, the COLLEGE variant) impacts the outcome (wages) only as a result of its impact on the intermediate behavior (college-going). That is, the COLLEGE gene doesn’t lead to higher wages on its own.

In the case of these genetic analyses, there are several primary ways the exclusion restriction might be violated.

The first problem has a name: pleiotropy. This is the phenomenon whereby a single gene influences multiple traits. For example: imagine that this COLLEGE gene influences college-going but also height. We know that taller people make more money (seriously). In this case, the differences we see in wages within the family might be due to differences in height, not differences in college-going. In this case, it would be a mistake to assign all the impact of the COLLEGE variant as due to the college-going.

A related issue is linkage disequilibrium. Genes that are near each other on a chromosome are more likely to be inherited together. If the COLLEGE gene is right next to a HEIGHT gene, you could get a form of the pleiotropy problem, even if they are distinct genes.

A final issue is that most of the time we have no idea what the gene really does. My hypothetical COLLEGE gene doesn’t, like, fill out the Common App for you. A gene that is predictive of college attendance could be predictive for any number of reasons — because it influences patience, because it influences ability, because it influences the likelihood of being good enough at tennis to play in college. But in some cases, these other factors could also independently predict wages. Again, it would be a mistake, then, to attribute all the effect of the gene to its impact on college-going.

None of this is to say that these analyses cannot be useful, or cannot deliver causal estimates. For example: there are cases (breast cancer, for example) in which we have genes that clearly lead to a dramatically increased risk of cancer. We could use within-family variation in these genes to estimate the impact of getting breast cancer on various outcomes or behaviors.

Even here, though, it wouldn’t necessarily be appropriate to use this to try to (for example) estimate the impact of life expectancy on happiness, even though these genetic variants do impact life expectancy, because they could influence happiness for other reasons too. As in virtually all cases where we use these instrumental variables strategies, it is necessary to think really carefully about what, exactly, is going on.

The literature using these techniques does, in fact, think about these exclusion restrictions. My point is simply that those are hard in some of these settings to get around.

New study: alcohol and heart disease

I dragged us through that long discussion of the logic of MR in order to discuss this new study about the link between alcohol consumption and cardiovascular disease. In the paper, the authors aim to use the MR approach to generate causal estimates of this relationship. As they note, when you look at a cross section of people, we tend to find that light drinking is associated with better heart health but heavy drinking with worse. These authors are rightly concerned that this result might simply reflect the fact that people who drink lightly tend to be better educated, wealthier, and less likely to smoke than either abstainers or those who drink heavily.

Their proposed solution is to exploit a set of genetic variants that have been associated with alcohol dependence. They use data on individuals that has information on their drinking amounts, heart disease, and these genetic variants. The authors use a form of the analysis described above to estimate a relationship between drinking and heart disease, using the genetic variation as the instrumental variable. Their conclusion is that this approach shows no health advantage to light drinking and a large health risk to heavy drinking — that all drinking is at least slightly bad, and drinking more is worse.

The paper has gotten some significant attention. The New York Times wrote about it, quoting a doctor who said the conclusions of the study “totally changed my life.”

I, however, remain skeptical. The analysis here is subject to a number of the complex concerns raised above. In several cases, the variants the authors explore are associated with outcomes even for non-drinkers. They attempt to exclude these particular variants from their main results, but this raises the general concern that these genes are impacting outcomes for reasons unrelated to drinking (this would violate the exclusion restriction). In a more technical sense, they are focused on instrumenting for both a linear and squared term in the analysis, and it isn’t clear that this will generate causal impacts even putting aside confounding concerns.

The main problem, though, the biggest issue with this paper, is simply that they do not do this analysis within sibling groups. I noted up top the idea that genes are randomly assigned and you could use that randomization for identification — this is true only within families. Genes aren’t randomly allocated around the population overall. However, in this paper the authors do not observe family groups. So rather than compare two siblings who got a different set of genetic endowments at random, they are comparing people whose family genetic makeup is different.

This approach is subject to more basic concerns. The individuals in the study whose genetic makeup is associated with more alcohol consumption are also more likely to have had parents who consumed more alcohol. This could matter for all kinds of reasons having nothing to do with their own consumption. The authors of the paper do not, as far as I can tell, observe anything about family background, parental drinking, or anything else like that.

In the end, then, the Mendelian randomization used in this paper is … not random. Forget about exclusion restrictions or interpretation concerns. This paper falls on a much more basic sword.

To be clear: I am quite sympathetic to the authors’ views that the slight positive effect of moderate alcohol consumption is correlation rather than causation. I felt that way before, and I still do. And there are methodologically stronger papers that use this approach to answer the same question (notably, this one). This particular paper, however, is too problematic to move the needle very much.

This post was a challenge, and I’m grateful for help from Jonathan Roth, Peter Hull, Dan Benjamin, and Penelope Shapiro.

  1. This is all very hypothetical. Although there are some SNPs associated with education, none of them are strongly associated, and I have no idea if they are on chromosome 3. Also, genetic variants have names like “rs1260326,” not “COLLEGE.”