We make decisions based on the data we see. One restaurant serves higher-quality food than another. One presidential candidate aligns more appropriately with our values. One surgical technique yields better outcomes. One applicant submits a stronger job application than a competitor. From these data, we decide what course of action to take. In many cases, these decisions are inconsequential. In others, however, a poor decision may lead to dangerous results. Let’s consider danger.
Imagine you are a surgeon. A patient arrives in your clinic with a particular condition. Let us call this condition, for illustrative purposes, phantasticolithiasis. The patient is in an immense amount of pain. After reviewing the literature on phantasticolithiasis, you discover that this condition can be fatal if left untreated. The review also describes two surgical techniques, which we shall call “A” and “B” here. Procedure A, according to the review, has a 69% success rate. Procedure B, however, seems much more promising, having a success rate of 81%. Based on these data, you prepare for Procedure B. You tell the patient the procedure you will be performing and share some of the information you learned. You tell a few colleagues about your plan. On the eve of the procedure, you call your old friend, a fellow surgeon practicing on another continent. You tell him about this interesting disease, phantasticolithiasis, what you learned about it, and your assessment and plan. There is a pause on the other end of the line. “What is the mass of the lesion?” he asks. You respond that it is much smaller than average. “Did you already perform the procedure?” he continues. You tell him that you didn’t and that the procedure is tomorrow morning.
“Switch to procedure A.”
Confused, you ask your friend why this could be true. He explains the review a bit further. The two procedures were performed on various categories of phantasticolithiasis. However, what the review failed to mention was that procedure A was more commonly performed on the largest lesions, and procedure B on the smallest lesions. Larger lesions, as you might imagine, have a much lower success rate than their smaller counterparts. If you separate the patient population into two categories for the large and small lesions, the results change dramatically. In the large-lesion category, procedure A has a success rate of 63% (250/400) and procedure B has a success rate of 57% (40/70). For the small lesions, procedure A is 99% successful (88/89) and procedure B is 88% successful (210/240). In other words, when controlling for the category of condition, procedure A is always more successful than procedure B. You follow your friend’s advice. The patient’s surgery is a success, and you remain dumbfounded.
What’s happening here is something called Simpson’s paradox. The idea is simple: When two variables are considered (for example, two procedures), one association results (procedure B is more successful). However, upon the conditioning of a third variable (lesion size), the association reverses (procedure A is more successful). This phenomenon has far-reaching implications. For example, since 2000, the median US wage has increased by 1% when adjusted for inflation, a statistic many politicians like to boast about. However, within every educational subgroup, the median wage has decreased. The same can be said for the gender pay gap. Barack Obama in both of his campaigns fought against the gap, reminding us that women only make 77 cents for every dollar a man earns. However, the problem is more than just a paycheck, and the differences change and may even disappear if you control for job sector or level of education. In other words, policy change to reduce the gap need to be more nuanced than a campaign snippet. A particularly famous case of the paradox arose at UC Berkeley. In this case, the school was sued for gender bias. The school admitted 44% of their male applicants and only 35% of their female applicants. However, upon conditioning for each department, it was found that women applied more often to those departments with lower rates of admission. In 2/3 of the departments, women had a higher entrance rate than men.
The paradox seems simple. When analyzing data and making a decision, simply control for other variables and the correct answer will emerge. Right? Not exactly. How do you know which variables should be controlled? In the case of phantasticolithiasis, how would you know to control for lesion size? Why couldn’t you just as easily control for the patient’s age or comorbidities? Could you control for all of them? If you do see the paradox emerge, what decision should you then make? Is the correct answer that of the conditioned data or that of the raw data? The paradox becomes complicated once again.
Judea Pearl wrote an excellent description of the problem and proposed a solution to the above questions. He cites the use of “do-calculus,” a technique rooted in the study of Bayesian networks. Put more simply, his methods find causality between a number of variables. In doing so, one can find the conditioning variables and can then decide whether the conditioned data or the raw data are best for decision-making. The set of variables that dictate causality are the ones that should be used. If you are interested in the technique and have some experience with the notation, I recommend this brief review on arXiv.
Of course, rapid and rather inconsequential decisions need not be based on such formalities. On the other hand, it serves all of us well if we at least consider the possibility of Simpson’s paradox on a day-to-day basis. Be skeptical when reading the paper, speaking with colleagues, and making decisions. Finally, if you’re ever lucky enough to be the first patient with phantasticolithiasis, opt for procedure A.