Section: 12 🔖 Multiple Hypothesis Testing

12.1 Introduction

We often consider multiple possible hypotheses in our search for discovery to find one with lowest possible p-value. Consciously or subconsciously we are engaging, what is often called, p-value hunting. We have to be very careful! We may “discover” what is simply random even if we correctly calculate p-value and compare it with the significance level. It is very important to learn about multiple hypothesis traps very early in the process of learning data science.

For example assume that we are looking for associations between sales of individual items in a supermarket. Does bread sell with butter? Does coffee sell with spring water? There is an exponential number of possible combinations (N choose 2 to be exact, where N is the number of items). For each such pair we perform hypothesis testing. If one test is performed at the 5% significance level and the corresponding null hypothesis is true, there is only a 5% chance of incorrectly rejecting the null hypothesis. However, if 100 tests are each conducted at the 5% significance level and all corresponding null hypotheses are true, the expected number of incorrect rejections (also known as false positives or Type I errors) is 5. If the tests are statistically independent from each other, the probability of at least one incorrect rejection is approximately 99.4%. Thus, we will almost surely find one false positive! In other words we will be fooled by data.

Bonferroni correction is a method to counteract the multiple hypothesis (often called multiple comparison problem. Make it harder to reject null hypotheses by dividing the significance level by number of hypotheses. The Bonferroni correction compensates for that increase by testing each individual hypothesis at a significance level of α / m where m is the number of hypotheses. For example, if a trial is testing me = 20 hypotheses with a desired α = 0.05, then the Bonferroni correction would test each individual hypothesis at

\[\begin{equation} \alpha = \frac{0.05} {20} = 0.0025 \end{equation}\]

Thus, there is a very simple remedy for multiple hypothesis traps. Just divide the significance level by the number of (potential) hypotheses tested. This will make it harder, often much much harder to reject the null hypothesis and yell Eureka! Critics say that in fact Bonferroni correction is too conservative and too “pro-null” and tough on alternative hypotheses to be acceptable.

The unwanted side effect of Bonferroni correction is that we may fail to reject the null hypothesis too often. Bonferroni correction makes discovery sometimes too hard, making data scientists too conservative and accepting null hypothesis when they should be rejected. It may also be the case that even Bonferroni correction will not protect us, as we will show in our example below. But at least we will be much less likely to make fools of ourselves coming with false discoveries leading potentially to very wrong business decisions.

There are other less conservative methods of correcting for multiple hypotheses - such as the Benjamini-Hochberg method described in the attached slides.

We illustrate the p-value hunt in snippet 8.1 below. It is based on synthetic data set showing summer temperatures in New Jersey townships. Table below describes the data set based on hypothetical temperature readings in various municipalities of New Jersey over summer. Is one city experiencing higher average temperatures than another? Can we find such a pair of cities? This is the ultimate p-value hunt. Let’s compare townshiships pair by pair, until we find a pair with sufficiently large differences of mean temperatures and sufficiently low p-value. Careful! You may come up with false discovery if you do not correct for multiple hypotheses!

The 12.1.1 shows several permutation tests for different pairs of townships and difference of means of temperatures hypothesis test. Two of four pairs show p–values less than customary significance level of 5%. Should we then reject the null hypothesis and conclude that indeed Ocean Grove is warmer than New Brunswick and that New Brunswick is warmer than Holmdel? Indeed, both pairs result in p-values significantly lower than 5%. If we incorrectly disregard the number of hypotheses considered, we may come to wrong conclusions supporting these two alternative hypotheses. But there are around 20 townships in the Temp data set. Thus there are around 200 possible hypotheses (200 pairs of townships) which we may consider in our p-value hunt. If we apply Bonferroni correction for N=200, the significance level will be 200 times lower, instead of 5%, it will be 0.025%. None of the two hypotheses (Ocean Grove vs New Brunswick and New Brunswick vs Holmdel) meets the new significance level. Indeed in both cases p-values are significantly larger than 0.025%. Thus, for none of the four pairs we can reject null hypotheses.

Now we can disclose that we have created our Temp data set completely randomly - assigning random temperatures between 50 and 100 degrees to each township. Thus, without Bonferroni coefficient we would be fooled by data, not once, but twice in our four tests. We would find a trend when it does not exist - it is simply random deviation.

It turns out however that even Bonferroni correction is not sufficient to protect us against incorrectly rejecting null hypothesis. Indeed, for Red Bank and Holmdel, we conclude that Red Band is warmer than Holmdel with p-value of 0.01%! (see the last permutation test in the snippet 1). This p-value falls even below significance level adjusted with Bonferroni correction (0.025%). It only shows that dealing with multiple hypotheses is a risky adventure. We may end up being fooled by data even when we apply Bonferroni correction. But at least we are less likely to fall into the trap of multiple hypotheses when we apply Bonferroni Correction.

Download: Tempratures.csv

Table 12.1: Snippet of Temperature Dataset
	Township	Temprature
1770	Red Bank	100
608	Red Bank	58
1787	Ocean Grove	90
161	North Brunswick	55
1515	Princeton	51
553	Trenton	66
1179	Morristown	90
1466	Morristown	96
843	Jersey City	90
81	Princeton	100

The Temp data set assigned random temperatures between 50 and 100 degrees to around 20 townships in New Jersey.

Without Bonferroni correction we would have to incorrectly reject the null hypothesis in the last three permutation tests (3rd, 4th and 5th) tests. We will still reject null hypothesis (and have false positive discovery) in the last, 5th case. Indeed, the 5th p-value is 0.0002 which is less than the significance level after Bonferroni correction (0.00025).

12.1.1 Multiple Permutation Tests

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlBlcm11dGF0aW9uIDwtIGZ1bmN0aW9uKGRmMSxjMSxjMixuLHcxLHcyKXtcbiAgZGYgPC0gYXMuZGF0YS5mcmFtZShkZjEpXG4gIERfbnVsbDwtYygpXG4gIFYxPC1kZlssYzFdXG4gIFYyPC1kZlssYzJdXG4gIHN1Yi52YWx1ZTEgPC0gZGZbZGZbLCBjMV0gPT0gdzEsIGMyXVxuICBzdWIudmFsdWUyIDwtIGRmW2RmWywgYzFdID09IHcyLCBjMl1cbiAgRCA8LSAgYWJzKG1lYW4oc3ViLnZhbHVlMiwgbmEucm09VFJVRSkgLSBtZWFuKHN1Yi52YWx1ZTEsIG5hLnJtPVRSVUUpKVxuICBtPWxlbmd0aChWMSlcbiAgbD1sZW5ndGgoVjFbVjE9PXcyXSlcbiAgZm9yKGpqIGluIDE6bil7XG4gICAgbnVsbCA8LSByZXAodzEsbGVuZ3RoKFYxKSlcbiAgICBudWxsW3NhbXBsZShtLGwpXSA8LSB3MlxuICAgIG5mIDwtIGRhdGEuZnJhbWUoS2V5PW51bGwsIFZhbHVlPVYyKVxuICAgIG5hbWVzKG5mKSA8LSBjKFwiS2V5XCIsXCJWYWx1ZVwiKVxuICAgIHcxX251bGwgPC0gbmZbbmYkS2V5ID09IHcxLDJdXG4gICAgdzJfbnVsbCA8LSBuZltuZiRLZXkgPT0gdzIsMl1cbiAgICBEX251bGwgPC0gYyhEX251bGwsbWVhbih3Ml9udWxsLCBuYS5ybT1UUlVFKSAtIG1lYW4odzFfbnVsbCwgbmEucm09VFJVRSkpXG4gIH1cbiAgbXloaXN0PC1oaXN0KERfbnVsbCwgcHJvYj1UUlVFKVxuICBtdWx0aXBsaWVyIDwtIG15aGlzdCRjb3VudHMgLyBteWhpc3QkZGVuc2l0eVxuICBteWRlbnNpdHkgPC0gZGVuc2l0eShEX251bGwsIGFkanVzdD0yKVxuICBteWRlbnNpdHkkeSA8LSBteWRlbnNpdHkkeSAqIG11bHRpcGxpZXJbMV1cbiAgcGxvdChteWhpc3QpXG4gIGxpbmVzKG15ZGVuc2l0eSwgY29sPSdibHVlJylcbiAgYWJsaW5lKHY9RCwgY29sPSdyZWQnKVxuICBNPC1tZWFuKERfbnVsbD5EKVxuICByZXR1cm4oTSlcbn0iLCJzYW1wbGUiOiJUZW1wIDwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9UZW1wcmF0dXJlcy5jc3ZcIikgI3dlYiBsb2FkXG5cblBlcm11dGF0aW9uKFRlbXAsIFwiVG93bnNoaXBcIiwgXCJUZW1wcmF0dXJlXCIsMTAwMCwgXCJQcmluY2V0b25cIiwgXCJUcmVudG9uXCIpXG5QZXJtdXRhdGlvbihUZW1wLCBcIlRvd25zaGlwXCIsIFwiVGVtcHJhdHVyZVwiLDEwMDAsIFwiUGFzc2FpY1wiLCBcIk5ld2Fya1wiKVxuUGVybXV0YXRpb24oVGVtcCwgXCJUb3duc2hpcFwiLCBcIlRlbXByYXR1cmVcIiwxMDAwLCBcIk9jZWFuIEdyb3ZlXCIsIFwiTmV3IEJydW5zd2lja1wiKVxuUGVybXV0YXRpb24oVGVtcCwgXCJUb3duc2hpcFwiLCBcIlRlbXByYXR1cmVcIiwxMDAwLCBcIk5ldyBCcnVuc3dpY2tcIiwgXCJIb2xtZGVsXCIpXG5QZXJtdXRhdGlvbihUZW1wLCBcIlRvd25zaGlwXCIsIFwiVGVtcHJhdHVyZVwiLDEwMDAsIFwiUmVkIEJhbmtcIiwgXCJIb2xtZGVsXCIpIn0=

12.2 Additional References

https://multithreaded.stitchfix.com/blog/2015/10/15/multiple-hypothesis-testing/