Section: 8 πŸ”– Hypothesis Testing

8.1 Introduction

Randomness is the biggest enemy of data scientists. How not to fall into its trap? How to distinguish what’s real from what’s random? This is the goal of hypothesis testing. One does not need statistics to understand the key idea. Of course statistics, probability distributions, means, standard deviations will come in due time. But we do not have to start with these sometimes intimidating concepts!

Permutation test is a shortest path to comprehend randomness and how not to fall for the randomness trap. To illustrate the permutation test, let us start with a simple example of a dataset storing information about traffic in Lincoln and Holland tunnels.

INSERT (table of the dataset correct it to 2022)

We observe that Lincoln traffic is higher than Holland tunnel traffic by calculating average traffic volume per minute in each of the tunnels using the following data.

We conclude with 68.54 for Lincoln and 67.71 for Holland tunnel. This seems to indicate that Lincoln traffic is higher than Holland traffic. But is it? Or it is just random deviation and maybe if we took more measurements, the trend would be reversed? This is where the permutation test comes handy. First, let us talk about the null hypothesis and the alternative hypothesis.

Null hypothesis for the Lincoln-Holland tunnel observation is that, not surprisingly, there is no difference in traffic between the two tunnels.

The alternative hypothesis states that Lincoln tunnel is more busy than Holland tunnel. Does observed data (observed traffic difference) provide us with enough evidence to reject null hypothesis and in fact support the alternative hypothesis? To answer this question we need to decide whether the observed result is reasonably likely to come randomly under the condition that NULL hypothesis holds.

How likely it is that observed difference (D=0.83) comes randomly? Permutation test helps us to estimate the chance that D=0.83 will come up randomly under the condition that traffic in Holland and Lincoln tunnels is equal.We can measure and observe it by running a permutation test, by randomly scrambling the traffic table.

Permutation test is run may times, typically 10,000, even 100,000 times and each permutation simulates a random process by simply reassigning the traffic volume values randomly between tunnels. The numbers of traffic measurements in Holland vs Lincoln remain the same. Just values differ, existing values are scrambled just breaking any relationship between volume numbers and tunnel names. Just think about each permutation as rolling a dice. How often will this random process produce the result which is at least as extreme as D=0.83 we observed? The less often it happens the more likely it is that what we observe is NOT random. Thus if we can get our observed result only 3 times in 1000 rolls of a dice (permutation test) it means that with probability of 99.7% our observed result cannot be random.

Permutation test provides an almost palpable experience with randomness. Just roll the dice many times and see how often you can get the observed result or more. If you can get D>0.83 relatively often (above what is called significance level usually it is at least 5% of the time), then you cannot reject the null hypothesis. In other words the conclusion that your observation appeared RANDOMLY. Otherwise, we can conclude that observation was not random - and we reject the null hypothesis. No need to calculate standard deviations, means, no need to know anything about z-test, t-test. These tests are the next step after you get a taste of randomness and understand how powerful randomness is. You would never guess!

The next step, which we illustrate here only through the z-test function, is to calculate p-value using descriptive statistics and the so-called Central limit theorem. This is much cheaper computationally. And the result is always the same, as opposed to the p-value computed by the permutation test. The latter changes slightly every time we run a permutation test. But it still changes because one cannot permute data in all possible ways. The descriptive statistics tests like z-test are asymptotic. There are strings attached to them (like data has to be large enough). And they hide randomness behind the armor of mathematics.

You can understand the data analyst struggle with randomness purely by a permutation test. Here in this active textbook we discuss hypothesis testing only through the permutation test. We show you z-test and later chi square tests primarily as black box functions, but we delay the know-how behind these tests to the statistics class.

The following synthetic data describes daily traffic on weekday and weekend days in Lincoln and Holland tunnels. The data frame has three attributes: TUNNEL, DAY and VOLUME_PER_MINUTE. Below we show a small sample of the TRAFFIC data frame

Table 8.1: Snippet of Traffic Dataset
TUNNEL DAY VOLUME_PER_MINUTE
1574 Lincoln weekday 55.0
2584 Lincoln weekend 72.0
2538 Lincoln weekend 59.0
1249 Holland weekend 52.5
2531 Lincoln weekend 59.0
761 Holland weekday 80.5
2673 Lincoln weekend 74.0
2006 Lincoln weekday 68.0
2710 Lincoln weekend 56.0
2688 Lincoln weekend 86.0


The following snippet 8.2 shows the code for hypothesis test of difference of means.

Is the mean traffic (VOLUME_PER_MINUTE) in the Holland tunnel bigger than mean traffic (VOLUME_PER_MINUTE) in the Lincoln?

8.2 Snippet 1: Permutation test

Do this in your R studio, since we cannot install our package in data camp service we are using to run the code snippets

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlBlcm11dGF0aW9uIDwtIGZ1bmN0aW9uKGRmMSxjMSxjMixuLHcxLHcyKXtcbiAgZGYgPC0gYXMuZGF0YS5mcmFtZShkZjEpXG4gIERfbnVsbDwtYygpXG4gIFYxPC1kZlssYzFdXG4gIFYyPC1kZlssYzJdXG4gIHN1Yi52YWx1ZTEgPC0gZGZbZGZbLCBjMV0gPT0gdzEsIGMyXVxuICBzdWIudmFsdWUyIDwtIGRmW2RmWywgYzFdID09IHcyLCBjMl1cbiAgRCA8LSAgYWJzKG1lYW4oc3ViLnZhbHVlMiwgbmEucm09VFJVRSkgLSBtZWFuKHN1Yi52YWx1ZTEsIG5hLnJtPVRSVUUpKVxuICBtPWxlbmd0aChWMSlcbiAgbD1sZW5ndGgoVjFbVjE9PXcyXSlcbiAgZm9yKGpqIGluIDE6bil7XG4gICAgbnVsbCA8LSByZXAodzEsbGVuZ3RoKFYxKSlcbiAgICBudWxsW3NhbXBsZShtLGwpXSA8LSB3MlxuICAgIG5mIDwtIGRhdGEuZnJhbWUoS2V5PW51bGwsIFZhbHVlPVYyKVxuICAgIG5hbWVzKG5mKSA8LSBjKFwiS2V5XCIsXCJWYWx1ZVwiKVxuICAgIHcxX251bGwgPC0gbmZbbmYkS2V5ID09IHcxLDJdXG4gICAgdzJfbnVsbCA8LSBuZltuZiRLZXkgPT0gdzIsMl1cbiAgICBEX251bGwgPC0gYyhEX251bGwsbWVhbih3Ml9udWxsLCBuYS5ybT1UUlVFKSAtIG1lYW4odzFfbnVsbCwgbmEucm09VFJVRSkpXG4gIH1cbiAgbXloaXN0PC1oaXN0KERfbnVsbCwgcHJvYj1UUlVFKVxuICBtdWx0aXBsaWVyIDwtIG15aGlzdCRjb3VudHMgLyBteWhpc3QkZGVuc2l0eVxuICBteWRlbnNpdHkgPC0gZGVuc2l0eShEX251bGwsIGFkanVzdD0yKVxuICBteWRlbnNpdHkkeSA8LSBteWRlbnNpdHkkeSAqIG11bHRpcGxpZXJbMV1cbiAgcGxvdChteWhpc3QpXG4gIGxpbmVzKG15ZGVuc2l0eSwgY29sPSdibHVlJylcbiAgYWJsaW5lKHY9RCwgY29sPSdyZWQnKVxuICBNPC1tZWFuKERfbnVsbD5EKVxuICByZXR1cm4oTSlcbn0iLCJzYW1wbGUiOiIjaW5zdGFsbC5wYWNrYWdlcyhcImRldnRvb2xzXCIpXG4jZGV2dG9vbHM6Omluc3RhbGxfZ2l0aHViKFwiZGV2YW5zaGFnci9QZXJtdXRhdGlvblRlc3RTZWNvbmRcIilcblxuI1Blcm11dGF0aW9uVGVzdFNlY29uZDo6UGVybXV0YXRpb24oZCwgXCJDYXRcIiwgXCJWYWxcIiwxMDAwMCwgXCJHcm91cEFcIiwgXCJHcm91cEJcIilcbnRyYWZmaWM8LXJlYWQuY3N2KFwiaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2Rldjc3OTYvZGF0YTEwMV90dXRvcmlhbC9tYWluL2ZpbGVzL2RhdGFzZXQvVHJhZmZpYzIwMjIuY3N2XCIpXG5QZXJtdXRhdGlvbih0cmFmZmljLCBcIlRVTk5FTFwiLCBcIlZPTFVNRV9QRVJfTUlOVVRFXCIsMTAwMCxcIkhvbGxhbmRcIiwgXCJMaW5jb2xuXCIpXG4gXG4jVGhlIFBlcm11dGF0aW9uIGZ1bmN0aW9uIHJldHVybnMgdGhlIGFic29sdXRlIHZhbHVlIG9mIHRoZSBkaWZmZXJlbmNlLiBTbyB0aGUgcmVkIGxpbmUgaXMgdGhlIGFic29sdXRlIHZhbHVlIG9mIHRoZSBvYnNlcnZlZCBkaWZmZXJlbmNlLiBZb3Ugd2lsbCBzZWUgYSBoaXN0b2dyYW0gaGF2aW5nIGEgbm9ybWFsIGRpc3RyaWJ1dGlvbiB3aXRoIGEgcmVkIHNob3dpbmcgdGhlIG9ic2VydmVkIGRpZmZlcmVuY2UuIn0=

8.3 Snippet 2: z-test

Null Hypothesis - Traffic in Holland tunnel is the same as traffic in Lincoln tunnel.

Alternative Hypothesis - Traffic in the Holland Tunnel is larger than traffic in the Lincoln tunnel.

In the snippet 8.3 we end up calculating the p-value which leads to rejection of Null hypothesis (good news for data scientist, bad for the sceptic). Indeed, p-value is less than the significance level of 5%. This means, that under null hypothesis it is extremely unlikely (less than 5% chance) to see the result which is at least as big as the observed difference of means.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6InpfdGVzdCA8LSBmdW5jdGlvbihkYXRhLGNvbDEsY29sMixzdWIxLHN1YjIpIHtcbiAgZGF0YSA8LSBhcy5kYXRhLmZyYW1lKGRhdGEpXG4gIFYxPC1kYXRhWyxjb2wxXVxuICBWMjwtZGF0YVssY29sMl1cbiAgI2RhdGEgY2xlYW4gYW5kIHN1YnNldCwgZWl0aGVyXG4gIGxpbmNvbG4uZGF0YSA8LSBzdWJzZXQoZGF0YSwgVjEgPT0gc3ViMSlcbiAgaG9sbGFuZC5kYXRhIDwtIHN1YnNldChkYXRhLCBWMSA9PSBzdWIyKVxuICBcbiAgI3RyYWZmaWMgYXQgbGluY29sblxuICBsaW5jb2xuLnRyYWZmaWMgPC0gbGluY29sbi5kYXRhWyxjb2wyXVxuICAjdHJhZmZpYyBhdCBob2xsYW5kXG4gIGhvbGxhbmQudHJhZmZpYyA8LSBob2xsYW5kLmRhdGFbLGNvbDJdXG4gIFxuICAjIHN0YW5kYXJkIGRldmlhdGlvbiBvZiB0d28gc2FtcGxlcy5cbiAgc2QubGluY29sbiA8LSBzZChsaW5jb2xuLnRyYWZmaWMpXG4gIHNkLmhvbGxhbmQgPC0gc2QoaG9sbGFuZC50cmFmZmljKVxuICBcbiAgI2xlbmd0aCBvZiBsaW5jb2xuIGFuZCBob2xsYW5kXG4gIGxlbl9saW5jb2xuIDwtIGxlbmd0aChsaW5jb2xuLnRyYWZmaWMpXG4gIGxlbl9ob2xsYW5kIDwtIGxlbmd0aChob2xsYW5kLnRyYWZmaWMpXG4gIGxlbl9saW5jb2xuXG4gIGxlbl9ob2xsYW5kXG4gIFxuICAjc3RhbmRhcmQgZGV2aWF0aW9uIG9mIGRpZmZlcmVuY2UgdHJhZmZpY1xuICBzZC5saW4uaG9sIDwtIHNxcnQoc2QubGluY29sbl4yL2xlbl9saW5jb2xuICsgc2QuaG9sbGFuZF4yL2xlbl9ob2xsYW5kKVxuICBzZC5saW4uaG9sXG4gIFxuICAjbWVhbnMgb2YgdHdvIHNhbXBsZXNcbiAgbWVhbi5saW5jb2xuIDwtIG1lYW4obGluY29sbi50cmFmZmljKVxuICBtZWFuLmhvbGxhbmQgPC0gbWVhbihob2xsYW5kLnRyYWZmaWMpXG4gIG1lYW4ubGluY29sblxuICBtZWFuLmhvbGxhbmRcbiAgXG4gICN6IHNjb3JlXG4gIHpldGEgPC0gKG1lYW4ubGluY29sbiAtIG1lYW4uaG9sbGFuZCkvc2QubGluLmhvbFxuICBwcmludChwYXN0ZSh6ZXRhLFwiIGlzIHRoZSB6LXZhbHVlXCIpKVxuICBcbiAgI3Bsb3QgcmVkIGxpbmVcbiAgcGxvdCh4PXNlcShmcm9tID0gLTUsIHRvPSA1LCBieT0wLjEpLHk9ZG5vcm0oc2VxKGZyb20gPSAtNSwgdG89IDUsICBieT0wLjEpLG1lYW49MCksdHlwZT0nbCcseGxhYiA9ICdtZWFuIGRpZmZlcmVuY2UnLCAgeWxhYj0ncG9zc2liaWxpdHknKVxuICBhYmxpbmUodj16ZXRhLCBjb2w9J3JlZCcpXG4gIFxuICAjZ2V0IHBcbiAgcCA9IDEtcG5vcm0oemV0YSlcbiAgcHJpbnQocGFzdGUocCwgXCIgaXMgdGhlIHAtdmFsdWVcIikpXG59Iiwic2FtcGxlIjoiVFJBRkZJQzwtcmVhZC5jc3YoJ2h0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L1RyYWZmaWMyMDIyLmNzdicpXG5cbnpfdGVzdChUUkFGRklDLFwiVFVOTkVMXCIsIFwiVk9MVU1FX1BFUl9NSU5VVEVcIixcIkxpbmNvbG5cIiwgXCJIb2xsYW5kXCIpIn0=

8.4 Snippet 3: Make your own data and see how p-value changes

For students familiar with basic descriptive statistics (mean, standard deviation)we build a synthetic data set ourselves and see how difference of means and difference of standard deviations affects the p-value. We will build our two distributions ourselves - varying the means and standard deviations. We will use rnorm() to generate normal distributions with given means and standard deviations. Then we will use a permutation test (can be a z-test as well) to test the difference of means for these two synthetic distributions. See for yourself the impact means and standard deviations have on p-values.

Build the data frame with two attributes: Cat and Val, using rnorm() function. Our null hypothesis is that Group A and Group B have identical mean(Val).

The alternative hypothesis is that the mean(Val) for Group B is higher than mean(Val) for Group A. We will change the mean and standard deviation of the data distributions for Group A and Group B and see how these changes affect the p-value. We will first use a permutation test and a single-step permutation test (just to illustrate what happens each single step when we run a permutation test). Then we finish off with the z-test.

8.4.1 Permuation test

Exercise - How p-value is affected by difference of means and standard deviations

We will build our two distributions ourseleves - varying the means and standard deviations. We will use rnorm() to generate normal distributions with given means and standard deviations. Then we will use permutation test (can be z-test as well) to test difference of means for these two synthetic distributions. See for yourself the impact means and standard deviations have on p-values.

Build the data frame with two attributes: Cat and Val, using rnorm() function

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6IlBlcm11dGF0aW9uIDwtIGZ1bmN0aW9uKGRmMSxjMSxjMixuLHcxLHcyKXtcbiAgZGYgPC0gYXMuZGF0YS5mcmFtZShkZjEpXG4gIERfbnVsbDwtYygpXG4gIFYxPC1kZlssYzFdXG4gIFYyPC1kZlssYzJdXG4gIHN1Yi52YWx1ZTEgPC0gZGZbZGZbLCBjMV0gPT0gdzEsIGMyXVxuICBzdWIudmFsdWUyIDwtIGRmW2RmWywgYzFdID09IHcyLCBjMl1cbiAgRCA8LSAgYWJzKG1lYW4oc3ViLnZhbHVlMiwgbmEucm09VFJVRSkgLSBtZWFuKHN1Yi52YWx1ZTEsIG5hLnJtPVRSVUUpKVxuICBtPWxlbmd0aChWMSlcbiAgbD1sZW5ndGgoVjFbVjE9PXcyXSlcbiAgZm9yKGpqIGluIDE6bil7XG4gICAgbnVsbCA8LSByZXAodzEsbGVuZ3RoKFYxKSlcbiAgICBudWxsW3NhbXBsZShtLGwpXSA8LSB3MlxuICAgIG5mIDwtIGRhdGEuZnJhbWUoS2V5PW51bGwsIFZhbHVlPVYyKVxuICAgIG5hbWVzKG5mKSA8LSBjKFwiS2V5XCIsXCJWYWx1ZVwiKVxuICAgIHcxX251bGwgPC0gbmZbbmYkS2V5ID09IHcxLDJdXG4gICAgdzJfbnVsbCA8LSBuZltuZiRLZXkgPT0gdzIsMl1cbiAgICBEX251bGwgPC0gYyhEX251bGwsbWVhbih3Ml9udWxsLCBuYS5ybT1UUlVFKSAtIG1lYW4odzFfbnVsbCwgbmEucm09VFJVRSkpXG4gIH1cbiAgbXloaXN0PC1oaXN0KERfbnVsbCwgcHJvYj1UUlVFKVxuICBtdWx0aXBsaWVyIDwtIG15aGlzdCRjb3VudHMgLyBteWhpc3QkZGVuc2l0eVxuICBteWRlbnNpdHkgPC0gZGVuc2l0eShEX251bGwsIGFkanVzdD0yKVxuICBteWRlbnNpdHkkeSA8LSBteWRlbnNpdHkkeSAqIG11bHRpcGxpZXJbMV1cbiAgcGxvdChteWhpc3QpXG4gIGxpbmVzKG15ZGVuc2l0eSwgY29sPSdibHVlJylcbiAgYWJsaW5lKHY9RCwgY29sPSdyZWQnKVxuICBNPC1tZWFuKERfbnVsbD5EKVxuICByZXR1cm4oTSlcbn0iLCJzYW1wbGUiOiJWYWwxPC1ybm9ybSgxMCxtZWFuPTI1LCBzZD0xMClcblZhbDI8LXJub3JtKDEwLG1lYW49MzAsIHNkPTEwKVxuIFxuQ2F0MTwtcmVwKFwiR3JvdXBBXCIsMTApICAjIGZvciBleGFtcGxlIEdyb3VwQSBjYW4gYmUgSG9sbGFuZCBUdW5uZWxcbkNhdDI8LXJlcChcIkdyb3VwQlwiLDEwKSAgIyBmb3IgZXhhbXBsZSBHcm91cCBCIHdpbGwgYmUgTGluY29sbiBUdW5uZWxcblxuQ2F0MVxuQ2F0MlxuXG4jVGhlIHJlcCBjb21tYW5kIHdpbGwgcmVwZWF0LCB0aGUgdmFyaWFibGVzIHdpbGwgYmUgb2YgdHlwZSBjaGFyYWN0ZXIgYW5kIHdpbGwgY29udGFpbiAxMCB2YWx1ZXMgZWFjaC5cblxuQ2F0PC1jKENhdDEsQ2F0MikgIyBBIHZhcmlhYmxlIHdpdGggZmlyc3QgMTAgdmFsdWVzIEdyb3VwQSBhbmQgbmV4dCAxMCB2YWx1ZXMgR3JvdXBCXG5DYXRcblxuVmFsPC1jKFZhbDEsVmFsMilcblZhbFxuXG5kPC1kYXRhLmZyYW1lKENhdCxWYWwpXG5kXG5cblBlcm11dGF0aW9uKGQsIFwiQ2F0XCIsIFwiVmFsXCIsMTAwMCxcIkdyb3VwQVwiLCBcIkdyb3VwQlwiKVxuXG5PYnNlcnZlZF9EaWZmZXJlbmNlPC1tZWFuKGRbZCRDYXQ9PSdHcm91cEInLDJdKS1tZWFuKGRbZCRDYXQ9PSdHcm91cEEnLDJdKVxuT2JzZXJ2ZWRfRGlmZmVyZW5jZVxuXG4jVGhpcyB3aWxsIGNhbGN1bGF0ZSB0aGUgbWVhbiBvZiB0aGUgc2Vjb25kIGNvbHVtbiAoaGF2aW5nIDEwIHJhbmRvbSB2YWx1ZXMgZm9yIGVhY2ggZ3JvdXApLCBhbmQgdGhlIG1lYW4gb2YgZ3JvdXBCIHZhbHVlcyBpcyBzdWJ0cmFjdGVkIGZyb20gdGhlIG1lYW4gb2YgZ3JvdXBBIHZhbHVlcywgd2hpY2ggd2lsbCBnaXZlIHlvdSB0aGUgdmFsdWUgb2YgdGhlIGRpZmZlcmVuY2Ugb2YgdGhlIG1lYW4uXG4gXG4gI1RyeSBjaGFuZ2luZyBtZWFuIGFuZCBzZCB2YWx1ZXMuIFdoZW4geW91IHJ1biB0aGlzIHlvdSB3aWxsIHNlZSB0aGF0IHRoZSBkaWZmZXJlbmNlIGlzIHNvbWV0aW1lcyBuZWdhdGl2ZSAjb3Igc29tZXRpbWVzIHBvc2l0aXZlLiJ9

8.4.2 One permutation at a time

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJ0cmFmZmljPC1yZWFkLmNzdignaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2Rldjc3OTYvZGF0YTEwMV90dXRvcmlhbC9tYWluL2ZpbGVzL2RhdGFzZXQvVHJhZmZpYzIwMjIuY3N2JylcblxucmFuTnVtIDwtIHNhbXBsZSgxOm5yb3codHJhZmZpYyksbnJvdyh0cmFmZmljKSlcbnJhbk51bVsxOjVdXG5cblZPTFVNRV9QRVJfTUlOVVRFPC10cmFmZmljJFZPTFVNRV9QRVJfTUlOVVRFW3Jhbk51bV1cblRVTk5FTDwtdHJhZmZpYyRUVU5ORUxcblxuUGVybXV0ZWRfdHJhZmZpYzwtZGF0YS5mcmFtZShUVU5ORUwsIFZPTFVNRV9QRVJfTUlOVVRFKVxuXG5tZWFuKHRyYWZmaWNbdHJhZmZpYyRUVU5ORUw9PSdMaW5jb2xuJywgXSRWT0xVTUVfUEVSX01JTlVURSkgLW1lYW4odHJhZmZpY1t0cmFmZmljJFRVTk5FTD09J0hvbGxhbmQnLCBdJFZPTFVNRV9QRVJfTUlOVVRFKVxuXG5tZWFuKFBlcm11dGVkX3RyYWZmaWNbUGVybXV0ZWRfdHJhZmZpYyRUVU5ORUw9PSdMaW5jb2xuJywgXSRWT0xVTUVfUEVSX01JTlVURSktbWVhbihQZXJtdXRlZF90cmFmZmljW1Blcm11dGVkX3RyYWZmaWMkVFVOTkVMPT0nSG9sbGFuZCcsIF0kVk9MVU1FX1BFUl9NSU5VVEUpIn0=

8.4.3 z-test

How p-value is affected by difference of means and standard deviations.

We will build two distributions ourselves - varying the means and standard deviations. We will use rnorm() to generate normal distributions with given means and standard deviations. Then we will use a permutation test (can be a z-test as well) to test the difference of means for these two synthetic distributions. See for yourself the impact means and standard deviations have on p-values. You can do it by changing values of mean and standard deviation in the rnorm() function.

Clearly the further apart the mean values are - the lower the p-value. But how do standard deviations affect the p-value? See for yourself.

Build the data frame with two attributes: Cat and Val, using rnorm() function

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6InpfdGVzdCA8LSBmdW5jdGlvbihkYXRhLGNvbDEsY29sMixzdWIxLHN1YjIpIHtcbiAgZGF0YSA8LSBhcy5kYXRhLmZyYW1lKGRhdGEpXG4gIFYxPC1kYXRhWyxjb2wxXVxuICBWMjwtZGF0YVssY29sMl1cbiAgI2RhdGEgY2xlYW4gYW5kIHN1YnNldCwgZWl0aGVyXG4gIGxpbmNvbG4uZGF0YSA8LSBzdWJzZXQoZGF0YSwgVjEgPT0gc3ViMSlcbiAgaG9sbGFuZC5kYXRhIDwtIHN1YnNldChkYXRhLCBWMSA9PSBzdWIyKVxuICBcbiAgI3RyYWZmaWMgYXQgbGluY29sblxuICBsaW5jb2xuLnRyYWZmaWMgPC0gbGluY29sbi5kYXRhWyxjb2wyXVxuICAjdHJhZmZpYyBhdCBob2xsYW5kXG4gIGhvbGxhbmQudHJhZmZpYyA8LSBob2xsYW5kLmRhdGFbLGNvbDJdXG4gIFxuICAjIHN0YW5kYXJkIGRldmlhdGlvbiBvZiB0d28gc2FtcGxlcy5cbiAgc2QubGluY29sbiA8LSBzZChsaW5jb2xuLnRyYWZmaWMpXG4gIHNkLmhvbGxhbmQgPC0gc2QoaG9sbGFuZC50cmFmZmljKVxuICBcbiAgI2xlbmd0aCBvZiBsaW5jb2xuIGFuZCBob2xsYW5kXG4gIGxlbl9saW5jb2xuIDwtIGxlbmd0aChsaW5jb2xuLnRyYWZmaWMpXG4gIGxlbl9ob2xsYW5kIDwtIGxlbmd0aChob2xsYW5kLnRyYWZmaWMpXG4gIGxlbl9saW5jb2xuXG4gIGxlbl9ob2xsYW5kXG4gIFxuICAjc3RhbmRhcmQgZGV2aWF0aW9uIG9mIGRpZmZlcmVuY2UgdHJhZmZpY1xuICBzZC5saW4uaG9sIDwtIHNxcnQoc2QubGluY29sbl4yL2xlbl9saW5jb2xuICsgc2QuaG9sbGFuZF4yL2xlbl9ob2xsYW5kKVxuICBzZC5saW4uaG9sXG4gIFxuICAjbWVhbnMgb2YgdHdvIHNhbXBsZXNcbiAgbWVhbi5saW5jb2xuIDwtIG1lYW4obGluY29sbi50cmFmZmljKVxuICBtZWFuLmhvbGxhbmQgPC0gbWVhbihob2xsYW5kLnRyYWZmaWMpXG4gIG1lYW4ubGluY29sblxuICBtZWFuLmhvbGxhbmRcbiAgXG4gICN6IHNjb3JlXG4gIHpldGEgPC0gKG1lYW4ubGluY29sbiAtIG1lYW4uaG9sbGFuZCkvc2QubGluLmhvbFxuICBwcmludChwYXN0ZSh6ZXRhLFwiIGlzIHRoZSB6LXZhbHVlXCIpKVxuICBcbiAgI3Bsb3QgcmVkIGxpbmVcbiAgcGxvdCh4PXNlcShmcm9tID0gLTUsIHRvPSA1LCBieT0wLjEpLHk9ZG5vcm0oc2VxKGZyb20gPSAtNSwgdG89IDUsICBieT0wLjEpLG1lYW49MCksdHlwZT0nbCcseGxhYiA9ICdtZWFuIGRpZmZlcmVuY2UnLCAgeWxhYj0ncG9zc2liaWxpdHknKVxuICBhYmxpbmUodj16ZXRhLCBjb2w9J3JlZCcpXG4gIFxuICAjZ2V0IHBcbiAgcCA9IDEtcG5vcm0oemV0YSlcbiAgcHJpbnQocGFzdGUocCwgXCIgaXMgdGhlIHAtdmFsdWVcIikpXG59Iiwic2FtcGxlIjoiVmFsMTwtcm5vcm0oMTAsbWVhbj0yNSwgc2Q9MTApXG5WYWwyPC1ybm9ybSgxMCxtZWFuPTM1LCBzZD0xMClcbkNhdDE8LXJlcChcIkdyb3VwQVwiLDEwKSAgXG5DYXQyPC1yZXAoXCJHcm91cEJcIiwxMCkgIFxuQ2F0PC1jKENhdDEsQ2F0MikgXG5WYWw8LWMoVmFsMSxWYWwyKVxuXG5kPC1kYXRhLmZyYW1lKENhdCxWYWwpXG5PYnNlcnZlZF9EaWZmZXJlbmNlPC1tZWFuKGRbZCRDYXQ9PSdHcm91cEInLDJdKS1tZWFuKGRbZCRDYXQ9PSdHcm91cEEnLDJdKVxuT2JzZXJ2ZWRfRGlmZmVyZW5jZVxuXG56X3Rlc3QoZCxcIkNhdFwiLCBcIlZhbFwiLFwiR3JvdXBCXCIsIFwiR3JvdXBBXCIpIn0=