Section: 4 🔖 Data puzzles secrets

  • Lecture slides:

4.1 Moody Data Puzzle

Table 4.1: Snippet of Moody Dataset
SCORE GRADE DOZES_OFF TEXTING_IN_CLASS PARTICIPATION
21.33 F never never 0.29
71.57 C always rarely 0.11
90.11 A always never 0.26
31.52 D sometimes rarely 0.03
95.94 A always rarely 0.21


Moody Data Puzzle is our first example of a data puzzle. By data puzzles we mean synthetically generated data sets which have some embedded patterns. Your goal is to find the embedded pattern(s). You may also find patterns similar (implied) by patterns embedded in the data puzzle. This is fine too. The goal of data puzzles is to excite you about exploratory data analysis. In many ways it is like a game.

Puzzle description:

Professor Moody has been teaching statistics 101 class for many years. His teaching evaluations went considerably south with the chief complaint: he DOES NOT seem to assign grades fairly. Students compared their scores among themselves and found quite a bit of discrepancies! But their complaints went nowhere since Professor promptly disappeared after posting the final grades and scores.

A new brave TA, managed to get hold of the carefully maintained grading table (spanning multiple years) of professor Moody by ….messing a bit with Moody’s computer….well, let’s not explain the details because he would get in trouble. What he found out was a remarkably structured account of how professor Moody assigns his grades.

Looks like Professor Moody is in fact very alert in class. He is aware of what students do, detecting texting during class and remembering exactly who was dozed off in class. He also keeps the mysterious “participation index” which is a numerical score from 0 to 1. This is probably related to questions asked and answered by students as well as their general attentiveness in class. Remarkable but a little creepy, isn’t it?

What is the best advice the new TA, can give future students how to get a good grade in Professor Moody’s class? What factors influence the grade besides the score? Back your recommendation up with plots and evidence from the attached data.

What are examples of patterns we are looking for here?

Here are some:

  • “Students who text a lot” have lower chance to get an A in the class”
  • “Students whose participation is lower than 0.25 fail the class more often”
  • “Dozing off does not matter if your score is more than 90, you still get an A”
  • “If you score is less than 30, you fail the class regardless of what your other attributes are”

4.1.1 Secrets Revealed- Patterns in Professor Moody’s data?


Many student solutions falsely attribute higher grades to higher values of participation attribute..

This is a classic example of a hidden variable described in the reference attached below.

The truth is that participation attribute value impacts the score attribute value. Generally, the higher the participation in Moody’s class, the higher the score. But it is the score attribute which has a direct impact on the grade. Thus, it is the score which is the real “hidden variable” impacting the final grade.

Thus, the score already reflects participation. Professor Moody seemed to look only at texting and dozing off attributes in grade determination (see the power points above with the explanation)

Compare with examples of hidden variables in the following reference about correlation and causation.

https://www.stewartmath.com/precalc_7e_dp/precalc_7e_dp6.html

4.1.2 Best Student’s Submissions 2022

4.2 Movies Data Hunt

Table 4.2: Snippet of Movies Dataset
country content imdb_score Gross Budget genre
5111 France PG-13 6.37 Medium Medium Action
8361 USA R 6.78 Low Medium Comedy
3111 USA R 2.46 Low High Drama
7851 USA PG 7.62 Low Low History
12481 France R 6.70 Low Low History


Puzzle description:

contains imdb scores of 12,800+ movies along with several attributes including budget, gross genre, content rating etc.

What are the most promising alternative hypotheses about imdb scores to test? Name your three top candidates along with the evidence which backs them up: either in the form of R instruction(s) or plot.

4.2.1 Secrets Revealed- Patterns in Movies data?

4.2.2 Best Student’s Submissions 2022

4.3 Minimarket Data Hunt

Table 4.3: Snippet of Minimarket Dataset
BREAD BUTTER COOKIES COFFEE TEA
7972 0 1 1 1 1
3405 0 0 1 0 0
8179 1 0 0 0 1
316 0 0 1 1 0
3088 0 1 0 1 0


Puzzle description:

Each row of Minimarket.csv contains one customer transaction which is represented as binary vector (treat this as NUM values). 1 means that customer bought an item, 0 - means that customer did not but that item. For example if customer bought Bread but did not buy Butter you will see 1 in the Bread column and 0 in the Butter column.

Summary

Here’s what you’d do:
1. Come up with a null hypothesis: “Bread does not impact the sales of butter”
2. Come up with an alternative hypothesis: “Bread impacts the sale of butter”
3. Compute the mean value of the Butter column for all the rows where Bread value = 0. Let’s say this is mean1.
4. Compute the mean value of the Butter column for all the rows where Bread value = 1. Let’s say this is mean2.

4.3.1 What were the secret associations between items in the minimarket?

4.4 Predicting grades in Professor Moody’s class

4.4.1 How did I cook the Professor Moody Prediction challenge data?