Section: 4 🔖 Data puzzles secrets

4.1 Moody Data Puzzle

Table 4.1: Snippet of Moody Dataset
SCORE	GRADE	DOZES_OFF	TEXTING_IN_CLASS	PARTICIPATION
21.33	F	never	never	0.29
71.57	C	always	rarely	0.11
90.11	A	always	never	0.26
31.52	D	sometimes	rarely	0.03
95.94	A	always	rarely	0.21

Moody Data Puzzle is our first example of a data puzzle. By data puzzles we mean synthetically generated data sets which have some embedded patterns. Your goal is to find the embedded pattern(s). You may also find patterns similar (implied) by patterns embedded in the data puzzle. This is fine too. The goal of data puzzles is to excite you about exploratory data analysis. In many ways it is like a game.

Puzzle description:

Professor Moody has been teaching statistics 101 class for many years. His teaching evaluations went considerably south with the chief complaint: he DOES NOT seem to assign grades fairly. Students compared their scores among themselves and found quite a bit of discrepancies! But their complaints went nowhere since Professor promptly disappeared after posting the final grades and scores.

A new brave TA, managed to get hold of the carefully maintained grading table (spanning multiple years) of professor Moody by ….messing a bit with Moody’s computer….well, let’s not explain the details because he would get in trouble. What he found out was a remarkably structured account of how professor Moody assigns his grades.

Looks like Professor Moody is in fact very alert in class. He is aware of what students do, detecting texting during class and remembering exactly who was dozed off in class. He also keeps the mysterious “participation index” which is a numerical score from 0 to 1. This is probably related to questions asked and answered by students as well as their general attentiveness in class. Remarkable but a little creepy, isn’t it?

What is the best advice the new TA, can give future students how to get a good grade in Professor Moody’s class? What factors influence the grade besides the score? Back your recommendation up with plots and evidence from the attached data.

What are examples of patterns we are looking for here?

Here are some:

“Students who text a lot” have lower chance to get an A in the class”
“Students whose participation is lower than 0.25 fail the class more often”
“Dozing off does not matter if your score is more than 90, you still get an A”
“If you score is less than 30, you fail the class regardless of what your other attributes are”

4.1.1 Secrets Revealed- Patterns in Professor Moody’s data?

Many student solutions falsely attribute higher grades to higher values of participation attribute..

This is a classic example of a hidden variable described in the reference attached below.

The truth is that participation attribute value impacts the score attribute value. Generally, the higher the participation in Moody’s class, the higher the score. But it is the score attribute which has a direct impact on the grade. Thus, it is the score which is the real “hidden variable” impacting the final grade.

Thus, the score already reflects participation. Professor Moody seemed to look only at texting and dozing off attributes in grade determination (see the power points above with the explanation)

Compare with examples of hidden variables in the following reference about correlation and causation.

https://www.stewartmath.com/precalc_7e_dp/precalc_7e_dp6.html

4.1.2 Best Student’s Submissions 2022

4.2 Movies Data Hunt

Table 4.2: Snippet of Movies Dataset
	country	content	imdb_score	Gross	Budget	genre
5111	France	PG-13	6.37	Medium	Medium	Action
8361	USA	R	6.78	Low	Medium	Comedy
3111	USA	R	2.46	Low	High	Drama
7851	USA	PG	7.62	Low	Low	History
12481	France	R	6.70	Low	Low	History

Puzzle description:

contains imdb scores of 12,800+ movies along with several attributes including budget, gross genre, content rating etc.

What are the most promising alternative hypotheses about imdb scores to test? Name your three top candidates along with the evidence which backs them up: either in the form of R instruction(s) or plot.

4.2.1 Secrets Revealed- Patterns in Movies data?

4.2.2 Best Student’s Submissions 2022

4.3 Minimarket Data Hunt

Table 4.3: Snippet of Minimarket Dataset
	BREAD	BUTTER	COOKIES	COFFEE	TEA
7972	0	1	1	1	1
3405	0	0	1	0	0
8179	1	0	0	0	1
316	0	0	1	1	0
3088	0	1	0	1	0

Puzzle description:

Each row of Minimarket.csv contains one customer transaction which is represented as binary vector (treat this as NUM values). 1 means that customer bought an item, 0 - means that customer did not but that item. For example if customer bought Bread but did not buy Butter you will see 1 in the Bread column and 0 in the Butter column.

Summary

Here’s what you’d do:
1. Come up with a null hypothesis: “Bread does not impact the sales of butter”
2. Come up with an alternative hypothesis: “Bread impacts the sale of butter”
3. Compute the mean value of the Butter column for all the rows where Bread value = 0. Let’s say this is mean1.
4. Compute the mean value of the Butter column for all the rows where Bread value = 1. Let’s say this is mean2.

Section: 4 🔖 Data puzzles secrets

4.1 Moody Data Puzzle

4.1.1 Secrets Revealed- Patterns in Professor Moody’s data?

4.1.2 Best Student’s Submissions 2022

4.2 Movies Data Hunt

4.2.1 Secrets Revealed- Patterns in Movies data?

4.2.2 Best Student’s Submissions 2022

4.3 Minimarket Data Hunt

4.3.1 What were the secret associations between items in the minimarket?

4.4 Predicting grades in Professor Moody’s class

4.4.1 How did I cook the Professor Moody Prediction challenge data?