Section: 1 Introduction

The objective of this textbook is to provide you with the shortest path to exploring your data, visualizing it, forming hypotheses and validating and defending them. In other words, to introduce you to data science. We call it an active textbook, since students can interact with the book by running and modifying snippets of R code. Students can also test themselves using Query and Code Roulette - on questions and simple coding tasks. Thus, an active textbook interacts with its reader, helps to run code and also asks students questions and gives them simple coding tasks. Concepts which we discuss in the book are widely covered on the web with countless youtube video tutorials. We briefly introduce the basic concepts here as well, but our focus is on problem solving and simple coding. In other words, honing the skills to do something with the data, not just talk about it. To learn a concept you have to know how to code it. To put it bluntly, no coding, no learning!

Active textbook is data-centric. We stress that prior to data analysis and exploration, you have to know your data. Paraphrasing the famous saying that real estate is about “location, location and location”, data science is about data, data and data. Using numerous data sets we guide the student through the process of getting acquainted with their data. We call these data sets - data puzzles, since each of them is synthetically created and has hidden patterns embedded in the process of data generation. For each of our data puzzles we show the process of getting familiar with the data beginning from simple scripts called queries and proceeding through as-hoc hypothesis testing as well as Bayesian reasoning. As we said, each data puzzle hides interesting and non-trivial patterns which are to be discovered. This process of discovery makes data science similar to the work of a detective. Discoveries range from grading methods of Professor Moody, factors influencing quality of a party, voter profiles in local town elections or quality of sleep determinants.

Given a data set, you want to be able to make any plot you wish, find plots which show something actionable and interesting, explore data by slicing and dicing it and finally present your results in a statistically convincing manner, perhaps in a colorful and visually appealing way. Finally, you will be able to apply some basic machine learning methods to build, train and test prediction models. All of this will be accomplished in a succinct and crisp way using a small subset of R instructions.

We assume no prior programming background. We will teach you as little R as necessary to achieve the goals of this book: explore the data, visualize it, verify hypotheses and build prediction models. Thus, you will be able to do a good chunk of work which data scientists do. We will accomplish this goal through active snippets of executable code. These are examples of R code (around 100 executable snippets of code) embedded in the textbook itself. More importantly, you will be able to modify the code and execute the modified code without need to install any application on your machine. This will allow you to understand the code in the book through the “what if” exploratory process. Thus, every code snippet is just an invitation to endless modifications. This is why we call this textbook - active.

Another unique aspect of this textbook is its reliance on data puzzles. These are synthetic data sets with embedded patterns and rules generated by our tool called DataMaker. We will present our data puzzles (dynamic list, may vary from year to year) in section 8 following introduction to plots, then we will proceed to freestyle data exploration. This will allow us to learn more about our data, form the leads, and finally state our hypotheses. We will follow up by an elementary introduction to hypothesis testing through a permutation test. We will learn how to calculate p-values and how to use them to defend our findings against the randomness trap. This will be particularly important in case of multiple hypotheses when one has to be particularly careful to avoid “false positives” . We will introduce Bayesian reasoning and learn how to compute posterior odds of a belief given an observation. All these important concepts will be introduced via executable snippets of code and “what if” practicing. We will then enter the key section of the book - the data puzzles section. In the data puzzle section, for each data set we will go through the process of getting to know the data and using the concepts learned so far by executing code tailored to each of the data puzzles.

In the second part of the book we discuss prediction models. We focus on decision trees (rpart() - recursive partitioning) and linear regression. But we also show how to use other machine learning methods from the rich R-library. We go over cross-validation and show how to build prediction models which combine multiple machine learning models. We stress the importance of knowing your data first, instead of just blind application of machine learning packages. Humans in the loop is very important and prior data exploration and visualization leads to improved quality prediction models. Students can practice prediction model building on especially prediction snippets to make themselves prepared for Kaggle based prediction challenge competition which takes the last months of the data 101 class. The last leaderboard of 2022 challenge is presented here LeaderBoard .

We will use as few R functions as possible to achieve our goals. In fact we will demonstrate how using less than ten R functions is sufficient for us. In the appendix, we show many more useful commands of R which eventually you would have to use. However, our goal in this short textbook, is to present the shortest path to data analysis which will let you import the data, plot it, make some analysis yourself and use R-libraries to build machine learning models. In this textbook and in this class we do not teach how to clean the data (data wrangling) and how to deal with a wide variety of data types. We also do not address complex data transformations such as multi-frame operations like merge function. We also do not explain how different machine learning methods work, we only show you how to use them. It is similar to teaching one how to drive a car without knowing how a car engine works.

Sections 2.5 and 2.6 provide the lists of all concepts which we cover in our active textbook and all R functions which are needed. Notice how small the set of R functions is. It is important for programming novices to start small and also see how far this small set of functions can get you.

Our question roulette allows self-testing on nearly 100 questions relevant to the material. Each question is answered, but students are encouraged first to answer questions themselves and only then follow it with checking the correct answer. The code roulette, on the other hand, consists of around 100 of simple common data science coding tasks.