Section: 16 🔖 Free Style: Prediction

16.1 Introduction

What is a prediction model?

Prediction model is a set of rules which, given the values of independent variables (predictors) determine the value of predicted (dependent variable). Here are example of such rules

If score > 80 and participation >0.6 then grade =’A’

If score >60 and score <70 and major=’Psychology’ and Ask_questions =’always’ then grade =’B’

If score <50 and score >40 and Doze_off =’always’ then grade = ‘F’

By freestyle prediction we mean building a prediction model without the R library functions such as rpart and other machine learning packages. In freestyle prediction one develops models from scratch, on the basis of plots as well as exploratory queries. Freestyle prediction is important for two reasons: First, building prediction models from scratch allows an aspiring data scientist to “feel the data” - as opposed to often blind direct applications of these library functions, Second, even when one uses the prediction models based on library functions, the best models are often created by combining of several such models. These combinations often arise from skillful subsetting of datasets and applying different models to different subsets.

As our prediction challenge competitions indicate, the winning prediction models (the ones with the least error) are predominantly combinations of different models applied to different subsets of the data. Thus, freestyle prediction is almost always a part of the prediction model building. We start with showing an example of a simple freestyle prediction model.

16.2 Example of a simple freestyle prediction model

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJ0ZXN0PC1yZWFkLmNzdihcImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L01vb2R5TWFyY2gyMDIyYi5jc3ZcIilcblxuc3VtbWFyeSh0ZXN0KVxuXG5teXByZWRpY3Rpb248LXRlc3RcbmRlY2lzaW9uIDwtIHJlcCgnRicsbnJvdyhteXByZWRpY3Rpb24pKVxuZGVjaXNpb25bbXlwcmVkaWN0aW9uJFNjb3JlPjQwXSA8LSAnRCdcbmRlY2lzaW9uW215cHJlZGljdGlvbiRTY29yZT42MF0gPC0gJ0MnXG5kZWNpc2lvbltteXByZWRpY3Rpb24kU2NvcmU+NzBdIDwtICdCJ1xuZGVjaXNpb25bbXlwcmVkaWN0aW9uJFNjb3JlPjgwXSA8LSAnQSdcbm15cHJlZGljdGlvbiRHcmFkZSA8LWRlY2lzaW9uXG5lcnJvciA8LSBtZWFuKHRlc3QkR3JhZGUhPSBteXByZWRpY3Rpb24kR3JhZGUpXG5lcnJvciJ9

16.3 How to build a freestyle (your own code) prediction model?

The key idea behind building freestyle prediction models is to subset data and select the most frequent value of the predicted variable as prediction. Of course we are interested in finding highly discriminative subsets of data with one highly dominant (most frequent value), since such a very frequent value as prediction choice will lead to a small error. But how to find data subsets with such dominant most frequent values? It is a bit of a trial and error process. As we show below in the snippet 2, it is a sequence of one line exploratory queries, which the programmer can rely on. Later, in the next section we show how the rpart() package generates such discriminative subsets of data automatically, though recursive partitioning.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9Nb29keU1hcmNoMjAyMmIuY3N2XCIpXG5cbiMgSG93IGRvIHdlIGJ1aWxkIGEgZnJlZXN0eWxlIHByZWRpY3Rpb24gbW9kZWw/ICBEZWZpbml0ZWx5IHN0YXJ0IHdpdGggcGxvdHMgbGlrZSB0aGUgYm94cGxvdCBmcm9tIHRoZSBzZWN0aW9uIDUgKGRhdGEgZXhwbG9yYXRpb24pLiAgQnV0IHRoZW4gZm9sbG93IHVwIHdpdGggZXhwbG9yYXRvcnkgcXVlcmllcyBhcyBpbiB0aGUgcmVjZW50IHF1aXp6ZXMuIEV4YW1wbGVzIGhlcmUgdXNlIHRhYmxlKCkgIGZ1bmN0b24gYW5kIGxvb2sgZm9yIHNpdHVhdGlvbnMgd2hlbiBvbmUgZ3JhZGUgaXMgYWJzb3V0ZWx5IGRvbWluYW50LiBUaGlzIHdvdWxkIGJlIHlvdXIgcHJlZGljdGlvbi4gVGh1cywgdGhlIGdvYWwgaXMgdG8gc2xpY2UgdGhlIGRhdGEgdXNpbmcgc3Vic2V0dGluZyBpbiBzdWNoIGEgd2F5IHRoYXQgZm9yIGVhY2ggc2xpY2UgeW91IGdldCBhIGNsZWFyIFwid2lubmVyIGdyYWRlXCIuIFRoZW4gY29tYmluZSB0aGVzZSBzdWJzZXQgcnVsZXMgaW50byBkZWNpc2lvbiB2ZWN0b3IgLSBqdXN0IGFzIHdlIGRpZCBpbiBzbmlwcGV0IDE0LjEuXG5cbiMgQmVsb3cgc29tZSBleGFtcGxlcyBvZiBzdWNoIGV4cGxvcmF0b3J5IHF1ZXJpZXMgd2l0aCBjbGVhciBncmFkZSB3aW5uZXJzLlxuXG5zdW1tYXJ5KG1vb2R5KVxudGFibGUobW9vZHkkR3JhZGUpXG50YWJsZShtb29keVttb29keSRTY29yZT44MCxdJEdyYWRlKVxudGFibGUobW9vZHlbbW9vZHkkU2NvcmU+ODAgJiBtb29keSRNYWpvcj09J1BzeWNob2xvZ3knLF0kR3JhZGUpXG50YWJsZShtb29keVttb29keSRTY29yZTw0MCAmIG1vb2R5JE1ham9yPT0nRWNvbm9taWNzJyxdJEdyYWRlKVxudGFibGUobW9vZHlbbW9vZHkkU2NvcmU8NDAgJiBtb29keSRTZW5pb3JpdHk9PSdGcmVzaG1hbicsXSRHcmFkZSkifQ==

16.4 One-step crossvalidation

How do we know if our prediction model is any good? After all, we may easily build a model which is close to perfect on the training data set but performs miserably on the new, testing data. This is a nightmare for every prediction model builder and it is called a Kaggle surprise. Kaggle surprise happens quite often during our prediction competitions when students build models which are overfitting the data and which give them a false feeling of great, low error just to do the opposite on the testing data and yield a miserably high error.

To avoid this or at least to protect one against it, cross validation is needed. We illustrate cross-validation in the next snippet. We split training data into the real training data and the testing data, which is the remaining part of our training data set. Thus we use part of the training data as testing data. We do it by randomly splitting our data set. Although we show here just one step of cross-validation, we should do it multiple times. This helps us to observe how our model behaves for different random subsets of training data and helps us to observe inconsistent results (high variance of error) - which is a warning sign of future kaggle surprise.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJ0cmFpbjwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9Nb29keU1hcmNoMjAyMmIuY3N2XCIpXG5zdW1tYXJ5KHRyYWluKVxuI3NjcmFtYmxlIHRoZSB0cmFpbiBmcmFtZVxudjwtc2FtcGxlKDE6bnJvdyh0cmFpbikpXG52WzE6NV1cbnRyYWluU2NyYW1ibGVkPC10cmFpblt2LCBdXG4jb25lIHN0ZXAgY3Jvc3N2YWxpZGF0aW9uXG50cmFpblNhbXBsZTwtdHJhaW5TY3JhbWJsZWRbbnJvdyh0cmFpblNjcmFtYmxlZCktMTA6bnJvdyh0cmFpblNjcmFtYmxlZCksIF1cbm15cHJlZGljdGlvbjwtdHJhaW5TYW1wbGVcblxuI3ByZWRpY3Rpb24gbW9kZWwgLSBmcmVlIHN0eWxlXG4jSG93IHRvIHRlc3QgaG93IGdvb2QgeW91ciBtb2RlbCBpcz9cbiNDcm9zc3ZhbGlkYXRpb246ICBEaXZpZGUgdHJhaW4gZGF0YSBzZXQgaW50byB0d28gZGlzam9pbnQgc3Vic2V0cyBUICh0cmFpbikgYW5kIHRyYWluIE1JTlVTIFQsIHRoZSBjb21wbGVtZW50IG9mIFQuIFxuI1lvdSB1c2UgVCB0byBkZXJpdmUgeW91ciBwcmVkaWN0aW9uIG1vZGVsIGFuZCB0aGUgY29tcGxlbWVudCBvZiBUICh0cmFpbiBNSU5VUyBUKSB0byB2YWxpZGF0ZSAodGVzdCBpdCkuXG4jIFdlIGFzc3VtZSB0aGF0IHlvdSBjcmVhdGVkIHByZWRpY3Rpb24gbW9kZWwgbG9va2luZyBqdXN0IGF0IHRoZSBzdWJzZXQgb2YgdHJhaW5pbmcgZGF0YSBUPXRyYWluU2NyYW1ibGVkWzE6OTkwLCAgXS4gXG4jU2luY2UgZm9yIGNyb3NzdmFsaWRhdGlvbiB3ZSB0cmFpbiBvbiBhIHN1YnNldCBUIG9mIHRoZSB0cmFpbmluZyBkYXRhIHNldCBhbmQgdmFsaWRhdGUgKHRlc3QpIG9uIHRoZSBjb21wbGVtZW50IG9mIFQuIFxuI0luIHRoaXMgY2FzZSBUPSB0cmFpblNjcmFtYmxlZFsxOjk5MCwgIF0gYW5kIGNvbXBsZW1lbnQgb2YgVCAodG8gdmFsaWRhdGUvdGVzdCkgaXMgc3RvcmVkIGFzIHRyYWluU2FtcGxlLlxuI1lvdSBjYW4gZG8gaXQgbXVsdGlwbGUgdGltZXMuIEFuZCBvYnNlcnZlIHRoZSBlcnJvciBhbmQgaXRzIHN0YWJpbGl0eS5cbiNZb3UgYnVpbGQgeW91ciBtb2RlbCB1c2luZyB0aGUgZGVjaXNpb24gdmVjdG9yLiAgSGVyZSBpcyB2ZXJ5IFNJTVBMSVNUSUMgTU9ERUwgd2hpY2ggaXMganVzdCBpbGx1c3RyYXRpb24uIFlvdXIgbW9kZWwgc2hvdWxkIGhhdmUgbXVjaCBiZXR0ZXIgZXJyb3IgYW5kIGJlIG1vcmUgc29waGlzdGljYXRlZC4gXG5cbmRlY2lzaW9uIDwtIHJlcCgnRicsbnJvdyhteXByZWRpY3Rpb24pKVxuZGVjaXNpb25bbXlwcmVkaWN0aW9uJFNjb3JlPjQwXSA8LSAnRCdcblxuZGVjaXNpb25bbXlwcmVkaWN0aW9uJFNjb3JlPjYwXSA8LSAnQydcblxuZGVjaXNpb25bbXlwcmVkaWN0aW9uJFNjb3JlPjcwXSA8LSAnQidcblxuZGVjaXNpb25bbXlwcmVkaWN0aW9uJFNjb3JlPjgwIF0gPC0gJ0EnXG5cbm15cHJlZGljdGlvbiRHcmFkZSA8LWRlY2lzaW9uXG5lcnJvciA8LSBtZWFuKHRyYWluU2FtcGxlJEdyYWRlIT0gbXlwcmVkaWN0aW9uJEdyYWRlKVxuZXJyb3IgICAifQ==

We use selected data puzzles from section 4 in prediction challenges. Given a data puzzle (such as 4.1), we separate it into training data subset and testing data subset. The training data is given to students to build and cross-validate their prediction models. Then we use Kaggle to evaluate their models on the testing subset of the data puzzle. Each prediction challenge is structured as competition and Kaggle ranks students’ models by prediction accuracy. For categorical variables it is the fraction of values which are predicted correctly, for numerical variables it is MSE (mean square error).

16.5 General Structure of the Prediction Challenges

The submission will take place on Kaggle which is used for organizing these prediction challenges online, helping in validating submissions, placing deadlines for submission and also calculating the prediction scores along with ranking all the submissions.

The datasets provided for each prediction challenge is as follows:

Training Dataset
- It is used for training and cross-validation purposes in the prediction challenge. This data has all the training attributes along and the values of the attribute wich is predicted (so called, Target attribute).
- Models for prediction are to be trained using this dataset only.
- Training data set is the set which is used when you build your prediction model - since this is the only data set which has all values of target attribute.
Testing Dataset
- It is used for applying your prediction model to new data. You do it only when you are finished with building your prediction model.
- Testing data set consists of all the attributes that were used for training, but it does not contain any values of the target attribute.
- It is disjoint with the training data set - it contains new data and it is missing the target variable.
Submission Dataset
- After prediction using the “testing” dataset, for submitting on Kaggle, we must copy the predicted attribute column to this Submission Dataset which only has 2 columns, first an index column(e.g. ID or name,etc) and second the predicted attribute column. Remember after copying the predicted attribute column to this dataset, one should also save this dataset into the same submission dataset file, which then can be used to upload on Kaggle.

To read the datasets use the read.csv() function and for writing the dataset to the file, use the write.csv() function. Offen times while writing the dataframe from R to a csv file, people make mistake of writing even the row names, which results in error upon submission of this file to Kaggle.

To avoid this, you can add the parameter, row.names = F in the write.csv() function. e.g. write.csv(*dataframe*,*fileaddress*,row.names = F).

16.5.1 Preparing submission.csv for Kaggle

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEhlcmUgeW91IGp1c3QgbmVlZCB0aGUgdGVzdCB0YWJsZSAod2l0aG91dCBncmFkZXMpIHRvIGFwcGx5IHlvdXIgcHJlZGljdGlvbiBtb2RlbCBhbmQgY2FsY3VsYXRlIHByZWRpY3RlZCBncmFkZXMuIEFuZCBzdWJtaXNzaW9uIGRhdGEgZnJhbWUgdG8gZmlsbCBpdCBpbiB3aXRoIHRoZSBwcmVkaWN0ZWQgI2dyYWRlc1xuXG50ZXN0PC1yZWFkLmNzdignaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2Rldjc3OTYvZGF0YTEwMV90dXRvcmlhbC9tYWluL2ZpbGVzL2RhdGFzZXQvTTIwMjJ0ZXN0U05vR3JhZGUuY3N2JylcbnN1Ym1pc3Npb248LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NMjAyMnN1Ym1pc3Npb24uY3N2JylcblxubXlwcmVkaWN0aW9uPC10ZXN0XG4jSGVyZSBpcyB5b3VyIG1vZGVsLiBJIGp1c3Qgc2hvdyBleGFtcGxlIG9mIHRyaXZpYWwgcHJlZGljdGlvbiBtb2RlbFxuZGVjaXNpb24gPC0gcmVwKCdGJyxucm93KG15cHJlZGljdGlvbikpXG5kZWNpc2lvbltteXByZWRpY3Rpb24kU2NvcmU+NDBdIDwtICdEJ1xuZGVjaXNpb25bbXlwcmVkaWN0aW9uJFNjb3JlPjYwXSA8LSAnQydcbmRlY2lzaW9uW215cHJlZGljdGlvbiRTY29yZT43MF0gPC0gJ0InXG5kZWNpc2lvbltteXByZWRpY3Rpb24kU2NvcmU+ODBdIDwtICdBJ1xuI05vdyBtYWtlIHlvdXIgc3VibWlzc2lvbiBmaWxlIC0gaXQgd2lsbCBoYXZlIHRoZSBJRHMgYW5kIG5vdyB0aGUgcHJlZGljdGVkIGdyYWRlc1xuc3VibWlzc2lvbiRHcmFkZTwtZGVjaXNpb25cbnN1Ym1pc3Npb25cbiMgdXNlIHdyaXRlLmNzdihzdWJtaXNzaW9uLCAnc3VibWlzc2lvbi5jc3YnLCByb3cubmFtZXM9RkFMU0UpIHRvIHN0b3JlIHN1Ym1pc3Npb24gYXMgY3N2IGZpbGUgb24geW91ciBtYWNoaW5lIGFuZCBzdWJzZXF1ZW50bHkgc3VibWl0IGl0IG9uIEthZ2dsZSJ9

Data League: https://data101.cs.rutgers.edu/?q=node/155
Kaggle competition: https://www.kaggle.com/competitions/predictive-challenge-2-2022/overview
Kaggle submission instructions: https://data101.cs.rutgers.edu/?q=node/150