Section: 17 🔖 Predictions with rpart

17.1 Introduction

Decision trees are one of the most powerful and popular tools for classification and prediction. The reason decision trees are very popular is that they can generate rules which are easier to understand as compared to other models. They require much less computations for performing modeling and prediction. Both continuous/numerical and categorical variables are handled easily while creating the decision trees.

17.2 Use of Rpart

Recursive Partitioning and Regression Tree RPART library is a collection of routines which implements a Decision Tree.The resulting model can be represented as a binary tree. For the purpose of illustration of rpart we will continue to use data puzzle 3.1 set - the Professor Moody data set.

The library associated with this RPART is called rpart. Install this library using install.packages("rpart").

Syntax for building the decision tree using rpart():

  • rpart( formula , method, data, control,...)
    • formula: here we mention the prediction column and the other related columns(predictors) on which the prediction will be based on.
      • prediction ~ predictor1 + predictor2 + predictor3 + ...
    • method: here we describe the type of decision tree we want. If nothing is provided, the function makes an intelligent guess. We can use “anova” for regression, “class” for classification, etc.
    • data: here we provide the dataset on which we want to fit the decision tree on.
    • control: here we provide the control parameters for the decision tree. Explained more in detail in the section further in this chapter.

For more info on the rpart function visit rpart documentation

Lets look at an example on the Moody 2022 dataset.

  • We will use the rpart() function with the following inputs:
    • prediction -> GRADE
    • predictors -> SCORE, DOZES_OFF, TEXTING_IN_CLASS, PARTICIPATION
    • data -> moody dataset
    • method -> “class” for classification.

17.2.1 rpart()

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uLlxucnBhcnQoR1JBREUgfiBTQ09SRStET1pFU19PRkYrVEVYVElOR19JTl9DTEFTUytQQVJUSUNJUEFUSU9OLCBkYXRhID0gbW9vZHksbWV0aG9kID0gXCJjbGFzc1wiKSJ9

We can see that the output of the rpart() function is the decision tree with details of,

  • node -> node number
  • split -> split conditions/tests
  • n -> number of records in either branch i.e. subset
  • yval -> output value i.e. the target predicted value.
  • yprob -> probability of obtaining a particular category as the predicted output.

Using the output tree, we can use the predict function to predict the grades of the test data. We will look at this process later in section 17.6

But coming back to the output of the rpart() function, the text type output is useful but difficult to read and understand, right! We will look at visualizing the decision tree in the next section.

17.3 Visualize the Decision tree

To visualize and understand the rpart() tree output in the easiest way possible, we use a library called rpart.plot. The function rpart.plot() of the rpart.plot library is the function used to visualize decision trees.

NOTE: The online runnable code block does not support rpart.plot library and functions, thus the output of the following code examples are provided directly.

17.3.1 rpart.plot()

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEZpcnN0IGxldHMgaW1wb3J0IHRoZSBycGFydCBsaWJyYXJ5XG5saWJyYXJ5KHJwYXJ0KVxuXG4jIEltcG9ydCBkYXRhc2V0XG5tb29keTwtcmVhZC5jc3YoJ2h0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L21vb2R5MjAyMl9uZXcuY3N2JylcblxuIyBVc2Ugb2YgdGhlIHJwYXJ0KCkgZnVuY3Rpb24uXG5ycGFydChHUkFERSB+IFNDT1JFK0RPWkVTX09GRitURVhUSU5HX0lOX0NMQVNTK1BBUlRJQ0lQQVRJT04sIGRhdGEgPSBtb29keSxtZXRob2QgPSBcImNsYXNzXCIpXG5cbiMgTm93IGxldHMgaW1wb3J0IHRoZSBycGFydC5wbG90IGxpYnJhcnkgdG8gdXNlIHRoZSBycGFydC5wbG90KCkgZnVuY3Rpb24uXG4jbGlicmFyeShycGFydC5wbG90KVxuXG4jIFVzZSBvZiB0aGUgcnBhcnQucGxvdCgpIGZ1bmN0aW9uICB0byB2aXN1YWxpemUgdGhlIGRlY2lzaW9uIHRyZWUuXG4jcnBhcnQucGxvdCh0cmVlKSJ9

Output Plot of rpart.plot() function

We can see that after plotting the tree using rpart.plot() function, the tree is more readable and provides better information about the splitting conditions, and the probability of outcomes. Each leaf node has information about

  • the grade category.
  • the outcome probability of each grade category.
  • the records percentage out of total records.

To study more in detail the arguments that can be passed to the rpart.plot() function, please look at these guides rpart.plot and Plotting with rpart.plot (PDF)

NOTE: In this chapter, from this point forward, the rpart.plots() generated in any example below will be shown as images, and also the code to generate those rpart.plots will be commented in the interactive code blocks. If you want to generate these plots yourself, please use a local Rstudio or R environment.

17.4 Rpart Control

Now let’s look at the rpart.control() function used to pass the control parameters to the control argument of the rpart() function.

  • rpart.control( *minsplit*, *minbucket*, *cp*,...)
  • minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. For example, minsplit=500 -> the minimum number of observations in a node must be 500 or up, in order to perform the split at the testing condition.
  • minbucket: minimum number of observations in any terminal(leaf) node. For example, minbucket=500 -> the minimum number of observation in the terminal/leaf node of the trees must be 500 or above.
  • cp: complexity parameter. Using this informs the program that any split which does not increase the accuracy of the fit by cp, will not be made in the tree.

For more information of the other arguments of the rpart.control() function visit rpart.control

Let look at few examples.

Suppose you want to set the control parameter minsplit=200.

17.4.1 rpart(): Minsplit = 200

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uIHdpdGggdGhlIGNvbnRyb2wgcGFyYW1ldGVyIG1pbnNwbGl0PTIwMFxudHJlZSA8LSBycGFydChHUkFERSB+IFNDT1JFK0RPWkVTX09GRitURVhUSU5HX0lOX0NMQVNTK1BBUlRJQ0lQQVRJT04sIGRhdGEgPSBtb29keSwgbWV0aG9kID0gXCJjbGFzc1wiLGNvbnRyb2w9cnBhcnQuY29udHJvbChtaW5zcGxpdCA9IDIwMCkpXG5cbnRyZWVcblxuI2xpYnJhcnkocnBhcnQucGxvdClcbiNycGFydC5wbG90KHRyZWUsZXh0cmEgPSAyKSJ9

Output tree plot of after setting minsplit=200 in rpart.control() function

17.4.2 rpart(): Minsplit = 100

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uIHdpdGggdGhlIGNvbnRyb2wgcGFyYW1ldGVyIG1pbnNwbGl0PTEwMFxudHJlZSA8LSBycGFydChHUkFERSB+IFNDT1JFK0RPWkVTX09GRitURVhUSU5HX0lOX0NMQVNTK1BBUlRJQ0lQQVRJT04sIGRhdGEgPSBtb29keSwgbWV0aG9kID0gXCJjbGFzc1wiLGNvbnRyb2w9cnBhcnQuY29udHJvbChtaW5zcGxpdCA9IDEwMCkpXG5cbnRyZWVcblxuI2xpYnJhcnkocnBhcnQucGxvdClcbiNycGFydC5wbG90KHRyZWUsZXh0cmEgPSAyKSJ9

Output tree plot of after setting minsplit=100 in rpart.control() function

We can see from the output of tree$splits and the tree plot, that at each split the total amount of observations are above 200 and 100. Also, in comparison to the tree without control, the tree with control has lower height, and lesser count of splits.

Now, lets set the minbucket parameter to 100, and see how that affects the tree parameters.

17.4.3 rpart(): Minbucket = 100

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uIHdpdGggdGhlIGNvbnRyb2wgcGFyYW1ldGVyIE1pbmJ1Y2tldD0xMDBcbnRyZWUgPC0gcnBhcnQoR1JBREUgfiBTQ09SRStET1pFU19PRkYrVEVYVElOR19JTl9DTEFTUytQQVJUSUNJUEFUSU9OLCBkYXRhID0gbW9vZHksIG1ldGhvZCA9IFwiY2xhc3NcIixjb250cm9sPXJwYXJ0LmNvbnRyb2wobWluYnVja2V0ID0gMTAwKSlcblxudHJlZVxuXG4jbGlicmFyeShycGFydC5wbG90KVxuI3JwYXJ0LnBsb3QodHJlZSxleHRyYSA9IDIpIn0=

Output tree plot of after setting minbucket=100 in rpart.control() function

We can see for the output and the tree plot, that the count of observations in each leaf node is greater than 100. Also, the tree height has shortened, suggesting that the control method was able to shorten the tree size.

17.4.4 rpart(): Minbucket = 200

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uIHdpdGggdGhlIGNvbnRyb2wgcGFyYW1ldGVyIE1pbmJ1Y2tldD0yMDBcbnRyZWUgPC0gcnBhcnQoR1JBREUgfiBTQ09SRStET1pFU19PRkYrVEVYVElOR19JTl9DTEFTUytQQVJUSUNJUEFUSU9OLCBkYXRhID0gbW9vZHksIG1ldGhvZCA9IFwiY2xhc3NcIixjb250cm9sPXJwYXJ0LmNvbnRyb2wobWluYnVja2V0ID0gMjAwKSlcblxudHJlZVxuXG4jbGlicmFyeShycGFydC5wbG90KVxuI3JwYXJ0LnBsb3QodHJlZSxleHRyYSA9IDIpIn0=

Output tree plot of after setting minbucket=200 in rpart.control() function

We can see for the output and the tree plot, that the count of observations in each leaf node is greater than 200. Also, the tree height has shortened, suggesting that the control method was able to shorten the tree size.

Lets now use the cp parameter and see its effect on the tree.

17.4.5 rpart(): cp = 0.05

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uIHdpdGggdGhlIGNvbnRyb2wgcGFyYW1ldGVyIGNwPTAuMlxudHJlZSA8LSBycGFydChHUkFERSB+IC4sIGRhdGEgPSBtb29keSxtZXRob2QgPSBcImNsYXNzXCIsY29udHJvbD1ycGFydC5jb250cm9sKGNwID0gMC4wNSkpXG5cbnRyZWVcblxuI2xpYnJhcnkocnBhcnQucGxvdClcbiNycGFydC5wbG90KHRyZWUpIn0=

Output tree plot of after setting cp=0.05 in rpart.control() function

17.4.6 rpart(): cp = 0.005

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uIHdpdGggdGhlIGNvbnRyb2wgcGFyYW1ldGVyIGNwPTAuMDA1XG50cmVlIDwtIHJwYXJ0KEdSQURFIH4gLiwgZGF0YSA9IG1vb2R5LG1ldGhvZCA9IFwiY2xhc3NcIixjb250cm9sPXJwYXJ0LmNvbnRyb2woY3AgPSAwLjAwNSkpXG5cbnRyZWVcblxuI2xpYnJhcnkocnBhcnQucGxvdClcbiNycGFydC5wbG90KHRyZWUpIn0=

Output tree plot of after setting cp=0.005 in rpart.control() function We can see for the output and the tree plot, that the tree size has increased, with increase in number of splits, and leaf nodes. Also we can see that the minimum CP value in the output is 0.005.

17.5 Cross Validation

Overfitting takes place when you have a high accuracy on training dataset, but a low accuracy on the test dataset. But how do you know whether you are overfitting or not? Especially since you cannot determine accuracy on the test dataset? That is where cross-validation comes into play.

Because we cannot determine accuracy on test dataset, we partition our training dataset into train and validation (testing). We train our model (rpart or lm) on train partition and test on the validation partition. The partition is defined by split ratio. If split ratio =0.7, 70% of the training dataset will be used for the actual training of your model (rpart or lm), and 30 % will be used for validation (or testing). The accuracy of this validation data is called cross-validation accuracy.

To know if you are overfitting or not, compare the training accuracy with the cross-validation accuracy. If your training accuracy is high, and cross-validation accuracy is low, that means you are overfitting.

  • cross_validate(*data*, *tree*, *n_iter*, *split_ratio*, *method*)
    • data: The dataset on which cross validation is to be performed.
    • tree: The decision tree generated using rpart.
    • n_iter: Number of iterations.
    • split_ratio: The splitting ratio of the data into train data and validation data.
    • method: Method of the prediction. “class” for classification.

The way the function works is as follows:

  • It randomly partitions your data into training and validation.
  • It then constructs the following two decision trees on training partition:
    • The tree that you pass to the function.
    • The tree is constructed on all attributes as predictors and with no control parameters. -It then determines the accuracy of the two trees on validation partition and returns you the accuracy values for both the trees.

The values in the first column(accuracy_subset) returned by cross-validation function are more important when it comes to detecting overfitting. If these values are much lower than the training accuracy you get, that means you are overfitting.

We would also want the values in accuracy_subset to be close to each other (in other words, have low variance). If the values are quite different from each other, that means your model (or tree) has a high variance which is not desired.

The second column(accuracy_all) tells you what happens if you construct a tree based on all attributes. If these values are larger than accuracy_subset, that means you are probably leaving out attributes from your tree that are relevant.

Each iteration of cross-validation creates a different random partition of train and validation, and so you have possibly different accuracy values for every iteration.

Let’s look at the cross_validate() function in action in the example below.

We will pass the tree with formula as GRADE ~ SCORE+DOZES_OFF+TEXTING_IN_CLASS+PARTICIPATION, and control parameter, with minsplit=100. And for cross_validate() function, we will usen_iter=5, and split_raitio=0.7

NOTE: Cross-Validation repository is already preloaded for the following interactive code block. Thus you can directly use the cross_validate() function in the following interactive code block. But if you wish to use the code_validate() function locally, please use
install.packages("devtools") 
devtools::install_github("devanshagr/CrossValidation")
CrossValidation::cross_validate()

17.5.1 cross_validate()

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImNyb3NzX3ZhbGlkYXRlIDwtIGZ1bmN0aW9uKGRmLCB0cmVlLCBuX2l0ZXIsIHNwbGl0X3JhdGlvLCBtZXRob2QgPSAnY2xhc3MnKVxue1xuICAjIHRyYWluaW5nIGRhdGEgZnJhbWUgZGZcbiAgZGYgPC0gYXMuZGF0YS5mcmFtZShkZilcblxuICAjIG1lYW5fc3Vic2V0IGlzIGEgdmVjdG9yIG9mIGFjY3VyYWN5IHZhbHVlcyBnZW5lcmF0ZWQgZnJvbSB0aGUgc3BlY2lmaWVkIGZlYXR1cmVzIGluIHRoZSB0cmVlIG9iamVjdFxuICBtZWFuX3N1YnNldCA8LSBjKClcblxuICAjIG1lYW5fYWxsIGlzIGEgdmVjdG9yIG9mIGFjY3VyYWN5IHZhbHVlcyBnZW5lcmF0ZWQgZnJvbSBhbGwgdGhlIGF2YWlsYWJsZSBmZWF0dXJlcyBpbiB0aGUgZGF0YSBmcmFtZVxuICBtZWFuX2FsbCA8LSBjKClcblxuICAjIGNvbnRyb2wgcGFyYW1ldGVycyBmb3IgdGhlIGRlY2lzaW9uIHRyZWVcbiAgY29udHJvID0gdHJlZSRjb250cm9sXG5cbiAgIyB0aGUgZm9sbG93aW5nIHNuaXBwZXQgd2lsbCBjcmVhdGUgcmVsYXRpb25zIHRvIGdlbmVyYXRlIGRlY2lzaW9uIHRyZWVzXG4gICMgcmVsYXRpb25fYWxsIHdpbGwgY3JlYXRlIGEgZGVjaXNpb24gdHJlZSB3aXRoIGFsbCB0aGUgZmVhdHVyZXNcbiAgIyByZWxhdGlvbl9zdWJzZXQgd2lsbCBjcmVhdGUgYSBkZWNpc2lvbiB0cmVlIHdpdGggb25seSB1c2VyLXNwZWNpZmllZCBmZWF0dXJlcyBpbiB0cmVlXG4gIGRlcCA8LSBhbGwudmFycyh0ZXJtcyh0cmVlKSlbMV1cbiAgaW5kZXAgPC0gbGlzdCgpXG4gIHJlbGF0aW9uX2FsbCA9IGFzLmZvcm11bGEocGFzdGUoZGVwLCAnLicsIHNlcCA9IFwiflwiKSlcbiAgaSA8LSAxXG4gIHdoaWxlIChpIDwgbGVuZ3RoKGFsbC52YXJzKHRlcm1zKHRyZWUpKSkpIHtcbiAgICBpbmRlcFtbaV1dIDwtIGFsbC52YXJzKHRlcm1zKHRyZWUpKVtpICsgMV1cbiAgICBpIDwtIGkgKyAxXG4gIH1cbiAgYiA8LSBwYXN0ZShpbmRlcCwgY29sbGFwc2UgPSBcIitcIilcbiAgcmVsYXRpb25fc3Vic2V0IDwtIGFzLmZvcm11bGEocGFzdGUoZGVwLCBiLCBzZXAgPSBcIn5cIikpXG5cbiAgIyBjcmVhdGluZyB0cmFpbiBhbmQgdGVzdCBzYW1wbGVzIHdpdGggdGhlIGdpdmVuIHNwbGl0IHJhdGlvXG4gICMgcGVyZm9ybWluZyBjcm9zcy12YWxpZGF0aW9uIG5faXRlciB0aW1lc1xuICBmb3IgKGkgaW4gMTpuX2l0ZXIpIHtcbiAgICBzYW1wbGUgPC1cbiAgICAgIHNhbXBsZS5pbnQobiA9IG5yb3coZGYpLFxuICAgICAgICAgICAgICAgICBzaXplID0gZmxvb3Ioc3BsaXRfcmF0aW8gKiBucm93KGRmKSksXG4gICAgICAgICAgICAgICAgIHJlcGxhY2UgPSBGKVxuICAgIHRyYWluIDwtIGRmW3NhbXBsZSxdXG4gICAgdGVzdGluZyAgPC0gZGZbLXNhbXBsZSxdXG4gICAgdHlwZSA9IHR5cGVvZih1bmxpc3QodGVzdGluZ1tkZXBdKSlcblxuICAgICMgZGVjaXNpb24gdHJlZSBmb3IgcmVncmVzc2lvbiBpZiB0aGUgbWV0aG9kIHNwZWNpZmllZCBpcyBcImFub3ZhXCJcbiAgICBpZiAobWV0aG9kID09ICdhbm92YScpIHtcbiAgICAgIGZpcnN0LnRyZWUgPC1cbiAgICAgICAgcnBhcnQoXG4gICAgICAgICAgcmVsYXRpb25fc3Vic2V0LFxuICAgICAgICAgIGRhdGEgPSB0cmFpbixcbiAgICAgICAgICBjb250cm9sID0gY29udHJvLFxuICAgICAgICAgIG1ldGhvZCA9ICdhbm92YSdcbiAgICAgICAgKVxuICAgICAgc2Vjb25kLnRyZWUgPC0gcnBhcnQocmVsYXRpb25fYWxsLCBkYXRhID0gdHJhaW4sIG1ldGhvZCA9ICdhbm92YScpXG4gICAgICBwcmVkMS50cmVlIDwtIHByZWRpY3QoZmlyc3QudHJlZSwgbmV3ZGF0YSA9IHRlc3RpbmcpXG4gICAgICBwcmVkMi50cmVlIDwtIHByZWRpY3Qoc2Vjb25kLnRyZWUsIG5ld2RhdGEgPSB0ZXN0aW5nKVxuICAgICAgbWVhbjEgPC0gbWVhbigoYXMubnVtZXJpYyhwcmVkMS50cmVlKSAtIHRlc3RpbmdbLCBkZXBdKSBeIDIpXG4gICAgICBtZWFuMiA8LSBtZWFuKChhcy5udW1lcmljKHByZWQyLnRyZWUpIC0gdGVzdGluZ1ssIGRlcF0pIF4gMilcbiAgICAgIG1lYW5fc3Vic2V0IDwtIGMobWVhbl9zdWJzZXQsIG1lYW4xKVxuICAgICAgbWVhbl9hbGwgPC0gYyhtZWFuX2FsbCwgbWVhbjIpXG4gICAgfVxuXG4gICAgIyBkZWNpc2lvbiB0cmVlIGZvciBjbGFzc2lmaWNhdGlvblxuICAgICMgaWYgdGhlIG1ldGhvZCBzcGVjaWZpZWQgaXMgbm90IFwiYW5vdmFcIiwgdGhlbiB0aGlzIGJsb2NrIGlzIGV4ZWN1dGVkXG4gICAgIyBpZiB0aGUgbWV0aG9kIGlzIG5vdCBzcGVjaWZpZWQgYnkgdGhlIHVzZXIsIHRoZSBkZWZhdWx0IG9wdGlvbiBpcyB0byBwZXJmb3JtIGNsYXNzaWZpY2F0aW9uXG4gICAgZWxzZXtcbiAgICAgIGZpcnN0LnRyZWUgPC1cbiAgICAgICAgcnBhcnQoXG4gICAgICAgICAgcmVsYXRpb25fc3Vic2V0LFxuICAgICAgICAgIGRhdGEgPSB0cmFpbixcbiAgICAgICAgICBjb250cm9sID0gY29udHJvLFxuICAgICAgICAgIG1ldGhvZCA9ICdjbGFzcydcbiAgICAgICAgKVxuICAgICAgc2Vjb25kLnRyZWUgPC0gcnBhcnQocmVsYXRpb25fYWxsLCBkYXRhID0gdHJhaW4sIG1ldGhvZCA9ICdjbGFzcycpXG4gICAgICBwcmVkMS50cmVlIDwtIHByZWRpY3QoZmlyc3QudHJlZSwgbmV3ZGF0YSA9IHRlc3RpbmcsIHR5cGUgPSAnY2xhc3MnKVxuICAgICAgcHJlZDIudHJlZSA8LVxuICAgICAgICBwcmVkaWN0KHNlY29uZC50cmVlLCBuZXdkYXRhID0gdGVzdGluZywgdHlwZSA9ICdjbGFzcycpXG4gICAgICBtZWFuMSA8LVxuICAgICAgICBtZWFuKGFzLmNoYXJhY3RlcihwcmVkMS50cmVlKSA9PSBhcy5jaGFyYWN0ZXIodGVzdGluZ1ssIGRlcF0pKVxuICAgICAgbWVhbjIgPC1cbiAgICAgICAgbWVhbihhcy5jaGFyYWN0ZXIocHJlZDIudHJlZSkgPT0gYXMuY2hhcmFjdGVyKHRlc3RpbmdbLCBkZXBdKSlcbiAgICAgIG1lYW5fc3Vic2V0IDwtIGMobWVhbl9zdWJzZXQsIG1lYW4xKVxuICAgICAgbWVhbl9hbGwgPC0gYyhtZWFuX2FsbCwgbWVhbjIpXG4gICAgfVxuICB9XG5cbiAgIyBhdmVyYWdlX2FjY3VyYWN5X3N1YnNldCBpcyB0aGUgYXZlcmFnZSBhY2N1cmFjeSBvZiBuX2l0ZXIgaXRlcmF0aW9ucyBvZiBjcm9zcy12YWxpZGF0aW9uIHdpdGggdXNlci1zcGVjaWZpZWQgZmVhdHVyZXNcbiAgIyBhdmVyYWdlX2FjdXJhY3lfYWxsIGlzIHRoZSBhdmVyYWdlIGFjY3VyYWN5IG9mIG5faXRlciBpdGVyYXRpb25zIG9mIGNyb3NzLXZhbGlkYXRpb24gd2l0aCBhbGwgdGhlIGF2YWlsYWJsZSBmZWF0dXJlc1xuICAjIHZhcmlhbmNlX2FjY3VyYWN5X3N1YnNldCBpcyB0aGUgdmFyaWFuY2Ugb2YgYWNjdXJhY3kgb2Ygbl9pdGVyIGl0ZXJhdGlvbnMgb2YgY3Jvc3MtdmFsaWRhdGlvbiB3aXRoIHVzZXItc3BlY2lmaWVkIGZlYXR1cmVzXG4gICMgdmFyaWFuY2VfYWNjdXJhY3lfYWxsIGlzIHRoZSB2YXJpYW5jZSBvZiBhY2N1cmFjeSBvZiBuX2l0ZXIgaXRlcmF0aW9ucyBvZiBjcm9zcy12YWxpZGF0aW9uIHdpdGggYWxsIHRoZSBhdmFpbGFibGUgZmVhdHVyZXNcbiAgY3Jvc3NfdmFsaWRhdGlvbl9zdGF0cyA8LVxuICAgIGxpc3QoXG4gICAgICBcImF2ZXJhZ2VfYWNjdXJhY3lfc3Vic2V0XCIgPSBtZWFuKG1lYW5fc3Vic2V0LCBuYS5ybSA9IFQpLFxuICAgICAgXCJhdmVyYWdlX2FjY3VyYWN5X2FsbFwiID0gbWVhbihtZWFuX2FsbCwgbmEucm0gPSBUKSxcbiAgICAgIFwidmFyaWFuY2VfYWNjdXJhY3lfc3Vic2V0XCIgPSB2YXIobWVhbl9zdWJzZXQsIG5hLnJtID0gVCksXG4gICAgICBcInZhcmlhbmNlX2FjY3VyYWN5X2FsbFwiID0gdmFyKG1lYW5fYWxsLCBuYS5ybSA9IFQpXG4gICAgKVxuXG4gICMgY3JlYXRpbmcgYSBkYXRhIGZyYW1lIG9mIGFjY3VyYWN5X3N1YnNldCBhbmQgYWNjdXJhY3lfYWxsXG4gICMgYWNjdXJhY3lfc3Vic2V0IGNvbnRhaW5zIG5faXRlciBhY2N1cmFjeSB2YWx1ZXMgb24gY3Jvc3MtdmFsaWRhdGlvbiB3aXRoIHVzZXItc3BlY2lmaWVkIGZlYXR1cmVzXG4gICMgYWNjdXJhY3lfYWxsIGNvbnRhaW5zIG5faXRlciBhY2N1cmFjeSB2YWx1ZXMgb24gY3Jvc3MtdmFsaWRhdGlvbiB3aXRoIGFsbCB0aGUgYXZhaWxhYmxlIGZlYXR1cmVzXG4gIGNyb3NzX3ZhbGlkYXRpb25fZGYgPC1cbiAgICBkYXRhLmZyYW1lKGFjY3VyYWN5X3N1YnNldCA9IG1lYW5fc3Vic2V0LCBhY2N1cmFjeV9hbGwgPSBtZWFuX2FsbClcbiAgcmV0dXJuKGxpc3QoY3Jvc3NfdmFsaWRhdGlvbl9kZiwgY3Jvc3NfdmFsaWRhdGlvbl9zdGF0cykpXG59Iiwic2FtcGxlIjoiIyBGaXJzdCBsZXRzIGltcG9ydCB0aGUgcnBhcnQgbGlicmFyeVxubGlicmFyeShycGFydClcbiMgSW1wb3J0IGRhdGFzZXRcbm1vb2R5PC1yZWFkLmNzdignaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2Rldjc3OTYvZGF0YTEwMV90dXRvcmlhbC9tYWluL2ZpbGVzL2RhdGFzZXQvbW9vZHkyMDIyX25ldy5jc3YnLHN0cmluZ3NBc0ZhY3RvcnMgPSBUKVxuIyBVc2Ugb2YgdGhlIHJwYXJ0KCkgZnVuY3Rpb24uXG50cmVlIDwtIHJwYXJ0KEdSQURFIH4gU0NPUkUrRE9aRVNfT0ZGK1RFWFRJTkdfSU5fQ0xBU1MsIGRhdGEgPSBtb29keSxtZXRob2QgPSBcImNsYXNzXCIsY29udHJvbCA9IHJwYXJ0LmNvbnRyb2wobWluc3BsaXQgPSAxMDApKVxudHJlZVxuIyBOb3cgbGV0cyBwcmVkaWN0IHRoZSBHcmFkZXMgb2YgdGhlIE1vb2R5IERhdGFzZXQuXG5wcmVkIDwtIHByZWRpY3QodHJlZSwgbW9vZHksIHR5cGU9XCJjbGFzc1wiKVxuaGVhZChwcmVkKVxuIyBMZXRzIGNoZWNrIHRoZSBUcmFpbmluZyBBY2N1cmFjeVxubWVhbihtb29keSRHUkFERT09cHJlZClcbiMgTGV0cyB1cyB0aGUgY3Jvc3NfdmFsaWRhdGUoKSBmdW5jdGlvbi5cbmNyb3NzX3ZhbGlkYXRlKG1vb2R5LHRyZWUsNSwwLjcpIn0=

You can see that the cross-validation accuracies for the tree that was passed (accuracy_subset) are fairly high and close to our training accuracy of 84%. This means we are not overfitting. Also observe that accuracy_subset and accuracy_all have the same values, which means that the only relevant attributes are score and participation, and adding more attributes doesn’t make any difference to the tree. Finally, the values in accuracy_subset are reasonably close to each other, which mean low variance.

17.6 Prediction using rpart.

Now that we have seen the process to create a decision tree and also plot it, we will like to use the output tree to predict the required attribute.

From the moody example, we are trying to predict the grade of students. Lets look at the predict() function to predict the outcomes.

  • predict(*object*,*data*,*type*,...)
    • object: the generated tree from the rpart function.
    • data: the data on which the prediction is to be performed.
    • type: the type of prediction required. One of “vector”, “prob”, “class” or “matrix”.

Now lets use the predict function to predict the grades of students using the tree generated on the Moody dataset.

17.6.1 predict()

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEZpcnN0IGxldHMgaW1wb3J0IHRoZSBycGFydCBsaWJyYXJ5XG5saWJyYXJ5KHJwYXJ0KVxuXG4jIEltcG9ydCBkYXRhc2V0XG5tb29keTwtcmVhZC5jc3YoJ2h0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L21vb2R5MjAyMl9uZXcuY3N2JylcblxuIyBVc2Ugb2YgdGhlIHJwYXJ0KCkgZnVuY3Rpb24uXG50cmVlIDwtIHJwYXJ0KEdSQURFIH4gU0NPUkUrRE9aRVNfT0ZGK1RFWFRJTkdfSU5fQ0xBU1MrUEFSVElDSVBBVElPTiwgZGF0YSA9IG1vb2R5ICxtZXRob2QgPSBcImNsYXNzXCIpXG50cmVlXG5cbiMgTm93IGxldHMgcHJlZGljdCB0aGUgR3JhZGVzIG9mIHRoZSBNb29keSBEYXRhc2V0LlxucHJlZCA8LSBwcmVkaWN0KHRyZWUsIG1vb2R5LCB0eXBlPVwiY2xhc3NcIilcbmhlYWQocHJlZCkifQ==

17.7 Combining multiple prediction models

How to build highly predictive models?

This is the million dollar question which many students always ask in the context of our Prediction Challenges (see the leaderboard for 2022). These usually consist of 4-5 prediction tasks and students who achieve the lowest cumulative error make it to the top of the leaderboard and are widely celebrated. What is the secret of building a competitive prediction model? It is not blind application of machine learning library functions such as rpart(). Even with the great set up of parameter values and careful cross validation a singular model will usually not be very competitive. The top prediction models combine human ingenuity, knowledge of data with machine learning library functions. But how to combine different prediction models to build the “supermodel”:-)? . First - know your data, do some preliminary freestyle data exploration, make some plots, see how data is distributed. Possibly identify subsets of data which may behave very differently and may require different prediction models - either “hand made” or ML made.

We will start by showing how to combine two different prediction models - applied to different partitions of the data set. We assume that the partition is “given”. It is usually the result of preliminary data exploration and plotting. In the next section we show an elegant and generic method of combining arbitrary numbers of prediction models using rpart() function.

For now, let us assume that we have partitioned the moody data set based on the attribute SCORE into two subsets: one with SCORE >50 and another with SCORE <=50. Furthermore we have trained separate rpart() prediction models for each of the two partitions. Now we want to combine these two models into the one, combined model and apply the combined model to the testing data set moodyTest.

The following snippet 16.6.1 shows how to do it. Two models: model1 and model2 are trained by running rpart() on two partitions of moody - *the training data set, based on SCORE. Then we use predict() function by applying model1 to the partition of SCORE >50 and model2 for the subset defined by SCORE <=50 of the testing data set - moodyTest. Finally the lines 11-14 built the decision vector which combines predictions of models 1 and 2 into one prediction vector on the moodyTest.

17.7.1 Combining rpart prediction models

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KSBcblxuIyBJbXBvcnQgZGF0YXNldCBcblxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5tb29keVRlc3Q8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpIFxuXG4jIFdlIG5lZWQgdHdvIHNldHMgaGVyZTogdHJhaW5pbmcgbW9vZHkgYW5kIHRlc3RpbmcgbW9vZHkgKHRoZSBmdWxsIHRlc3RpbmcgbGlrZSBLYWdnbGUgc3RvcmVzIGJ1dCBkb2VzIG5vdCBnaXZlIHRvIHN0dWRlbnRzXG5tb2RlbDE8LXJwYXJ0KEdSQURFfi4sIGRhdGE9bW9vZHlbbW9vZHkkU0NPUkU+NTAsXSk7IFxubW9kZWwyPC1ycGFydChHUkFERX4uLCBkYXRhPW1vb2R5W21vb2R5JFNDT1JFPD01MCxdKTsgXG5tb2RlbDEgXG5tb2RlbDIgXG5cbnByZWQxIDwtIHByZWRpY3QobW9kZWwxLCBuZXdkYXRhPW1vb2R5VGVzdFttb29keVRlc3QkU0NPUkU+NTAsXSwgdHlwZT1cImNsYXNzXCIpIFxucHJlZDIgPC0gcHJlZGljdChtb2RlbDIsIG5ld2RhdGE9bW9vZHlUZXN0W21vb2R5VGVzdCRTQ09SRTw9NTAsXSwgdHlwZT1cImNsYXNzXCIpIFxubXlwcmVkaWN0aW9uPC1tb29keVRlc3QgXG5cbmRlY2lzaW9uIDwtIHJlcCgnRicsbnJvdyhteXByZWRpY3Rpb24pKSBcblxuZGVjaXNpb25bbXlwcmVkaWN0aW9uJFNDT1JFPjUwXSA8LSBhcy5jaGFyYWN0ZXIocHJlZDEpIFxuIFxuZGVjaXNpb25bbXlwcmVkaWN0aW9uJFNDT1JFPD01MF0gPC1hcy5jaGFyYWN0ZXIocHJlZDIpIFxuIFxubXlwcmVkaWN0aW9uJEdSQURFIDwtZGVjaXNpb24gXG5cbmVycm9yIDwtIG1lYW4obW9vZHkkR1JBREUhPSBteXByZWRpY3Rpb24kR1JBREUpIFxuZXJyb3IifQ==

17.7.2 Combining multiple prediction models using rpart

We describe here an elegant method which will allow us to build a combined model in two (or more) phases. Let us start with two prediction models: pred1 which is freestyle model and pred2 which is rpart() model. We have faced this situation in our prediction challenges in the spring of 2022. Students were asked to create two prediction modes: one, which was their own code (freestyle prediction) and another - through application of rpart(). Turned out that top freestyle prediction models had lower error on the testing data than rpart(). The challenging task was to combine two models and make the best out of the two, hopefully getting a combined model which beats both freestyle and rpart() models. But how to build such a model? In the previous section we just described the mechanics of combining two models - by splitting the data set into two disjoint partitions and applying each model to just one partition. But how to find such partitions? Fortunately we have rpart() to help us.

We will demonstrate the proposed method using some pseudo-code and then illustrate it further with an executable snippet combining two specific prediction models. We will expand first the training (and testing) data with two additional, derived attributes. One for each prediction model. Call these attributes \(model1\) and \(model2\). Then use rpart() to find the best model which uses original attributes of the data set as well as these two new attributes. Therefore we just let rpart() decide what is the best use of these two new attributes.

Let df_train be the training data set (data frame) and let df_test be the testing data frame. Let pred_yourModel be the freestyle prediction function which returns a decision vector according to a freestyle prediction model. For example 16.6.2 snippet shows such a very simplistic model for a moody data set, which assigns grades based on the disjoint intervals of SCORE attribute.

df_train$model1<- pred_yourModel(df_train)
tree<-rpart(df_traing,...)
df_train$model2 <- predict(tree, test, type="class") 

Now, the training data set has two extra attributes: model1 and model2.

Finally we create a compound model by using the extended attribute set of moody.

Tree_combined <- rpart(F, data = df_train, method = "class")) 

F is of the form T~.  where T is the target attribute of df (the one we predict). We let rpart() use all attributes including the new ones: model1 and model2. 

Tree_combined will use both prediction models as attributes and depending on their information gain these two new attributes may play an important role. We can cross validate like before and estimate the error of this combined prediction model on the training data set. 

If we are satisfied with the combined model, we then repeat the same process on testing data.

df_test$model1<- pred_yourModel(df_test)
tree<-rpart()
df_test$model2 <- predict(tree, df_test, type="class") 
Now, the training data set has two extra attributes: model1 and model2.

And calculate final prediction using predict function: 

predict(Tree_combined, moody_test, type="class") 


The next snippet illustrates this process  for a moody data set.  Freestyle model is very simplistic:

decision <- rep('F',nrow(moody))
decision[moody$Score>40] <- 'D'
decision[moody$Score>60] <- 'C'
decision[moody$Score>70] <- 'B'
decision[moody$Score>80] <- 'A'
moody$model2 <-decision

This prediction model assigns grades solely on the basis of SCORE attribute: A’s for SCORE over 80, B’s for SCORE between 70 and 80, C’s for SCORE between 60 and 70, D’ s for SCORE between 40 and 60 and finally F for SCORE <40.

We combine this model with model1 which uses rpart().

Last two lines of the code show where model1 and model2 differ and how often do these two models differ (in almost 25% of the data set)

17.7.2.1 Combining two prediction models using rpart() for moody data set

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEZpcnN0IGxldHMgaW1wb3J0IHRoZSBycGFydCBsaWJyYXJ5XG5saWJyYXJ5KHJwYXJ0KVxuI2luc3RhbGwucGFja2FnZXMoJ3JwYXJ0LnBsb3QnKVxuI2xpYnJhcnkocnBhcnQucGxvdClcblxuIyBVc2Ugb2YgdGhlIHJwYXJ0LnBsb3QoKSBmdW5jdGlvbiAgdG8gdmlzdWFsaXplIHRoZSBkZWNpc2lvbiB0cmVlLlxuXG5cbiMgSW1wb3J0IGRhdGFzZXRcbm1vb2R5PC1yZWFkLmNzdignaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2Rldjc3OTYvZGF0YTEwMV90dXRvcmlhbC9tYWluL2ZpbGVzL2RhdGFzZXQvbW9vZHkyMDIyX25ldy5jc3YnKVxubW9vZHlUZXN0PC1yZWFkLmNzdignaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2Rldjc3OTYvZGF0YTEwMV90dXRvcmlhbC9tYWluL2ZpbGVzL2RhdGFzZXQvbW9vZHkyMDIyX25ldy5jc3YnKSBcblxuI1RoZXNlIGFyZSB0aGUgc2FtZSB0d28gc2V0cyAtIGJ1dCBJIHdhbnRlZCB0byBkaXN0aW5ndWlzaCB0aGF0IG9uZSBvZiB0aGVtIGlzIHRyYWluaW5nIGFuZCBhbm90aGVyIGlzIHRlc3RpbmcuXG4jTWF5YmUgd2UgY2FuIHRoZW4gdXNlIG1vb2R5ICBhbmQgbW9vZHlUZXN0IGFzIHR3byBwYXJ0aXRpb25zIG9mIG1vb2R5LCBsaWtlIGluIGNyb3NzIHZhbGlkYXRpb24uXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uLlxudHJlZSA8LSBycGFydChHUkFERSB+IFNDT1JFK0RPWkVTX09GRitURVhUSU5HX0lOX0NMQVNTK1BBUlRJQ0lQQVRJT04sIGRhdGEgPSBtb29keSAsbWV0aG9kID0gXCJjbGFzc1wiKVxudHJlZVxuXG4jIE5vdyBsZXQncyBwcmVkaWN0IHRoZSBHcmFkZXMgb2YgdGhlIE1vb2R5IERhdGFzZXQuXG5tb29keSRtb2RlbDE8LXJlcCgnRicsbnJvdyhtb29keSkpXG5tb29keSRtb2RlbDI8LXJlcCgnRicsbnJvdyhtb29keSkpXG5tb29keSRtb2RlbDE8LSBwcmVkaWN0KHRyZWUsIG1vb2R5LCB0eXBlPVwiY2xhc3NcIilcbmRlY2lzaW9uIDwtIHJlcCgnRicsbnJvdyhtb29keSkpXG5kZWNpc2lvblttb29keSRTQ09SRT40MF0gPC0gJ0QnXG5kZWNpc2lvblttb29keSRTQ09SRT42MF0gPC0gJ0MnXG5kZWNpc2lvblttb29keSRTQ09SRT43MF0gPC0gJ0InXG5kZWNpc2lvblttb29keSRTQ09SRT44MF0gPC0gJ0EnXG5tb29keSRtb2RlbDIgPC1kZWNpc2lvblxuY29sbmFtZXMobW9vZHkpXG5tb29keVRlc3QkbW9kZWwxPC1yZXAoJ0YnLG5yb3cobW9vZHkpKVxubW9vZHlUZXN0JG1vZGVsMjwtcmVwKCdGJyxucm93KG1vb2R5KSlcbm1vb2R5VGVzdCRtb2RlbDE8LSBwcmVkaWN0KHRyZWUsIG1vb2R5VGVzdCwgdHlwZT1cImNsYXNzXCIpXG5kZWNpc2lvbiA8LSByZXAoJ0YnLG5yb3cobW9vZHkpKVxuZGVjaXNpb25bbW9vZHlUZXN0JFNDT1JFPjQwXSA8LSAnRCdcbmRlY2lzaW9uW21vb2R5VGVzdCRTQ09SRT42MF0gPC0gJ0MnXG5kZWNpc2lvblttb29keVRlc3QkU0NPUkU+NzBdIDwtICdCJ1xuZGVjaXNpb25bbW9vZHlUZXN0JFNDT1JFPjgwXSA8LSAnQSdcbm1vb2R5VGVzdCRtb2RlbDIgPC1kZWNpc2lvblxuY29sbmFtZXMobW9vZHkpXG50cmVlX2NvbWJpbmVkPC1ycGFydChHUkFERX4uLCBkYXRhPW1vb2R5LCBtZXRob2Q9J2NsYXNzJylcbnRyZWVfY29tYmluZWRcbmNvbG5hbWVzKG1vb2R5VGVzdClcbiNycGFydC5wbG90KHRyZWVfY29tYmluZWQpXG5wcmVkaWN0KHRyZWVfY29tYmluZWQsIG1vb2R5VGVzdCwgdHlwZT0nY2xhc3MnKVxubnJvdyhtb29keVttb29keSRtb2RlbDEhPW1vb2R5JG1vZGVsMixdKVxubnJvdyhtb29keSlcbmVycm9yPC1tZWFuKG1vb2R5VGVzdCRHUkFERSE9cHJlZGljdCh0cmVlX2NvbWJpbmVkLCBtb29keVRlc3QsIHR5cGU9J2NsYXNzJykpXG5lcnJvciJ9
Table 17.1: Snippet of combined models Dataset
SCORE GRADE DOZES_OFF TEXTING_IN_CLASS PARTICIPATION model1 model2
758 69.03 B never never 0.60 B C
156 46.93 C sometimes rarely 0.03 C D
364 34.50 F sometimes rarely 0.47 D F
753 46.07 C sometimes rarely 0.83 C D
183 58.99 C sometimes never 0.29 C D

Output tree plot of after combining model1 and model2

Next snippet shows how to submit your prediction vector to Kaggle by creating your submission data frame.

17.8 Submission of your prediction vector

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxudGVzdDwtcmVhZC5jc3YoJ2h0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L00yMDIydGVzdFNOb0dyYWRlLmNzdicpXG5zdWJtaXNzaW9uPC1yZWFkLmNzdignaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2Rldjc3OTYvZGF0YTEwMV90dXRvcmlhbC9tYWluL2ZpbGVzL2RhdGFzZXQvTTIwMjJzdWJtaXNzaW9uLmNzdicpXG50cmFpbiA8LSByZWFkLmNzdihcImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L00yMDIydHJhaW4uY3N2XCIpXG5cbnRyZWUgPC0gcnBhcnQoR3JhZGUgfiBNYWpvcitTY29yZStTZW5pb3JpdHksIGRhdGEgPSB0cmFpbiwgbWV0aG9kID0gXCJjbGFzc1wiLGNvbnRyb2w9cnBhcnQuY29udHJvbChtaW5idWNrZXQgPSAyMDApKVxudHJlZVxuXG5wcmVkaWN0aW9uIDwtIHByZWRpY3QodHJlZSwgdGVzdCwgdHlwZT1cImNsYXNzXCIpXG5cbiNOb3cgbWFrZSB5b3VyIHN1Ym1pc3Npb24gZmlsZSAtIGl0IHdpbGwgaGF2ZSB0aGUgSURzIGFuZCBub3cgdGhlIHByZWRpY3RlZCBncmFkZXNcbnN1Ym1pc3Npb24kR3JhZGU8LXByZWRpY3Rpb24gXG5cbiMgdXNlIHdyaXRlLmNzdihzdWJtaXNzaW9uLCAnc3VibWlzc2lvbi5jc3YnLCByb3cubmFtZXM9RkFMU0UpIHRvIHN0b3JlIHN1Ym1pc3Npb24gYXMgY3N2IGZpbGUgb24geW91ciBtYWNoaW5lIGFuZCBzdWJzZXF1ZW50bHkgc3VibWl0IGl0IG9uIEthZ2dsZSJ9

17.9 Additional Reference