Section: 16 🔖 Predictions with rpart

  • Lecture slides:

Decision trees are one of the most powerful and popular tools for classification and prediction. The reason decision trees are very popular is that they can generate rules which are easier to understand as compared to other models. They require much less computations for performing modeling and prediction. Both continuous/numerical and categorical variables are handled easily while creating the decision trees.

16.1 Use of Rpart

Recursive Partitioning and Regression Tree RPART library is a collection of routines which implements a Decision Tree.The resulting model can be represented as a binary tree.

The library associated with this RPART is called rpart. Install this library using install.packages("rpart").

Syntax for building the decision tree using rpart():

  • rpart( formula , method, data, control,...)
    • formula: here we mention the prediction column and the other related columns(predictors) on which the prediction will be based on.
      • prediction ~ predictor1 + predictor2 + predictor3 + ...
    • method: here we describe the type of decision tree we want. If nothing is provided, the function makes an intelligent guess. We can use “anova” for regression, “class” for classification, etc.
    • data: here we provide the dataset on which we want to fit the decision tree on.
    • control: here we provide the control parameters for the decision tree. Explained more in detail in the section further in this chapter.

For more info on the rpart function visit rpart documentation

Lets look at an example on the Moody 2022 dataset.

  • We will use the rpart() function with the following inputs:
    • prediction -> GRADE
    • predictors -> SCORE, DOZES_OFF, TEXTING_IN_CLASS, PARTICIPATION
    • data -> moody dataset
    • method -> “class” for classification.

16.1.1 Snippet 1

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uLlxucnBhcnQoR1JBREUgfiBTQ09SRStET1pFU19PRkYrVEVYVElOR19JTl9DTEFTUytQQVJUSUNJUEFUSU9OLCBkYXRhID0gbW9vZHksbWV0aG9kID0gXCJjbGFzc1wiKSJ9

We can see that the output of the rpart() function is the decision tree with details of,

  • node -> node number
  • split -> split conditions/tests
  • n -> number of records in either branch i.e. subset
  • yval -> output value i.e. the target predicted value.
  • yprob -> probability of obtaining a particular category as the predicted output.

Using the output tree, we can use the predict function to predict the grades of the test data. We will look at this process later in section 16.5

But coming back to the output of the rpart() function, the text type output is useful but difficult to read and understand, right! We will look at visualizing the decision tree in the next section.

16.2 Visualize the Decision tree

To visualize and understand the rpart() tree output in the easiest way possible, we use a library called rpart.plot. The function rpart.plot() of the rpart.plot library is the function used to visualize decision trees.

NOTE: The online runnable code block does not support rpart.plot library and functions, thus the output of the following code examples are provided directly.

16.2.1 Snippet 2

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEZpcnN0IGxldHMgaW1wb3J0IHRoZSBycGFydCBsaWJyYXJ5XG5saWJyYXJ5KHJwYXJ0KVxuXG4jIEltcG9ydCBkYXRhc2V0XG5tb29keTwtcmVhZC5jc3YoJ2h0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L21vb2R5MjAyMl9uZXcuY3N2JylcblxuIyBVc2Ugb2YgdGhlIHJwYXJ0KCkgZnVuY3Rpb24uXG5ycGFydChHUkFERSB+IFNDT1JFK0RPWkVTX09GRitURVhUSU5HX0lOX0NMQVNTK1BBUlRJQ0lQQVRJT04sIGRhdGEgPSBtb29keSxtZXRob2QgPSBcImNsYXNzXCIpXG5cbiMgTm93IGxldHMgaW1wb3J0IHRoZSBycGFydC5wbG90IGxpYnJhcnkgdG8gdXNlIHRoZSBycGFydC5wbG90KCkgZnVuY3Rpb24uXG4jbGlicmFyeShycGFydC5wbG90KVxuXG4jIFVzZSBvZiB0aGUgcnBhcnQucGxvdCgpIGZ1bmN0aW9uICB0byB2aXN1YWxpemUgdGhlIGRlY2lzaW9uIHRyZWUuXG4jcnBhcnQucGxvdCh0cmVlKSJ9

Output Plot of rpart.plot() function

We can see that after plotting the tree using rpart.plot() function, the tree is more readable and provides better information about the splitting conditions, and the probability of outcomes. Each leaf node has information about

  • the grade category.
  • the outcome probability of each grade category.
  • the records percentage out of total records.

To study more in detail the arguments that can be passed to the rpart.plot() function, please look at these guides rpart.plot and Plotting with rpart.plot (PDF)

NOTE: In this chapter, from this point forward, the rpart.plots() generated in any example below will be shown as images, and also the code to generate those rpart.plots will be commented in the interactive code blocks. If you want to generate these plots yourself, please use a local Rstudio or R environment.

16.3 Rpart Control

Now let’s look at the rpart.control() function used to pass the control parameters to the control argument of the rpart() function.

  • rpart.control( *minsplit*, *minbucket*, *cp*,...)
  • minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. For example, minsplit=500 -> the minimum number of observations in a node must be 500 or up, in order to perform the split at the testing condition.
  • minbucket: minimum number of observations in any terminal(leaf) node. For example, minbucket=500 -> the minimum number of observation in the terminal/leaf node of the trees must be 500 or above.
  • cp: complexity parameter. Using this informs the program that any split which does not increase the accuracy of the fit by cp, will not be made in the tree.

For more information of the other arguments of the rpart.control() function visit rpart.control

Let look at few examples.

Suppose you want to set the control parameter minsplit=200.

16.3.1 Snippet 3: Minsplit = 200

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uIHdpdGggdGhlIGNvbnRyb2wgcGFyYW1ldGVyIG1pbnNwbGl0PTIwMFxudHJlZSA8LSBycGFydChHUkFERSB+IFNDT1JFK0RPWkVTX09GRitURVhUSU5HX0lOX0NMQVNTK1BBUlRJQ0lQQVRJT04sIGRhdGEgPSBtb29keSwgbWV0aG9kID0gXCJjbGFzc1wiLGNvbnRyb2w9cnBhcnQuY29udHJvbChtaW5zcGxpdCA9IDIwMCkpXG5cbnRyZWVcblxuI2xpYnJhcnkocnBhcnQucGxvdClcbiNycGFydC5wbG90KHRyZWUsZXh0cmEgPSAyKSJ9

Output tree plot of after setting minsplit=200 in rpart.control() function

16.3.2 Snippet 4: Minsplit = 100

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uIHdpdGggdGhlIGNvbnRyb2wgcGFyYW1ldGVyIG1pbnNwbGl0PTEwMFxudHJlZSA8LSBycGFydChHUkFERSB+IFNDT1JFK0RPWkVTX09GRitURVhUSU5HX0lOX0NMQVNTK1BBUlRJQ0lQQVRJT04sIGRhdGEgPSBtb29keSwgbWV0aG9kID0gXCJjbGFzc1wiLGNvbnRyb2w9cnBhcnQuY29udHJvbChtaW5zcGxpdCA9IDEwMCkpXG5cbnRyZWVcblxuI2xpYnJhcnkocnBhcnQucGxvdClcbiNycGFydC5wbG90KHRyZWUsZXh0cmEgPSAyKSJ9

Output tree plot of after setting minsplit=100 in rpart.control() function

We can see from the output of tree$splits and the tree plot, that at each split the total amount of observations are above 200 and 100. Also, in comparison to the tree without control, the tree with control has lower height, and lesser count of splits.

Now, lets set the minbucket parameter to 100, and see how that affects the tree parameters.

16.3.3 Snippet 5: Minbucket = 100

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uIHdpdGggdGhlIGNvbnRyb2wgcGFyYW1ldGVyIE1pbmJ1Y2tldD0xMDBcbnRyZWUgPC0gcnBhcnQoR1JBREUgfiBTQ09SRStET1pFU19PRkYrVEVYVElOR19JTl9DTEFTUytQQVJUSUNJUEFUSU9OLCBkYXRhID0gbW9vZHksIG1ldGhvZCA9IFwiY2xhc3NcIixjb250cm9sPXJwYXJ0LmNvbnRyb2wobWluYnVja2V0ID0gMTAwKSlcblxudHJlZVxuXG4jbGlicmFyeShycGFydC5wbG90KVxuI3JwYXJ0LnBsb3QodHJlZSxleHRyYSA9IDIpIn0=

Output tree plot of after setting minbucket=100 in rpart.control() function

We can see for the output and the tree plot, that the count of observations in each leaf node is greater than 100. Also, the tree height has shortened, suggesting that the control method was able to shorten the tree size.

16.3.4 Snippet 6: Minbucket = 200

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uIHdpdGggdGhlIGNvbnRyb2wgcGFyYW1ldGVyIE1pbmJ1Y2tldD0yMDBcbnRyZWUgPC0gcnBhcnQoR1JBREUgfiBTQ09SRStET1pFU19PRkYrVEVYVElOR19JTl9DTEFTUytQQVJUSUNJUEFUSU9OLCBkYXRhID0gbW9vZHksIG1ldGhvZCA9IFwiY2xhc3NcIixjb250cm9sPXJwYXJ0LmNvbnRyb2wobWluYnVja2V0ID0gMjAwKSlcblxudHJlZVxuXG4jbGlicmFyeShycGFydC5wbG90KVxuI3JwYXJ0LnBsb3QodHJlZSxleHRyYSA9IDIpIn0=

Output tree plot of after setting minbucket=200 in rpart.control() function

We can see for the output and the tree plot, that the count of observations in each leaf node is greater than 200. Also, the tree height has shortened, suggesting that the control method was able to shorten the tree size.

Lets now use the cp parameter and see its effect on the tree.

16.3.5 Snippet 7: cp = 0.05

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uIHdpdGggdGhlIGNvbnRyb2wgcGFyYW1ldGVyIGNwPTAuMlxudHJlZSA8LSBycGFydChHUkFERSB+IC4sIGRhdGEgPSBtb29keSxtZXRob2QgPSBcImNsYXNzXCIsY29udHJvbD1ycGFydC5jb250cm9sKGNwID0gMC4wNSkpXG5cbnRyZWVcblxuI2xpYnJhcnkocnBhcnQucGxvdClcbiNycGFydC5wbG90KHRyZWUpIn0=

Output tree plot of after setting cp=0.05 in rpart.control() function

16.3.6 Snippet 8: cp = 0.005

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5cbiMgVXNlIG9mIHRoZSBycGFydCgpIGZ1bmN0aW9uIHdpdGggdGhlIGNvbnRyb2wgcGFyYW1ldGVyIGNwPTAuMDA1XG50cmVlIDwtIHJwYXJ0KEdSQURFIH4gLiwgZGF0YSA9IG1vb2R5LG1ldGhvZCA9IFwiY2xhc3NcIixjb250cm9sPXJwYXJ0LmNvbnRyb2woY3AgPSAwLjAwNSkpXG5cbnRyZWVcblxuI2xpYnJhcnkocnBhcnQucGxvdClcbiNycGFydC5wbG90KHRyZWUpIn0=

Output tree plot of after setting cp=0.005 in rpart.control() function We can see for the output and the tree plot, that the tree size has increased, with increase in number of splits, and leaf nodes. Also we can see that the minimum CP value in the output is 0.005.

16.4 Cross Validation

Overfitting takes place when you have a high accuracy on training dataset, but a low accuracy on the test dataset. But how do you know whether you are overfitting or not? Especially since you cannot determine accuracy on the test dataset? That is where cross-validation comes into play.

Because we cannot determine accuracy on test dataset, we partition our training dataset into train and validation (testing). We train our model (rpart or lm) on train partition and test on the validation partition. The partition is defined by split ratio. If split ratio =0.7, 70% of the training dataset will be used for the actual training of your model (rpart or lm), and 30 % will be used for validation (or testing). The accuracy of this validation data is called cross-validation accuracy.

To know if you are overfitting or not, compare the training accuracy with the cross-validation accuracy. If your training accuracy is high, and cross-validation accuracy is low, that means you are overfitting.

  • cross_validate(*data*, *tree*, *n_iter*, *split_ratio*, *method*)
    • data: The dataset on which cross validation is to be performed.
    • tree: The decision tree generated using rpart.
    • n_iter: Number of iterations.
    • split_ratio: The splitting ratio of the data into train data and validation data.
    • method: Method of the prediction. “class” for classification.

The way the function works is as follows:

  • It randomly partitions your data into training and validation.
  • It then constructs the following two decision trees on training partition:
    • The tree that you pass to the function.
    • The tree is constructed on all attributes as predictors and with no control parameters. -It then determines the accuracy of the two trees on validation partition and returns you the accuracy values for both the trees.

The values in the first column(accuracy_subset) returned by cross-validation function are more important when it comes to detecting overfitting. If these values are much lower than the training accuracy you get, that means you are overfitting.

We would also want the values in accuracy_subset to be close to each other (in other words, have low variance). If the values are quite different from each other, that means your model (or tree) has a high variance which is not desired.

The second column(accuracy_all) tells you what happens if you construct a tree based on all attributes. If these values are larger than accuracy_subset, that means you are probably leaving out attributes from your tree that are relevant.

Each iteration of cross-validation creates a different random partition of train and validation, and so you have possibly different accuracy values for every iteration.

Let’s look at the cross_validate() function in action in the example below.

We will pass the tree with formula as GRADE ~ SCORE+DOZES_OFF+TEXTING_IN_CLASS+PARTICIPATION, and control parameter, with minsplit=100. And for cross_validate() function, we will usen_iter=5, and split_raitio=0.7

NOTE: Cross-Validation repository is already preloaded for the following interactive code block. Thus you can directly use the cross_validate() function in the following interactive code block. But if you wish to use the code_validate() function locally, please use
install.packages("devtools") 
devtools::install_github("devanshagr/CrossValidation")
CrossValidation::cross_validate()

16.4.1 Snippet 9

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImNyb3NzX3ZhbGlkYXRlIDwtIGZ1bmN0aW9uKGRmLCB0cmVlLCBuX2l0ZXIsIHNwbGl0X3JhdGlvLCBtZXRob2QgPSAnY2xhc3MnKVxue1xuICAjIHRyYWluaW5nIGRhdGEgZnJhbWUgZGZcbiAgZGYgPC0gYXMuZGF0YS5mcmFtZShkZilcblxuICAjIG1lYW5fc3Vic2V0IGlzIGEgdmVjdG9yIG9mIGFjY3VyYWN5IHZhbHVlcyBnZW5lcmF0ZWQgZnJvbSB0aGUgc3BlY2lmaWVkIGZlYXR1cmVzIGluIHRoZSB0cmVlIG9iamVjdFxuICBtZWFuX3N1YnNldCA8LSBjKClcblxuICAjIG1lYW5fYWxsIGlzIGEgdmVjdG9yIG9mIGFjY3VyYWN5IHZhbHVlcyBnZW5lcmF0ZWQgZnJvbSBhbGwgdGhlIGF2YWlsYWJsZSBmZWF0dXJlcyBpbiB0aGUgZGF0YSBmcmFtZVxuICBtZWFuX2FsbCA8LSBjKClcblxuICAjIGNvbnRyb2wgcGFyYW1ldGVycyBmb3IgdGhlIGRlY2lzaW9uIHRyZWVcbiAgY29udHJvID0gdHJlZSRjb250cm9sXG5cbiAgIyB0aGUgZm9sbG93aW5nIHNuaXBwZXQgd2lsbCBjcmVhdGUgcmVsYXRpb25zIHRvIGdlbmVyYXRlIGRlY2lzaW9uIHRyZWVzXG4gICMgcmVsYXRpb25fYWxsIHdpbGwgY3JlYXRlIGEgZGVjaXNpb24gdHJlZSB3aXRoIGFsbCB0aGUgZmVhdHVyZXNcbiAgIyByZWxhdGlvbl9zdWJzZXQgd2lsbCBjcmVhdGUgYSBkZWNpc2lvbiB0cmVlIHdpdGggb25seSB1c2VyLXNwZWNpZmllZCBmZWF0dXJlcyBpbiB0cmVlXG4gIGRlcCA8LSBhbGwudmFycyh0ZXJtcyh0cmVlKSlbMV1cbiAgaW5kZXAgPC0gbGlzdCgpXG4gIHJlbGF0aW9uX2FsbCA9IGFzLmZvcm11bGEocGFzdGUoZGVwLCAnLicsIHNlcCA9IFwiflwiKSlcbiAgaSA8LSAxXG4gIHdoaWxlIChpIDwgbGVuZ3RoKGFsbC52YXJzKHRlcm1zKHRyZWUpKSkpIHtcbiAgICBpbmRlcFtbaV1dIDwtIGFsbC52YXJzKHRlcm1zKHRyZWUpKVtpICsgMV1cbiAgICBpIDwtIGkgKyAxXG4gIH1cbiAgYiA8LSBwYXN0ZShpbmRlcCwgY29sbGFwc2UgPSBcIitcIilcbiAgcmVsYXRpb25fc3Vic2V0IDwtIGFzLmZvcm11bGEocGFzdGUoZGVwLCBiLCBzZXAgPSBcIn5cIikpXG5cbiAgIyBjcmVhdGluZyB0cmFpbiBhbmQgdGVzdCBzYW1wbGVzIHdpdGggdGhlIGdpdmVuIHNwbGl0IHJhdGlvXG4gICMgcGVyZm9ybWluZyBjcm9zcy12YWxpZGF0aW9uIG5faXRlciB0aW1lc1xuICBmb3IgKGkgaW4gMTpuX2l0ZXIpIHtcbiAgICBzYW1wbGUgPC1cbiAgICAgIHNhbXBsZS5pbnQobiA9IG5yb3coZGYpLFxuICAgICAgICAgICAgICAgICBzaXplID0gZmxvb3Ioc3BsaXRfcmF0aW8gKiBucm93KGRmKSksXG4gICAgICAgICAgICAgICAgIHJlcGxhY2UgPSBGKVxuICAgIHRyYWluIDwtIGRmW3NhbXBsZSxdXG4gICAgdGVzdGluZyAgPC0gZGZbLXNhbXBsZSxdXG4gICAgdHlwZSA9IHR5cGVvZih1bmxpc3QodGVzdGluZ1tkZXBdKSlcblxuICAgICMgZGVjaXNpb24gdHJlZSBmb3IgcmVncmVzc2lvbiBpZiB0aGUgbWV0aG9kIHNwZWNpZmllZCBpcyBcImFub3ZhXCJcbiAgICBpZiAobWV0aG9kID09ICdhbm92YScpIHtcbiAgICAgIGZpcnN0LnRyZWUgPC1cbiAgICAgICAgcnBhcnQoXG4gICAgICAgICAgcmVsYXRpb25fc3Vic2V0LFxuICAgICAgICAgIGRhdGEgPSB0cmFpbixcbiAgICAgICAgICBjb250cm9sID0gY29udHJvLFxuICAgICAgICAgIG1ldGhvZCA9ICdhbm92YSdcbiAgICAgICAgKVxuICAgICAgc2Vjb25kLnRyZWUgPC0gcnBhcnQocmVsYXRpb25fYWxsLCBkYXRhID0gdHJhaW4sIG1ldGhvZCA9ICdhbm92YScpXG4gICAgICBwcmVkMS50cmVlIDwtIHByZWRpY3QoZmlyc3QudHJlZSwgbmV3ZGF0YSA9IHRlc3RpbmcpXG4gICAgICBwcmVkMi50cmVlIDwtIHByZWRpY3Qoc2Vjb25kLnRyZWUsIG5ld2RhdGEgPSB0ZXN0aW5nKVxuICAgICAgbWVhbjEgPC0gbWVhbigoYXMubnVtZXJpYyhwcmVkMS50cmVlKSAtIHRlc3RpbmdbLCBkZXBdKSBeIDIpXG4gICAgICBtZWFuMiA8LSBtZWFuKChhcy5udW1lcmljKHByZWQyLnRyZWUpIC0gdGVzdGluZ1ssIGRlcF0pIF4gMilcbiAgICAgIG1lYW5fc3Vic2V0IDwtIGMobWVhbl9zdWJzZXQsIG1lYW4xKVxuICAgICAgbWVhbl9hbGwgPC0gYyhtZWFuX2FsbCwgbWVhbjIpXG4gICAgfVxuXG4gICAgIyBkZWNpc2lvbiB0cmVlIGZvciBjbGFzc2lmaWNhdGlvblxuICAgICMgaWYgdGhlIG1ldGhvZCBzcGVjaWZpZWQgaXMgbm90IFwiYW5vdmFcIiwgdGhlbiB0aGlzIGJsb2NrIGlzIGV4ZWN1dGVkXG4gICAgIyBpZiB0aGUgbWV0aG9kIGlzIG5vdCBzcGVjaWZpZWQgYnkgdGhlIHVzZXIsIHRoZSBkZWZhdWx0IG9wdGlvbiBpcyB0byBwZXJmb3JtIGNsYXNzaWZpY2F0aW9uXG4gICAgZWxzZXtcbiAgICAgIGZpcnN0LnRyZWUgPC1cbiAgICAgICAgcnBhcnQoXG4gICAgICAgICAgcmVsYXRpb25fc3Vic2V0LFxuICAgICAgICAgIGRhdGEgPSB0cmFpbixcbiAgICAgICAgICBjb250cm9sID0gY29udHJvLFxuICAgICAgICAgIG1ldGhvZCA9ICdjbGFzcydcbiAgICAgICAgKVxuICAgICAgc2Vjb25kLnRyZWUgPC0gcnBhcnQocmVsYXRpb25fYWxsLCBkYXRhID0gdHJhaW4sIG1ldGhvZCA9ICdjbGFzcycpXG4gICAgICBwcmVkMS50cmVlIDwtIHByZWRpY3QoZmlyc3QudHJlZSwgbmV3ZGF0YSA9IHRlc3RpbmcsIHR5cGUgPSAnY2xhc3MnKVxuICAgICAgcHJlZDIudHJlZSA8LVxuICAgICAgICBwcmVkaWN0KHNlY29uZC50cmVlLCBuZXdkYXRhID0gdGVzdGluZywgdHlwZSA9ICdjbGFzcycpXG4gICAgICBtZWFuMSA8LVxuICAgICAgICBtZWFuKGFzLmNoYXJhY3RlcihwcmVkMS50cmVlKSA9PSBhcy5jaGFyYWN0ZXIodGVzdGluZ1ssIGRlcF0pKVxuICAgICAgbWVhbjIgPC1cbiAgICAgICAgbWVhbihhcy5jaGFyYWN0ZXIocHJlZDIudHJlZSkgPT0gYXMuY2hhcmFjdGVyKHRlc3RpbmdbLCBkZXBdKSlcbiAgICAgIG1lYW5fc3Vic2V0IDwtIGMobWVhbl9zdWJzZXQsIG1lYW4xKVxuICAgICAgbWVhbl9hbGwgPC0gYyhtZWFuX2FsbCwgbWVhbjIpXG4gICAgfVxuICB9XG5cbiAgIyBhdmVyYWdlX2FjY3VyYWN5X3N1YnNldCBpcyB0aGUgYXZlcmFnZSBhY2N1cmFjeSBvZiBuX2l0ZXIgaXRlcmF0aW9ucyBvZiBjcm9zcy12YWxpZGF0aW9uIHdpdGggdXNlci1zcGVjaWZpZWQgZmVhdHVyZXNcbiAgIyBhdmVyYWdlX2FjdXJhY3lfYWxsIGlzIHRoZSBhdmVyYWdlIGFjY3VyYWN5IG9mIG5faXRlciBpdGVyYXRpb25zIG9mIGNyb3NzLXZhbGlkYXRpb24gd2l0aCBhbGwgdGhlIGF2YWlsYWJsZSBmZWF0dXJlc1xuICAjIHZhcmlhbmNlX2FjY3VyYWN5X3N1YnNldCBpcyB0aGUgdmFyaWFuY2Ugb2YgYWNjdXJhY3kgb2Ygbl9pdGVyIGl0ZXJhdGlvbnMgb2YgY3Jvc3MtdmFsaWRhdGlvbiB3aXRoIHVzZXItc3BlY2lmaWVkIGZlYXR1cmVzXG4gICMgdmFyaWFuY2VfYWNjdXJhY3lfYWxsIGlzIHRoZSB2YXJpYW5jZSBvZiBhY2N1cmFjeSBvZiBuX2l0ZXIgaXRlcmF0aW9ucyBvZiBjcm9zcy12YWxpZGF0aW9uIHdpdGggYWxsIHRoZSBhdmFpbGFibGUgZmVhdHVyZXNcbiAgY3Jvc3NfdmFsaWRhdGlvbl9zdGF0cyA8LVxuICAgIGxpc3QoXG4gICAgICBcImF2ZXJhZ2VfYWNjdXJhY3lfc3Vic2V0XCIgPSBtZWFuKG1lYW5fc3Vic2V0LCBuYS5ybSA9IFQpLFxuICAgICAgXCJhdmVyYWdlX2FjY3VyYWN5X2FsbFwiID0gbWVhbihtZWFuX2FsbCwgbmEucm0gPSBUKSxcbiAgICAgIFwidmFyaWFuY2VfYWNjdXJhY3lfc3Vic2V0XCIgPSB2YXIobWVhbl9zdWJzZXQsIG5hLnJtID0gVCksXG4gICAgICBcInZhcmlhbmNlX2FjY3VyYWN5X2FsbFwiID0gdmFyKG1lYW5fYWxsLCBuYS5ybSA9IFQpXG4gICAgKVxuXG4gICMgY3JlYXRpbmcgYSBkYXRhIGZyYW1lIG9mIGFjY3VyYWN5X3N1YnNldCBhbmQgYWNjdXJhY3lfYWxsXG4gICMgYWNjdXJhY3lfc3Vic2V0IGNvbnRhaW5zIG5faXRlciBhY2N1cmFjeSB2YWx1ZXMgb24gY3Jvc3MtdmFsaWRhdGlvbiB3aXRoIHVzZXItc3BlY2lmaWVkIGZlYXR1cmVzXG4gICMgYWNjdXJhY3lfYWxsIGNvbnRhaW5zIG5faXRlciBhY2N1cmFjeSB2YWx1ZXMgb24gY3Jvc3MtdmFsaWRhdGlvbiB3aXRoIGFsbCB0aGUgYXZhaWxhYmxlIGZlYXR1cmVzXG4gIGNyb3NzX3ZhbGlkYXRpb25fZGYgPC1cbiAgICBkYXRhLmZyYW1lKGFjY3VyYWN5X3N1YnNldCA9IG1lYW5fc3Vic2V0LCBhY2N1cmFjeV9hbGwgPSBtZWFuX2FsbClcbiAgcmV0dXJuKGxpc3QoY3Jvc3NfdmFsaWRhdGlvbl9kZiwgY3Jvc3NfdmFsaWRhdGlvbl9zdGF0cykpXG59Iiwic2FtcGxlIjoiIyBGaXJzdCBsZXRzIGltcG9ydCB0aGUgcnBhcnQgbGlicmFyeVxubGlicmFyeShycGFydClcbiMgSW1wb3J0IGRhdGFzZXRcbm1vb2R5PC1yZWFkLmNzdignaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2Rldjc3OTYvZGF0YTEwMV90dXRvcmlhbC9tYWluL2ZpbGVzL2RhdGFzZXQvbW9vZHkyMDIyX25ldy5jc3YnLHN0cmluZ3NBc0ZhY3RvcnMgPSBUKVxuIyBVc2Ugb2YgdGhlIHJwYXJ0KCkgZnVuY3Rpb24uXG50cmVlIDwtIHJwYXJ0KEdSQURFIH4gU0NPUkUrRE9aRVNfT0ZGK1RFWFRJTkdfSU5fQ0xBU1MsIGRhdGEgPSBtb29keSxtZXRob2QgPSBcImNsYXNzXCIsY29udHJvbCA9IHJwYXJ0LmNvbnRyb2wobWluc3BsaXQgPSAxMDApKVxudHJlZVxuIyBOb3cgbGV0cyBwcmVkaWN0IHRoZSBHcmFkZXMgb2YgdGhlIE1vb2R5IERhdGFzZXQuXG5wcmVkIDwtIHByZWRpY3QodHJlZSwgbW9vZHksIHR5cGU9XCJjbGFzc1wiKVxuaGVhZChwcmVkKVxuIyBMZXRzIGNoZWNrIHRoZSBUcmFpbmluZyBBY2N1cmFjeVxubWVhbihtb29keSRHUkFERT09cHJlZClcbiMgTGV0cyB1cyB0aGUgY3Jvc3NfdmFsaWRhdGUoKSBmdW5jdGlvbi5cbmNyb3NzX3ZhbGlkYXRlKG1vb2R5LHRyZWUsNSwwLjcpIn0=

You can see that the cross-validation accuracies for the tree that was passed (accuracy_subset) are fairly high and close to our training accuracy of 84%. This means we are not overfitting. Also observe that accuracy_subset and accuracy_all have the same values, which means that the only relevant attributes are score and participation, and adding more attributes doesn’t make any difference to the tree. Finally, the values in accuracy_subset are reasonably close to each other, which mean low variance.

16.5 Prediction using rpart.

Now that we have seen the process to create a decision tree and also plot it, we will like to use the output tree to predict the required attribute.

From the moody example, we are trying to predict the grade of students. Lets look at the predict() function to predict the outcomes.

  • predict(*object*,*data*,*type*,...)
    • object: the generated tree from the rpart function.
    • data: the data on which the prediction is to be performed.
    • type: the type of prediction required. One of “vector”, “prob”, “class” or “matrix”.

Now lets use the predict function to predict the grades of students using the tree generated on the Moody dataset.

16.5.1 Snippet 10

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIEZpcnN0IGxldHMgaW1wb3J0IHRoZSBycGFydCBsaWJyYXJ5XG5saWJyYXJ5KHJwYXJ0KVxuXG4jIEltcG9ydCBkYXRhc2V0XG5tb29keTwtcmVhZC5jc3YoJ2h0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L21vb2R5MjAyMl9uZXcuY3N2JylcblxuIyBVc2Ugb2YgdGhlIHJwYXJ0KCkgZnVuY3Rpb24uXG50cmVlIDwtIHJwYXJ0KEdSQURFIH4gU0NPUkUrRE9aRVNfT0ZGK1RFWFRJTkdfSU5fQ0xBU1MrUEFSVElDSVBBVElPTiwgZGF0YSA9IG1vb2R5ICxtZXRob2QgPSBcImNsYXNzXCIpXG50cmVlXG5cbiMgTm93IGxldHMgcHJlZGljdCB0aGUgR3JhZGVzIG9mIHRoZSBNb29keSBEYXRhc2V0LlxucHJlZCA8LSBwcmVkaWN0KHRyZWUsIG1vb2R5LCB0eXBlPVwiY2xhc3NcIilcbmhlYWQocHJlZCkifQ==

16.6 Snippet 11: Your Model with rpart

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjSG93IHRvIGNvbWJpbmUgeW91ciBmcmVlc3R5bGUgcHJlZGljdGlvbiBtb2RlbCB3aXRoIHRoZSBycGFydD8gXG5cbiNPbmUgd2F5IG9mIGRvaW5nIGl0IGlzIHRvIGRpdmlkZSB0aGUgZGF0YSBzZXRzIGludG8gdHdvIG11dHVhbGx5IGV4Y2x1c2l2ZSBzdWJzZXRzICh3aGljaCBjb3ZlciBhbGwgZGF0YSBhbHNvKS4gIEhvdyBkbyB5b3UgbWFrZSB0aGVzZSBzdWJzZXRzPyAgVW5mb3J0dW5hdGVseSB0aGVyZSBpcyBubyBhbGdvcml0aG0gZm9yIHRoaXMgYW5kIGl0IGlzIG1vcmUgcmVseWluZyBvbiBob3cgd2VsbCBpcyB5b3VyIG1vZGVsIGRvaW5nIGZvciBkaWZmZXJlbnQgc2xpY2VzIG9mIHRoZSBkYXRhLiAgXG5cbiNJbiB0aGlzIGV4YW1wbGUgKHNpbWlsYXJseSB0byBzbmlwcGV0IDE2Ljcgd2hlcmUgd2UgY29tYmluZSB0d28gcnBhcnQgbW9kZWxzLCB3ZSBhc3N1bWUgdGhhdCBpbml0aWFsIHNwbGl0IHdlIGRlY2lkZWQgb24gaXMgYmFzZWQgb24gU0NPUkUuIEJ1dCBpbnN0ZWFkIG9mIGhhdmluZyB0d28gcnBhcnQgbW9kZWxzICAoMTYuNyksIHdlIHdpbGwgdXNlIG91ciBwcmVkaWN0aW9uICBtb2RlbCBmcm9tIHByZWRpY3Rpb24gY2hhbGxlbmdlIDEgIGZvciBTQ09SRSA+NTAgYW5kIHJwYXJ0IGZvciBTQ09SRSA8PTUwLlxuXG4jTGV0cyBhc3N1bWUgdGhhdCB5b3VyUHJlZGljdGlvbiBpcyBvdXIgbW9kZWwgZnJvbSBQcmVkaWN0aW9uIENoYWxsZW5nZSAxICh5b3VyIGVudGlyZSBjb2RlIGhhcyB0byBiZSBhcHBsaWVkIGhlcmUgdG8gdGhlIGRhdGEgc2V0IChtb29keSwgYmVsb3cpXG5cbm1vb2R5PC1yZWFkLmNzdignaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2Rldjc3OTYvZGF0YTEwMV90dXRvcmlhbC9tYWluL2ZpbGVzL2RhdGFzZXQvbW9vZHkyMDIyX25ldy5jc3YnKVxuXG4jcnBhcnRNb2RlbDwtcnBhcnQoR1JBREV+LiwgZGF0YT1tb29keVttb29keSRTQ09SRTw9NTAsXSk7XG4jcHJlZF9ycGFydE1vZGVsIDwtIHByZWRpY3QocnBhcnRNb2RlbCwgbmV3ZGF0YT1tb29keVttb29keSRTQ09SRTw9NTAsXSwgdHlwZT1cImNsYXNzXCIpXG4jcHJlZF95b3VyTW9kZWwgPC0geW91clByZWRpY3Rpb25bbW9vZHkkU0NPUkU8PTUwXVxuI215cHJlZGljdGlvbjwtbW9vZHlcblxuIyMgSGVyZSB3ZSBjb21iaW5lIHR3byBtb2RlbHMgLSBvdXIgbW9kZWwgZnJvbSBwcmVkaWN0aW9uIDEgY2hhbGxlbmdlIGFuZCBycGFydC5cblxuI2RlY2lzaW9uIDwtIHJlcCgnRicsbnJvdyhteXByZWRpY3Rpb24pKVxuI2RlY2lzaW9uW215cHJlZGljdGlvbiRTQ09SRT41MF0gPC0gcHJlZF95b3VyTW9kZWxcbiNkZWNpc2lvbltteXByZWRpY3Rpb24kU0NPUkU8PTUwXSA8LWFzLmNoYXJhY3RlcihwcmVkX3JwYXJ0TW9kZWwgKVxuI215cHJlZGljdGlvbiRHUkFERSA8LWRlY2lzaW9uXG4jZXJyb3IgPC0gbWVhbihtb29keSRHUkFERSE9IG15cHJlZGljdGlvbiRHUkFERVxuI2Vycm9yIn0=

16.7 Snippet 12: Freestyle + rpart: Combining rpart prediction models

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxuIyBJbXBvcnQgZGF0YXNldFxubW9vZHk8LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjJfbmV3LmNzdicpXG5tb2RlbDE8LXJwYXJ0KEdSQURFfi4sIGRhdGE9bW9vZHlbbW9vZHkkU0NPUkU+NTAsXSk7XG5tb2RlbDI8LXJwYXJ0KEdSQURFfi4sIGRhdGE9bW9vZHlbbW9vZHkkU0NPUkU8PTUwLF0pO1xubW9kZWwxXG5tb2RlbDJcbnByZWQxIDwtIHByZWRpY3QobW9kZWwxLCBuZXdkYXRhPW1vb2R5W21vb2R5JFNDT1JFPjUwLF0sIHR5cGU9XCJjbGFzc1wiKVxucHJlZDIgPC0gcHJlZGljdChtb2RlbDIsIG5ld2RhdGE9bW9vZHlbbW9vZHkkU0NPUkU8PTUwLF0sIHR5cGU9XCJjbGFzc1wiKVxubXlwcmVkaWN0aW9uPC1tb29keVxuZGVjaXNpb24gPC0gcmVwKCdGJyxucm93KG15cHJlZGljdGlvbikpXG5kZWNpc2lvbltteXByZWRpY3Rpb24kU0NPUkU+NTBdIDwtIGFzLmNoYXJhY3RlcihwcmVkMSlcbmRlY2lzaW9uW215cHJlZGljdGlvbiRTQ09SRTw9NTBdIDwtYXMuY2hhcmFjdGVyKHByZWQyKVxubXlwcmVkaWN0aW9uJEdSQURFIDwtZGVjaXNpb25cbmVycm9yIDwtIG1lYW4obW9vZHkkR1JBREUhPSBteXByZWRpY3Rpb24kR1JBREUpXG5lcnJvciJ9

16.8 Snippet 13: Submission with rpart

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxudGVzdDwtcmVhZC5jc3YoJ2h0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L00yMDIydGVzdFNOb0dyYWRlLmNzdicpXG5zdWJtaXNzaW9uPC1yZWFkLmNzdignaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2Rldjc3OTYvZGF0YTEwMV90dXRvcmlhbC9tYWluL2ZpbGVzL2RhdGFzZXQvTTIwMjJzdWJtaXNzaW9uLmNzdicpXG50cmFpbiA8LSByZWFkLmNzdihcImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L00yMDIydHJhaW4uY3N2XCIpXG5cbnRyZWUgPC0gcnBhcnQoR3JhZGUgfiBNYWpvcitTY29yZStTZW5pb3JpdHksIGRhdGEgPSB0cmFpbiwgbWV0aG9kID0gXCJjbGFzc1wiLGNvbnRyb2w9cnBhcnQuY29udHJvbChtaW5idWNrZXQgPSAyMDApKVxudHJlZVxuXG5wcmVkaWN0aW9uIDwtIHByZWRpY3QodHJlZSwgdGVzdCwgdHlwZT1cImNsYXNzXCIpXG5cbiNOb3cgbWFrZSB5b3VyIHN1Ym1pc3Npb24gZmlsZSAtIGl0IHdpbGwgaGF2ZSB0aGUgSURzIGFuZCBub3cgdGhlIHByZWRpY3RlZCBncmFkZXNcbnN1Ym1pc3Npb24kR3JhZGU8LXByZWRpY3Rpb24gXG5cbiMgdXNlIHdyaXRlLmNzdihzdWJtaXNzaW9uLCAnc3VibWlzc2lvbi5jc3YnLCByb3cubmFtZXM9RkFMU0UpIHRvIHN0b3JlIHN1Ym1pc3Npb24gYXMgY3N2IGZpbGUgb24geW91ciBtYWNoaW5lIGFuZCBzdWJzZXF1ZW50bHkgc3VibWl0IGl0IG9uIEthZ2dsZSJ9