Section: 8 π Data Transformation with Derived Attributes
R allows creating new data frame attributes (columns) βon the flyβ. These are new vectors, which are often defined as functions of existing attributes. Hence, the name - derived attributes.
Derived attributes will play an important role in data exploration as well as in building prediction models. Very often, derived attributes allow discovery of important patterns in data. Similarly, derived attributes may be more predictive than original attributes in the imported data sets.
The term feature engineering is often used in machine learning to describe creation of derived attributes.
8.1 Making new categorical attributes
Here we define a new attribute PF (Pass/Fail) to βPassβ. Students who got A, B or C, passed. Students who received F, failed. We are grouping values of Grade into two categories of a new categorical attribute PF.
The line 5 replaces βPassβ by βFailβ for students who received F.
8.2 Making categorical attribute from numerical attribute using function Cut()
- cut() function divides the range of x into intervals. Provides ability to label intervals as well. It plays important role in defining derived attributes from attributes which are numerical.
8.3 Making new numerical attribute from numerical attributes
Suppose we would like to combine score and participation into one combined score. We can define a new numerical attribute from SCORE and PARTICIPATION . We can see that the moody data frame will be expanded by the additional attribute.
8.4 More complex example of defining derived attributes
The way we define combined score attributes rewards students even for poor participation. Their combined score is always higher than their score in class even if their participation was quite low. It would make more sense to define combined score by either penalizing for poor performance or rewarding good performance.
The next snippet illustrates defining such a new numerical attribute, $adjustedScore of a student in the Moody data frame. adjustedScore penalizes low participation or rewards for good participation.
Score is adjusted by the value of participation attribute in the following way:
- If participation is larger than 0.5 - a bonus proportional to participation * 10 is added to the score.
- If participation is smaller than 0.5, a penalty of 1-participation) * 10 is subtracted from the score.
In this way, for someone with very small participation, the 10 point penalty will be imposed (10 points subtracted from the score). Conversely, someone with perfect participation (1.0) will receive a 10 point bonus.
We are now able to transform our data by slicing and dicing rows and columns, using subset function (or sub-data frame), we can also add new attributes as shown above. Data Transformation is critical not just in data exploration and plotting but foremost in building high quality prediction models as we will show later.