Basic Functions
mean()
- mean() function is used to find the average of values in a numerical vector.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjBiLmNzdlwiKVxuXG4jTGV0cyBsb29rIGF0IHRoZSBtZWFuIG9mIHNjb3JlIGNvbHVtbi5cbm1lYW4obW9vZHkkc2NvcmUpIn0=
length()
- length() function is used to get the number of elements in any vector
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjBiLmNzdlwiKVxuXG4jTGV0cyBsb29rIGF0IHRoZSBsZW5ndGggb2YgdGhlIGdyYWRlIGNvbHVtbiBcbmxlbmd0aChtb29keSRncmFkZSkifQ==
max()
- max() function is used to get the maximum value in a numerical vector.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjBiLmNzdlwiKVxuXG4jbGV0cyBsb29rIGF0IHRoZSBtYXhpbXVtIHZhbHVlIG9mIHRoZSBzY29yZSBpbiB0aGUgc2NvcmUgY29sdW1uXG5tYXgobW9vZHkkc2NvcmUpIn0=
min()
- min() function is used to get the minimum value in a numerical vector
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjBiLmNzdlwiKVxuXG4jTGV0cyBsb29rIGF0IHRoZSBtaW5pbXVtIHZhbHVlIG9mIHNjb3JlIGluIHRoZSBzY29yZSBjb2x1bW4uXG5taW4obW9vZHkkc2NvcmUpIn0=
sd()
- sd() function is used to find the standard deviation of numerical vector
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjBiLmNzdlwiKVxuXG4jTGV0cyBsb29rIGF0IHRoZSBzdGFuZGFyZCBkZXZpYXRpb24gb2Ygc2NvcmUgY29sdW1uXG5zZChtb29keSRzY29yZSkifQ==
Subset
Snippet 1- example of subset function
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NT09EWS0yMDE5LmNzdlwiKVxuI1N1YnNldCBvZiByb3dzXG5tb29keV9uZXZlcl9zbWFydHBob25lPC1zdWJzZXQobW9vZHksT05fU01BUlRQSE9ORT09XCJuZXZlclwiKVxubnJvdyhtb29keSlcbm5yb3cobW9vZHlfbmV2ZXJfc21hcnRwaG9uZSkifQ==
Snippet 2- example of subset function
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NT09EWS0yMDE5LmNzdlwiKVxuI1N1YnNldCBvZiByb3dzXG5tb29keTE8LXN1YnNldChtb29keSxPTl9TTUFSVFBIT05FPT1cIm5ldmVyXCIpXG4jIFlvdSBjYW4gc2VlIG9ubHkgc3R1ZGVudCBuZXZlciBvbiBzbWFydHBob25lIGFyZSBpbiB0aGUgc3Vic2V0LlxudGFibGUobW9vZHkxJE9OX1NNQVJUUEhPTkUpICJ9
Snippet 3- subset as subframe
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NT09EWS0yMDE5LmNzdlwiKVxuI0FsdGVybmF0ZSB3YXkgdG8gc3Vic2V0LlxubW9vZHkyPC1tb29keVttb29keSRPTl9TTUFSVFBIT05FPT1cIm5ldmVyXCIsIF1cbiMgWW91IGNhbiBzZWUgYSBzaW1pbGFyIHRhYmxlIGFzIGFib3ZlLlxudGFibGUobW9vZHkyJE9OX1NNQVJUUEhPTkUpICJ9
Snippet 4- subsetting columns
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NT09EWS0yMDE5LmNzdlwiKVxuY29sbmFtZXMobW9vZHkpXG4jc3Vic2V0IG9mIGNvbHVtbnNcbm1vb2R5Mzwtc3Vic2V0KG1vb2R5LCBzZWxlY3QgPSAtYygxKSlcbm5jb2wobW9vZHkzKVxuIyBZb3UgY2FuIHNlZSB0aGUgbnVtYmVyIG9mIGNvbHVtbnMgaGFzIGJlZW4gcmVkdWNlZCBieSAxLCBkdWUgdG8gc3ViLXNldHRpbmcgd2l0aG91dCBjb2x1bW4gMVxubmNvbChtb29keTMpIn0=
Snippet 5- sub-setting rows and columns
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NT09EWS0yMDE5LmNzdlwiKVxuI1N1YnNldCBvZiBSb3dzIGFuZCBDb2x1bW5zXG5tb29keTE8LXN1YnNldChtb29keSwgc2VsZWN0ID0gYygyOjQpLCBPTl9TTUFSVFBIT05FID09IFwibmV2ZXJcIilcbmNvbG5hbWVzKG1vb2R5MSlcbiNOb3RpY2UgdGhhdCBvbmx5IDMgY29sdW1ucyBhcmUgcmVtYWluaW5nXG5kaW0obW9vZHkxKSJ9
Code Review
What would R say?
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NT09EWS0yMDE5LmNzdlwiKVxubW9vZHlbbW9vZHkkU0NPUkU+PTkwLDNdXG4jIFdoYXQgd2lsbCBSIHNheT9cblxuXG4jIEEuIEdldCBzdWJzZXQgb2YgYWxsIGNvbHVtbnMgd2hpY2ggY29udGFpbnMgc3R1ZGVudHMgd2hvIHNjb3JlZCBtb3JlIHRoYW4gZXF1YWwgdG8gOTBcbiMgQi4gZXJyb3JcbiMgQy4gZ2V0IGFsbCBzY29yZSB2YWx1ZXMgd2hpY2ggYXJlIG1vcmUgdGhhbiBlcXVhbCB0byA5MFxuIyBELiBnZXQgc3Vic2V0IG9mIG9ubHkgdGhlIGdyYWRlcyBvZiBzdHVkZW50cyB3aXRoIHNjb3JlIGdyZWF0ZXIgdGhhbiBlcXVhbCB0byA5MCJ9
What would R say?
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NT09EWS0yMDE5LmNzdlwiKVxuXG5tb29keVttb29keSRTQ09SRT49ODAuMCAmIG1vb2R5JEdSQURFID09J0InLF0gXG4jIFdoYXQgd2lsbCBSIHNheT9cblxuIyBBLiBzdWJzZXQgb2YgbW9vZHkgZGF0YSBmcmFtZSB3aG8gZ290IEIgZ3JhZGUuXG4jIEIuIGVycm9yLlxuIyBDLiBzdWJzZXQgb2YgbW9vZHkgZGF0YSBmcmFtZSB3aXRoIHNjb3JlIGdyZWF0ZXIgdGhhbiA4MC5cbiMgRC4gc3Vic2V0IG9mIG1vb2R5IGRhdGEgZnJhbWUgd2l0aCBzY29yZSBtb3JlIHRoYW4gODAgYW5kIGdvdCBCIGdyYWRlLiJ9
Derived Attribute
R allows creating new data frame attributes (columns) “on the fly”. These are new vectors, which are often defined as functions of existing attributes. Hence, the name - derived attributes.
Derived attributes will play an important role in data exploration as well as in building prediction models. Very often, derived attributes allow discovery of important patterns in data. Similarly, derived attributes may be more predictive than original attributes in the imported data sets.
The term feature engineering is often used in machine learning to describe creation of derived attributes.
Snippet 1 - Making new categorical attribute.
The line 4 initializes the new attribute PF (Pass/Fail) to “Pass”. The line 5 replaces “Pass” by “Fail” for students who received F. This new attribute, PF, will allow exploratory analysis to find “How to pass Professor Moody’s class”. The answer to this question may be different than then answer to “How to get a good grade in Professor Moody’s class”.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NT09EWS0yMDE5LmNzdlwiKVxuXG4jIEN1dCBFeGFtcGxlIHVzaW5nIGJyZWFrcyAtIEN1dHRpbmcgZGF0YSB1c2luZyBkZWZpbmVkIHZlY3Rvci4gXG5tb29keSRQRjwtJ1Bhc3MnXG5tb29keVttb29keSRHUkFERT09J0YnLF0kUEY8LSdGYWlsJ1xuXG4jIGxldHMgc2VlIG91ciBhZGRlZCBjb2x1bW4gUEZcbm1vb2R5In0=
Cut
- cut() function divides the range of x into intervals. Provides ability to label intervals as well. It plays important role in defining derived attributes from attributes which are numerical.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NT09EWS0yMDE5LmNzdlwiKVxuXG4jIEN1dCBFeGFtcGxlIHVzaW5nIGJyZWFrcyAtIEN1dHRpbmcgZGF0YSB1c2luZyBkZWZpbmVkIHZlY3Rvci4gXG5zY29yZTEgPC0gY3V0KG1vb2R5JFNDT1JFLGJyZWFrcz1jKDAsNTAsMTAwKSxsYWJlbHM9YyhcIkZcIixcIlBcIikpXG50YWJsZShzY29yZTEpIn0=
Code Review
What would R say?
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NT09EWS0yMDE5LmNzdlwiKVxuXG5jdXQobW9vZHkkU0NPUkUsIGJyZWFrcz1jKDAsMjUsNzAsMTAwKSxsYWJlbHM9YyhcImxvd1wiLCBcIm1lZGl1bVwiLCBcImhpZ2hcIikpXG4jV2hhdCB3b3VsZCBSIHNheT9cblxuIyBBLiA1IGludGVydmFscyBvZiBhdHRyaWJ1dGUgc2NvcmVcbiMgQi4gMyBpbnRlcnZhbHMgKDAsMjUpICgyNSw3MCkgKDc1LDEwMClcbiMgQy4gMyBjYXRlZ29yaWNhbCB2YWx1ZXMgXCJsb3dcIiwgXCJtZWRpdW1cIiBhbmQgXCJoaWdoXCIgZm9yIGRpZmZlcmVudCBzY29yZSBpbnRlcnZhbHNcbiMgRC4gMyBzZXBhcmF0ZSBkYXRhc2V0cyB3aXRoIHNpbWlsYXIgc2NvcmUgdmFsdWVzIn0=
What would R say?
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NT09EWS0yMDE5LmNzdlwiKVxuXG5vdXRwdXQ8LWN1dChtb29keSRTQ09SRSwgNSlcbnN1bW1hcnkob3V0cHV0KVxuI1doYXQgd291bGQgUiBzYXk/XG5cbiMgQS4gNSBpbnRlcnZhbHMgb2YgYXR0cmlidXRlIHNjb3JlIG9mIHVuZXF1YWwgY291bnQgb2YgZWxlbWVudHNcbiMgQi4gNSBpbnRlcnZhbHMgb2YgYXR0cmlidXRlIHNjb3JlIG9mIGVxdWFsIGNvdW50IG9mIGVsZW1lbnRzXG4jIEMuIDUgY2F0ZWdvcmljYWwgdmFsdWVzIGZvciBkaWZmZXJlbnQgc2NvcmUgaW50ZXJ2YWxzXG4jIEQuIDUgc2VwYXJhdGUgZGF0YXNldCB3aXRoIHNpbWlsYXIgc2NvcmUgdmFsdWVzIn0=
What would R say?
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NT09EWS0yMDE5LmNzdlwiKVxuXG5vdXRwdXQ8LWN1dChtb29keSRBU0tTX1FVRVNUSU9OUywgMilcbnN1bW1hcnkob3V0cHV0KVxuI1doYXQgd291bGQgUiBzYXk/XG5cbiMgQS4gMiBpbnRlcnZhbHMgb2YgYXR0cmlidXRlIGFza19xdWVzdGlvbnMgb2YgdW5lcXVhbCBjb3VudCBvZiBlbGVtZW50cyBpbiBlYWNoIGludGVydmFsXG4jIEIuIDIgaW50ZXJ2YWxzIG9mIGF0dHJpYnV0ZSBhc2tfcXVlc3Rpb25zIG9mIGVxdWFsIGNvdW50IG9mIGVsZW1lbnRzIGluIGVhY2ggaW50ZXJ2YWxcbiMgQy4gMiBjYXRlZ29yaWNhbCB2YWx1ZXMgZm9yIGRpZmZlcmVudCBhc2tfcXVlc3Rpb25zIGludGVydmFsc1xuIyBELiBFcnJvci4ifQ==
More complex example of defining derived attributes
The next snippet illustrates defining a new numerical attribute, $adjustedScore of a student in the Moody data frame.
Score is adjusted by the value of participation attribute in the following way:
If participation is larger than 0.5 - a bonus proportional to participation * 10 is added to the score.
If participation is smaller than 0.5, a penalty of 1-participation) * 10 is subtracted from the score.
In this way, for someone with very small participation, the 10 point penalty will be imposed (10 points subtracted from the score). Conversely, someone with perfect participation (1.0) will receive a 10 point bonus.
Snippet 1
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjBiLmNzdlwiKVxuXG5cbm1vb2R5JGNvbmRpdGlvbmFsIDwtMFxubW9vZHlbbW9vZHkkcGFydGljaXBhdGlvbjwwLjUwLCBdJGNvbmRpdGlvbmFsIDwtIG1vb2R5W21vb2R5JHBhcnRpY2lwYXRpb248MC41MCwgXSRzY29yZSAtMTAqKDEtbW9vZHlbbW9vZHkkcGFydGljaXBhdGlvbjwwLjUwLCBdJHBhcnRpY2lwYXRpb24pXG5tb29keVttb29keSRwYXJ0aWNpcGF0aW9uPj0wLjUwLCBdJGNvbmRpdGlvbmFsIDwtIG1vb2R5W21vb2R5JHBhcnRpY2lwYXRpb24+PTAuNTAsIF0kc2NvcmUgKzEwKm1vb2R5W21vb2R5JHBhcnRpY2lwYXRpb24+PTAuNTAsIF0kcGFydGljaXBhdGlvblxuXG4jIHByaW50IHRoZSBjb2x1bW4gbmFtZXNcbmNvbG5hbWVzKG1vb2R5KVxuXG4jIGxldHMgbG9vayBhdCB0aGUgY29uZGl0aW9uYWwgYXR0cmlidXRlIFxuaGVhZChtb29keSlcblxuI3N1YnNldCB0aGUgbW9vZHkgZGF0YXNldCByb3dzID0gMSB0byAxMCBhbmQgY29scyA9IDEsNVxubW9vZHlbMToxMCwgYygxLDUpXVxuXG4jc3Vic2V0IHRoZSBtb29keSBkYXRhc2V0IHJvd3MgPSAxIHRvIDEwIGFuZCBjb2xzID0gMSw1LDZcbm1vb2R5WzE6MTAsIGMoMSw1LDYpXVxuXG4jIHByaW50IHN1bW1hcnkgb2YgaW5pZGl2aWR1YWwgY29sdW1uc1xuc3VtbWFyeShtb29keSRzY29yZSlcbnN1bW1hcnkobW9vZHkkY29uZGl0aW9uYWwpXG5cbiMgUGxvdHRpbmcgdGhlIGNvbmRpdGlvbmFsIGF0dHJpYnV0ZSB1c2luZyBib3hwbG90XG5ib3hwbG90KG1vb2R5JGNvbmRpdGlvbmFsLGNvbCA9IGMoXCJyZWRcIiksbWFpbj1cIkNvbXBsZXggRXhhbXBsZVwiKVxuXG4jIFBsb3R0aW5nIHRoZSBzY29yZSBhdHRyaWJ1dGUgdXNpbmcgYm94cGxvdFxuYm94cGxvdChtb29keSRzY29yZSxjb2wgPSBjKFwiYmx1ZVwiKSxtYWluPVwiQ29tcGxleCBFeGFtcGxlXCIpIn0=