Section: 3 ๐Ÿ”– Basic R Intructions

In this section we introduce the absolutely basic R instructions, call it R101, which will be sufficient for the entire data 101 class. This is a very small subset of the entire R. The good news is that using this very small subset of R we can accomplish all coding objectives for data 101!

The set we present below is a mix of simple arithmetic aggregate functions such as mean() 3.4.1, max() 3.4.3, sum() , basic data structures such as vectors and data frames and finally, two core functions defined for data frames: subset() 3.5, tapply() 3.6 and table() 3.3 function defined on vectors.

3.1 Vector

  • A vector is simply a list of items that are of the same type.

3.1.1 Categorical vectors

Lets look at example of creating a vector:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjTGV0cyBjcmVhdGUgMyB2ZWN0b3JzIHdpdGggdGl0bGUsIGF1dGhvciBhbmQgeWVhci5cbmNvbG9yIDwtIGMoJ1JlZCcsJ0JsdWUnLCdZZWxsb3cnLCdHcmVlbicpXG5cbiNMZXRzIGxvb2sgYXQgaG93IHRoZSBjcmVhdGVkIHZlY3RvcnMgbG9vay5cbmNvbG9yIn0=

3.1.2 Numerical vectors

Create a vector with numerical values in a sequence, use the : operator:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjTGV0cyBjcmVhdGUgYSB2ZWN0b3JzIHdpdGggbnVtZXJpY2FsIHNlcXVlbmNlLlxueWVhciA8LSAyMDE4OjIwMjJcblxuI0xldHMgbG9vayBhdCBob3cgdGhlIGNyZWF0ZWQgdmVjdG9ycyBsb29rLlxueWVhciJ9

3.2 Data Frames

  • Data Frames are data displayed in a format as a table.

3.2.1 Data Frame creation

Data frames will serve as containers of imported data - typically data provided in csv format, like the moody data set above. Snippet 4.21 shows how to populate a data frame using read.csv() instruction. Notice that the moody data frame which is the container for the imported data set will automatically inherit attribute names (columns) of the underlying data set.

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIExvYWQgdGhlIGRhdGFzZXQgaW50byB0aGUgbW9vZHkgdmFyaWFibGVcbm1vb2R5PC1yZWFkLmNzdihcImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L21vb2R5MjAyMi5jc3ZcIilcblxuIyBOb3cgbGV0cyB2aWV3IHRoZSBkYXRhZnJhbWUgbW9vZHkgd2l0aCBqdXN0IDUtNiB0dXBsZXNcbmhlYWQobW9vZHkpIn0=

3.2.2 Data frame subsetting

We can select subsets of columns and subsets of rows for a data frame using the following the notation data[rows, columns]:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIExvYWQgdGhlIGRhdGFzZXQgaW50byB0aGUgbW9vZHkgdmFyaWFibGVcbm1vb2R5PC1yZWFkLmNzdihcImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L21vb2R5MjAyMi5jc3ZcIilcblxuIyBSZXR1cm4gcm93IDFcbm1vb2R5WzEsIF1cblxuIyBSZXR1cm4gY29sdW1uIDVcbm1vb2R5WywgNV1cblxuIyBSb3dzIDE6NSBhbmQgY29sdW1uIDJcbm1vb2R5WzE6NSwgMl1cblxuIyBHaXZlIG1lIHJvd3MgMS0zIGFuZCBjb2x1bW5zIDIgYW5kIDQgb2YgbW9vZHlcbm1vb2R5WzE6MywgYygyOjQpXSJ9

3.3 Table

3.3.1 Table()

The below examples show how to use this function:

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiIjIG1vb2R5PC1yZWFkLmNzdihcIi4uL2ZpbGVzL2RhdGFzZXQvbW9vZHkyMDIwYi5jc3ZcIikgI3N0YXRpYyBMb2FkXG5tb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjIuY3N2XCIpICN3ZWIgbG9hZFxuXG4jbGV0cyBtYWtlIGEgdGFibGUgZm9yIHRoZSBncmFkZXMgb2Ygc3R1ZGVudHMgYW5kIGNvdW50cyBvZiBzdHVkZW50cyBmb3IgZWFjaCBHcmFkZS4gXG5ncmFkZXMgPC0gdGFibGUobW9vZHkkR3JhZGUpXG5cbiNKb2ludCBkaXN0cmlidXRpb24gb2YgZ3JhZGUgYW5kIG1ham9yXG50YWJsZShtb29keSRHcmFkZSwgbW9vZHkkTWFqb3IpIn0=

3.4 Basic Functions

Table 3.1: Snippet of moody Dataset
Major Score Seniority GPA Grade
197 Statistics 97 Junior 2.0 A
632 Psychology 41 Sophomore 1.8 D
363 Economics 70 Senior 1.0 C
252 CS 86 Senior 4.0 B
136 CS 85 Senior 4.0 B

3.4.1 mean()

  • mean() function is used to find the average of values in a numerical vector.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjIuY3N2XCIpXG5cbiNMZXRzIGxvb2sgYXQgdGhlIG1lYW4gb2Ygc2NvcmUgY29sdW1uLlxubWVhbihtb29keSRTY29yZSkifQ==

3.4.2 length()

  • length() function is used to get the number of elements in any vector
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjIuY3N2XCIpXG5cbiNMZXRzIGxvb2sgYXQgdGhlIGxlbmd0aCBvZiB0aGUgZ3JhZGUgY29sdW1uIFxubGVuZ3RoKG1vb2R5JEdyYWRlKSJ9

3.4.3 max()

  • max() function is used to get the maximum value in a numerical vector.
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjIuY3N2XCIpXG5cbiNsZXRzIGxvb2sgYXQgdGhlIG1heGltdW0gdmFsdWUgb2YgdGhlIHNjb3JlIGluIHRoZSBzY29yZSBjb2x1bW5cbm1heChtb29keSRTY29yZSkifQ==

3.4.4 min()

  • min() function is used to get the minimum value in a numerical vector
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjIuY3N2XCIpXG5cbiNMZXRzIGxvb2sgYXQgdGhlIG1pbmltdW0gdmFsdWUgb2Ygc2NvcmUgaW4gdGhlIHNjb3JlIGNvbHVtbi5cbm1pbihtb29keSRTY29yZSkifQ==

3.4.5 sd()

  • sd() function is used to find the standard deviation of numerical vector
eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjIuY3N2XCIpXG5cbiNMZXRzIGxvb2sgYXQgdGhlIHN0YW5kYXJkIGRldmlhdGlvbiBvZiBzY29yZSBjb2x1bW5cbnNkKG1vb2R5JFNjb3JlKSJ9

Now we are ready to introduce basic data transformation techniques such as slicing and dicing. Slicing, otherwise known as subsetting, allows the selection of data frame subsets. These subsets are defined by boolean conditions built from Attribute op value pairs where op is one of the arithmetic operators such as =, !=, < etc. For example (Score >70)& (Grade ==โ€™Aโ€™) refers to a subset of a data frame describing students who scored more than 70 points and got an A.

Dicing refers to eliminating some of the attributes from a data frame - it is vertical slicing - which results in a more โ€œnarrowโ€ frame. Finally we can also expand our data frame with new, so called derived, attributes. This is a very useful operation in data analysis since it allows so-called โ€œfeature engineeringโ€. These new user-defined features can lead to totally new insights into the data.

3.5 Subset

The following snippets demonstrate two ways of subsetting a data frame: first through explicit function subset() and second through the native sub-data frame notation df[ ].

3.5.1 subset()

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjIuY3N2XCIpXG4jU3Vic2V0IG9mIHJvd3Ncbm1vb2R5X3BzeWNob2xvZ3k8LXN1YnNldChtb29keSwgTWFqb3I9PSAnUHN5Y2hvbG9neScpXG5ucm93KG1vb2R5KVxubnJvdyhtb29keV9wc3ljaG9sb2d5KSJ9

3.5.2 Subframe

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjIuY3N2XCIpXG5cbiNBbHRlcm5hdGUgd2F5IHRvIHN1YnNldC5cbm1vb2R5W21vb2R5JE1ham9yPT1cIlBzeWNob2xvZ3lcIiwgXVxubW9vZHlbbW9vZHkkTWFqb3IhPVwiUHN5Y2hvbG9neVwiLCBdXG5tb29keVttb29keSRTY29yZSA+ODAsIF1cbm1vb2R5W21vb2R5JFNjb3JlID44MCAmIG1vb2R5JEdyYWRlID09ICdCJywgXSJ9

3.5.3 Subsetting columns

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjIuY3N2XCIpXG5jb2xuYW1lcyhtb29keSlcbiNzdWJzZXQgb2YgY29sdW1uc1xubW9vZHkzPC1zdWJzZXQobW9vZHksIHNlbGVjdCA9IC1jKDEpKVxubmNvbChtb29keTMpXG4jIFlvdSBjYW4gc2VlIHRoZSBudW1iZXIgb2YgY29sdW1ucyBoYXMgYmVlbiByZWR1Y2VkIGJ5IDEsIGR1ZSB0byBzdWItc2V0dGluZyB3aXRob3V0IGNvbHVtbiAxXG5uY29sKG1vb2R5MykifQ==

3.5.4 Subsetting rows and columns

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjIuY3N2XCIpXG4jU3Vic2V0IG9mIFJvd3MgYW5kIENvbHVtbnNcbm1vb2R5MTwtc3Vic2V0KG1vb2R5LCBzZWxlY3QgPSBjKDI6NCksIE1ham9yPT1cIlBzeWNob2xvZ3lcIilcbmNvbG5hbWVzKG1vb2R5MSlcbiNOb3RpY2UgdGhhdCBvbmx5IDMgY29sdW1ucyBhcmUgcmVtYWluaW5nXG5kaW0obW9vZHkxKSJ9

One of the most important R instructions is tapply. It allows parallel execution of an aggregate function for different values of a categorical variable.

3.6 tapply

tapply() has four arguments: the data frame (df), numerical attribute of df, categorical attribute of df and aggregate function (mean, max, min etc). Syntax of df is as follows:

tapply(df$numerical attribute, df$categorical attribute, aggregate function)

  • tapply() first slices data frame df by different values of a categorical attribute and then computes an aggregate (mean, median, min, max, etc..) of a numerical attribute to each slice.

3.6.1 tapply()

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjIuY3N2XCIpXG5cbiNEaXN0cmlidXRpb24gb2YgZ3JhZGVzIGZvciBzZW5pb3JzIHdobyBtYWpvciBpbiBFY29ub21pY3NcbnRhcHBseShtb29keVttb29keSRTZW5pb3JpdHkgPT0gJ0p1bmlvcicsXSRTY29yZSwgbW9vZHlbbW9vZHkkU2VuaW9yaXR5ID09ICdKdW5pb3InLF0kR3JhZGUsbWVhbikifQ==

3.6.2 Combining table() and subset()

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keTwtcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9tb29keTIwMjIuY3N2XCIpXG5cbiNEaXN0cmlidXRpb24gb2YgZ3JhZGVzIGZvciBqdW5pb3JzXG50YWJsZShtb29keVttb29keSRTZW5pb3JpdHkgPT0gJ0p1bmlvcicsXSRHcmFkZSlcbiNEaXN0cmlidXRpb24gb2YgZ3JhZGVzIGZvciBzZW5pb3JzIHdobyBtYWpvciBpbiBFY29ub21pY3NcbnRhYmxlKG1vb2R5W21vb2R5JFNlbmlvcml0eSA9PSAnU2VuaW9yJyAmIG1vb2R5JE1ham9yID09ICdFY29ub21pY3MnLCAsXSRHcmFkZSkifQ==