Section: 3 πŸ”– Basic R Intructions

In this section we introduce the absolutely basic R instructions, call it R101, which will be sufficient for the entire data 101 class. This is a very small subset of the entire R. The good news is that using this very small subset of R we can accomplish all coding objectives for data 101!

The set we present below is a mix of simple arithmetic aggregate functions such as mean() 3.4.1, max() 3.4.3, sum() , basic data structures such as vectors and data frames and finally, two core functions defined for data frames: subset() 3.5, tapply() 3.6 and table() 3.3 function defined on vectors.

3.1 Vector

  • A vector is simply a list of items that are of the same type.

3.1.1 Categorical vectors

Lets look at example of creating a vector:

1
2
#Lets create 3 vectors with title, author and year.
color <- c('Red','Blue','Yellow','Green')
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.1.2 Numerical vectors

Create a vector with numerical values in a sequence, use the : operator:

1
2
#Lets create a vectors with numerical sequence.
year <- 2018:2022
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.2 Data Frames

  • Data Frames are data displayed in a format as a table.

3.2.1 Data Frame creation

Data frames will serve as containers of imported data - typically data provided in csv format, like the moody data set above. Snippet 4.21 shows how to populate a data frame using read.csv() instruction. Notice that the moody data frame which is the container for the imported data set will automatically inherit attribute names (columns) of the underlying data set.

1
2
# Load the dataset into the moody variable
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.2.2 Data frame subsetting

We can select subsets of columns and subsets of rows for a data frame using the following the notation data[rows, columns]:

1
2
# Load the dataset into the moody variable
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.3 Table

3.3.1 Table()

The below examples show how to use this function:

1
2
# moody<-read.csv("../files/dataset/moody2020b.csv") #static Load
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv") #web load
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.4 Basic Functions

Table 3.1: Snippet of moody Dataset
Major Score Seniority GPA Grade
197 Statistics 97 Junior 2.0 A
632 Psychology 41 Sophomore 1.8 D
363 Economics 70 Senior 1.0 C
252 CS 86 Senior 4.0 B
136 CS 85 Senior 4.0 B

3.4.1 mean()

  • mean() function is used to find the average of values in a numerical vector.
1
2
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.4.2 length()

  • length() function is used to get the number of elements in any vector
1
2
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.4.3 max()

  • max() function is used to get the maximum value in a numerical vector.
1
2
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.4.4 min()

  • min() function is used to get the minimum value in a numerical vector
1
2
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.4.5 sd()

  • sd() function is used to find the standard deviation of numerical vector
1
2
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

Now we are ready to introduce basic data transformation techniques such as slicing and dicing. Slicing, otherwise known as subsetting, allows the selection of data frame subsets. These subsets are defined by boolean conditions built from Attribute op value pairs where op is one of the arithmetic operators such as =, !=, < etc. For example (Score >70)& (Grade ==’A’) refers to a subset of a data frame describing students who scored more than 70 points and got an A.

Dicing refers to eliminating some of the attributes from a data frame - it is vertical slicing - which results in a more β€œnarrow” frame. Finally we can also expand our data frame with new, so called derived, attributes. This is a very useful operation in data analysis since it allows so-called β€œfeature engineering”. These new user-defined features can lead to totally new insights into the data.

3.5 Subset

The following snippets demonstrate two ways of subsetting a data frame: first through explicit function subset() and second through the native sub-data frame notation df[ ].

3.5.1 subset()

1
2
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
#Subset of rows
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.5.2 Subframe

1
2
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.5.3 Subsetting columns

1
2
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
colnames(moody)
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.5.4 Subsetting rows and columns

1
2
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
#Subset of Rows and Columns
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

One of the most important R instructions is tapply. It allows parallel execution of an aggregate function for different values of a categorical variable.

3.6 tapply

tapply() has four arguments: the data frame (df), numerical attribute of df, categorical attribute of df and aggregate function (mean, max, min etc). Syntax of df is as follows:

tapply(df$numerical attribute, df$categorical attribute, aggregate function)

  • tapply() first slices data frame df by different values of a categorical attribute and then computes an aggregate (mean, median, min, max, etc..) of a numerical attribute to each slice.

3.6.1 tapply()

1
2
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0

3.6.2 Combining table() and subset()

1
2
moody<-read.csv("https://raw.githubusercontent.com/dev7796/data101_tutorial/main/files/dataset/moody2022.csv")
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
1/0 0/0