In this section we introduce the absolutely basic R instructions, call it R101, which will be sufficient for the entire data 101 class. This is a very small subset of the entire R. The good news is that using this very small subset of R we can accomplish all coding objectives for data 101!
The set we present below is a mix of simple arithmetic aggregate functions such as mean() 3.4.1, max() 3.4.3, sum() , basic data structures such as vectors and data frames and finally, two core functions defined for data frames: subset() 3.5, tapply() 3.6 and table() 3.3 function defined on vectors.
3.1 Vector
A vector is simply a list of items that are of the same type.
Data Frames are data displayed in a format as a table.
3.2.1 Data Frame creation
Data frames will serve as containers of imported data - typically data provided in csv format, like the moody data set above. Snippet 4.21 shows how to populate a data frame using read.csv() instruction. Notice that the moody data frame which is the container for the imported data set will automatically inherit attribute names (columns) of the underlying data set.
Now we are ready to introduce basic data transformation techniques such as slicing and dicing. Slicing, otherwise known as subsetting, allows the selection of data frame subsets. These subsets are defined by boolean conditions built from Attribute op value pairs where op is one of the arithmetic operators such as =, !=, < etc. For example (Score >70)& (Grade ==โAโ) refers to a subset of a data frame describing students who scored more than 70 points and got an A.
Dicing refers to eliminating some of the attributes from a data frame - it is vertical slicing - which results in a more โnarrowโ frame.
Finally we can also expand our data frame with new, so called derived, attributes. This is a very useful operation in data analysis since it allows so-called โfeature engineeringโ. These new user-defined features can lead to totally new insights into the data.
3.5 Subset
The following snippets demonstrate two ways of subsetting a data frame: first through explicit function subset() and second through the native sub-data frame notation df[ ].
One of the most important R instructions is tapply. It allows parallel execution of an aggregate function for different values of a categorical variable.
3.6 tapply
tapply() has four arguments: the data frame (df), numerical attribute of df, categorical attribute of df and aggregate function (mean, max, min etc). Syntax of df is as follows:
tapply() first slices data frame df by different values of a categorical attribute and then computes an aggregate (mean, median, min, max, etc..) of a numerical attribute to each slice.