Section: 17 🔖 Linear Regression

Lecture slides:

Linear regression is a linear approach to modeling the relationship between a numerical response ($Y$) and one or more independent variables ($X_i$).

Usually in linear regression, models are used to predict only one scalar variable. But there are two subtype if these models: - First when there is only one explanatory variable and one output variable. This type of linear regression model known as simple linear regression. - Second, when there are multiple predictors, i.e. explanatory/dependent variables for the output variable. This type of linear regression model known as multiple linear regression.

Linear models fitted to various different type of data spread. This illustrates the pitfalls of relying solely on a fitted model to understand the relationship between variables. Credits: Wikipedia.

17.1 Linear regression using lm() function

Syntax for building the regression model using the lm() function is as follows:

lm(formula, data, ...)
- formula: here we mention the prediction column and the other related columns(predictors) on which the prediction will be based on.
  - prediction ~ predictor1 + predictor2 + predictor3 + ...
- data: here we provide the dataset on which the linear regression model is to be trained.

For more info on the lm() function visit lm()

Lets look at the example on the Moody dataset.

Table 17.1: Snippet of Moody Num Dataset
Midterm	Project	FinalExam	ClassScore
73	8	70	39.60000
61	100	20	68.20000
58	88	38	67.00000
93	41	46	52.47565
85	52	85	68.50000
97	48	19	49.10000
26	59	22	41.30000
58	62	25	50.10000
53	56	27	46.70000
66	27	17	34.80494

Now we can build a simple linear regression model to predict the ClassScore attribute based on the various other attributes present in the dataset, as shown above.

Since we will be predicting only one attribute values, this model will be called simple linear regression model.

17.1.1 Snippet 1: How much do Midterm, Project and Final Exam count?

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJtb29keU5VTTwtcmVhZC5jc3YoJ2h0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L01vb2R5TlVNLmNzdicpXG5zcGxpdDwtMC43Km5yb3cobW9vZHlOVU0pXG5zcGxpdFxubW9vZHlOVU1UcjwtbW9vZHlOVU1bMTpzcGxpdCxdXG5tb29keU5VTVRyXG5tb29keU5VTVRzPC1tb29keU5VTVtzcGxpdDpucm93KG1vb2R5TlVNKSxdXG4jV2UgdXNlIGxpbmVhciByZWdyZXNzaW9uIHRvICNmaW5kIG91dCB0aGUgd2VpZ2h0cyBvZiAjTWlkdGVybSwgUHJvamVjdCBhbmQgRmluYWwgI0V4YW0gaW4gY2FsY3VsYXRpb24gb2YgdGhlICNmaW5hbCBjbGFzcyBzY29yZS4gRWFjaCBvZiAjdGhlbSBhcmUgc2NvcmVkIG91dCBvZiAxMDAgYW5kICN0aGUgZmluYWwgY2xhc3Mgc2NvcmUgaXMgYWxzbyAjc2NvcmVkIG91dCBvZiAxMDAgYXMgd2VpZ2h0ZWQgI3N1bSBvZiBNaWR0ZXJtLCBQcm9qZWN0IGFuZCAjRmluYWwgRXhhbSBzY29yZXMuXG50cmFpbiA8LSBsbShDbGFzc1Njb3Jlfi4sICBkYXRhPW1vb2R5TlVNVHIpXG50cmFpblxucHJlZCA8LSBwcmVkaWN0KHRyYWluLG5ld2RhdGE9bW9vZHlOVU1Ucylcbm1lYW4oKHByZWQgLSBtb29keU5VTVRzJENsYXNzU2NvcmUpXjIpIn0=

We can see that,

The summary of the lm model give us information about the parameters of the model, the residuals and coefficients, etc.
The predicted values are obtained from the predict function using the trained model and the test data.

17.2 Calculating the Error using mse()

As was the simple case in the categorical predictions of the classification models, where we could just compare the predicted categories and the actual categories, this type of direct comparison as an accuracy test won’t prove useful now in our numerical predictions scenario.

We don’t want to eyeball every time we predict, to find the accuracy of our predictions each row by row, so lets see a method to calculate the accuracy of our predictions, using some statistical technique.

To do this we will use the Mean Squared Error(MSE).

The MSE is a measure of the quality of an predictor/estimator
It is always non-negative
Values closer to zero are better.

The equation to calculate the MSE is as follows:

\[\begin{equation} MSE=\frac{1}{n} \sum_{i=1}^{n}{(Y_i - \hat{Y_i})^2} \\ \text{where $n$ is the number of data points, $Y_i$ are the observed value}\\ \text{and $\hat{Y_i}$ are the predicted values} \end{equation}\]

To implement this, we will use the mse() function present in the Metrics Package, so remember to install the Metrics package and use library(Metrics) in the code for local use.

The syntax for mse() function is very simple:

mse(actual,predicted)
- actual: vector of the actual values of the attribute we want to predict.
- predicted: vector of the predicted values obtained using our model.

17.3 Snippet 2: Cross Validate your prediction

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KE1vZGVsTWV0cmljcylcblxudHJhaW4gPC0gcmVhZC5jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9Nb29keU5VTS5jc3ZcIilcbiNzY3JhbWJsZSB0aGUgdHJhaW4gZnJhbWVcbnY8LXNhbXBsZSgxOm5yb3codHJhaW4pKVxudlsxOjVdXG50cmFpblNjcmFtYmxlZDwtdHJhaW5bdiwgXVxuXG4jb25lIHN0ZXAgY3Jvc3N2YWxpZGF0aW9uXG5uIDwtIDEwMFxudHJhaW5TYW1wbGU8LXRyYWluU2NyYW1ibGVkW25yb3codHJhaW5TY3JhbWJsZWQpLW46bnJvdyh0cmFpblNjcmFtYmxlZCksIF1cbnRlc3RTYW1wbGUgPC0gdHJhaW5TY3JhbWJsZWRbMTpuLF1cblxubG0udHJlZSA8LSBsbShDbGFzc1Njb3Jlfi4sICBkYXRhPXRyYWluU2FtcGxlKVxubG0udHJlZVxuXG5wcmVkIDwtIHByZWRpY3QobG0udHJlZSxuZXdkYXRhPXRlc3RTYW1wbGUpXG5wcmVkXG5cbm1zZSh0ZXN0U2FtcGxlJENsYXNzU2NvcmUscHJlZCkifQ==

We can see that,

The summary of the lm model give us information about the parameters of the model, the residuals and coefficients, etc.
The predicted values are obtained form the predict function using the trained model and the test data. In comparison to the previous model we are using the cross validation technique to check that if we have more accurate predictions, thus increasing the overall accuracy of the model.

17.4 Snippet 3: Submission with lm

eyJsYW5ndWFnZSI6InIiLCJzYW1wbGUiOiJsaWJyYXJ5KHJwYXJ0KVxudGVzdDwtcmVhZC5jc3YoJ2h0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9kZXY3Nzk2L2RhdGExMDFfdHV0b3JpYWwvbWFpbi9maWxlcy9kYXRhc2V0L01vb2R5TlVNX3Rlc3QuY3N2JylcbnN1Ym1pc3Npb248LXJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vZGV2Nzc5Ni9kYXRhMTAxX3R1dG9yaWFsL21haW4vZmlsZXMvZGF0YXNldC9NMjAyMnN1Ym1pc3Npb24uY3N2JylcbnRyYWluIDwtIHJlYWQuY3N2KFwiaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2Rldjc3OTYvZGF0YTEwMV90dXRvcmlhbC9tYWluL2ZpbGVzL2RhdGFzZXQvTW9vZHlOVU0uY3N2XCIpXG5cbnRyZWUgPC0gbG0oQ2xhc3NTY29yZX4uLCAgZGF0YT10cmFpbilcbnRyZWVcblxucHJlZGljdGlvbiA8LSBwcmVkaWN0KHRyZWUsIG5ld2RhdGE9dGVzdClcblxuI05vdyBtYWtlIHlvdXIgc3VibWlzc2lvbiBmaWxlIC0gaXQgd2lsbCBoYXZlIHRoZSBJRHMgYW5kIG5vdyB0aGUgcHJlZGljdGVkIGdyYWRlc1xuc3VibWlzc2lvbiRHcmFkZTwtcHJlZGljdGlvbiBcblxuIyB1c2Ugd3JpdGUuY3N2KHN1Ym1pc3Npb24sICdzdWJtaXNzaW9uLmNzdicsIHJvdy5uYW1lcz1GQUxTRSkgdG8gc3RvcmUgc3VibWlzc2lvbiBhcyBjc3YgZmlsZSBvbiB5b3VyIG1hY2hpbmUgYW5kIHN1YnNlcXVlbnRseSBzdWJtaXQgaXQgb24gS2FnZ2xlIn0=