Section: 18 π Linear Regression
18.1 Introduction
How to build prediction models for numerical variables?
So far we have discussed prediction models for categorical target variables. In order to predict numerical variables we often use linear regression.
18.2 Linear regression using lm() function
Syntax for building the regression model using the lm() function is as follows:
lm(formula, data, ...)
- formula: here we mention the prediction column and the other related columns(predictors) on which the prediction will be based on.
prediction ~ predictor1 + predictor2 + predictor3 + ...
- data: here we provide the dataset on which the linear regression model is to be trained.
- formula: here we mention the prediction column and the other related columns(predictors) on which the prediction will be based on.
For more info on the lm() function visit lm()
Lets look at the example on the Moody dataset.
Midterm | Project | FinalExam | ClassScore |
---|---|---|---|
73 | 8 | 70 | 39.60000 |
61 | 100 | 20 | 68.20000 |
58 | 88 | 38 | 67.00000 |
93 | 41 | 46 | 52.47565 |
85 | 52 | 85 | 68.50000 |
97 | 48 | 19 | 49.10000 |
26 | 59 | 22 | 41.30000 |
58 | 62 | 25 | 50.10000 |
53 | 56 | 27 | 46.70000 |
66 | 27 | 17 | 34.80494 |
Imagine that we do not know the weights of midterm, project and final exam. However we have the data from the previous semesters. Can we find these weights out? The answer is yes - by using linear regression
.
18.2.1 How much do Midterm, Project and Final Exam count?
We can see that,
- The summary of the lm model give us information about the parameters of the model, the residuals and coefficients, etc.
- The predicted values are obtained from the predict function using the trained model and the test data.
18.3 Calculating the Error using mse()
As was the simple case in the categorical predictions of the classification models, where we could just compare the predicted categories and the actual categories, this type of direct comparison as an accuracy test wonβt prove useful now in our numerical predictions scenario.
We donβt want to eyeball every time we predict, to find the accuracy of our predictions each row by row, so lets see a method to calculate the accuracy of our predictions, using some statistical technique.
To do this we will use the Mean Squared Error(MSE).
- The MSE is a measure of the quality of an predictor/estimator
- It is always non-negative
- Values closer to zero are better.
The equation to calculate the MSE is as follows:
\[\begin{equation} MSE=\frac{1}{n} \sum_{i=1}^{n}{(Y_i - \hat{Y_i})^2} \\ \text{where $n$ is the number of data points, $Y_i$ are the observed value}\\ \text{and $\hat{Y_i}$ are the predicted values} \end{equation}\]
To implement this, we will use the mse() function present in the Metrics Package, so remember to install the Metrics package and use library(Metrics)
in the code for local use.
The syntax for mse() function is very simple:
mse(actual,predicted)
- actual: vector of the actual values of the attribute we want to predict.
- predicted: vector of the predicted values obtained using our model.
18.4 Cross Validate your prediction
We can see that,
- The summary of the lm model gives us information about the parameters of the model, the residuals and coefficients, etc.
- The predicted values are obtained from the predict function using the trained model and the test data. In comparison to the previous model we are using the cross validation technique to check if we have more accurate predictions, thus increasing the overall accuracy of the model.