MLR

Map > Data Science > Predicting the Future > Modeling > Regression > Multiple Linear Regression

Multiple Linear Regression

Multiple linear regression (MLR) is a method used to model the linear relationship between a dependent variable (target) and one or more independent variables (predictors).

MLR is based on ordinary least squares (OLS), the model is fit such that the sum-of-squares of differences of observed and predicted values is minimized.

The MLR model is based on several assumptions (e.g., errors are normally distributed with zero mean and constant variance). Provided the assumptions are satisfied, the regression estimators are optimal in the sense that they are unbiased, efficient, and consistent. Unbiased means that the expected value of the estimator is equal to the true value of the parameter. Efficient means that the estimator has a smaller variance than any other estimator. Consistent means that the bias and variance of the estimator approach zero as the sample size approaches infinity.

How good is the model?

R² also called as coefficient of determination summarizes the explanatory power of the regression model and is computed from the sums-of-squares terms.

R² describes the proportion of variance of the dependent variable explained by the regression model. If the regression model is “perfect”, SSE is zero, and R² is 1. If the regression model is a total failure, SSE is equal to SST, no variance is explained by regression, and R² is zero. It is important to keep in mind that there is no direct relationship between high R² and causation.

How significant is the model?

F-ratio estimates the statistical significance of the regression model and is computed from the mean squared terms in the ANOVA table. The significance of the F-ratio is obtained by referring to the F distribution table using two degrees of freedom (df_MSR, df_MSE). p is the number of independent variables (e.g., p is one for the simple linear regression).

The advantage of the F-ratio over R² is that the F-ratio incorporates sample size and number of predictors in assessment of significance of the regression model. A model can have a high R² and still not be statistically significant.

How significant are the coefficients?

If the regression model is significantly good, we can use t-test to estimate the statistical significance of each coefficient.

Example

Multicolinearity

A high degree of multicolinearity between predictors produces unreliable regression coefficient estimates. Signs of multicolinearity include:

High correlation between pairs of predictor variables.
Regression coefficients whose signs or magnitudes do not make good physical sense.
Statistically nonsignificant regression coefficients on important predictors.
Extreme sensitivity of sign or magnitude of regression coefficients to insertion or deletion of a predictor.

The diagonal values in the (X'X)^-1matrix called Variance Inflation Factors (VIFs) and they are very useful measures of multicolinearity. If any VIF exceed 5, multicolinearity is a problem.

Model Selection

A frequent problem in data mining is to avoid predictors that do not contribute significantly to model prediction. First, It has been shown that dropping predictors that have insignificant coefficients can reduce the average error of predictions. Second, estimation of regression coefficients are likely to be unstable due to multicollinearity in models with many variables. Finally, a simpler model is a better model with more insight into the influence of predictors in models. There are two main methods of model selection:

Forward selection, the best predictors are entered in the model, one by one.
Backward Elimination, the worst predictors are eliminated from the model, one by one.

Exercise