Assumptions of Linear Regression

Tulsipatro
5 min readMar 2, 2021

--

Why is it important to understand the assumptions of Linear Regression..?

ANSCOMBE’S QUARTET

A set of four different data sets that look completely distinctive from each other but had the same regression line. [ y = 0.5x + 3 ]
They also had the same mean for both x and y.
This illustrates the deceptive nature of nos. when performing Regression.
Violation of any of the regression assumptions puts us at risk for inaccurate model.

  1. Linearity
    First, there must be a Linear relationship between the variables. Regression lines will be misleading if data isn’t approximately linear.
    One way to check this condition is to make a scatter plot of your data. If the data looks it can roughly fit a line then it will satisfy the condition but even if the data is generally linear.

Linear regression is sensitive to outliers. The best course of action is to remove the extreme outliers from the data. As a single outlier may significantly impact the regression line.

2. Homoscedasticity
Secondly, the variance of errors should be constant and the technical term for this homoscedasticity.
The points in the data set should form a tube-like shape like this,

The graph shown below represents the shape a cone where the errors are becoming larger.

3. Independence of Errors
Thirdly, the terms which are also called Residuals; should not have any correlation among themselves.

Consequently, we are trying to avoid the second error correlating with the first error.

4. Normality of Error Distribution
Data points should be normally distributed around the regression line which is called Normality of Error Distribution.
If we draw a histogram of the residuals, the plot should follow the normal distribution meaning that the majority of the residuals have a value close to zero with only a few outliers.

To summarize,
Firstly, there should be a linear trend in the data without the presence of extreme outliers.

Next, variance of errors should be constant.

Thirdly, independence of errors means that there is no correlation among the errors themselves.
Lastly, Normality of error distribution means that most of the errors are close to zero with few outliers.

For Multiple Linear Regression, there are two additional assumptions;
i. Overfitting
ii. Multicollinearity

Overfitting
There’s a possibility that the algorithm fits the training data too well. Overfitting occurs when the model captures the trends in the training data too well consequently the model performs poorly on the new unseen data. The solution is to reduce the no. of parameters. For example by Regularization which introduces mathematical constraints that favor regression lines that are simpler and have less terms.

Multicollinearity
When we add more input variables, it creates relationships among them.

How many variables should you incorporate in your Training data?
Not overload your training data with useless variables but also not eliminating potentially useful variables.

Working in Excel
i. Non-Normality of Residuals

> Compute Regression using the Data Analysis table.
> Find the Standard Residuals.
> Create a Bin range Table starting from -3 to +3.
[The Standard Residuals comprises of the z-value ranging from starting from -3 to +3.]
> Plot a Histogram using Standard Residuals and Bin range.
> Interpret from the plot whether residuals are normally distributed or not.

ii. Heteroscedasticity of Residuals
> Calculate the square of Standard Residuals.
> Plot a graph between Predicted values & square of Standard residuals.

From the graph, if we can see an increase in the values of squared residuals, then it suggests that heteroscedasticity might be present in our model.
Residual plots also indicate the presence of Heteroscedasticity by its tendency to fan out as the value of x increases.
This means that as the value of x becomes larger, there is increasing dispersion or uncertainty associated with the response of y.

Breusch Pagan Test for Heteroscedasticity
Running another regression considering the Residual Squared as y and other variables as x. Check for the p-value for F.
Here, the p-value (0.001031) is significant, so we can reject the null hypothesis of homoscedastic error terms and consider our alternate hypothesis, i.e. the error terms are heteroscedastic.

iii. Autocorrelation
The Durbin-Watson statistic will always have a value between 0 and 4.
A value of 2.0 means that there is no autocorrelation detected in the sample. Values from 0 to less than 2 indicate positive autocorrelation and
values from from 2 to 4 indicate negative autocorrelation.

Formula for Durbin-Watson Statistic

> Calculate the square of Residuals
> In another new column, leave a unit lag, subtract the second data point from the first and square it. Continue the same
> Find summation of both the columns and then divide it.

iv. Multicollinearity
> Perform Regression analysis by considering one variable as x and remaining as y for all the variables.
> Collect R square for all the regression.
> Compute (1 — R square)
> Calculate VIF ( using formula [1/ 1-R square])
VIF < 5 : There is little or no evidence of multicollinearity with other explanatory variables.
> Create Correlation Matrix

--

--