Linear Regression – Is It Really That Simple?


Table of Contents

Special Notes:

Listen to our L&L lectures online: WHRI Lunch & Learn Series – Women’s Health Research Institute

Visit our Stats corner in the e-blast for previously published tips on data management and analysis: E-Blast Archive – Women’s Health Research Institute (

Information about future lectures from my series can be found here: Sabina’s Stats Series – Women’s Health Research Institute ( 

What is linear regression?

Simple linear regression is used to estimate the relationship between two quantitative variables. You can use simple linear regression when you want to know:

  • How strong the relationship is between two variables
  • The value of the dependent (continuous) variable at a certain value of the independent variable (any type)

Important Steps when Performing Linear Regression

  • Check the assumptions.
  • If the assumptions of linear regression are not met, your results are definitely biased

Linear regression assumptions and how to check them:

  • Linear relationship: The relationship between the independent and dependent variables should be linear. The easiest way to check for this assumption is to run the scatter plot.
  • It is also very important to check for outliers, since linear regression is sensitive to the outlier effect.
  • All variables should be multivariate normal. This refers to the idea that the model’s error terms, or residuals, should be normally distributed. In other words, the residuals should have a mean of zero and be distributed in a bell-shaped curve. This assumption is important because it allows us to use various statistical tests and confidence intervals to make inferences about the model and its parameters. One common method is to plot a histogram of the residuals and visually inspect the distribution for evidence of normality. A normal probability plot (Q-Q-Plot) can also be used to graphically assess the residuals’ normality. Another method is to perform a normality test, such as the Shapiro-Wilk (usually used for small samples), the Anderson-Darling or Kolmogorov-Smirnov test (usually used for larger samples) to test the hypothesis that the residuals are normally distributed formally. If the assumption of multivariate normality is not met, the analysis has several potential implications. One of the most serious consequences is that the estimates of the standard errors and confidence intervals for the model’s parameters may need to be corrected. This, in turn, can affect the results of hypothesis tests and lead to incorrect inferences about the relationship between the dependent and independent variables. Furthermore, the validity of other statistical results, such as the F-test for the overall significance of the model, may also be compromised.

There are several ways to address the violation of this assumption:

 Transform dependent variable to make residuals more normal (I’m not a big fan of transformation as the results would be valid for the transformed variable and not the actual variable).

 Use non-linear regression or a more robust regression model.

Multicollinearity of independent variables

Multicollinearity of independent variables can be tested in three simple ways:

 Correlation matrix – when computing the matrix of Pearson’s Bivariate Correlation among all independent variables, the correlation coefficients need to be smaller than 1.

 The tolerance is calculated with an initial linear regression analysis. Tolerance is defined as T = 1 – R² for these first-step regression analyses. With T < 0.1 there might be multicollinearity in the data, and with T < 0.01 there certainly is.

 Variance Inflation Factor (VIF) is defined as VIF = 1/T. With VIF > 5, there is an indication that multicollinearity may be present; with VIF > 10 there is certainly multicollinearity among the variables.

No autocorrelation

No autocorrelation (autocorrelation refers to the degree of correlation of the same variables between two successive time intervals). It usually occurs when residuals are not independent of each other. In other words, when the value of y(x+1) is not independent from the value of y(x). Autocorrelation could be tested using Durbin-Watson and is expressed by the statistical value of d. While d can assume values between 0 and 4, values around 2 indicate no autocorrelation. As a rule of thumb, values of 1.5 < d < 2.5 show that there is no autocorrelation in the data. There are some limitations for the Durbin-Watson test only analyses linear autocorrelation and only between direct neighbors, which are first-order effects, but this test is sufficient for this assumption test. Scatter plot (again) is a good visualization tool to look for the autocorrelation pattern in your data.


Homoscedasticity means that residuals are equal across the regression line. The Goldfeld-Quandt Test can be used to test for heteroscedasticity. The test splits the data into two groups and tests to see if the variances of the residuals are similar across the groups.

Please note that we talked about homoscedasticity and heteroscedasticity during our previous lectures so you can refresh your memory by reviewing them.

What's Next? Conducting & Interpreting our Linear Regression

Now that all the assumptions are checked and hopefully met, we can conduct and interpret our linear regression.

Conducting Linear Regression and Checking Model Fit

The most common method of performing linear regression is Ordinary Least Squares (OLS) regression. Three statistics are used in Ordinary Least Squares (OLS) regression to evaluate model fit:

 R-squared

 The overall F-test

 Root Mean Square Error (RMSE)

All three are based on two sums of squares: Sum of Squares Total (SST) and Sum of Squares Error (SSE).


Most statistical software will produce two R-squared estimates, one unadjusted and one adjusted to all independent variables in the model. R-squared has the useful property that its scale is intuitive. It ranges from zero to one. Zero indicates that the proposed model does not improve prediction over the mean model. One indicates perfect prediction. Improvement in the regression model results in proportional increases in R-squared. One pitfall of R-squared is that it can only increase as predictors are added to the regression model. This increase is artificial when predictors are not actually improving the model’s fit. To remedy this, a related statistic, Adjusted R-squared, incorporates the model’s degrees of freedom.

The F-test

The F-test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one is not. An equivalent null hypothesis is that R-squared equals zero. A significant F-test indicates that the observed R-squared is reliable and is not a spurious result of oddities in the data set. Thus the F-test determines whether the proposed relationship between the response variable and the set of predictors is statistically reliable. It can be useful when the research objective is either prediction or explanation.


The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data – how close the observed data points are to the model’s predicted values. Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit. As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance. It has the useful property of being in the same units as the response variable. Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response. It’s the most important criterion for fit if the main purpose of the model is prediction.

Interpretation of linear regression results


As we discussed in our previous lectures regression analysis is an inferential analysis. The p-value can help to understand whatever the relationship we observe in our sample exist in larger population. With the linear regression p- value for each independent variable tests the null hypothesis that the variable has no correlation with the dependent variable. If there is no correlation, there is no association between the changes in the independent variable and the shifts in the dependent variable. In other words, there is insufficient evidence to conclude that there is an effect at the population level. If p-value is less than significance level chosen there are enough evidence to reject the null hypothesis (that there are no correlation)


The sign of linear regression coefficient will tell us if relationship between two variables is positive or negative. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease.

The coefficient value show how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant. This property of holding the other variables constant is crucial because it allows you to assess the effect of each variable in isolation from the others.

Summary – steps of performing linear regression (or any other regression)

  1. Prepare and clean your data
  2. Use exploratory analysis to understand your data and important relationship between key variables
  3. Identify level of missingness and outliers
  4. Check for the linear regression assumptions
  5. If assumptions are met – perform the linear regression and interpret the results
  6. If not – consider data transformation of alternative more sophisticated and robust models
  7. Please contact me if you have any questions regarding the fitting of linear regression for your specific analysis

Good luck with your linear regression adventure!

Contact Sabina for statistics help or questions here: