A simple linear regression assesses the linear relationship between two continuous variables to predict the value of a dependent variable based on the value of an independent variable. More specifically, it will let you: (a) determine whether the linear regression between these two variables is statistically significant; (b) determine how much of the variation in the dependent variable is explained by the independent variable; (c) understand the direction and magnitude of any relationship; and (d) predict values of the dependent variables based on different values of the independent variable.
Note: This test is also known by a number of different names, including a bivariate linear regression, but it is often referred to simply as a ‘linear regression’. Furthermore, the dependent variable is also referred to as the outcome, target or criterion variable, and the independent variable as the predictor, explanatory or regressor variable.
For example, you can use simple linear regression to predict lawyers’ salaries based on the number of years they have practiced law (i.e., your dependent variable would be “salary” and your independent variable would be “years practicing law”). You could also determine how much of the variation in lawyers’ salaries can be attributed to the number of years they have practiced law. You could also use linear regression to predict the distance women can run in 30 minutes based on their VO2max, which is a measure of fitness (i.e., your dependent variable would be “distance run” and your independent variable would be “VO2max”). Again, you could determine how much of the variation in the distance run could be attributed to the womens’ VO2max scores.
In order to run a linear regression analysis, there are seven assumptions that need to be considered. The first two assumptions relate to your choice of study design and the measurements you chose to make, whilst the other five assumptions relate to how your data fits the linear regression model. These assumptions are:
- Assumption #1: You have one dependent variable that is measured at the continuous level. Examples of continuous variables include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth.
Note: The dependent variable is also referred to as the outcome, target or criterion variable.
- Assumption #2: You have one independent variable that is measured at the continuous level. See the bullet above for examples of continuous variables.
Note: The independent variable is also referred to as the predictor, explanatory or regressor variable.
- Assumption #3: There needs to be a linear relationship between the dependent and independent variables
There needs to be a linear relationship between your dependent and independent variables. In this example, a linear relationship between ‘time spent watching tv’ and ‘cholesterol concentration’ (time_tv and cholesterol, respectively). There is more than one way to determine if a linear relationship exists. In this guide, we show you how to visually inspect a scatterplot of the dependent variable plotted against the independent variable to see if a linear relationship exists. If the relationship approximately follows a straight line, you have a linear relationship. However, if you have something other than a straight line, for example, a curved line, you do not have a linear relationship.
- Assumption #4: You should have independence of observations, which you can easily check using the Durbin-Watson statistic
An important assumption of linear regression is that the errors are independent. In linear regression, in place of the errors we use the residuals. As such, the residuals need to be independent. If the errors/residuals are not independent they are often referred to as correlated. Basically, having independent residuals means that one residual cannot provide any information about any other residual. Independence can be broken in many ways, but arguably the most common method in linear regression is when you have time series data (e.g., weather forecasts). In this situation, it is very likely that observations close together are more alike than observations further apart, which can lead to correlated errors/residuals. A lack of independent errors can also occur if there are improvements or detriments over time in how a dependent variable is measured. For example, you might have measured body fat with skin calipers, but not practiced enough before you started the study. As such, you improve your ability to use the skin calipers throughout the study leading to correlated errors/residuals. There are many other ways in which errors/residuals might be correlated.
- Assumption #5: There should be no significant outliers
Outliers, leverage points or influential cases are all examples of unusual points. We focus on how outliers, which are cases where the observed value of the dependent variable is very different to its predicted value (i.e., the differences are evident in the y-axis where an observation [data point] does not follow the usual pattern of points). This can have: (a) a detrimental effect on the regression equation and statistical inferences; (b) a large effect on the variability of residuals, leading to problems with normality or homoscedasticity, which leads to a reduction in the accuracy of prediction; and (c) a significant effect on the line of best fit (regression line).
- Assumption #6: Your data needs to show homoscedasticity
The assumption of homoscedasticity is an important assumption of linear regression and indicates that the variance of the errors (residuals) is constant across all the values of the independent variable. Due to the manner in which the residuals act as the errors, this assumption of equal error variances can be checked by inspection of a plot of the unstandardized or standardized residual values against the unstandardized or standardized predicted values (also known as the fitted values). We show you how to interpret a plot of the latter – the standardized values against the standardized predicted values. If you have heteroscedasticity, such that your residuals are not evenly spread (but are spread in an increasing funnel, decreasing funnel or fan shape, for example), we explain ways to proceed with your analysis.
- Assumption #7: You need to check that the residuals (errors) of the regression line are approximately normally distributed
SPSS Statistics produces two graphical measures that can be used to assess normality: (a) a histogram (with superimposed normal curve) of the standardized residuals; and (b) a normal probability plot (i.e., a Normal P-P Plot).
After running the linear regression procedure and testing that your data meet the assumptions of a linear regression in the previous two sections, SPSS Statistics will have generated a number of tables that contain all the information you need to report the results of your linear regression.
There are three main objectives that you can achieve with the output from a simple linear regression: (1) determine the proportion of the variation in the dependent variable explained by the independent variable; (2) predict dependent variable values based on new independent variable values; and (3) determine how much the dependent variable changes for a one unit change in the independent variable. All of these objectives will be answered in the following sections.
When interpreting and reporting your results from a linear regression, we suggest working through three stages: (a) determine whether the linear regression model is a good fit for the data; (b) understand the coefficients of the regression model; and (c) make predictions of the dependent variable based on values of the independent variable. To recap:
- First, you need to determine whether the linear regression model is a good fit for the data: There are a number of statistics you can use to determine whether the linear regression model is a good fit for the data. These are: (a) the percentage of variance explained; (b) the statistical significance of the overall model; and (c) the precision of the predictions from the regression model. The Model Summary and ANOVA tables contain all the information you need to evaluate (a) and (b), whilst (c) is addressed in the third bullet below.
- Second, you need to understand the coefficients of the regression model: Now that you have interpreted the overall model fit you can interpret and report the coefficients of the regression model. These coefficients are useful in order to understand whether there is a linear relationship between the two variables. In addition, you can use this regression equation to calculate predicted values of cholesterol concentration for given values of average daily time spent watching TV.
- Third, you can use SPSS Statistics to make predictions of the dependent variable based on values of the independent variable: For example, you can use the regression equation from the previous section to predict cholesterol concentration for different average daily times watching TV (e.g., cholesterol concentration for an average of 160 minutes of TV per day).