A multiple regression is used to predict a continuous dependent variable based on multiple independent variables. As such, it extends simple linear regression, which is used when you have only one continuous independent variable. Multiple regression also allows you to determine the overall fit (variance explained) of the model and the relative contribution of each of the predictors to the total variance explained.
Note 1: The dependent variable can also be referred to as the “outcome”, “target” or “criterion” variable, whilst the independent variables can be referred to as “predictor”, “explanatory” or “regressor” variables. It does not matter which of these you use, but we will continue to use “dependent variable” and “independent variable” for consistency.
Note 2: This guide deals with “standard” multiple regression rather than a specific type of multiple regression, such as hierarchical multiple regression, stepwise regression, amongst others.
For example, you could use multiple regression to understand whether exam performance can be predicted based on revision time, test anxiety, lecture attendance, course studied and gender. Here, your continuous dependent variable would be “exam performance”, whilst you would have three continuous independent variables – “revision time”, measured in hours, “test anxiety”, measured using the TAI index, “lecture attendance”, measured as a percentage of classes attended – one nominal variable – course studied, which as four groups: business, psychology, biology and mechanical engineering – and one dichotomous independent variable – gender, which has two groups: “males” and “females”. You could also use multiple regression to determine how much of the variation in exam performance can be explained by revision time, test anxiety, lecture attendance, course studied and gender “as a whole”, but also the “relative contribution” of each of these independent variables in explaining the variance.
- Assumption #1: You have one dependent variable that is measured at the continuous level (i.e., the interval or ratio level). Examples of continuous variables include height (measured in centimetres), temperature (measured in °C), salary (measured in US dollars), revision time (measured in hours), intelligence (measured using IQ score), firm size (measured in terms of the number of employees), age (measured in years), reaction time (measured in milliseconds), grip strength (measured in kg), weight (measured in kg), power output (measured in watts), test performance (measured from 0 to 100), sales (measured in number of transactions per month), academic achievement (measured in terms of GMAT score), and so forth.
Note 1: You should note that SPSS Statistics refers to continuous variables as Scale variables.
Note 2: The dependent variable can also be referred to as the “outcome”, “target” or “criterion” variable. It does not matter which of these you use, but we will continue to use “dependent variable” for consistency.
- Assumption #2: You have two or more independent variables that are measured either at the continuous or nominal level. Examples of continuous variables are provided above. Examples of nominal variables include gender (e.g., two categories: male and female), ethnicity (e.g., three categories: Caucasian, African American, and Hispanic), physical activity level (e.g., four categories: sedentary, low, moderate and high) and profession (e.g., five categories: surgeon, doctor, nurse, dentist, and therapist).
Note 1: The “categories” of the independent variable are also referred to as “groups” or “levels”, but the term “levels” is usually reserved for the categories of an ordinal variable (e.g., an ordinal variable such as “fitness level”, which has three levels: “low”, “moderate” and “high”). However, these three terms – “categories”, “groups” and “levels” – can be used interchangeably. We refer to them as categories in this guide.
Note 2: An independent variable with only two categories is known as a dichotomous variable whereas an independent variable with three or more categories is referred to as a polytomous variable.
Important: If one of your independent variables was measured at the ordinal level, it can still be entered in a multiple regression, but it must be treated as either a continuous or nominal variable. It cannot be entered as an ordinal variable. Examples of ordinal variables include Likert items (e.g., a 7-point scale from strongly agree through to strongly disagree), physical activity level (e.g., 4 groups: sedentary, low, moderate and high), customer liking a product (ranging from “Not very much”, to “It is OK”, to “Yes, a lot”), and so forth.
- Assumption #3: You should have independence of observations (i.e., independence of residuals)The assumption of independence of observations in a multiple regression is designed to test for 1st-order autocorrelation, which means that adjacent observations (specifically, their errors) are correlated (i.e., not independent). This is largely a study design issue because the observations in a multiple regression must not be related or you would need to run a different statistical test such as time series methods. In SPSS Statistics, independence of observations can be checked using the Durbin-Watson statistic.
- Assumption #4: There needs to be a linear relationship between (a) the dependent variable and each of your independent variables, and (b) the dependent variable and the independent variables collectively. The assumption of linearity in a multiple regression needs to be tested in two parts (but in no particular order). You need to (a), establish if a linear relationship exists between the dependent and independent variables collectively, which can be achieved by plotting a scatterplot of the studentized residuals against the (unstandardized) predicted values. You also need to (b), establish if a linear relationship exists between the dependent variable and each of your independent variables, which can be achieved using partial regression plots between each independent variable and the dependent variable (although you can ignore any categorical independent variables; e.g., gender).
- Assumption #5: Your data needs to show homoscedasticity of residuals (equal error variances)The assumption of homoscedasticity is that the residuals are equal for all values of the predicted dependent variable (i.e., the variances along the line of best fit remain similar as you move along the line). To check for heteroscedasticity, you can use the plot you created to check linearity in the previous section, namely plotting the studentized residuals against the unstandardized predicted values. When you analyze your own data, you will need to plot the studentized residuals against the unstandardized predicted values.
- Assumption #6: Your data must not show multicollinearityMulticollinearity occurs when you have two or more independent variables that are highly correlated with each other. This leads to problems with understanding which independent variable contributes to the variance explained in the dependent variable, as well as technical issues in calculating a multiple regression model.
You can use SPSS Statistics to detect for multicollinearity through an inspection of correlation coefficients and Tolerance/VIF values.
- Assumption #7: There should be no significant outliers, high leverage points or highly influential pointsOutliers, leverage and influential points are different terms used to represent observations in your data set that are in some way unusual when you wish to perform a multiple regression analysis. These different classifications of unusual points reflect the different impact they have on the regression line. An observation can be classified as more than one type of unusual point. However, all these points can have a very negative effect on the regression equation that is used to predict the value of the dependent variable based on the independent variables. This can change the output that SPSS Statistics produces and reduce the predictive accuracy of your results as well as the statistical significance. Fortunately, when using SPSS Statistics to run multiple regression on your data, you can detect possible outliers, high leverage points and highly influential points.
- Assumption #8: You need to check that the residuals (errors) are approximately normally distributed. In order to be able to run inferential statistics (i.e., determine statistical significance), the errors in prediction – the residuals – need to be normally distributed. Two common methods you can use to check for the assumption of normality of the residuals are: (a) a histogram with superimposed normal curve and a P-P Plot; or (b) a Normal Q-Q Plot of the studentized residuals.
After running the multiple regression procedure and testing that your data meet the assumptions of a multiple regression in the previous two sections, SPSS Statistics will have generated a number of tables that contain all the information you need to report the results of your multiple regression.
There are three main objectives that you can achieve with the output from a multiple regression: (1) determine the proportion of the variation in the dependent variable explained by the independent variables; (2) predict dependent variable values based on new values of the independent variables; and (3) determine how much the dependent variable changes for a one unit change in the independent variables. All of these objectives will be answered in the following sections.
When interpreting and reporting your results from a multiple regression, we suggest working through three stages: (a) determine whether the multiple regression model is a good fit for the data; (b) understand the coefficients of the regression model; and (c) make predictions of the dependent variable based on values of the independent variables. To recap:
- First, you need to determine whether the multiple regression model is a good fit for the data: There are a number of statistics you can use to determine whether the multiple regression model is a good fit for the data. These are: (a) the multiple correlation coefficient, (b) the percentage (or proportion) of variance explained; (c) the statistical significance of the overall model; and (d) the precision of the predictions from the regression model.
- Second, you need to understand the coefficients of the regression model. These coefficients are useful in order to understand whether there is a linear relationship between the dependent variable and the independent variables. In addition, you can use this regression equation to calculate predicted values of VO2max for a given set of values for age, weight, heart rate, and gender.
- Third, you can use SPSS Statistics to make predictions of the dependent variable based on values of the independent variable: For example, you can use the regression equation from the previous section to predict VO2max for a different set of values for age, weight, heart rate and gender (e.g., the VO2max for a 30 year old male weighing 80 kg with a heart rate of 133 bpm).