Binomial Logistic Regression - AMSTAT Consulting

No Comments

A binomial logistic regression attempts to predict the probability that an observation falls into one of two categories of a dichotomous dependent variable based on one or more independent variables that can be either continuous or categorical.

In many ways, binomial logistic regression is similar to linear regression, with the exception of the measurement type of the dependent variable (i.e., linear regression uses a continuous dependent variable rather than a dichotomous one). However, unlike linear regression, you are not attempting to determine the predicted value of the dependent variable, but the probability of being in a particular category of the dependent variable given the independent variables. An observation is assigned to whichever category is predicted as most likely. As with other types of regression, binomial logistic regression can also use interactions between independent variables to predict the dependent variable.

Note: Binomial logistic regression is often referred to as just logistic regression.

For example, you could use binomial logistic regression to predict whether students will pass or fail an exam based on the amount of time they spend revising, whether English is their first language and their pre-exam stress levels. Here, your dichotomous dependent variable would be “exam performance”, which has two categories – “pass” and “fail” – and you would have three independent variables: the continuous variable, “time spent revising”, measured in hours, the dichotomous independent variable, “English as a first language”, which has two categories – “yes” and “no” – and the ordinal independent variable, “pre-exam stress levels”, which has three levels: “low stress”, medium stress” and “high stress”.

In order to run a binomial logistic regression, there are seven assumptions that need to be considered. The first four assumptions relate to your choice of study design and the measurements you chose to make, whilst the other three assumptions relate to how your data fits the binomial logistic regression model. These assumptions are:

Assumption #1: You have one dependent variable that is dichotomous (i.e., a nominal variable with two outcomes). Examples of dichotomous variables include gender (two outcomes: “males” or “females”), presence of heart disease (two outcomes: “yes” or “no”), employment status (two outcomes: “employed” or “unemployed”), transport type (two outcomes: “bus” or “car”).

Note 1: The dependent variable can also be referred to as the “outcome”, “target” or “criterion” variable. It does not matter which of these you use, but we will continue to use “dependent variable” for consistency.

Note 2: We refer to the dependent variable as being a nominal variable with two “outcomes”, but it is also common to use the word “categories” (i.e., a variable such as “gender” would have two categories: “males” or “females”). Again, it does not matter which of these you use.

Assumption #2: You have one or more independent variables that are measured on either a continuous or nominal scale. Examples of continuous variables include height (measured in metres and centimetres), temperature (measured in °C), salary (measured in US dollars), revision time (measured in hours), intelligence (measured using IQ score), firm size (measured in terms of the number of employees), age (measured in years), reaction time (measured in milliseconds), grip strength (measured in kg), power output (measured in watts), test performance (measured from 0 to 100), sales (measured in number of transactions per month) and academic achievement (measured in terms of GMAT score). Examples of nominal variables include gender (e.g., two categories: male and male), ethnicity (e.g., three categories: Caucasian, African American and Hispanic) and profession (e.g., five categories: surgeon, doctor, nurse, dentist, therapist).

Note: The “categories” of the independent variable are also referred to as “groups” or “levels”, but the term “levels” is usually reserved for the categories of an ordinal variable (e.g., an ordinal variable such as “fitness level”, which has three levels: “low”, “moderate” and “high”). However, these three terms – “categories”, “groups” and “levels” – can be used interchangeably. We refer to them as categories in this guide.

Important: If one of your independent variables was measured at the ordinal level, it can still be entered in a binomial logistic regression, but it must be treated as either a continuous or nominal variable. It cannot be entered as an ordinal variable. Examples of ordinal variables include Likert items (e.g., a 7-point scale from strongly agree through to strongly disagree), physical activity level (e.g., 4 groups: sedentary, low, moderate and high), customer liking a product (ranging from “Not very much”, to “It is OK”, to “Yes, a lot”), and so forth.

Assumption #3: You should have independence of observations and the categories of the dichotomous dependent variable and all your nominal independent variables should be mutually exclusive and exhaustive.

Independence of observations means that there is no relationship between the observations in each category of the dependent variable or the observations in each category of any nominal independent variables. In addition, there is no relationship between the categories. Indeed, an important distinction is made in statistics when comparing values from either different individuals or from the same individuals.

To illustrate this, consider again the example from the Introduction where binomial logistic regression could be used to predict whether students will pass or fail an exam based on the amount of time they spend revising, whether English is their first language and their pre-exam stress levels. Here, your dichotomous dependent variable would be “exam performance”, which has two categories – “pass” and “fail” – and you would have three independent variables: the continuous variable, “time spent revising”, measured in hours, the dichotomous independent variable, “English as a first language”, which has two categories – “yes” and “no” – and the ordinal independent variable, “pre-exam stress levels”, which has three levels: “low stress”, medium stress” and “high stress”.

In this scenario, independence of observations means that a student could either “pass” or “fail” the exam. They could not pass “and” fail the exam. As such, the student has to be placed into one of the two categories of the dependent variable. The student cannot be placed into both categories. Similar, take the dichotomous independent variable, “English as a first language”. The correct answer for the purposes of a binomial logistic regression is either “yes” or “no”. A student cannot be entered into both categories.

Independence of observations is largely a study design issue rather than something you can test for using SPSS Statistics, but it is an important assumption of binomial logistic regression. If there is a relationship between the categories of any variables or between the categories themselves, this means that the observations are related. Therefore, if your study fails this assumption, you will need to use another statistical test instead of binomial logistic regression; possibly linear mixed models or Generalized Estimating Equations (GEE) (you can use our Statistical Test Selector to find the appropriate statistical test).
Assumption #4: You should have a bare minimum of 15 cases per independent variable, although some recommend as high as 50 cases per independent variable. As with other multivariate techniques, such as multiple regression, there are a number of recommendations regarding minimum sample size. Indeed, binomial logistic regression relies on maximum likelihood estimation (MLE) and the reliability of estimates declines significantly for combinations of cases where there are few cases.
Assumptions #5, #6 and #7: As will be discussed further in our Assumptions I section, a binomial logistic regression must also meet three assumptions that relate to how your data fits the binomial logistic regression model in order to provide a valid result: (a) there should be a linear relationship between the continuous independent variables and the logit transformation of the dependent variable; (b) there should be no multicollinearity; and (c) there should be no significant outliers, leverage or influential points. Since these are assumptions that you can test using SPSS Statistics, we show you how to do this in the Assumptions I section later.
Assumption #5
There needs to be a linear relationship between the continuous independent variables and the logit transformation of the dependent variable.

The assumption of linearity in a binomial logistic regression requires that there is a linear relationship between the continuous independent variables, age, weight and VO2max, and the logit transformation of the dependent variable, heart_disease.

There are a number of methods to test for a linear relationship between the continuous independent variables and the logit of the dependent variable. In this guide, we use the Box-Tidwell approach, which adds an interaction terms between the continuous independent variables and their natural logs to the regression equation. We will use the Binary Logistic procedure in SPSS Statistics to test this assumption; (b) interpret and report the results from this test; and (c) proceed with our analysis depending on whether we have met or violated this assumption.
Assumption #6
Your data must not show multicollinearity

Multicollinearity occurs when you have two or more independent variables that are highly correlated with each other. This leads to problems with understanding which independent variable contributes to the variance explained in the dependent variable, as well as technical issues in calculating a binomial logistic regression model.

You can detect for multicollinearity through an inspection of correlation coefficients and Tolerance/VIF values, which will inform you whether your data meets or violates this assumption.

Note: We will be adding a section to the guide to show how to test for multicollinearity. If you would like to know when this becomes available, please contact us.

Assumption #7
There should be no significant outliers, high leverage points or highly influential points

Outliers, leverage and influential points are different terms used to represent observations in your data set that are in some way unusual when you wish to perform a binomial logistic regression analysis. These different classifications of unusual points reflect the different impact they have on the regression line. An observation can be classified as more than one type of unusual point. However, all these points can have a very negative effect on the regression equation that is used to predict the value of the dependent variable based on the independent variables. This can change the output that SPSS Statistics produces and reduce the predictive accuracy of your results as well as the statistical significance. Fortunately, when using SPSS Statistics to run binomial logistic regression on your data, you can detect possible outliers, high leverage points and highly influential points.

Interpreting Results

After running the binomial logistic regression procedures and testing that your data meets the assumptions of a binomial logistic regression in the previous sections, SPSS Statistics will have generated a number of tables that contain all the information you need to report the results of your binomial logistic regression. We show you how to interpret these results.

There are two main objectives that you can achieve with the output from a binomial logistic regression: (a) determine which of your independent variables (if any) have a statistically significant effect on your dependent variable; and (b) determine how well your binomial logistic regression model predicts the dependent variable. Both of these objectives will be answered in the following sections:

Data coding: You can start your analysis by inspecting your variables and data, including: (a) checking if any cases are missing and whether you have the number of cases you expect (the “Case Processing Summary” table); (b) making sure that the correct coding was used for the dependent variable (the “Dependent Variable Encoding” table); and (c) determining whether there are any categories amongst your categorical independent variables with very low counts – a situation that is undesirable for binomial logistic regression (the “Categorical Variables Codings” table).

Baseline analysis: Next, you can consult the “Classification Table”, “Variables in the Equation” and “Variables not in the Equation” tables. These all relate to the situation where no independent variables have been added to the model and the model just includes the constant. As such, you are interested in this information only as a comparison to the model with all the independent variables added. This Baseline analysis section provides a basis against which the main binomial logistic regression analysis with all independent variables added to the equation can be evaluated.

Binomial logistic regression results: In evaluating the main logistic regression results, you can start by determining the overall statistical significance of the model (namely, how well the model predicts categories compared to no independent variables). You can also assess the adequacy of the model by analyzing how poor the model is at predicting the categorical outcomes using the Hosmer and Lemeshow goodness of fit test. Next, you can consult the Cox & Snell R Square and Nagelkerke R Square values to understand how much variation in the dependent variable can be explained by the model (i.e., these are two methods of calculating the explained variation), but it is preferable to report the Nagelkerke R2 value. This is illustrated in the Variance explained section.

Category prediction: After determining model fit and explained variation, it is very common to use binomial logistic regression to predict whether cases can be correctly classified (i.e., predicted) from the independent variables. Logistic regression estimates the probability of an event (in this case, having heart disease) occurring. If the estimated probability of the event occurring is greater than or equal to 0.5 (better than even chance), SPSS Statistics classifies the event as occurring (e.g., heart disease being present). If the probability is less than 0.5, SPSS Statistics classifies the event as not occurring (e.g., no heart disease).

Variables in the equation: We can assess the contribution of each independent variable to the model and its statistical significance using the Variables in the Equation table. We will also be able to use the odds ratios of each of the independent variables (along with their confidence intervals) to understand the change in the odds ratio for each increase in one unit of the independent variable. Using these odds ratios, we will be able to, for example, make statements such as: “the odds of having heart disease is 7.026 times greater for males as opposed to females”. You can make such predictions for categorical and continuous independent variables.

more insights

Gas Usage

Value/Supply Chain

Barriers To Residential Electrification