Analysing Categorical Data Using Logistic Regression Models

In a recent post we introduced some basic techniques for summarising and analysing categorical survey data using diverging stacked bar charts, contingency tables and Pearson’s Chi-squared tests. Whilst these methods are a great way to start exploring your categorical data, to really investigate them fully, we can apply a more formal approach using generalised linear models.

When analysing a continuous response variable we would normally use a simple linear regression model to explore possible relationships with other explanatory variables. We might for example, investigate the relationship between a response variable, such as a person’s weight, and other explanatory variables such as their height and gender.

“Logistic regression and multinomial regression models are specifically designed for analysing binary and categorical response variables.”

When the response variable is binary or categorical a standard linear regression model can’t be used, but we can use logistic regression models instead. These alternative regression models are specifically designed for analysing binary (e.g., yes/no) or categorical (e.g., Full-time/Part-time/Retired/Unemployed) response variables. Similar to linear regression models, logistic regression models can accommodate continuous and/or categorical explanatory variables as well as interaction terms to investigate potential combined effects of the explanatory variables (see our recent blog on Key Driver Analysis for more information).

Binary logistic regression

Logistic regression models for binary response variables allow us to estimate the probability of the outcome (e.g., yes vs. no), based on the values of the explanatory variables. We could simply model this probability directly as a function of the explanatory variables but, instead, we use the logit function, logit(p) = ln(p/(1-p)), where p is the probability of the outcome occurring, in order to determine the corresponding log odds of the outcome which we then model as a linear combination of the explanatory variables. As with standard linear regression analyses, the model coefficients can then be interpreted in order to understand the direction and strength of the relationships between the explanatory variables and the response variable.

Suppose, for example, that we are interested in how likely a student is to be offered a place on a post-graduate course. We consider the potential effects of the student’s mark on the course’s admissions exam (EXAM), their academic grading from their undergraduate degree (GRAD) and the prestige of their undergraduate institution (RANK, taking values from 1 to 4). We collect data from 400 students applying to graduate school and record whether they were successful or not in being admitted onto the course – so our response variable is binary (admit/not admit). (These data are available from the UCLA Institute for digital research and education using the following link: https://stats.idre.ucla.edu/stat/data/binary.csv.)

Running the logistic regression model (for example, using the statistical software package R), we obtain p-values for each explanatory variable and we find that all three explanatory variables are statistically significant (at the 5% significance level). So there’s evidence that each of these has an independent effect on the probability of a student being admitted (rather than just a difference observed due to chance). But what are these effects – are they positive or negative and how strong are they? We need to look at the coefficients estimated by the model in order to understand this and find, for example, that:

  • For every one unit change in EXAM the log odds of admission (vs. non-admission) increases by 0.00226.
  • Attending an undergraduate institution with a rank of 2, compared to an institution with a rank of 1, changes the log odds of admission by -0.675.

We can also exponentiate the coefficients and interpret them as odds ratios. This is the most common way of measuring the association between each explanatory variable and the outcome when using logistic regression. For the undergraduate institution rank above, the odds ratio for “if Rank=2” represents the odds of admission for an institution with Rank=2 compared to the odds of admission for an institution with Rank=1. The estimated odds ratio is exp(-0.675) = 0.509, which means that the odds of admission having attended a Rank=2 institution are 0.509 times that of the odds for having attended a Rank=1 institution (or equivalently 49% [= 0.509-1 x 100] lower). In other words, if the odds of a Rank=1 candidate are 1 to 10 (i.e., p=1/11 and 1-p=10/11), the odds of a Rank=2 candidate being admitted are about half as good or about 1 to 20 (i.e., p=1/21 and 1-p=20/21). So, for every Rank=2 applicant who is admitted, twenty Rank=2 candidates will be rejected, but for every Rank=1 applicant who is admitted, only ten Rank=1 candidates will be rejected.

Odds ratios can also be provided for continuous variables and in this case the odds ratio summarises the change in the odds per unit increase in the explanatory variable. For example, looking at the effect of GRAD above, the odds ratio (exp(0.804) = 2.23) says how the odds change per grade point – i.e., 2.23 times higher per point in this case. It’s important to note that, for continuous explanatory variables, their effect on the probability (as opposed to the odds) of the outcome is not constant across all values of the explanatory variable. Due to the logit transformation, the effect will be smaller for very low or very high values of the explanatory variable, and much larger for those in the middle.

We can also calculate a confidence interval to capture our uncertainty in the odds ratio estimate and we’ve put together an online odds ratio confidence interval calculator that you can use to do exactly this (you just need to enter your data from a contingency table). For the GRAD variable above, the 95% confidence interval for the odds ratio (estimated to be 2.23) is 1.17 to 4.32, so we’re 95% confident that this range covers the true odds ratio (if the study was repeated and the range calculated each time, we would expect the true value to lie within these ranges on 95% of occasions).

A key advantage of this modelling approach is that we are able to analyse the data all-in-one rather than splitting the data into subgroups and performing multiple tests (using a CHAID analysis, for example) which, with a reduced sample size, will have less statistical power. See our recent blog for further information on the importance and effect of sample size. By including all of the potential explanatory variables in one model, we can see which make up the most informative combination of predictors for the outcome.

Categorical logistic regression

All of the above (binary logistic regression modelling) can be extended to categorical outcomes (e.g., blood type: A, B, AB or O) – using multinomial logistic regression. The principles are very similar, but with the key difference being that one category of the response variable must be chosen as the reference category. Separate odds ratios are determined for all explanatory variables for each category of the response variable, except for the reference category. The odds ratios then represent the change in odds of the outcome being a particular category versus the reference category, for differing factor levels of the corresponding explanatory variable.

There are also extensions to the logistic regression model when the categorical outcome has a natural ordering (we call this ‘ordinal’ data as opposed to ‘nominal’ data). For example, the outcome might be the response to a survey where the answer could be “poor”, “average”, “good”, “very good”, and “excellent”. In this case we use ordered logistic regression modelling and we can explore whether the odds of being in a ‘higher’ category is associated with each of our explanatory variables.

What-if? scenarios

These log-linear models can also be used to make predictions of the probability of an outcome for particular cases. We can input the values of the explanatory variables (into the formula generated by the model) for a range of possible scenarios and obtain the predicted odds or probability of the outcome in each case.

The model can be implemented within a tool, for example in Microsoft Excel or as a web app (see our recent post on Interacting with Your Data). This allows a range of predictions to be made and visualised easily. Prediction intervals can also be provided with each projection to quantify the associated uncertainty in the estimate – giving the range for which we are confident that the true probability will lie and allowing the user to consider best- and worst-case scenarios.

Logistic regression models are a great tool for analysing binary and categorical data, allowing you to perform a contextual analysis to understand the relationships between the variables, test for differences, estimate effects, make predictions, and plan for future scenarios. For a real World example of the value of logistic regression modelling, see our case study on developing a medical decision tool using binary logistic regression to help inform the assessment of whether to extubate intensive care patients.

Logistic regression models are also great tools for classification problems – take a look at our blog on Classifying Binary Outcomes to find out more.