The post Will it be Turkey this Christmas? appeared first on Select Statistical Consultants.
]]>Today, it seems like everyone has turkey at Christmas, but what about the rest of the year? If we look at the data available from the Department for Environment, Food and Rural Affairs for the number of turkeys slaughtered per year, we can see a definite spike around December.
From Figure 1 we can see that over the last 20 or so years, nearly 1 million more turkeys are killed around December each year compared to May. The British people tend to enjoy turkey as a seasonal meat – we can see that turkey numbers increase throughout September, October and November before peaking in December. Think about all the Christmas related food sold around this season – Turkey and stuffing sandwiches are not only sold during December!
Since turkeys were originally an American import, how does the UK annual pattern compare to that of the US? Americans have less of a specific tradition around Christmas dinner – “Turkey day” for them refers to Thanksgiving, which takes place on the fourth Thursday of November.
Looking at data available from the United States Department of Agriculture we can see these trends reflected. Figure 2 below shows the UK and US turkey production in pounds, estimated for the UK by using the average turkey weight 14 pounds, and scaled by the population in each country, so that the graph tells us how many pounds of turkey is produced each month per person.
There are two interesting things to note from this plot. Firstly, the smoothed average line shows an increase in the USA’s production of turkeys in October and November, in the lead up to Thanksgiving, compared to the rest of the year. This is interestingly followed by a drop in December; the opposite to what is happening in the UK. Secondly, there is much less variation in turkey production over the whole year compared to the UK, implying that Americans tend to eat turkey at a reasonable steady rate all year long, and not just as a special treat during the holidays.
We can also look at whether turkey’s consumption has changed by plotting the annual total of turkeys slaughtered in pounds for both countries over the last 24 years (Figure 3 below). It is noticeable from the two previous graphs that there is a lot of variation across years, most particularly so in the UK.
Since the beginning of the century, it appears that UK turkey consumption has been steadily decreasing, and nearly halved between 1995 and 2007, while in the US there has been a slow increase. Could this be due to turkey being replaced by other meats, fish or vegetarian alternatives at Christmas? And you, what are you having for Christmas dinner?
The post Will it be Turkey this Christmas? appeared first on Select Statistical Consultants.
]]>The post Select’s Startistical Christmas Social appeared first on Select Statistical Consultants.
]]>
Nestled in a cosy restaurant in a picturesque village, we were treated to an expert workshop to create fairylightstudded, willow star decorations. You could almost hear the cogs whirring as Select’s brains switched gear from the statistical to the startistical. The cakefuelled tranquillity was occasionally pierced by flailing willow branches – but thankfully all survived unharmed.
We all had a fantastically relaxed afternoon, aided in no small part by generous hosting and delicious food. The delightful results hang proudly in the homes of Select, and are a lovely reminder of an afternoon very merrily spent.
We would like to heartily thank Victoria Westaway (http://www.victoriawestaway.co.uk/) for showing us her considerable skills and kindly complementing our subprofessional attempts; and Vitamin Sea Restaurant (https://vitaminsea.saltydogdevon.com/) for their tasty food and charming hosting. We enthusiastically recommend both to anyone in the Exeter area.
The post Select’s Startistical Christmas Social appeared first on Select Statistical Consultants.
]]>The post Making the Most of Budgets for Public Health Interventions appeared first on Select Statistical Consultants.
]]>The National Institute for Health and Care Excellence (NICE) is responsible for assessing new medicines, medical technologies and diagnostics to identify the most clinically and costeffective treatments available. This helps to ensure that those products which offer the best value for patients are adopted for use by the NHS and in public health programmes implemented by local government. NICE therefore needs to evaluate the tradeoff between how well a new treatment works and how much it will cost.
One method of working out how effective a treatment is, is to look at the average change in life expectancy for the person to whom it was given. However, this does not give the full picture; the quality of life of the patient should also to be taken into account, i.e. factoringin their ability to carry out daily activities, freedom from pain and mental anguish. All of this information can be combined into one metric, frequently used in health economics, called a qualityadjusted life year (QALY). This is a measure of the life expectancy of a patient, weighted by a quality of life score (on a scale from 0 to 1) over each year. For example, one QALY could represent either 12 months at ‘perfect health’ (a quality of life score of 1), or 24 months at ‘50% health’ (a quality of life score of 0.5). These scores are routinely calculated from questionnaire responses where patients are asked to rate aspects of their quality of life including mobility, ability to selfcare and anxiety/depression.
Generally, QALYs are calculated using a naive method. This involves plotting quality of life scores over time and measuring the area under the curve (AUC) individually for each subject included in the study. This fragmented approach does not apply statistical modelling to bring the data together and make inferences across the whole dataset. This leads to limitations in two main areas:
These limitations associated with the standard approach to calculating QALYs motivate the use of a more rigorous statistical approach which will enable us to calculate more accurate estimates, more efficiently.
Joint longitudinalsurvival modelling can be applied to data including life expectancy and quality of life information to help obtain improved QALY estimates. Joint modelling is a new statistical approach that has recently been developed. It essentially combines two established types of model: mixed effects models and survival models. Mixed effects models are appropriate for analysing longitudinal data, accounting for repeated observations collected for the same subject over time. Survival models are designed to analyse timetoevent data, such as survival times. They account for censoring where, for example, we may only know that the time to the event is greater than the current number of days of followup. A joint longitudinalsurvival model is designed to analyse datasets that include both of these types of data, allowing inferences to be made about the survival and the quality of life over time trends from a single model.
Data collected to calculate QALYs fit this scenario. For each patient, we have survival times with censoring information, and repeated measurements over time recording the quality of life scores the patients gave. We can combine both types of data into one joint model which can examine the association between patients’ survival and how good their quality of life is, as well as looking at overall trends in survival and quality of life and factors that can affect these two things separately. The joint model also reduces bias due to missing data, by sharing information across all of the subjects included in the study, rather than considering each patient individually (as with the AUC method).
The final fitted model can then be used to estimate QALYs under different scenarios through simulation and taking expectations. For example, they might be used to compare the average QALY for a patient who was taking a new, experimental treatment, with someone taking the standard, currently available treatment. Combining this information with estimates of the costs of each treatment, these results can then be used to assess their cost effectiveness and decide whether it is worth investing in a new treatment.
By applying a more rigorous statistical analysis, joint models can increase both the accuracy and the efficiency of calculating QALY estimates. This in turn can help to improve the service provided to the public by NICE and the NHS in two major ways.
Firstly, by accounting for the correlation between quality of life and life expectancy, joint modelling allows more efficient use of the available data to be made. This may help to reduce the sample sizes required in studies contributing to life expectancy and quality of life estimates, thereby reducing costs and potentially shortening timelines. This will allow decisions to be made more quickly and costeffectively.
Secondly, by reducing biases and therefore obtaining more accurate QALY estimates, NICE and the NHS can make better informed choices as to which treatments and technologies it would be most costeffective to spend taxpayers’ money on. This in turn will help improve overall resource management, maximising the health benefits provided for a given cost and leading to an improved service for the public.
The post Making the Most of Budgets for Public Health Interventions appeared first on Select Statistical Consultants.
]]>The post Sample size calculation for complex designs appeared first on Select Statistical Consultants.
]]>Experimental controlled trials are an essential part of the development of new production methods or treatments. Calculating the appropriate sample size to demonstrate the efficacy of a new production method or a new treatment is not always straightforward, as the practical aspects of the trial need to be considered.
For example, in an agricultural context, a trial might be set up in a production setting to measure performance indicators of farm animals (such as feed conversion ratio or daily weight gain) in order to compare a new formulation (which we will refer to as the treatment) with a control feed. These types of trials will often be constrained by both animal welfare considerations and ensuring that the trial conditions are as similar to reallife conditions as possible (e.g. minimum and maximum flock or herd sizes, limited numbers of pens within a barn, allin allout production systems, etc.).
These logistical constraints mean that it is generally unlikely that a standard experimental design, such as a fully randomised block design, can be used. Therefore, though estimates of the expected effects, and sources and sizes of variation can be obtained from the literature or previous smaller scale studies, there is generally no offtheshelf sample size calculator or formula that can be applied.
Calculating the appropriate sample size for a trial is about getting the right balance between having a sufficient number of subjects to be statistically confident that the minimum desired effect size can be detected and ensuring that the trial is logistically feasible (i.e. it does not need to be run for an extensively long time, or with far too many subjects).
In the absence of a “standard” sample size calculator, we can use a simulationbased approach. This approach consists of simulating a large number of datasets where the number of animals (or individual plants or crops), the number of blocks or other higherlevel groupings (such as pens, herds, farms or fields) are varied across a range of values corresponding with likely practical designs.
As with any sample size calculation (or equivalently power analysis), this requires the use of sensible estimates of the expected effect size of the treatment compared to the control group as well as any other important contributors to the trial (e.g., the estimated variability between individual animals, or between groups of animals at the different levels of the design). These estimates can usually be obtained from previous studies, the literature or from expert panels.
The result is a number of simulations that each correspond to a specific sample size. For each simulation, we can fit a statistical model that, combined with repeated runs of the simulation, provides an estimate of the associated power to detect a statistically significant effect of the treatment of interest. The results of the simulations can then be visualised in the form of power curves where the estimated power to detect the desired effect is plotted against the range of sample sizes tested (an example of this is given in Figure 1).
This type of simulation approach gives us added flexibility for it also allows us to test a number of different scenarios. For example, for a range of possible sample sizes, we can investigate how varying the expected size of the treatment effect affects the estimated power to detect a significant effect (demonstrated by the different coloured lines in Figure 1).
The results of the simulations in Figure 1 indicate that as the sample size increases, so too does the estimated power to detect an effect. However, the power also varies considerably across the different effect sizes. For example, it is clear that whilst a sample size of 400 individual animals is more than adequate to detect a large effect size if it existed (in fact this effect could be detected with a power of over 80% with a sample size of only 120), this sample size would not be sufficient to detect a small effect size. For a medium effect size, we see that a sample size of 400 individual animals results in a power of 65%, which is equivalent to a 35% chance of failing to detect a statistically significant improvement if one were to exist. Therefore, even for a medium effect, it is likely that a larger sample size would be needed.
Note that this approach could also be carried out on other key variables (not just different possible effect sizes) such as the impact of different amounts of variability between individual subjects (or between groups of subjects) or any other confounding factor that should be controlled for during the analysis of the trial data.
Whichever final design you choose for your experiment, a simulationbased approach to calculating your sample size ensures that you will have the necessary power to detect a meaningful effect of your treatment with the level of confidence you require, whilst meeting the practical constraints of the trial.
Without having to run a number of different and potentially expensive trials, you can explore a range of scenarios, changing trial conditions or likely treatment effects, to understand their impact and thus inform the final design choice. This further enables you to be confident that your chosen design, including the sample size, will be appropriate to demonstrate and support the aims of your study. While we illustrated the approach in an agricultural experiment context, this type of simulation approach is well suited to any study with a complex design, including nested classifications, such as customer surveys or new stock management processes in the retail or leisure industry, or intervention studies in the education sector.
An additional advantage is that when using a simulationbased approach to calculate the sample size for your experiment, the statistical model that will be used for the data analysis needs to be specified. This also means that once the data has been collected at the end of the trial, the analysis can be done more quickly as the statistical model that will fit the data best has already been developed.
The post Sample size calculation for complex designs appeared first on Select Statistical Consultants.
]]>The post Measuring Human Behaviours and Traits with Item Response Theory appeared first on Select Statistical Consultants.
]]>Item response theory (IRT) is a statistical modelling technique used in the field of measurement. IRT assumes that there is a single underlying trait, which is latent and that this latent trait influences how people respond to questions. As IRT is often used in education the latent trait is often referred to as ability. Item response theory models the probability of a person correctly answering a test question (an item), given their ability. After running an IRT model we can obtain estimates for each person’s ability, and also for each item’s difficulty (some items are easier or harder than others). We would expect someone with a lower ability measure to get the easier items correct but get fewer or none of the more difficult items correct. And we would expect someone with higher ability to get more of the difficult items correct. While we tend to still talk in terms of ability and item difficulty, IRT can be used to measure other character traits, not just academic ability, with items or questions that measure these.
As such, item response theory can be used to analyse surveys since a commonly used survey question format is the Likert scale, where respondents are asked how much they agree to a series of statements, with response options ranging from ‘strongly disagree’ to ‘strongly agree’.
An example are the statements below, which were asked of Year 8 to Year 10 pupils in the NHS’s National Study of Health and Wellbeing Survey of children and young people in 2010.
Questions of this kind ask respondents about a number of components that are closely related and which assess different aspects of a single underlying trait, in this case resilience. People with high levels of resilience will tend to agree and strongly agree with these statements, while people with low levels of resilience will tend to disagree.
Using IRT we can measure respondents’ resilience and also their propensity to endorse statements like those above.
IRT puts the thresholds (the boundaries between “strongly agree” and “agree”, and between “agree” and “neither”, etc) for all of the statements onto the same scale, i.e., in terms of the respondents’ levels of resilience in this example. The figure below shows the thresholds for the 5 statements above. The coloured bands show at what point on the scale each response (from “strong disagree” to “strongly agree”) is most likely, depending on someone’s level of resilience. You can see that these don’t all occur in the same place; that the point at which ‘agree’ is more likely than ‘neither’ for “I try to stay positive” is lower down the scale than for “I am good at solving problems in my life“.
While a chart like the diverging stacked bar chart is useful for comparing levels of agreement and disagreement (as demonstrated in this previous blog: Analysing Categorical Survey Data), it positions neutrality in the same place and symmetrically for all statements. By positioning the statements on the same scale, i.e. in terms of resilience in this case, we can compare how easy each statement is to endorse. Of the five statements, pupils who responded to this survey found “I am a very determined person” one of the easiest statements to endorse. The threshold between agree and strongly agree is 0.7. Whereas the equivalent threshold for the statement “I am good at solving problems in my life” is 1.9. This second statement is more difficult to endorse; pupils need a higher degree of resilience (scores higher than 1.9) before “strongly agree” becomes the most likely response.
To illustrate this further, take two example pupils; one with a resilience score of 2.8, the other with a resilience score of 1.3. These are marked on the diagram below with arrows.
Each pupil’s most likely response is indicated by the region in which their arrow lies. The pupil with the lower resilience score of 2.8 is most likely to “disagree” with the first and last statements, “I can usually think of lots of ways to solve a problem” and “I am good a solving problems in my life“; they are most likely to answer “neither” to “I try to stay positive” and “I am a very determined person“; and they are most likely to “strongly disagree” with the fourth statement, “I really believe in myself“.
The pupil with the higher resilience score is most likely to “agree” with the first and last statements, “I can usually think of lots of ways to solve a problem” and “I am good a solving problems in my life“; and most likely to “strongly agree” with the other three statements.
While using pupils’ IRT scores to understand their most likely responses to the statements is a useful exercise, it is mostly illustrative (it can be useful for reporting, for example). IRT is itself a statistical model (also known as a confirmatory factor analysis) and provided this model fits the response data well, the resulting IRT scores provide robust, continuous measures that can be used in further analyses.
IRT scores could be used to compare latent traits like ability or resilience between, say, groups of pupils. This would allow comparisons to be made between those who have received an intervention and those who haven’t, to assess its effectiveness. These scores could also be used in further statistical modelling to, for example, explore relationships between these attributes and other characteristics or outcomes.
The post Measuring Human Behaviours and Traits with Item Response Theory appeared first on Select Statistical Consultants.
]]>The post Sally is Awarded Graduate Statistician Status appeared first on Select Statistical Consultants.
]]>We’re pleased to announce that Sally has gained the designation of Graduate Statistician (GradStat). GradStat status, granted by the Royal Statistical Society (RSS), is a professional award formally recognising a member’s statistical qualifications. Sally recently achieved a distinction in her MSc in Statistics from Lancaster University, a degree accredited by the RSS. Accreditation signifies that the course meets the Society’s academic standards, measured against their qualifications framework, ensuring that an appropriate breadth and level of statistical knowledge and skills has been demonstrated at a Graduate and Masters level.
As a GradStat member, Sally is required to abide by the Society’s code of conduct and to adhere to their comprehensive CPD policy. On obtaining GradStat, Sally is also now eligible to work towards becoming a Chartered Statistician, the highest professional award for a statistician which is also granted by the RSS. This award requires at least five years’ postgraduate experience as a professional statistician, in addition to an approved degree and training.
The post Sally is Awarded Graduate Statistician Status appeared first on Select Statistical Consultants.
]]>The post Women in Maths appeared first on Select Statistical Consultants.
]]>Lynsey was pleased to be invited to give a presentation on her current role and her career to date following her PhD. The afternoon session where Lynsey spoke consisted of three speakers who had all undertaken a PhD in Mathematics or Statistics, but had chosen different career paths. The other speakers had studied Pure and Applied Maths and one was now a Lecturer at the University and the other an Analyst at GCHQ.
“I was really pleased to be asked to present at this event as I’m always keen to promote and inspire mathematics and statistics, particularly to young females early in their career,” said Lynsey. “It was a fantastically organised event with a great attendance. I particularly enjoyed listening to the other speakers to hear about the different interesting and vibrant careers that they had embarked on following their PhDs. It’s clear just how wide and varied the opportunities are after studying mathematics and statistics!”
The post Women in Maths appeared first on Select Statistical Consultants.
]]>The post Visualising Refugee and AsylumSeeker Data appeared first on Select Statistical Consultants.
]]>Looking at the most recent mid2017 estimates, we find that the population of refugees in the UK at that time consisted of 112,698 refugees from 114 different countries of origin. (To calculate these numbers, we first removed two origin categories from the data – those refugees that have been categorised as ‘Stateless’ (1,841) and those with ‘Various/Unknown’ origin (6,775)). We then ranked the population of refugees in terms of their country of origin and plotted the top 20 countries in the map below (Figure 1), where the colour of the country reflects the percentage of the total number refugees currently residing in the UK to have come from that country. We have also provided a table of the countries of origin and the number/percentage of refugees. This map highlights that the largest proportion of refugees are from the Middle East, but there are also a relatively high proportion from Africa (particularly the country Eritrea). [Note that we did not map all 114 countries given that the top 20 account for almost 90% of all refugees currently in the UK, meaning mapping all countries on one scale masked some of the more interesting patterns.]
Country of Origin  Number/ Proportion of Refugees  Country of Origin  Number/ Proportion of Refugees 
Iran  15,438 (13.7%)  Albania  2,350 (2.1%) 
Eritrea  13,886 (12.3%)  Nigeria  2,090 (1.9%) 
Afghanistan  9,938 (8.8%)  China  1,676 (1.5%) 
Syria  8,758 (7.8%)  Libya  1,546 (1.4%) 
Zimbabwe  8,513 (7.6%)  Gambia  1,320 (1.2%) 
Sudan  7,885 (7.0%)  Ethiopia  1,297 (1.2%) 
Pakistan  7,003 (6.2%)  Dem. Rep. of Congo  1,205 (1.1%) 
Sri Lanka  5,829 (5.2%)  Uganda  1,143 (1.0%) 
Somalia  5,328 (4.7%)  Bangladesh  1,105 (1.0%) 
Iraq  3,947 (3.5%)  Myanmar  896 (0.8%) 
We can examine how these numbers compare to different countries in Europe. For example, we have plotted the same map, but for the number of refugees in Germany in Figure 2. Germany has over seven times as many refugees compared to the UK, with 831,264 refugees originating from 132 different countries. Whilst Germany also has a large proportion from the Middle East (which is unsurprising due to the recent conflicts in these areas), there are fewer African countries in the top 20, and Russia and some of the Eastern European countries feature. Whilst Syria makes up 7.8% of all refugees in the UK, Syrian refugees account for over 50% of all refugees in Germany which may be a reflection of a difference in immigration and refugee policy between the two countries since the Syrian conflict.
Country of Origin  Number/ Proportion of Refugees  Country of Origin  Number/ Proportion of Refugees 
Syria  458,871 (55.2%)  Sri Lanka  3,854 (0.5%) 
Iraq  118,497 (14.3%)  Ethiopia  3,573 (0.4%) 
Afghanistan  82,233 (9.9%)  Azerbaijan  2,864 (0.3%) 
Eritrea  41,254 (5.0%)  Nigeria  2,098 (0.3%) 
Iran  32,920 (4.0%)  Armenia  1,663 (0.2%) 
Turkey  19,378 (2.3%)  China  1,527 (0.2%) 
Somalia  14,980 (1.8%)  Dem. Rep. of Congo  1,493 (0.2%) 
Serbia and Kosovo  9,235 (1.1%)  Egypt  1,493 (0.2%) 
Russia  5,995 (0.7%)  Bosnia and Herzegovina  1,453 (0.2%) 
Pakistan  5,773 (0.7%)  Vietnam  1,287 (0.2%) 
As well as looking at a snapshot in time, we can also look at how the numbers and spatial patterns of refugees have changed over time since annual data are available from 1988. Below we have produced an animation of the top 20 countries of origin for refugees in the UK since 1988 where the colour of the country reflects the total number refugees currently residing in the UK to have come from that country.
This animation highlights how the numbers of refugees in the UK has changed over time. It seems that there was a steady increase until around 2005 followed by a reduction to the present number. We can confirm this pattern by plotting a time series of the overall number of refugees to have entered the UK in Figure 3 (note that this time series is plotted from 1951; we have total refugee population numbers between 1951 and 1987, but these are not broken down by country of origin). The longer time series also highlights that the number of refugees in the UK steadily decreased from the 1950s until the 1980s before rising once more. This rise in number and the geographic pattern of where refugees come from are likely to coincide with a number of recent conflicts such as those in Bosnia, Iraq, Afghanistan and Syria. We can see that, for example, from the early 2000s, the three countries where large numbers of refugees originate from are Iran, Afghanistan and Somalia. Other historical events may be reflected in these numbers; for example from 1992 to 1997 the Russian Federation is in the top20 countries (the Soviet Union broke up in 1991).
In addition to looking at refugees, we have also analysed the number of asylum seekers present in the UK. At the start of 2017, the UK had 43,597 asylumseekers from 109 different countries. Plotting the same maps as those we created for the refugee population in the UK, we found similar spatial patterns. A potentially more interesting set of data are the information that the UNHCR provides on the number of decisions made during the first half of 2017. The UK processed 45,191 asylum applications by mid2017 of which 30% were recognised, 57% were rejected and the remaining were otherwise closed. In Figure 4 we map the proportion of decisions made by country and in Figure 5 we provide the success rate in each country. We can see in Figure 4 that the largest numbers of decisions were made for asylum seekers that originated from the Middle East, India, China and North East Africa. Countries with particularly high success rates include Syria and Tajikistan (both 70% and above). Countries with low success rates particularly given the number of asylum seekers present in the UK include India and China.
Finally, we have also produced a circular plot of the flow of asylumseekers in 2017, also known as a chord diagram. In Figure 6 below we have plotted the population of 2017 asylum seekers, grouping each country into larger geographic regions. This plot shows the numbers of estimated asylumseekers moving between the different regions, which is quantified by the width of the flow at both the region of origin and asylum (given in 1000s). The colour of the flow can be used to identify where asylum seekers are moving from and to, with there being a larger gap between the flow and the region that asylum seekers are leaving. We can use this plot to better understand the population size of asylumseekers in each region and the composition of the asylumseeker population in terms of where they have come from and where they have travelled to.
From this figure, we can conclude the following:
In this blog we have looked at different ways of visualising the populations of both refugees and asylumseekers using the UNHCR annual statistics. We’ve used a combination of maps, time series plots and a chord diagram to demonstrate how many more insights you can gain from a data set when you use innovative visualisations. Not only have we discovered how recent conflicts have impacted the population of refugees within the UK, but we’ve been able to better understand which countries are more likely to have a successful asylumseeker application as well as provide a whole World snapshot of asylum seekers in one simple plot.
The post Visualising Refugee and AsylumSeeker Data appeared first on Select Statistical Consultants.
]]>The post The Select Team is Growing! appeared first on Select Statistical Consultants.
]]>Both have spent the summer working on their dissertations. Sally looked at data from an NHS clinical trial to explore a new approach for estimating qualityadjusted life years (a metric bringing together information on both a person’s life expectancy and their quality of life during that time), which would improve on the accuracy of the current industry standard techniques. Louise focused on machine learning techniques, investigating whether they could more accurately predict the risk of a patient developing Type 2 Diabetes compared to the standard statistical approach as well as examining the viability of using this approach in medical practice.
“We’re delighted to welcome both Sally and Louise to Select.” says Managing Director, Lynsey McColl, “Having both just finished their Master’s, they are clearly enthusiastic and excited about using their statistical knowhow and coding expertise on RealWorld client problems. I’m sure they’ll be a great asset to our consulting team.”
The post The Select Team is Growing! appeared first on Select Statistical Consultants.
]]>The post Cumulative Gains and Lift Curves: Measuring the Performance of a Marketing Campaign appeared first on Select Statistical Consultants.
]]>What returns will I get from running my marketing campaign?
In this context, we want to understand what benefit the predictive model can offer in predicting which customers will be responders versus nonresponders in a new campaign (compared to targeting them at random). This can be achieved by examining the cumulative gains and lift associated with the model, comparing its performance in targeting responders with how successful we would be without the added value offered by the model. We can also use the same information to help decide how many pieces of direct mail to send, balancing the marketing costs with the expected returns from the resulting sales. There is a cost associated with each customer that you mail and therefore you want to maximise the number of respondents that you acquire for the number of mailings you send.
In this blog, we describe the steps required to calculate the cumulative gains and lift associated with a predictive classification model.
Continuing with the direct marketing example, using the fitted model we can compare the observed outcomes from the historical marketing campaign, i.e., who responded and who did not, with the predicted probabilities of responding for each customer contacted in that campaign. (Note that, in practice, we would fit the model to a subset of our data and use this model to predict the probability of responding for each customer in a “holdout” sample to get a more accurate assessment of how the model would perform for new customers.)
We first sort the customers by their predicted probabilities, in decreasing order from highest (closest to one) to lowest (closest to zero). Splitting the customers into equally sized segments, we create groups containing the same numbers of customers, for example, 10 decile groups each containing 10% of the customer base. So, those customers who we predict are most likely to respond are in decile group 1, the next most likely in decile group 2, and so on. Examining each of the decile groups, we can produce a decile summary, as shown in Table 1, summarising the numbers and proportions of customers and responders in each decile.
The historical data may show that overall, and therefore when mailing the customer base at random, approximately 5% of customers respond (506 out of 10,000 customers). So, if you mail 1,000 customers you expect to see around 50 responders. But, if we look at the response rates achieved in each of the decile groups in Table 1, we see that the top groups have a higher response rate than this, they are our best prospects.
Decile Group  Predicted Probability Range  Number of Customers  Cumulative No. of Customers  Cumulative % of Customers  Responders  Response Rate  Cumulative No. of Responders  Cumulative % of Responders  Lift 
1  0.1291.000  1,000  1,000  10.0%  143  14.3%  143  28.3%  2.83 
2  0.1050.129  1,000  2,000  20.0%  118  11.8%  261  51.6%  2.58 
3  0.0730.105  1,000  3,000  30.0%  96  9.6%  357  70.6%  2.35 
4  0.0400.073  1,000  4,000  40.0%  51  5.1%  408  80.6%  2.02 
5  0.0250.040  1,000  5,000  50.0%  32  3.2%  440  87.0%  1.74 
6  0.0180.025  1,000  6,000  60.0%  19  1.9%  459  90.7%  1.51 
7  0.0150.018  1,000  7,000  70.0%  17  1.7%  476  94.1%  1.34 
8  0.0120.015  1,000  8,000  80.0%  14  1.4%  490  96.8%  1.21 
9  0.0060.012  1,000  9,000  90.0%  11  1.1%  501  99.0%  1.10 
10  0.0000.006  1,000  10,000  100.0%  5  0.5%  506  100.0%  1.00 
For example, we find that in decile group 1 the response rate was 14.3% (there were 143 responders out of the 1,000 customers), compared with the overall response rate of 5.1%. We can also visualise the results from the decile summary in a waterfall plot, as shown in Figure 1. This illustrates that all of the customers in decile groups 1, 2 and 3 have a higher response rate using the predictive model.
From the decile summary, we can also calculate the cumulative gains provided by the model. We compare the cumulative percentage of customers who are responders with the cumulative percentage of customers contacted in the marketing campaign across the groups. This describes the ‘gain’ in targeting a given percentage of the total number of customers using the highest modelled probabilities of responding, rather than targeting them at random.
For example, the top 10% of customers with the highest predicted probabilities (decile 1), contain approximately 28.3% of the responders (143/506). So, rather than capturing 10% of the responders, we have found 28.3% of the responders having mailed only 10% of the customer base. Including a further 10% of customers (deciles 1 and 2), we find that the top 20% of customers contain approximately 51.6% of the responders. These figures can be displayed in a cumulative gains chart, as shown in Figure 2.
The dashed line in Figure 2 corresponds with “no gain”, i.e., what we would expect to achieve by contacting customers at random. The closer the cumulative gains line is to the topleft corner of the chart, the greater the gain; the higher the proportion of the responders that are reached for the lower proportion of customers contacted.
Depending on the costs associated with sending each piece of direct mail and the expected revenue from each responder, the cumulative gains chart can be used to decide upon the optimum number of customers to contact. There will likely be a tipping point at which we have reached a sufficiently high proportion of responders, and where the costs of contacting a greater proportion of customers are too great given the diminishing returns. This will generally correspond with a flatteningoff of the cumulative gains curve, where further contacts (corresponding with additional deciles) are not expected to provide many additional responders. In practice, rather than grouping customers into deciles, a larger number of groups could be examined, allowing greater flexibility in the proportion of customers we might consider contacting.
We can also look at the lift achieved by targeting increasing percentages of the customer base, ordered by decreasing probability. The lift is simply the ratio of the percentage of responders reached to the percentage of customers contacted.
So, a lift of 1 is equivalent to no gain compared with contacting customers at random. Whereas a lift of 2, for example, corresponds with there being twice the number of responders reached compared with the number you’d expect by contacting the same number of customers at random. So, we may have only contacted 40% of the customers, but we may have reached 80% of the responders in the customer base. Therefore, we have doubled the number of responders reached by targeting this group compared with mailing a random sample of customers.
These figures can be displayed in a lift curve, as shown in Figure 3. Ideally, we want the lift curve to extend as high as possible into the topleft corner of the figure, indicating that we have a large lift associated with contacting a small proportion of customers.
In a previous blog post we discussed how ROC curves can be used in assessing how good a model is at classifying (i.e., predicting an outcome). As well as understanding the predictive accuracy of a model used for classification, it can also be helpful to understand what benefit is offered by the model compared with trying to identify an outcome without it.
Cumulative gains and lift curves are a simple and useful approach to understand what returns you are likely to get from running a marketing campaign and how many customers you should contact, based on targeting the most promising customers using a predictive model. These approaches could similarly be applied in the context of predicting which individuals will default on a personal loan in order to decide who could be offered a credit card, for example. In this case, the aim is to minimise the number of people likely to default on the loan, whilst maximising the number of credit cards offered to those who will not default. The predictive model in each case could be any appropriate statistical approach for generating a probability for a binary outcome, be that a logistic regression model, a random forest, or a neural network, for example.
The post Cumulative Gains and Lift Curves: Measuring the Performance of a Marketing Campaign appeared first on Select Statistical Consultants.
]]>The post Debunking the myth of a North/South divide in GCSE performance appeared first on Select Statistical Consultants.
]]>Interestingly, his analysis, conducted on three annual cohorts of pupils, finds the same results as our single cohort schoollevel analysis reported in our recent blog; that differences between GCSE performance are not driven by a North/South divide and that factors affecting performance are, in fact, multifaceted and complex.
The article advises against just using highlevel statistics, and highlights the importance of undertaking indepth analyses. In our analyses of the available data, we fit a statistical model to the data and found deprivation to be a driver of performance; areas of highlevels of deprivation tended to have lower GCSE performance. Interestingly, Stephen Gorard’s research has delved into this a little deeper and shows that it is not just whether or not pupils are eligible for free school meals that affects attainment, but that a more important factor is the length of time that pupils have faced disadvantage. The article says that the current measure of deprivation, whether a child is eligible for free school meals or not, does not capture enough of the aspects of socioeconomic deprivation or disadvantage.
Of course, in the education sector and other fields that use observational studies, whilst we include as many of the influential and relevant factors in any analysis, we must always be aware that analyses are often limited by the factors you can include, or more importantly, what you can’t include. Many analyses of student outcomes can’t, for example, take account of factors such as motivation, the effect of inspirational teachers, or home resources since these are not simple to measure.
Given past headlines stating the existence of a North/South educational divide, how do we know that an analysis has been conducted appropriately and whether or not to believe a headline? In our experience, clear and honest reporting is vital; detailing not only the results, but also the methods, any assumptions and limitations. By being clear about what is and isn’t included in your analyses and what it does and doesn’t tell you enables others to appropriately evaluate the evidence themselves.
The post Debunking the myth of a North/South divide in GCSE performance appeared first on Select Statistical Consultants.
]]>The post Analysing outcomes with multiple categories appeared first on Select Statistical Consultants.
]]>Suppose, for example, that instead of modelling the odds of a student being offered a place on a course, we wish to understand the choice a student makes between different types of highschool programmes (e.g. an academic, general or vocational course). Here our response variable is still categorical, but now there are three possible outcomes (academic, general or vocation) rather than two. To explore this, we have data available on the academic choices made by 200 US students. (These data are available from the UCLA Institute for digital research and education using the following link: https://stats.idre.ucla.edu/stat/data/hsbdemo.dta).
In addition to the actual programme choice made by the student, the dataset also contains information on other factors that could potentially influence their choices such as each student’s socioeconomic status, the type of school attended (public or private), gender and their prior reading, writing, maths and science scores.
For example, in Figure 1 below we plot the proportion of students that choose each programme by their socioeconomic status (classified as low, middle, high). This figure seems to indicate that low income students are less likely to choose an academic programme compared to a general programme. We can also summarise this type of information using contingency tables and a Pearson’s chisquared test, which are often explored in the initial exploratory analysis of an analysis (see our blog on analysing categorical survey data for more details). For example, a Pearson’s chisquared confirmed that there is a evidence of a statistically significant association between socioeconomic status and degree choice made by the students in our data (pvalue = 0.002).
While visualisations and simple hypothesis testing are a useful first step to understanding the data, they only look at the effect of one variable in isolation and therefore they do not account for other potential confounding factors such as prior maths or reading scores in this case. Not controlling for confounding factors can lead to incomplete or even wrong conclusions being drawn (for more information see our blog on Simpson’s paradox). We can account for confounding factors by using a more formal approach, in this case a multinomial logistic regression model.
In a binary logistic regression, we model the probability of the outcome happening (vs. not happening). When extending the approach to model an outcome with multiple categories, we jointly estimate the probabilities of each outcome happening versus a baseline or reference category (usually the most desirable or most common outcome).
While the choice of the reference category does not change the model (i.e. the drivers retained in the model) nor does it change the estimated probabilities of each outcome, it does change how the results are reported. Therefore, the choice of the reference category should be based on the question at hand.
For instance,
Here we choose to use the “academic” programme as the reference category. When we fit the model, we get two sets of model coefficients for each explanatory variable as an output (see Table 1 below), one for each comparison to the reference category. One set estimates the changes in the log odds of choosing a vocational course rather than an academic course and the other of choosing a general course rather than an academic course.
For each variable (and levels of the variable), the table below provides the coefficient estimates for both comparisons of the outcome, along with the standard error and 95% confidence intervals. The pvalues reported are for each variable across both comparisons, rather than for the individual comparisons.
General vs. Academic log odds  Vocational vs. Academic log odds 


Explanatory variable  Coefficient estimate (standard error)  95% Confidence Interval  Coefficient estimate (standard error)  95% Confidence Interval  pvalue 
Intercept  0.29 (0.38)  1.03; 0.45  1.12 (0.46)  2.03; 0.20  
SES: Middle vs. Low
SES: High vs. Low 
0.32 (0.49)
1.04 (0.56) 
1.28; 0.63
2.14; 0.07 
0.86 (0.52)
0.33 (0.64) 
0.17; 1.89
1.60; 0.93 
0.030 
Private vs. Public school  0.61 (0.55)  1.68; 0.47  2.02 (0.81)  3.60; 0.43  0.012 
Reading score  0.06 (0.03)  0.11;0.004  0.07 (0.03)  0.13; 0.01  0.027 
Maths score  0.11 (0.03)  0.17; 0.04  0.14 (0.04)  0.21; 0.07  < 0.001 
Science score  0.09 (0.03)  0.03; 0.15  0.04 (0.03)  0.01; 0.10  0.004 
We find that the socioeconomic status of students, the type of school, and their prior reading, maths and science scores are all statistically significant for one or both of the comparisons (at the 5% significance level), while there is no evidence that the choice of programme differs between boys and girls (hence why gender is not included in the table).
Note that the pvalues reported are for each of the variables in the model across both comparisons. Furthermore, if the confidence intervals for any of the variable do not include zero for a given comparison, this means that this variable has a statistically significant effect on those odds.
It is often (but not always) the case with a multinomial logistic regression, that one variable has a statistically significant effect on the odds for one comparison but not another, i.e. that variable might not be associated with how likely one outcome is compared to the reference, but it could be associated with how likely a different outcome is compared to the same reference.
For example, the coefficients for students attending a private school as opposed to a public school are negative for both the odds of choosing a general vs. an academic programme and a vocational vs. an academic programme, but the coefficient is only statistically significant for the latter. For the comparison between general vs. academic programme, the 95% confidence interval includes 0, meaning that there is no evidence to suggest that a student from a private school (as opposed to a public school) is more or less likely to choose a general programme over an academic. On the contrary there is strong evidence to suggest that a student from a private school (as opposed to public school) is less likely to choose a vocational course over an academic one.
The above example illustrates that rather than interpreting each set of model coefficients uniquely, both sets of coefficients need to be considered in parallel to draw meaningful conclusions.
Whilst the raw model coefficients presented above are useful to understand the general direction of the effects of the different factors, they are not always naturally interpretable because they are reported on the logodds scale. In a future blog we will discuss the different ways we can present the results of a multinomial logistic regression model such as converting the outputs to odds ratios (using a similar interpretation to the ones presented in our previous blog) or to predicted probabilities. These different outputs allow us to more naturally understand which outcome is more likely to happen or, in our example, which programme is more likely to be chosen by a given student.
The post Analysing outcomes with multiple categories appeared first on Select Statistical Consultants.
]]>The post Seeing Statistics in Practice appeared first on Select Statistical Consultants.
]]>Sarah was pleased to be invited to speak to the students about her daytoday life as a statistical consultant; focussing not only on the statistical challenges she faces in her role but the wider consultancy skills that are a crucial element of the job. Other speakers came from cyber security, actuarial science, and finance and gave an insight into the differing statistical problems being tackled in their respective industries.
Sarah and the other speakers also had the opportunity to attend poster presentation sessions by the students on their dissertation projects. “Meeting the students and hearing about their work was really interesting”, said Sarah. “It was great to see the diversity of both the projects they are working on and the statistical approaches being applied. It’s clear that the course is equipping the students with the necessary skills and enthusiasm for tackling challenging statistical problems, which they can take forward in their future careers be that in industry or academia.”
The post Seeing Statistics in Practice appeared first on Select Statistical Consultants.
]]>The post Trust in Numbers: a Pillar of Good Statistical Practice appeared first on Select Statistical Consultants.
]]>In the newly termed “posttruth” society in which we live, numbers and scientific evidence can often be (mis)used to provide a certificate of credibility. Professor Spiegelhalter pointed out that the intentional falsification of numbers and scientific evidence is thankfully rare, and often the misuse of statistics has more to do with attempts to make a story more appealing by using “high impact visual data representation”, simplifying the presentation of the results by removing any mention of uncertainty, or omitting any discussion of the limitations of the data, experiment or analyses.
While clarity and insight are key for the presentation of statistical results, this should not be at the expense of quality and transparency, if we as statisticians are to build the general public’s confidence in numbers. To help with this, the UK Statistics Authority has released a new Code Of Practice, centred around three pillars: Trustworthiness, Quality and Value. Whilst organisations producing official statistics are required to adhere by this code, any organisations producing data and statistics in general are encouraged to consider committing to the three pillars.
Here at Select, our consultants, as Chartered Statisticians and professional members of the Royal Statistical Society (RSS), and also bide by the RSS Code of Conduct which is designed to ensure that professional statisticians provide the highest level of statistical service and advice. We do not compromise on Trustworthiness, Quality or Value to make a finding more insightful or to create a better story.
The post Trust in Numbers: a Pillar of Good Statistical Practice appeared first on Select Statistical Consultants.
]]>The post Is there a North/South divide in GCSE performance? appeared first on Select Statistical Consultants.
]]>In a previous blog we looked at the GCSE results of pupils in different regions of England and examined the current and historical differences in attainment between the North and the South.
Updated analysis of the 2017 results by School Dash shows that the pattern of attainment (pupils in the South tending to perform better than pupils in the North) is still present in the latest GCSE results published by the DfE.
Mapping the percentage of pupils achieving 5 or more A* to C grades at GCSE (including grade 4 or higher in English and Maths) in 2017 for each Local Authority (LA; see Figure 1) shows the same pattern of attainment as with the 2015 GCSE results in our previous blog; that higher performing local authorities tend to be those located in the South (although as discussed in our previous blog there are clearly regional differences).
Fitting a statistical model to this data showed that, on average, 63% of pupils in the South gained 5+ A*C grades compared to 59% of pupils in the North; this result was statistically significant.
Comparing GSCE results by region is very simplistic. It is likely that factors other than region affect the educational attainment of pupils. The DfE and other government departments publish a wealth of data about schools and regions, so we combined the GCSE results at LAlevel with data about their pupils’ characteristics, teacher vacancies, and deprivation measures (averaged across each LA).
To include these variables in the analysis, we fit a statistical model to the GCSE results at LAlevel and add them as explanatory variables to the model (we also included North/South as an explanatory variable). The factors that were significantly associated with GCSE performance are shown in the chart below. In this model there ceased to be any real difference between pupils from the North and the South; any differences being accounted for by differences in background factors.
For each of these factors, figure 2 below shows how the percentage of pupils achieving 5 or more good grades at GCSE deviates from the national average (61%). Also shown, for comparison, is the difference for pupils in the North compared to the South. Not only has the gap in performance between pupils in the North and South decreased (now less than 1 percentage point compared to the previous 4 percentage point difference), this difference is not statistically significant.
Figure 2 shows us that the variables that are statistically significant include those that represent:
Clearly deprivation measures are associated with GCSE performance though their interpretation is complicated. GCSE performance tends to be lower in areas that are generally more deprived (measured by average IDACI), but this is offset somewhat in LAs with larger areas in the top 10% most deprived areas of the country. LAs with higher proportions of pupils eligible for FSM (also a measure of deprivation) tend to have lower GCSE performance.
LAs with higher proportions of pupils with EAL tend to have higher GCSE performance, while LAS with higher proportions of pupils with SEN support tend to have lower levels of GCSE performance. Once these factors have been taken into account there is no longer any real difference between the GCSE performance of LAs in the North and South of England.
This can be further illustrated by looking at the model residuals; these are the differences in GCSE performance that remain after taking account of differences that are due to levels of deprivation, and proportion of pupils with FSM, EAL and SEN. Figure 3 shows the model residuals for LAs in England. The figure illustrates that the regional differences dissipate; that LAs where performance is above average (indicated in shades of yellow to red) and LAs where performance is below average (indicated in shades of blue) are distributed across the country; there is no geographic pattern, confirming that the differences in performance are not driven by a North/South divide.
The model results highlight that deprivation is clearly an important factor associated with school performance. Mapping deprivation (in this case the average IDACI measure) for LAs illustrates the similarity to the pattern of GCSE performance.
Areas with higher GCSE performance in Figure 1 (shaded red) tend also to be the areas with lower deprivation in Figure 4 (paler shades), and areas with lower GCSE performance (e.g. cities such as Hull, Leicester, Derby, Nottingham, Stoke) have relatively high levels of deprivation (shaded darker purple). While there are areas in the North that are less deprived, e.g. North Yorkshire and the East Riding of Yorkshire, there are clusters of more deprived areas in other parts of Yorkshire, around the river Mersey, around Birmingham and in the North East.
It is noticeable that the ‘city effect’ seen in the previous blog, is still observed here; that the higher levels of deprivation and lower GCSE performance observed in a number of other cities is not reflected in LAs in central London. London has similar levels of deprivation as some of the areas in the North (the North East, the North West, Yorkshire and the West Midlands) and yet, while taking deprivation into account, the percentage of pupils gaining 5+ A*Cs in LAs in London is between 5 and 7 percentage points higher than other regions of the country.
The link between education outcomes and disadvantage is not a new discovery and has been explored and discussed by others previously. Why London seems to do relatively well is not established and there are many other examples of schools and pupils that overcome their disadvantages and do well, demonstrating that the link between education outcomes and deprivation can be broken. A report by the Northern Powerhouse (Educating the North: driving ambition across the Powerhouse) published in February this year highlights “the devastating consequences of disadvantage in the North” and calls for “the government, local authorities, businesses and others to invest in our children and young people, to ensure they have the future they deserve.”
The factors associated with GCSE performance are multifaceted and complex. Even in this simple example we have shown that to really begin to understand why there are variations in GCSE performance it is important to use a statistical model. Once other variables are included in the model the North/South divide disappears. However, this model is limited and could be improved. More of the differences in regional performance (more of the variation) could be explained by adding more variables; we have not included any information about pupils’ home background, for example. If data were available, as well as background factors, further refinement could be added by drilling down to the school level, or lower. While statistical models can usually be improved they are often a balance between detail and parsimony.
The post Is there a North/South divide in GCSE performance? appeared first on Select Statistical Consultants.
]]>The post Why Use a Complex Sample for Your Survey? appeared first on Select Statistical Consultants.
]]>Most statistical analyses assume that the data collected are from a simple random sample (SRS) of the population of interest. So say, for example, that you were conducting a survey of employees in your workplace (this is the “population”), a simple random sample would be where each of your colleagues in the office (or “sampling units”) were equally likely to be sampled. However, it’s not always possible or practical to take a simple random sample. Simple random sampling requires access to the whole population of interest (a “complete sampling frame” listing the sampling units) which may not be feasible for large populations. If sampling units are widely spread out geographically, for example, it might also be prohibitively expensive to access and sample across the whole area. Or, if some members of the population (e.g., of a particular demographic background) are relatively low in number, a simple random sample might not obtain enough (or any) of these individuals to reliably measure their responses. So, even if a complete sampling frame is available, it might be much cheaper or more efficient to use a complex sampling scheme instead of SRS, such as multistage sampling, clustering and/or stratification, for example.
With these approaches, members of the population don’t all have the same probability of being selected into the sample. Complex samples are most often used for surveys, especially large national or multinational ones where simple random sampling is simply not practical. For example, suppose you were conducting a survey in a conflictaffected country and the target population was all adults aged over 18, totaling, say, 20 million individuals. You might be interested in how responses differ by occupation, but some categories (perhaps selfemployed) may only represent a small fraction of the population. You might therefore consider stratifying your sampling to ensure that sufficient responses were obtained to make reliable estimates in each occupation group. The country may also be split geographically into, say, 40 states. To travel to and interview people in each of these states would likely be unfeasible, so cluster sampling might be used so that only a subset of the states needed to be accessed.
Remember – complex samples require statistical methods that take the sampling design into account.
Complex samples may also be incorporated into the design of crosssectional observational studies or even interventional studies (such as clinical trials). The key thing to remember is that when analysing data from a survey using complex sampling, the statistical methods that you use must take the sampling design into account.
So, what are the most common complex sampling approaches and why and when are they used? We focus here on cluster sampling and stratified sampling. We’ll also discuss sampling without replacement which should also be taken into account when analysing your data.
In cluster sampling, the population is split into similar groups of individuals (“clusters”) and then a sample of these clusters is taken (the clusters are the sampling units in this case) so that all of the elements in the selected clusters are included in the sample. Clustering is appropriate when we expect elements in different clusters to be relatively similar (“homogeneous”), i.e., each cluster is representative of the population.
For example, suppose we wanted to gather the opinions of school children in a particular county in England, say Somerset. It would be difficult and expensive to interview all schoolaged children in Somerset, so we take a sample of those children instead. However, taking a simple random sample of pupils in Somerset may mean that we still need to survey pupils in all, or a large proportion, of the schools in the county. It would be much cheaper to only survey the pupils in a subset of schools – so, we might cluster pupils according to their schools and then take a sample of the clusters (surveying all students within those selected clusters) to obtain a clustered sample of school children in the county.
This method is most efficient when most of the variation in the population is within clusters, rather than between them (higher withincluster correlation increases the variance compared to SRS). Cluster sampling is generally used to reduce costs, by reducing the number of clusters that we sample within whilst maintaining the sample efficiency.
Stratified sampling involves splitting the members of the population into subgroups (“strata”) before sampling, and then applying sampling (usually SRS) separately within each and every group (“stratum”). This is in contrast to cluster sampling where whole clusters are sampled, rather than samples of individuals being taken within each group (i.e., stratum), as illustrated in Figure 1. Stratified sampling can help to ensure that the sample collected is representative of the population, by guaranteeing that sufficient individuals from each subgroup (e.g., gender, or socioeconomic status) will be sampled. This is especially important if some strata only represent a small proportion of the overall population and if the survey responses are expected to differ across the subgroups. For example, responses to a survey might differ by nationality so if we were to miss some of the nationalities in our sample, our results might be biased.
Stratified sampling is appropriate when elements in different clusters are relatively dissimilar (“heterogeneous”), whereas cluster sampling is most efficient when the majority of the variation in the population is within clusters.
Returning to the example of surveying school children in Somerset, we might want to estimate the proportion of pupils with different characteristics stratified (i.e., estimated separately) by school type (e.g., academy, faith school, voluntary aided school, etc.). In this case, we could use stratified sampling to ensure that pupils from different school types are adequately represented in our sample. Schools would be split into strata (e.g., by school type: academy, faith school, voluntary aided school, etc.), and then samples of pupils would be taken within each stratum, to ensure that pupils from each school type were adequately represented in the sample. Contrast this with cluster sampling where we would cluster pupils and then take a sample of the clusters (surveying all students within those selected clusters).
Each stratum can be sampled in proportion to the relative size of that subgroup in the total population (“proportionate allocation”) to make the overall sample as representative as possible. Or, larger samples can be obtained in strata with greater variability to minimise the sampling variance (“optimum allocation”), improving the efficiency of the sample overall.
It is also possible to combine stratified sampling with clustered sampling. For example, we might stratify schools by type and then take cluster samples of schools within each stratum. This is an example of a onestage cluster sampling scheme but further stages of sampling could be also included. In twostage cluster sampling, for example, after taking the sample of clusters, a sample of elements within each selected cluster is then taken. So, we might only interview a sample of the pupils in each selected school.
After completing your survey, you might find that the sample you have taken is not representative of the population (for example, 40% of the population might be male, whereas in the sample obtained only 20% might be males and so males are “undersampled”). In this case poststratification can be applied. Such differences can be due to nonresponse or incomplete coverage, which are an inevitable consequence of the fact that we cannot sample everyone in the population nor compel them to respond. If the sample is imbalanced with respect to key factors that are likely to affect the study/survey responses then they can lead to biases in the results. Sampling weights can be calculated to poststratify the sample (to adjust the sample data after it has been collected) to ensure that the results are representative of the population. For more information on survey weighting and poststratification, see our case study on the work we did recently with Sport Wales for their School Sport Survey.
Suppose you were taking a sample of animals from the wild, in order to estimate their average weights, for example. Once one animal had been caught and measured, it would then be released back into the wild. It’s possible, in this case, that you might catch and measure the same animal more than once – we call this “sampling with replacement”. With replacement means that once an individual is selected to be in the sample, that individual is placed back in the population to potentially be sampled again. There are two ways to select a sample from the population – with replacement, as in this example, or without replacement. Without replacement means that once an individual is sampled, that individual cannot be sampled again; they are not placed back in the population. This will often occur when a sample is preselected from a sampling frame, i.e., a list of all those in the population who can be sampled.
Many standard analysis techniques assume that the sample being analysed was obtained from a sample taken with replacement or from an infinite population (when the population is infinite, or extremely large, then there’s little difference between sampling with and without replacement). However, in practice, most simple random samples are actually taken without replacement from a finite population. In this case, the variability of our sample is actually less than expected, and therefore we can apply a finite population correction to account for this greater efficiency in the sampling process. Each sampled individual is always unique and therefore provides ‘new’ information when sampling without replacement, whereas it’s possible when sampling with replacement to have ‘repeated’ information. When sampling without replacement from a finite population, it may be possible to sample all individuals in which case we’ll have no uncertainty in our estimates. The correction only has a noticeable effect when the sampling fraction, i.e., the proportion of the population sampled, is large. A good ruleofthumb to decide whether you need to apply a finite population correction is if you obtain a sample that makes up more than 5% of the population you should apply the correction. A finite population correction factor (FPC) is calculated, which is then multiplied by the standard error of the estimate. We’ve recently released a series of sample size and confidence interval calculators, which include a finite population correction – for more details (including the formula for the FPC) see the calculators on the Resources section of our website.
The most important thing to understand about complex sampling is that a more sophisticated analysis is needed when analysing the data collected – standard approaches are not necessarily appropriate. We must take account of the sample design in order for our conclusions to be reliable, whether we are estimating a characteristic of the population or testing for effects, for example.
The usual standard errors, assuming a simple random sample with replacement, will be incorrect if a complex sample has been taken. For example, a sample that is collected using cluster sampling underestimates the true population variance. Adjusting the standard errors to account for the complex sampling plan we find that they are larger, if correctly estimated, than those that would have been obtained assuming a simple random sample of the same size. This is because we might expect responses within a cluster to be more similar to each other than for randomly selected individuals across the population. Without correcting for these underestimates, we increase the risk of falsely determining significant effects when they do not actually exist (“false positives”).
In the statistical software package SPSS, complex samples analysis plans can be generated which, when used alongside the corresponding Analyze>Complex Samples menu, ensure that the sample design is incorporated into the analysis. In R, the survey package similarly allows you to specify a complex survey design and carry out appropriate analyses taking the design into account. Other packages in R, such as the anesrake package are also useful for implementing survey weighting, for example.
Complex samples are a useful tool for creating more efficient (e.g., stratified sampling with optimum allocation) or cheaper (e.g., cluster sampling) sampling designs. However, it’s crucial when using a complex sample to account for the sampling design when analysing your data in order to ensure that the results are accurate and reliable. If you’re conducting a survey using complex sampling and need help with the survey design or analysis, contact us to find out how we can help.
The post Why Use a Complex Sample for Your Survey? appeared first on Select Statistical Consultants.
]]>The post Select Welcomes Jo to the Team appeared first on Select Statistical Consultants.
]]>“We’re really pleased to welcome Jo to the consulting team” says Managing Director, Lynsey McColl. “Jo has considerable experience in education and we are excited to work with her in developing this sector further within the company. Much of Jo’s knowledge and skills are also highly transferable, such as her experience in the design and analysis of surveys, and I know she is very much looking forward to working on projects from a wide range of sectors.”
The post Select Welcomes Jo to the Team appeared first on Select Statistical Consultants.
]]>The post Analysing Categorical Data Using Logistic Regression Models appeared first on Select Statistical Consultants.
]]>When analysing a continuous response variable we would normally use a simple linear regression model to explore possible relationships with other explanatory variables. We might for example, investigate the relationship between a response variable, such as a person’s weight, and other explanatory variables such as their height and gender.
“Logistic regression and multinomial regression models are specifically designed for analysing binary and categorical response variables.”
When the response variable is binary or categorical a standard linear regression model can’t be used, but we can use logistic regression models instead. These alternative regression models are specifically designed for analysing binary (e.g., yes/no) or categorical (e.g., Fulltime/Parttime/Retired/Unemployed) response variables. Similar to linear regression models, logistic regression models can accommodate continuous and/or categorical explanatory variables as well as interaction terms to investigate potential combined effects of the explanatory variables (see our recent blog on Key Driver Analysis for more information).
Logistic regression models for binary response variables allow us to estimate the probability of the outcome (e.g., yes vs. no), based on the values of the explanatory variables. We could simply model this probability directly as a function of the explanatory variables but, instead, we use the logit function, logit(p) = ln(p/(1p)), where p is the probability of the outcome occurring, in order to determine the corresponding log odds of the outcome which we then model as a linear combination of the explanatory variables. As with standard linear regression analyses, the model coefficients can then be interpreted in order to understand the direction and strength of the relationships between the explanatory variables and the response variable.
Suppose, for example, that we are interested in how likely a student is to be offered a place on a postgraduate course. We consider the potential effects of the student’s mark on the course’s admissions exam (EXAM), their academic grading from their undergraduate degree (GRAD) and the prestige of their undergraduate institution (RANK, taking values from 1 to 4). We collect data from 400 students applying to graduate school and record whether they were successful or not in being admitted onto the course – so our response variable is binary (admit/not admit). (These data are available from the UCLA Institute for digital research and education using the following link: http://www.ats.ucla.edu/stat/data/binary.csv.)
Running the logistic regression model (for example, using the statistical software package R), we obtain pvalues for each explanatory variable and we find that all three explanatory variables are statistically significant (at the 5% significance level). So there’s evidence that each of these has an independent effect on the probability of a student being admitted (rather than just a difference observed due to chance). But what are these effects – are they positive or negative and how strong are they? We need to look at the coefficients estimated by the model in order to understand this and find, for example, that:
We can also exponentiate the coefficients and interpret them as odds ratios. This is the most common way of measuring the association between each explanatory variable and the outcome when using logistic regression. For the undergraduate institution rank above, the odds ratio for “if Rank=2” represents the odds of admission for an institution with Rank=2 compared to the odds of admission for an institution with Rank=1. The estimated odds ratio is exp(0.675) = 0.509, which means that the odds of admission having attended a Rank=2 institution are 0.509 times that of the odds for having attended a Rank=1 institution (or equivalently 49% [= 0.5091 x 100] lower). In other words, if the odds of a Rank=1 candidate are 1 to 10 (i.e., p=1/11 and 1p=10/11), the odds of a Rank=2 candidate being admitted are about half as good or about 1 to 20 (i.e., p=1/21 and 1p=20/21). So, for every Rank=2 applicant who is admitted, twenty Rank=2 candidates will be rejected, but for every Rank=1 applicant who is admitted, only ten Rank=1 candidates will be rejected.
Odds ratios can also be provided for continuous variables and in this case the odds ratio summarises the change in the odds per unit increase in the explanatory variable. For example, looking at the effect of GRAD above, the odds ratio (exp(0.804) = 2.23) says how the odds change per grade point – i.e., 2.23 times higher per point in this case. It’s important to note that, for continuous explanatory variables, their effect on the probability (as opposed to the odds) of the outcome is not constant across all values of the explanatory variable. Due to the logit transformation, the effect will be smaller for very low or very high values of the explanatory variable, and much larger for those in the middle.
We can also calculate a confidence interval to capture our uncertainty in the odds ratio estimate and we’ve put together an online odds ratio confidence interval calculator that you can use to do exactly this (you just need to enter your data from a contingency table). For the GRAD variable above, the 95% confidence interval for the odds ratio (estimated to be 2.23) is 1.17 to 4.32, so we’re 95% confident that this range covers the true odds ratio (if the study was repeated and the range calculated each time, we would expect the true value to lie within these ranges on 95% of occasions).
A key advantage of this modelling approach is that we are able to analyse the data allinone rather than splitting the data into subgroups and performing multiple tests (using a CHAID analysis, for example) which, with a reduced sample size, will have less statistical power. See our recent blog for further information on the importance and effect of sample size. By including all of the potential explanatory variables in one model, we can see which make up the most informative combination of predictors for the outcome.
All of the above (binary logistic regression modelling) can be extended to categorical outcomes (e.g., blood type: A, B, AB or O) – using multinomial logistic regression. The principles are very similar, but with the key difference being that one category of the response variable must be chosen as the reference category. Separate odds ratios are determined for all explanatory variables for each category of the response variable, except for the reference category. The odds ratios then represent the change in odds of the outcome being a particular category versus the reference category, for differing factor levels of the corresponding explanatory variable.
There are also extensions to the logistic regression model when the categorical outcome has a natural ordering (we call this ‘ordinal’ data as opposed to ‘nominal’ data). For example, the outcome might be the response to a survey where the answer could be “poor”, “average”, “good”, “very good”, and “excellent”. In this case we use ordered logistic regression modelling and we can explore whether the odds of being in a ‘higher’ category is associated with each of our explanatory variables.
These loglinear models can also be used to make predictions of the probability of an outcome for particular cases. We can input the values of the explanatory variables (into the formula generated by the model) for a range of possible scenarios and obtain the predicted odds or probability of the outcome in each case.
The model can be implemented within a tool, for example in Microsoft Excel or as a web app (see our recent post on Interacting with Your Data). This allows a range of predictions to be made and visualised easily. Prediction intervals can also be provided with each projection to quantify the associated uncertainty in the estimate – giving the range for which we are confident that the true probability will lie and allowing the user to consider best and worstcase scenarios.
Logistic regression models are a great tool for analysing binary and categorical data, allowing you to perform a contextual analysis to understand the relationships between the variables, test for differences, estimate effects, make predictions, and plan for future scenarios. For a real World example of the value of logistic regression modelling, see our case study on developing a medical decision tool using binary logistic regression to help inform the assessment of whether to extubate intensive care patients.
Logistic regression models are also great tools for classification problems – take a look at our blog on Classifying Binary Outcomes to find out more.
The post Analysing Categorical Data Using Logistic Regression Models appeared first on Select Statistical Consultants.
]]>The post Camille is Awarded Chartered Statistician Status appeared first on Select Statistical Consultants.
]]>We’re pleased to announce that the prestigious Chartered Statistician designation has been granted to Camille by the Royal Statistical Society, recognising her extensive training and experience as a professional statistician.
The Chartered Statistician (CStat) status provides formal recognition of an individual’s statistical qualifications, professional training and experience and is the highest professional award for a statistician. To qualify, the Royal Statistical Society (RSS) requires an approved degree together with postgraduate training and experience as a professional statistician for at least 5 years, or alternatively the ability to demonstrate breadth and depth of statistical knowledge. Camille’s award, gained through the competencybased route, recognises her 10 years’ professional experience in a statistical role at the Pirbright Institute for Animal Health, University of Bristol and now here at Select Statistical Services. She was also able to demonstrate a strong and consistent commitment to continued professional development (CDP), another key criterion considered by the RSS in making the award.
Chartered Statisticians are required to abide by the Society’s code of conduct, and to adhere to their comprehensive CPD policy. Each CStat is required to regularly revalidate their qualification to ensure that they continue to adhere to the RSS’s strict guidelines which are designed to ensure that Chartered Statisticians provide the highest level of professional service to their clients.
Guidance on how to apply for the CStat award is available on the RSS web site, but the Select team are also very happy to offer advice and guidance on how to develop and maintain a suitable CPD programme and to apply for the CStat award.
The post Camille is Awarded Chartered Statistician Status appeared first on Select Statistical Consultants.
]]>The post CHAID (Chisquare Automatic Interaction Detector) appeared first on Select Statistical Consultants.
]]>In our Market Research terminology blog series, we discuss a number of common terms used in market research analysis and explain what they are used for and how they relate to established statistical techniques. Here we discuss “CHAID”, but take a look at our previous articles on Key Driver Analysis, Maximum Difference Scaling and Customer Segmentation, and look out for new articles on TURF and Brand Mapping, coming soon. If there are other terms that you’d like us to blog on, we’d love to hear from you so please do get in touch.
CHAID (Chisquare Automatic Interaction Detector) analysis is an algorithm used for discovering relationships between a categorical response variable and other categorical predictor variables. It is useful when looking for patterns in datasets with lots of categorical variables and is a convenient way of summarising the data as the relationships can be easily visualised.
In practice, CHAID is often used in direct marketing to understand how different groups of customers might respond to a campaign based on their characteristics. So suppose, for example, that we run a marketing campaign and are interested in understanding what customer characteristics (e.g., gender, socioeconomic status, geographic location, etc.) are associated with the response rate achieved. We build a CHAID “tree” showing the effects of different customer characteristics on the likelihood of response.
At the first level (the “trunk”) we have all customers and the overall response rate for the marketing campaign was, say, 24.3%. As we progress down the tree to the first “branch”, we identify the factor that has the greatest impact on the likelihood of response, and our overall population is broken down into groups (“leaves”) based upon their differing values of this characteristic – Urban/Rural. We might find that rural customers have a response rate of only 18.6%, whereas urban customers have a response rate of 28.5%. We check to see if this difference is statistically significant and, if it is, we retain these as new leaves. At the next branch, for each of the new groups (Urban/Rural), we then consider whether they can be further split into subgroups so that there is a significant difference in the dependent variable (the response rate). Urban homeowners may have a much higher response rate (36.1%) compared with urban nonhomeowners (22.7%), and rural fulltime workers might have a higher response rate (24.0%) than rural parttime workers (17.8%) or the rural retired/unemployed (5.3%), for example. At each step every predictor variable is considered to see if splitting the sample based on this factor leads to a statistically significant relationship with the response variable. Where there might be more than two groupings for a predictor, merging of the categories is also considered to find the best discrimination. If a statistically significant difference is observed then the most significant factor is used to make a split, which becomes the next branch in the tree.
The process repeats to find the predictor variable on each leaf that is most significantly related to the response, branch by branch, until no further factors are found to have a statistically significant effect on the response (e.g., likelihood of responding to the marketing campaign). The results can be visualised with a socalled tree diagram – see below, for example. In this case, we can see that urban homeowners (36.1%) have the highest response rates, followed by rural fulltime workers (24.0%) and that these are therefore the best groups of customers to target. On the other hand, the lowest response rates were observed for the rural, retired/unemployed, aged over 65 years (1.4%).
As indicated in the name, CHAID uses Person’s Chisquare tests of independence, which test for an association between two categorical variables. A statistically significant result indicates that the two variables are not independent, i.e., there is a relationship between them. (See our recent blog post “Depression in Men ‘Regularly Ignored’…” for an example looking at the relationship between perceived mental health disorders and gender.)
Chisquare tests are applied at each of the stages in building the CHAID tree, as described above, to ensure that each branch is associated with a statistically significant predictor of the response variable (e.g., response rate). Bonferroni corrections, or similar adjustments, are used to account for the multiple testing that takes place. When testing with a 5% significance level (i.e., considering a pvalue of less than 0.05 to be statistically significant) we have a one in 20 chance of finding a falsepositive result; concluding that there is a difference when in fact none exists (see this lighthearted cartoon for further discussion of multiple testing). The more tests that we do, the greater the chance we will find one of these falsepositive results (inflating the socalled Type I error), so adjustments to the pvalues are used to counter this, so that stronger evidence is required to indicate a significant result.
CHAID can also be extended to apply to the case where we have a continuous response variable, for example, sales recorded in £’s. However, in this case Ftests rather than Chisquare tests are used. Continuous predictor variables can also be incorporated by determining cutoffs to create ordinal groups of variables, based, for example, on particular percentiles of the variable. So, we might band incomes into four groups, based on its quartiles, such as ≤ £15,000; > £15,000 & ≤ £20,000; > £20,000 & ≤ £33,000; and > £33,000.
Generally a large sample size is needed to perform a CHAID analysis. At each branch, as we split the total population, we reduce the number of observations available and with a small total sample size the individual groups can quickly become too small for reliable analysis.
When we are interested in identifying groups of customers for targeted marketing where we do not have a response variable on which to base the splits in our sample, we can use other market segmentation techniques such as cluster analysis (see our recent blog on Customer segmentation for further information).
CHAID is sometimes used as an exploratory method for predictive modelling. However, a more formal multiple logistic or multinomial regression model could be applied instead. These regression models are specifically designed for analysing binary (e.g., yes/no) or categorical response variables and can accommodate continuous and/or categorical predictor variables. Interaction terms could be included in the model to investigate the associations between predictors that are tested for in the CHAID algorithm, whilst allowing a wider range of possible model specifications which may well fit the data better. Another advantage of this modelling approach is that we are able to analyse the data allinone rather than splitting the data into subgroups and performing multiple tests. In particular, where a continuous response variable is of interest or there are a number of continuous predictors to consider, we would recommend performing a multiple regression analysis instead. See our recent blog post on Analysing Categorical Data Using Logistic Regression Models for further details of these more formal modelling approaches.
The post CHAID (Chisquare Automatic Interaction Detector) appeared first on Select Statistical Consultants.
]]>