The post Cumulative Gains and Lift Curves: Measuring the Performance of a Marketing Campaign appeared first on Select Statistical Consultants.
]]>What returns will I get from running my marketing campaign?
In this context, we want to understand what benefit the predictive model can offer in predicting which customers will be responders versus nonresponders in a new campaign (compared to targeting them at random). This can be achieved by examining the cumulative gains and lift associated with the model, comparing its performance in targeting responders with how successful we would be without the added value offered by the model. We can also use the same information to help decide how many pieces of direct mail to send, balancing the marketing costs with the expected returns from the resulting sales. There is a cost associated with each customer that you mail and therefore you want to maximise the number of respondents that you acquire for the number of mailings you send.
In this blog, we describe the steps required to calculate the cumulative gains and lift associated with a predictive classification model.
Continuing with the direct marketing example, using the fitted model we can compare the observed outcomes from the historical marketing campaign, i.e., who responded and who did not, with the predicted probabilities of responding for each customer contacted in that campaign. (Note that, in practice, we would fit the model to a subset of our data and use this model to predict the probability of responding for each customer in a “holdout” sample to get a more accurate assessment of how the model would perform for new customers.)
We first sort the customers by their predicted probabilities, in decreasing order from highest (closest to one) to lowest (closest to zero). Splitting the customers into equally sized segments, we create groups containing the same numbers of customers, for example, 10 decile groups each containing 10% of the customer base. So, those customers who we predict are most likely to respond are in decile group 1, the next most likely in decile group 2, and so on. Examining each of the decile groups, we can produce a decile summary, as shown in Table 1, summarising the numbers and proportions of customers and responders in each decile.
The historical data may show that overall, and therefore when mailing the customer base at random, approximately 5% of customers respond (506 out of 10,000 customers). So, if you mail 1,000 customers you expect to see around 50 responders. But, if we look at the response rates achieved in each of the decile groups in Table 1, we see that the top groups have a higher response rate than this, they are our best prospects.
Decile Group  Predicted Probability Range  Number of Customers  Cumulative No. of Customers  Cumulative % of Customers  Responders  Response Rate  Cumulative No. of Responders  Cumulative % of Responders  Lift 
1  0.1291.000  1,000  1,000  10.0%  143  14.3%  143  28.3%  2.83 
2  0.1050.129  1,000  2,000  20.0%  118  11.8%  261  51.6%  2.58 
3  0.0730.105  1,000  3,000  30.0%  96  9.6%  357  70.6%  2.35 
4  0.0400.073  1,000  4,000  40.0%  51  5.1%  408  80.6%  2.02 
5  0.0250.040  1,000  5,000  50.0%  32  3.2%  440  87.0%  1.74 
6  0.0180.025  1,000  6,000  60.0%  19  1.9%  459  90.7%  1.51 
7  0.0150.018  1,000  7,000  70.0%  17  1.7%  476  94.1%  1.34 
8  0.0120.015  1,000  8,000  80.0%  14  1.4%  490  96.8%  1.21 
9  0.0060.012  1,000  9,000  90.0%  11  1.1%  501  99.0%  1.10 
10  0.0000.006  1,000  10,000  100.0%  5  0.5%  506  100.0%  1.00 
For example, we find that in decile group 1 the response rate was 14.3% (there were 143 responders out of the 1,000 customers), compared with the overall response rate of 5.1%. We can also visualise the results from the decile summary in a waterfall plot, as shown in Figure 1. This illustrates that all of the customers in decile groups 1, 2 and 3 have a higher response rate using the predictive model.
From the decile summary, we can also calculate the cumulative gains provided by the model. We compare the cumulative percentage of customers who are responders with the cumulative percentage of customers contacted in the marketing campaign across the groups. This describes the ‘gain’ in targeting a given percentage of the total number of customers using the highest modelled probabilities of responding, rather than targeting them at random.
For example, the top 10% of customers with the highest predicted probabilities (decile 1), contain approximately 28.3% of the responders (143/506). So, rather than capturing 10% of the responders, we have found 28.3% of the responders having mailed only 10% of the customer base. Including a further 10% of customers (deciles 1 and 2), we find that the top 20% of customers contain approximately 51.6% of the responders. These figures can be displayed in a cumulative gains chart, as shown in Figure 2.
The dashed line in Figure 2 corresponds with “no gain”, i.e., what we would expect to achieve by contacting customers at random. The closer the cumulative gains line is to the topleft corner of the chart, the greater the gain; the higher the proportion of the responders that are reached for the lower proportion of customers contacted.
Depending on the costs associated with sending each piece of direct mail and the expected revenue from each responder, the cumulative gains chart can be used to decide upon the optimum number of customers to contact. There will likely be a tipping point at which we have reached a sufficiently high proportion of responders, and where the costs of contacting a greater proportion of customers are too great given the diminishing returns. This will generally correspond with a flatteningoff of the cumulative gains curve, where further contacts (corresponding with additional deciles) are not expected to provide many additional responders. In practice, rather than grouping customers into deciles, a larger number of groups could be examined, allowing greater flexibility in the proportion of customers we might consider contacting.
We can also look at the lift achieved by targeting increasing percentages of the customer base, ordered by decreasing probability. The lift is simply the ratio of the percentage of responders reached to the percentage of customers contacted.
So, a lift of 1 is equivalent to no gain compared with contacting customers at random. Whereas a lift of 2, for example, corresponds with there being twice the number of responders reached compared with the number you’d expect by contacting the same number of customers at random. So, we may have only contacted 40% of the customers, but we may have reached 80% of the responders in the customer base. Therefore, we have doubled the number of responders reached by targeting this group compared with mailing a random sample of customers.
These figures can be displayed in a lift curve, as shown in Figure 3. Ideally, we want the lift curve to extend as high as possible into the topleft corner of the figure, indicating that we have a large lift associated with contacting a small proportion of customers.
In a previous blog post we discussed how ROC curves can be used in assessing how good a model is at classifying (i.e., predicting an outcome). As well as understanding the predictive accuracy of a model used for classification, it can also be helpful to understand what benefit is offered by the model compared with trying to identify an outcome without it.
Cumulative gains and lift curves are a simple and useful approach to understand what returns you are likely to get from running a marketing campaign and how many customers you should contact, based on targeting the most promising customers using a predictive model. These approaches could similarly be applied in the context of predicting which individuals will default on a personal loan in order to decide who could be offered a credit card, for example. In this case, the aim is to minimise the number of people likely to default on the loan, whilst maximising the number of credit cards offered to those who will not default. The predictive model in each case could be any appropriate statistical approach for generating a probability for a binary outcome, be that a logistic regression model, a random forest, or a neural network, for example.
The post Cumulative Gains and Lift Curves: Measuring the Performance of a Marketing Campaign appeared first on Select Statistical Consultants.
]]>The post Debunking the myth of a North/South divide in GCSE performance appeared first on Select Statistical Consultants.
]]>Interestingly, his analysis, conducted on three annual cohorts of pupils, finds the same results as our single cohort schoollevel analysis reported in our recent blog; that differences between GCSE performance are not driven by a North/South divide and that factors affecting performance are, in fact, multifaceted and complex.
The article advises against just using highlevel statistics, and highlights the importance of undertaking indepth analyses. In our analyses of the available data, we fit a statistical model to the data and found deprivation to be a driver of performance; areas of highlevels of deprivation tended to have lower GCSE performance. Interestingly, Stephen Gorard’s research has delved into this a little deeper and shows that it is not just whether or not pupils are eligible for free school meals that affects attainment, but that a more important factor is the length of time that pupils have faced disadvantage. The article says that the current measure of deprivation, whether a child is eligible for free school meals or not, does not capture enough of the aspects of socioeconomic deprivation or disadvantage.
Of course, in the education sector and other fields that use observational studies, whilst we include as many of the influential and relevant factors in any analysis, we must always be aware that analyses are often limited by the factors you can include, or more importantly, what you can’t include. Many analyses of student outcomes can’t, for example, take account of factors such as motivation, the effect of inspirational teachers, or home resources since these are not simple to measure.
Given past headlines stating the existence of a North/South educational divide, how do we know that an analysis has been conducted appropriately and whether or not to believe a headline? In our experience, clear and honest reporting is vital; detailing not only the results, but also the methods, any assumptions and limitations. By being clear about what is and isn’t included in your analyses and what it does and doesn’t tell you enables others to appropriately evaluate the evidence themselves.
The post Debunking the myth of a North/South divide in GCSE performance appeared first on Select Statistical Consultants.
]]>The post Analysing outcomes with multiple categories appeared first on Select Statistical Consultants.
]]>Suppose, for example, that instead of modelling the odds of a student being offered a place on a course, we wish to understand the choice a student makes between different types of highschool programmes (e.g. an academic, general or vocational course). Here our response variable is still categorical, but now there are three possible outcomes (academic, general or vocation) rather than two. To explore this, we have data available on the academic choices made by 200 US students. (These data are available from the UCLA Institute for digital research and education using the following link: https://stats.idre.ucla.edu/stat/data/hsbdemo.dta).
In addition to the actual programme choice made by the student, the dataset also contains information on other factors that could potentially influence their choices such as each student’s socioeconomic status, the type of school attended (public or private), gender and their prior reading, writing, maths and science scores.
For example, in Figure 1 below we plot the proportion of students that choose each programme by their socioeconomic status (classified as low, middle, high). This figure seems to indicate that low income students are less likely to choose an academic programme compared to a general programme. We can also summarise this type of information using contingency tables and a Pearson’s chisquared test, which are often explored in the initial exploratory analysis of an analysis (see our blog on analysing categorical survey data for more details). For example, a Pearson’s chisquared confirmed that there is a evidence of a statistically significant association between socioeconomic status and degree choice made by the students in our data (pvalue = 0.002).
While visualisations and simple hypothesis testing are a useful first step to understanding the data, they only look at the effect of one variable in isolation and therefore they do not account for other potential confounding factors such as prior maths or reading scores in this case. Not controlling for confounding factors can lead to incomplete or even wrong conclusions being drawn (for more information see our blog on Simpson’s paradox). We can account for confounding factors by using a more formal approach, in this case a multinomial logistic regression model.
In a binary logistic regression, we model the probability of the outcome happening (vs. not happening). When extending the approach to model an outcome with multiple categories, we jointly estimate the probabilities of each outcome happening versus a baseline or reference category (usually the most desirable or most common outcome).
While the choice of the reference category does not change the model (i.e. the drivers retained in the model) nor does it change the estimated probabilities of each outcome, it does change how the results are reported. Therefore, the choice of the reference category should be based on the question at hand.
For instance,
Here we choose to use the “academic” programme as the reference category. When we fit the model, we get two sets of model coefficients for each explanatory variable as an output (see Table 1 below), one for each comparison to the reference category. One set estimates the changes in the log odds of choosing a vocational course rather than an academic course and the other of choosing a general course rather than an academic course.
For each variable (and levels of the variable), the table below provides the coefficient estimates for both comparisons of the outcome, along with the standard error and 95% confidence intervals. The pvalues reported are for each variable across both comparisons, rather than for the individual comparisons.
General vs. Academic log odds  Vocational vs. Academic log odds 


Explanatory variable  Coefficient estimate (standard error)  95% Confidence Interval  Coefficient estimate (standard error)  95% Confidence Interval  pvalue 
Intercept  0.29 (0.38)  1.03; 0.45  1.12 (0.46)  2.03; 0.20  
SES: Middle vs. Low
SES: High vs. Low 
0.32 (0.49)
1.04 (0.56) 
1.28; 0.63
2.14; 0.07 
0.86 (0.52)
0.33 (0.64) 
0.17; 1.89
1.60; 0.93 
0.030 
Private vs. Public school  0.61 (0.55)  1.68; 0.47  2.02 (0.81)  3.60; 0.43  0.012 
Reading score  0.06 (0.03)  0.11;0.004  0.07 (0.03)  0.13; 0.01  0.027 
Maths score  0.11 (0.03)  0.17; 0.04  0.14 (0.04)  0.21; 0.07  < 0.001 
Science score  0.09 (0.03)  0.03; 0.15  0.04 (0.03)  0.01; 0.10  0.004 
We find that the socioeconomic status of students, the type of school, and their prior reading, maths and science scores are all statistically significant for one or both of the comparisons (at the 5% significance level), while there is no evidence that the choice of programme differs between boys and girls (hence why gender is not included in the table).
Note that the pvalues reported are for each of the variables in the model across both comparisons. Furthermore, if the confidence intervals for any of the variable do not include zero for a given comparison, this means that this variable has a statistically significant effect on those odds.
It is often (but not always) the case with a multinomial logistic regression, that one variable has a statistically significant effect on the odds for one comparison but not another, i.e. that variable might not be associated with how likely one outcome is compared to the reference, but it could be associated with how likely a different outcome is compared to the same reference.
For example, the coefficients for students attending a private school as opposed to a public school are negative for both the odds of choosing a general vs. an academic programme and a vocational vs. an academic programme, but the coefficient is only statistically significant for the latter. For the comparison between general vs. academic programme, the 95% confidence interval includes 0, meaning that there is no evidence to suggest that a student from a private school (as opposed to a public school) is more or less likely to choose a general programme over an academic. On the contrary there is strong evidence to suggest that a student from a private school (as opposed to public school) is less likely to choose a vocational course over an academic one.
The above example illustrates that rather than interpreting each set of model coefficients uniquely, both sets of coefficients need to be considered in parallel to draw meaningful conclusions.
Whilst the raw model coefficients presented above are useful to understand the general direction of the effects of the different factors, they are not always naturally interpretable because they are reported on the logodds scale. In a future blog we will discuss the different ways we can present the results of a multinomial logistic regression model such as converting the outputs to odds ratios (using a similar interpretation to the ones presented in our previous blog) or to predicted probabilities. These different outputs allow us to more naturally understand which outcome is more likely to happen or, in our example, which programme is more likely to be chosen by a given student.
The post Analysing outcomes with multiple categories appeared first on Select Statistical Consultants.
]]>The post Seeing Statistics in Practice appeared first on Select Statistical Consultants.
]]>Sarah was pleased to be invited to speak to the students about her daytoday life as a statistical consultant; focussing not only on the statistical challenges she faces in her role but the wider consultancy skills that are a crucial element of the job. Other speakers came from cyber security, actuarial science, and finance and gave an insight into the differing statistical problems being tackled in their respective industries.
Sarah and the other speakers also had the opportunity to attend poster presentation sessions by the students on their dissertation projects. “Meeting the students and hearing about their work was really interesting”, said Sarah. “It was great to see the diversity of both the projects they are working on and the statistical approaches being applied. It’s clear that the course is equipping the students with the necessary skills and enthusiasm for tackling challenging statistical problems, which they can take forward in their future careers be that in industry or academia.”
The post Seeing Statistics in Practice appeared first on Select Statistical Consultants.
]]>The post Trust in Numbers: a Pillar of Good Statistical Practice appeared first on Select Statistical Consultants.
]]>In the newly termed “posttruth” society in which we live, numbers and scientific evidence can often be (mis)used to provide a certificate of credibility. Professor Spiegelhalter pointed out that the intentional falsification of numbers and scientific evidence is thankfully rare, and often the misuse of statistics has more to do with attempts to make a story more appealing by using “high impact visual data representation”, simplifying the presentation of the results by removing any mention of uncertainty, or omitting any discussion of the limitations of the data, experiment or analyses.
While clarity and insight are key for the presentation of statistical results, this should not be at the expense of quality and transparency, if we as statisticians are to build the general public’s confidence in numbers. To help with this, the UK Statistics Authority has released a new Code Of Practice, centred around three pillars: Trustworthiness, Quality and Value. Whilst organisations producing official statistics are required to adhere by this code, any organisations producing data and statistics in general are encouraged to consider committing to the three pillars.
Here at Select, our consultants, as Chartered Statisticians and professional members of the Royal Statistical Society (RSS), and also bide by the RSS Code of Conduct which is designed to ensure that professional statisticians provide the highest level of statistical service and advice. We do not compromise on Trustworthiness, Quality or Value to make a finding more insightful or to create a better story.
The post Trust in Numbers: a Pillar of Good Statistical Practice appeared first on Select Statistical Consultants.
]]>The post Is there a North/South divide in GCSE performance? appeared first on Select Statistical Consultants.
]]>In a previous blog we looked at the GCSE results of pupils in different regions of England and examined the current and historical differences in attainment between the North and the South.
Updated analysis of the 2017 results by School Dash shows that the pattern of attainment (pupils in the South tending to perform better than pupils in the North) is still present in the latest GCSE results published by the DfE.
Mapping the percentage of pupils achieving 5 or more A* to C grades at GCSE (including grade 4 or higher in English and Maths) in 2017 for each Local Authority (LA; see Figure 1) shows the same pattern of attainment as with the 2015 GCSE results in our previous blog; that higher performing local authorities tend to be those located in the South (although as discussed in our previous blog there are clearly regional differences).
Fitting a statistical model to this data showed that, on average, 63% of pupils in the South gained 5+ A*C grades compared to 59% of pupils in the North; this result was statistically significant.
Comparing GSCE results by region is very simplistic. It is likely that factors other than region affect the educational attainment of pupils. The DfE and other government departments publish a wealth of data about schools and regions, so we combined the GCSE results at LAlevel with data about their pupils’ characteristics, teacher vacancies, and deprivation measures (averaged across each LA).
To include these variables in the analysis, we fit a statistical model to the GCSE results at LAlevel and add them as explanatory variables to the model (we also included North/South as an explanatory variable). The factors that were significantly associated with GCSE performance are shown in the chart below. In this model there ceased to be any real difference between pupils from the North and the South; any differences being accounted for by differences in background factors.
For each of these factors, figure 2 below shows how the percentage of pupils achieving 5 or more good grades at GCSE deviates from the national average (61%). Also shown, for comparison, is the difference for pupils in the North compared to the South. Not only has the gap in performance between pupils in the North and South decreased (now less than 1 percentage point compared to the previous 4 percentage point difference), this difference is not statistically significant.
Figure 2 shows us that the variables that are statistically significant include those that represent:
Clearly deprivation measures are associated with GCSE performance though their interpretation is complicated. GCSE performance tends to be lower in areas that are generally more deprived (measured by average IDACI), but this is offset somewhat in LAs with larger areas in the top 10% most deprived areas of the country. LAs with higher proportions of pupils eligible for FSM (also a measure of deprivation) tend to have lower GCSE performance.
LAs with higher proportions of pupils with EAL tend to have higher GCSE performance, while LAS with higher proportions of pupils with SEN support tend to have lower levels of GCSE performance. Once these factors have been taken into account there is no longer any real difference between the GCSE performance of LAs in the North and South of England.
This can be further illustrated by looking at the model residuals; these are the differences in GCSE performance that remain after taking account of differences that are due to levels of deprivation, and proportion of pupils with FSM, EAL and SEN. Figure 3 shows the model residuals for LAs in England. The figure illustrates that the regional differences dissipate; that LAs where performance is above average (indicated in shades of yellow to red) and LAs where performance is below average (indicated in shades of blue) are distributed across the country; there is no geographic pattern, confirming that the differences in performance are not driven by a North/South divide.
The model results highlight that deprivation is clearly an important factor associated with school performance. Mapping deprivation (in this case the average IDACI measure) for LAs illustrates the similarity to the pattern of GCSE performance.
Areas with higher GCSE performance in Figure 1 (shaded red) tend also to be the areas with lower deprivation in Figure 4 (paler shades), and areas with lower GCSE performance (e.g. cities such as Hull, Leicester, Derby, Nottingham, Stoke) have relatively high levels of deprivation (shaded darker purple). While there are areas in the North that are less deprived, e.g. North Yorkshire and the East Riding of Yorkshire, there are clusters of more deprived areas in other parts of Yorkshire, around the river Mersey, around Birmingham and in the North East.
It is noticeable that the ‘city effect’ seen in the previous blog, is still observed here; that the higher levels of deprivation and lower GCSE performance observed in a number of other cities is not reflected in LAs in central London. London has similar levels of deprivation as some of the areas in the North (the North East, the North West, Yorkshire and the West Midlands) and yet, while taking deprivation into account, the percentage of pupils gaining 5+ A*Cs in LAs in London is between 5 and 7 percentage points higher than other regions of the country.
The link between education outcomes and disadvantage is not a new discovery and has been explored and discussed by others previously. Why London seems to do relatively well is not established and there are many other examples of schools and pupils that overcome their disadvantages and do well, demonstrating that the link between education outcomes and deprivation can be broken. A report by the Northern Powerhouse (Educating the North: driving ambition across the Powerhouse) published in February this year highlights “the devastating consequences of disadvantage in the North” and calls for “the government, local authorities, businesses and others to invest in our children and young people, to ensure they have the future they deserve.”
The factors associated with GCSE performance are multifaceted and complex. Even in this simple example we have shown that to really begin to understand why there are variations in GCSE performance it is important to use a statistical model. Once other variables are included in the model the North/South divide disappears. However, this model is limited and could be improved. More of the differences in regional performance (more of the variation) could be explained by adding more variables; we have not included any information about pupils’ home background, for example. If data were available, as well as background factors, further refinement could be added by drilling down to the school level, or lower. While statistical models can usually be improved they are often a balance between detail and parsimony.
The post Is there a North/South divide in GCSE performance? appeared first on Select Statistical Consultants.
]]>The post Why Use a Complex Sample for Your Survey? appeared first on Select Statistical Consultants.
]]>Most statistical analyses assume that the data collected are from a simple random sample (SRS) of the population of interest. So say, for example, that you were conducting a survey of employees in your workplace (this is the “population”), a simple random sample would be where each of your colleagues in the office (or “sampling units”) were equally likely to be sampled. However, it’s not always possible or practical to take a simple random sample. Simple random sampling requires access to the whole population of interest (a “complete sampling frame” listing the sampling units) which may not be feasible for large populations. If sampling units are widely spread out geographically, for example, it might also be prohibitively expensive to access and sample across the whole area. Or, if some members of the population (e.g., of a particular demographic background) are relatively low in number, a simple random sample might not obtain enough (or any) of these individuals to reliably measure their responses. So, even if a complete sampling frame is available, it might be much cheaper or more efficient to use a complex sampling scheme instead of SRS, such as multistage sampling, clustering and/or stratification, for example.
With these approaches, members of the population don’t all have the same probability of being selected into the sample. Complex samples are most often used for surveys, especially large national or multinational ones where simple random sampling is simply not practical. For example, suppose you were conducting a survey in a conflictaffected country and the target population was all adults aged over 18, totaling, say, 20 million individuals. You might be interested in how responses differ by occupation, but some categories (perhaps selfemployed) may only represent a small fraction of the population. You might therefore consider stratifying your sampling to ensure that sufficient responses were obtained to make reliable estimates in each occupation group. The country may also be split geographically into, say, 40 states. To travel to and interview people in each of these states would likely be unfeasible, so cluster sampling might be used so that only a subset of the states needed to be accessed.
Remember – complex samples require statistical methods that take the sampling design into account.
Complex samples may also be incorporated into the design of crosssectional observational studies or even interventional studies (such as clinical trials). The key thing to remember is that when analysing data from a survey using complex sampling, the statistical methods that you use must take the sampling design into account.
So, what are the most common complex sampling approaches and why and when are they used? We focus here on cluster sampling and stratified sampling. We’ll also discuss sampling without replacement which should also be taken into account when analysing your data.
In cluster sampling, the population is split into similar groups of individuals (“clusters”) and then a sample of these clusters is taken (the clusters are the sampling units in this case) so that all of the elements in the selected clusters are included in the sample. Clustering is appropriate when we expect elements in different clusters to be relatively similar (“homogeneous”), i.e., each cluster is representative of the population.
For example, suppose we wanted to gather the opinions of school children in a particular county in England, say Somerset. It would be difficult and expensive to interview all schoolaged children in Somerset, so we take a sample of those children instead. However, taking a simple random sample of pupils in Somerset may mean that we still need to survey pupils in all, or a large proportion, of the schools in the county. It would be much cheaper to only survey the pupils in a subset of schools – so, we might cluster pupils according to their schools and then take a sample of the clusters (surveying all students within those selected clusters) to obtain a clustered sample of school children in the county.
This method is most efficient when most of the variation in the population is within clusters, rather than between them (higher withincluster correlation increases the variance compared to SRS). Cluster sampling is generally used to reduce costs, by reducing the number of clusters that we sample within whilst maintaining the sample efficiency.
Stratified sampling involves splitting the members of the population into subgroups (“strata”) before sampling, and then applying sampling (usually SRS) separately within each and every group (“stratum”). This is in contrast to cluster sampling where whole clusters are sampled, rather than samples of individuals being taken within each group (i.e., stratum), as illustrated in Figure 1. Stratified sampling can help to ensure that the sample collected is representative of the population, by guaranteeing that sufficient individuals from each subgroup (e.g., gender, or socioeconomic status) will be sampled. This is especially important if some strata only represent a small proportion of the overall population and if the survey responses are expected to differ across the subgroups. For example, responses to a survey might differ by nationality so if we were to miss some of the nationalities in our sample, our results might be biased.
Stratified sampling is appropriate when elements in different clusters are relatively dissimilar (“heterogeneous”), whereas cluster sampling is most efficient when the majority of the variation in the population is within clusters.
Returning to the example of surveying school children in Somerset, we might want to estimate the proportion of pupils with different characteristics stratified (i.e., estimated separately) by school type (e.g., academy, faith school, voluntary aided school, etc.). In this case, we could use stratified sampling to ensure that pupils from different school types are adequately represented in our sample. Schools would be split into strata (e.g., by school type: academy, faith school, voluntary aided school, etc.), and then samples of pupils would be taken within each stratum, to ensure that pupils from each school type were adequately represented in the sample. Contrast this with cluster sampling where we would cluster pupils and then take a sample of the clusters (surveying all students within those selected clusters).
Each stratum can be sampled in proportion to the relative size of that subgroup in the total population (“proportionate allocation”) to make the overall sample as representative as possible. Or, larger samples can be obtained in strata with greater variability to minimise the sampling variance (“optimum allocation”), improving the efficiency of the sample overall.
It is also possible to combine stratified sampling with clustered sampling. For example, we might stratify schools by type and then take cluster samples of schools within each stratum. This is an example of a onestage cluster sampling scheme but further stages of sampling could be also included. In twostage cluster sampling, for example, after taking the sample of clusters, a sample of elements within each selected cluster is then taken. So, we might only interview a sample of the pupils in each selected school.
After completing your survey, you might find that the sample you have taken is not representative of the population (for example, 40% of the population might be male, whereas in the sample obtained only 20% might be males and so males are “undersampled”). In this case poststratification can be applied. Such differences can be due to nonresponse or incomplete coverage, which are an inevitable consequence of the fact that we cannot sample everyone in the population nor compel them to respond. If the sample is imbalanced with respect to key factors that are likely to affect the study/survey responses then they can lead to biases in the results. Sampling weights can be calculated to poststratify the sample (to adjust the sample data after it has been collected) to ensure that the results are representative of the population. For more information on survey weighting and poststratification, see our case study on the work we did recently with Sport Wales for their School Sport Survey.
Suppose you were taking a sample of animals from the wild, in order to estimate their average weights, for example. Once one animal had been caught and measured, it would then be released back into the wild. It’s possible, in this case, that you might catch and measure the same animal more than once – we call this “sampling with replacement”. With replacement means that once an individual is selected to be in the sample, that individual is placed back in the population to potentially be sampled again. There are two ways to select a sample from the population – with replacement, as in this example, or without replacement. Without replacement means that once an individual is sampled, that individual cannot be sampled again; they are not placed back in the population. This will often occur when a sample is preselected from a sampling frame, i.e., a list of all those in the population who can be sampled.
Many standard analysis techniques assume that the sample being analysed was obtained from a sample taken with replacement or from an infinite population (when the population is infinite, or extremely large, then there’s little difference between sampling with and without replacement). However, in practice, most simple random samples are actually taken without replacement from a finite population. In this case, the variability of our sample is actually less than expected, and therefore we can apply a finite population correction to account for this greater efficiency in the sampling process. Each sampled individual is always unique and therefore provides ‘new’ information when sampling without replacement, whereas it’s possible when sampling with replacement to have ‘repeated’ information. When sampling without replacement from a finite population, it may be possible to sample all individuals in which case we’ll have no uncertainty in our estimates. The correction only has a noticeable effect when the sampling fraction, i.e., the proportion of the population sampled, is large. A good ruleofthumb to decide whether you need to apply a finite population correction is if you obtain a sample that makes up more than 5% of the population you should apply the correction. A finite population correction factor (FPC) is calculated, which is then multiplied by the standard error of the estimate. We’ve recently released a series of sample size and confidence interval calculators, which include a finite population correction – for more details (including the formula for the FPC) see the calculators on the Resources section of our website.
The most important thing to understand about complex sampling is that a more sophisticated analysis is needed when analysing the data collected – standard approaches are not necessarily appropriate. We must take account of the sample design in order for our conclusions to be reliable, whether we are estimating a characteristic of the population or testing for effects, for example.
The usual standard errors, assuming a simple random sample with replacement, will be incorrect if a complex sample has been taken. For example, a sample that is collected using cluster sampling underestimates the true population variance. Adjusting the standard errors to account for the complex sampling plan we find that they are larger, if correctly estimated, than those that would have been obtained assuming a simple random sample of the same size. This is because we might expect responses within a cluster to be more similar to each other than for randomly selected individuals across the population. Without correcting for these underestimates, we increase the risk of falsely determining significant effects when they do not actually exist (“false positives”).
In the statistical software package SPSS, complex samples analysis plans can be generated which, when used alongside the corresponding Analyze>Complex Samples menu, ensure that the sample design is incorporated into the analysis. In R, the survey package similarly allows you to specify a complex survey design and carry out appropriate analyses taking the design into account. Other packages in R, such as the anesrake package are also useful for implementing survey weighting, for example.
Complex samples are a useful tool for creating more efficient (e.g., stratified sampling with optimum allocation) or cheaper (e.g., cluster sampling) sampling designs. However, it’s crucial when using a complex sample to account for the sampling design when analysing your data in order to ensure that the results are accurate and reliable. If you’re conducting a survey using complex sampling and need help with the survey design or analysis, contact us to find out how we can help.
The post Why Use a Complex Sample for Your Survey? appeared first on Select Statistical Consultants.
]]>The post Select Welcomes Jo to the Team appeared first on Select Statistical Consultants.
]]>“We’re really pleased to welcome Jo to the consulting team” says Managing Director, Lynsey McColl. “Jo has considerable experience in education and we are excited to work with her in developing this sector further within the company. Much of Jo’s knowledge and skills are also highly transferable, such as her experience in the design and analysis of surveys, and I know she is very much looking forward to working on projects from a wide range of sectors.”
The post Select Welcomes Jo to the Team appeared first on Select Statistical Consultants.
]]>The post Analysing Categorical Data Using Logistic Regression Models appeared first on Select Statistical Consultants.
]]>When analysing a continuous response variable we would normally use a simple linear regression model to explore possible relationships with other explanatory variables. We might for example, investigate the relationship between a response variable, such as a person’s weight, and other explanatory variables such as their height and gender.
“Logistic regression and multinomial regression models are specifically designed for analysing binary and categorical response variables.”
When the response variable is binary or categorical a standard linear regression model can’t be used, but we can use logistic regression models instead. These alternative regression models are specifically designed for analysing binary (e.g., yes/no) or categorical (e.g., Fulltime/Parttime/Retired/Unemployed) response variables. Similar to linear regression models, logistic regression models can accommodate continuous and/or categorical explanatory variables as well as interaction terms to investigate potential combined effects of the explanatory variables (see our recent blog on Key Driver Analysis for more information).
Logistic regression models for binary response variables allow us to estimate the probability of the outcome (e.g., yes vs. no), based on the values of the explanatory variables. We could simply model this probability directly as a function of the explanatory variables but, instead, we use the logit function, logit(p) = ln(p/(1p)), where p is the probability of the outcome occurring, in order to determine the corresponding log odds of the outcome which we then model as a linear combination of the explanatory variables. As with standard linear regression analyses, the model coefficients can then be interpreted in order to understand the direction and strength of the relationships between the explanatory variables and the response variable.
Suppose, for example, that we are interested in how likely a student is to be offered a place on a postgraduate course. We consider the potential effects of the student’s mark on the course’s admissions exam (EXAM), their academic grading from their undergraduate degree (GRAD) and the prestige of their undergraduate institution (RANK, taking values from 1 to 4). We collect data from 400 students applying to graduate school and record whether they were successful or not in being admitted onto the course – so our response variable is binary (admit/not admit). (These data are available from the UCLA Institute for digital research and education using the following link: http://www.ats.ucla.edu/stat/data/binary.csv.)
Running the logistic regression model (for example, using the statistical software package R), we obtain pvalues for each explanatory variable and we find that all three explanatory variables are statistically significant (at the 5% significance level). So there’s evidence that each of these has an independent effect on the probability of a student being admitted (rather than just a difference observed due to chance). But what are these effects – are they positive or negative and how strong are they? We need to look at the coefficients estimated by the model in order to understand this and find, for example, that:
We can also exponentiate the coefficients and interpret them as odds ratios. This is the most common way of measuring the association between each explanatory variable and the outcome when using logistic regression. For the undergraduate institution rank above, the odds ratio for “if Rank=2” represents the odds of admission for an institution with Rank=2 compared to the odds of admission for an institution with Rank=1. The estimated odds ratio is exp(0.675) = 0.509, which means that the odds of admission having attended a Rank=2 institution are 0.509 times that of the odds for having attended a Rank=1 institution (or equivalently 49% [= 0.5091 x 100] lower). In other words, if the odds of a Rank=1 candidate are 1 to 10 (i.e., p=1/11 and 1p=10/11), the odds of a Rank=2 candidate being admitted are about half as good or about 1 to 20 (i.e., p=1/21 and 1p=20/21). So, for every Rank=2 applicant who is admitted, twenty Rank=2 candidates will be rejected, but for every Rank=1 applicant who is admitted, only ten Rank=1 candidates will be rejected.
Odds ratios can also be provided for continuous variables and in this case the odds ratio summarises the change in the odds per unit increase in the explanatory variable. For example, looking at the effect of GRAD above, the odds ratio (exp(0.804) = 2.23) says how the odds change per grade point – i.e., 2.23 times higher per point in this case. It’s important to note that, for continuous explanatory variables, their effect on the probability (as opposed to the odds) of the outcome is not constant across all values of the explanatory variable. Due to the logit transformation, the effect will be smaller for very low or very high values of the explanatory variable, and much larger for those in the middle.
We can also calculate a confidence interval to capture our uncertainty in the odds ratio estimate and we’ve put together an online odds ratio confidence interval calculator that you can use to do exactly this (you just need to enter your data from a contingency table). For the GRAD variable above, the 95% confidence interval for the odds ratio (estimated to be 2.23) is 1.17 to 4.32, so we’re 95% confident that this range covers the true odds ratio (if the study was repeated and the range calculated each time, we would expect the true value to lie within these ranges on 95% of occasions).
A key advantage of this modelling approach is that we are able to analyse the data allinone rather than splitting the data into subgroups and performing multiple tests (using a CHAID analysis, for example) which, with a reduced sample size, will have less statistical power. See our recent blog for further information on the importance and effect of sample size. By including all of the potential explanatory variables in one model, we can see which make up the most informative combination of predictors for the outcome.
All of the above (binary logistic regression modelling) can be extended to categorical outcomes (e.g., blood type: A, B, AB or O) – using multinomial logistic regression. The principles are very similar, but with the key difference being that one category of the response variable must be chosen as the reference category. Separate odds ratios are determined for all explanatory variables for each category of the response variable, except for the reference category. The odds ratios then represent the change in odds of the outcome being a particular category versus the reference category, for differing factor levels of the corresponding explanatory variable.
There are also extensions to the logistic regression model when the categorical outcome has a natural ordering (we call this ‘ordinal’ data as opposed to ‘nominal’ data). For example, the outcome might be the response to a survey where the answer could be “poor”, “average”, “good”, “very good”, and “excellent”. In this case we use ordered logistic regression modelling and we can explore whether the odds of being in a ‘higher’ category is associated with each of our explanatory variables.
These loglinear models can also be used to make predictions of the probability of an outcome for particular cases. We can input the values of the explanatory variables (into the formula generated by the model) for a range of possible scenarios and obtain the predicted odds or probability of the outcome in each case.
The model can be implemented within a tool, for example in Microsoft Excel or as a web app (see our recent post on Interacting with Your Data). This allows a range of predictions to be made and visualised easily. Prediction intervals can also be provided with each projection to quantify the associated uncertainty in the estimate – giving the range for which we are confident that the true probability will lie and allowing the user to consider best and worstcase scenarios.
Logistic regression models are a great tool for analysing binary and categorical data, allowing you to perform a contextual analysis to understand the relationships between the variables, test for differences, estimate effects, make predictions, and plan for future scenarios. For a real World example of the value of logistic regression modelling, see our case study on developing a medical decision tool using binary logistic regression to help inform the assessment of whether to extubate intensive care patients.
Logistic regression models are also great tools for classification problems – take a look at our blog on Classifying Binary Outcomes to find out more.
The post Analysing Categorical Data Using Logistic Regression Models appeared first on Select Statistical Consultants.
]]>The post Camille is Awarded Chartered Statistician Status appeared first on Select Statistical Consultants.
]]>We’re pleased to announce that the prestigious Chartered Statistician designation has been granted to Camille by the Royal Statistical Society, recognising her extensive training and experience as a professional statistician.
The Chartered Statistician (CStat) status provides formal recognition of an individual’s statistical qualifications, professional training and experience and is the highest professional award for a statistician. To qualify, the Royal Statistical Society (RSS) requires an approved degree together with postgraduate training and experience as a professional statistician for at least 5 years, or alternatively the ability to demonstrate breadth and depth of statistical knowledge. Camille’s award, gained through the competencybased route, recognises her 10 years’ professional experience in a statistical role at the Pirbright Institute for Animal Health, University of Bristol and now here at Select Statistical Services. She was also able to demonstrate a strong and consistent commitment to continued professional development (CDP), another key criterion considered by the RSS in making the award.
Chartered Statisticians are required to abide by the Society’s code of conduct, and to adhere to their comprehensive CPD policy. Each CStat is required to regularly revalidate their qualification to ensure that they continue to adhere to the RSS’s strict guidelines which are designed to ensure that Chartered Statisticians provide the highest level of professional service to their clients.
Guidance on how to apply for the CStat award is available on the RSS web site, but the Select team are also very happy to offer advice and guidance on how to develop and maintain a suitable CPD programme and to apply for the CStat award.
The post Camille is Awarded Chartered Statistician Status appeared first on Select Statistical Consultants.
]]>The post CHAID (Chisquare Automatic Interaction Detector) appeared first on Select Statistical Consultants.
]]>In our Market Research terminology blog series, we discuss a number of common terms used in market research analysis and explain what they are used for and how they relate to established statistical techniques. Here we discuss “CHAID”, but take a look at our previous articles on Key Driver Analysis, Maximum Difference Scaling and Customer Segmentation, and look out for new articles on TURF and Brand Mapping, coming soon. If there are other terms that you’d like us to blog on, we’d love to hear from you so please do get in touch.
CHAID (Chisquare Automatic Interaction Detector) analysis is an algorithm used for discovering relationships between a categorical response variable and other categorical predictor variables. It is useful when looking for patterns in datasets with lots of categorical variables and is a convenient way of summarising the data as the relationships can be easily visualised.
In practice, CHAID is often used in direct marketing to understand how different groups of customers might respond to a campaign based on their characteristics. So suppose, for example, that we run a marketing campaign and are interested in understanding what customer characteristics (e.g., gender, socioeconomic status, geographic location, etc.) are associated with the response rate achieved. We build a CHAID “tree” showing the effects of different customer characteristics on the likelihood of response.
At the first level (the “trunk”) we have all customers and the overall response rate for the marketing campaign was, say, 24.3%. As we progress down the tree to the first “branch”, we identify the factor that has the greatest impact on the likelihood of response, and our overall population is broken down into groups (“leaves”) based upon their differing values of this characteristic – Urban/Rural. We might find that rural customers have a response rate of only 18.6%, whereas urban customers have a response rate of 28.5%. We check to see if this difference is statistically significant and, if it is, we retain these as new leaves. At the next branch, for each of the new groups (Urban/Rural), we then consider whether they can be further split into subgroups so that there is a significant difference in the dependent variable (the response rate). Urban homeowners may have a much higher response rate (36.1%) compared with urban nonhomeowners (22.7%), and rural fulltime workers might have a higher response rate (24.0%) than rural parttime workers (17.8%) or the rural retired/unemployed (5.3%), for example. At each step every predictor variable is considered to see if splitting the sample based on this factor leads to a statistically significant relationship with the response variable. Where there might be more than two groupings for a predictor, merging of the categories is also considered to find the best discrimination. If a statistically significant difference is observed then the most significant factor is used to make a split, which becomes the next branch in the tree.
The process repeats to find the predictor variable on each leaf that is most significantly related to the response, branch by branch, until no further factors are found to have a statistically significant effect on the response (e.g., likelihood of responding to the marketing campaign). The results can be visualised with a socalled tree diagram – see below, for example. In this case, we can see that urban homeowners (36.1%) have the highest response rates, followed by rural fulltime workers (24.0%) and that these are therefore the best groups of customers to target. On the other hand, the lowest response rates were observed for the rural, retired/unemployed, aged over 65 years (1.4%).
As indicated in the name, CHAID uses Person’s Chisquare tests of independence, which test for an association between two categorical variables. A statistically significant result indicates that the two variables are not independent, i.e., there is a relationship between them. (See our recent blog post “Depression in Men ‘Regularly Ignored’…” for an example looking at the relationship between perceived mental health disorders and gender.)
Chisquare tests are applied at each of the stages in building the CHAID tree, as described above, to ensure that each branch is associated with a statistically significant predictor of the response variable (e.g., response rate). Bonferroni corrections, or similar adjustments, are used to account for the multiple testing that takes place. When testing with a 5% significance level (i.e., considering a pvalue of less than 0.05 to be statistically significant) we have a one in 20 chance of finding a falsepositive result; concluding that there is a difference when in fact none exists (see this lighthearted cartoon for further discussion of multiple testing). The more tests that we do, the greater the chance we will find one of these falsepositive results (inflating the socalled Type I error), so adjustments to the pvalues are used to counter this, so that stronger evidence is required to indicate a significant result.
CHAID can also be extended to apply to the case where we have a continuous response variable, for example, sales recorded in £’s. However, in this case Ftests rather than Chisquare tests are used. Continuous predictor variables can also be incorporated by determining cutoffs to create ordinal groups of variables, based, for example, on particular percentiles of the variable. So, we might band incomes into four groups, based on its quartiles, such as ≤ £15,000; > £15,000 & ≤ £20,000; > £20,000 & ≤ £33,000; and > £33,000.
Generally a large sample size is needed to perform a CHAID analysis. At each branch, as we split the total population, we reduce the number of observations available and with a small total sample size the individual groups can quickly become too small for reliable analysis.
When we are interested in identifying groups of customers for targeted marketing where we do not have a response variable on which to base the splits in our sample, we can use other market segmentation techniques such as cluster analysis (see our recent blog on Customer segmentation for further information).
CHAID is sometimes used as an exploratory method for predictive modelling. However, a more formal multiple logistic or multinomial regression model could be applied instead. These regression models are specifically designed for analysing binary (e.g., yes/no) or categorical response variables and can accommodate continuous and/or categorical predictor variables. Interaction terms could be included in the model to investigate the associations between predictors that are tested for in the CHAID algorithm, whilst allowing a wider range of possible model specifications which may well fit the data better. Another advantage of this modelling approach is that we are able to analyse the data allinone rather than splitting the data into subgroups and performing multiple tests. In particular, where a continuous response variable is of interest or there are a number of continuous predictors to consider, we would recommend performing a multiple regression analysis instead. See our recent blog post on Analysing Categorical Data Using Logistic Regression Models for further details of these more formal modelling approaches.
The post CHAID (Chisquare Automatic Interaction Detector) appeared first on Select Statistical Consultants.
]]>The post Select Team Cooks their Way to Fine Dining appeared first on Select Statistical Consultants.
]]>On a cold and crisp November day, the whole of Select Team (including future colleague Jo Morrison) met up at Exeter cookery school for a full day of developing new culinary skills and techniques.
The day included learning how to make pasta from scratch, using an insane number of freerange eggs. The smooth pasta dough was then turned into scrumptious spinach, ricotta and whole eggyolk ravioli. Complete focus was required to closeup each delicate raviolo, removing all air bubbles but without breaking the yolk which was sitting comfortably on top of the spinach and ricotta filing.
The ravioli were then cooked to be the centrepiece of a welldeserved lunch, comprising a bed of rocket salad with mustard dressing, some wild mushrooms sautés in butter, grilled bacon and parmesan savings. This was enjoyed by all present, following the opening of the first Christmas crackers of the season.
The afternoon was spent working with chocolate to make a chocolate delice, a yummy dessert with a crunchy nutty base, covered with a rich chocolate ganache and finished by a chocolate mirror glaze.
A lot of skills, bowls, pans and pots were involved as well some blow torches!
After a fullon day learning new skills away from desks and computers, we all went home with a bag of homecooked goodies and some sore feet.
The post Select Team Cooks their Way to Fine Dining appeared first on Select Statistical Consultants.
]]>The post Customer Segmentation appeared first on Select Statistical Consultants.
]]>In our Market Research terminology blog series, we discuss a number of common terms used in market research analysis and explain what they are used for and how they relate to established statistical techniques. Here we discuss “Customer Segmentation”, but have a look at our other posts on Key Driver Analysis, Maximum Difference Scaling and CHAID, and watch out for new articles on TURF and Brand Mapping, amongst others, coming soon. If there are other terms that you’d like us to blog on, we’d love to hear from you so please do get in touch.
Customer segmentation (sometimes also referred to as market segmentation) breaks down large groups of current and/or potential customers in a given market into smaller groups that are “similar” in terms of their preferences or characteristics. This allows you to adopt a different marketing mix (e.g., combination of price, product, promotion, and place) for each segment of the market. The same methods can also be used to select and target the best prospects, identifying those customers with the highest likely lifetime value or conversion rate, for example.
“Target your best prospects.”
Segmentation can be based upon a variety of factors including demographics, geography and spending behaviours as well as perceived needs and values. Traditionally, segmentation has focussed on identifying customer groups based on core demographics and values. However, valuebased segmentation is now increasingly common. In this case we also group customers using variables that capture the revenue they generate, e.g., their lifetime value, and the costs of establishing and maintaining a relationship with them.
The segmentation process often begins by taking the most obvious market segments, such as male, female, teen and adult (socalled “a priori” segments) and breaking them up into smaller segments that are made up of actual or potential customers with specific shared characteristics. These characteristics are carefully selected as being those likely to affect customer behaviour and the segmentation process determines the relative importance of each in order to ensure that the final segmentation is of practical commercial value. See our recent blog post, “How Do Supermarkets Use Your Data?“, for a great example of the power of customer segmentation in creating accurate customer profiles to improve the targeting of products and services.
Customer segmentation can be used in both business to business (B2B) and business to consumer (B2C) sales and marketing. In the case of B2B, the “customers” that we are segmenting are businesses rather than individuals and so the characteristics on which we segment might differ, but the underlying statistical techniques used are just the same.
Data from an account or customer relationship management database are often used in customer segmentation as they provide a great resource of customer attributes. Additional data from other sources, including external databases, can also be used to supplement your own and allow you to consider potential as well current customers.
A number of different statistical techniques can be used in performing customer segmentation. We discuss two of the most common methods (clustering and predictive modelling) below, but other classification techniques, such as random forests and mixture models (or latent class analysis) can also be used.
Clustering is a socalled “unsupervised” analysis that is designed to categorise observations (in this case customers) into a number of different groups (“clusters”), with each being relatively similar based on their values for a range of different factors. In each case, some form of distance measure is used to determine how close together or far apart different customers are based on their attributes.
There are many flavours of clustering methods depending upon how you measure the distance between points within and between clusters and also on how you explore the different groupings. For example, we can use Ward’s distance which seeks to minimise the total variance between points within each cluster. Then, in order to construct the best clustering, we might use an iterative procedure starting with every point being assigned to its own cluster and then merge clusters successively so as to minimise the increase in the Ward’s distance. The process continues until there’s just one cluster containing all the observations. A socalled dendrogram (see Figure 1 for example) can be produced that shows which clusters are merged at each step and the associated variance total, allowing us to select the most appropriate number of clusters.
There is a subjective element to using these clustering techniques. Following the analysis, we would need to review the data and identify what the members of each cluster have in common in a meaningful and practical sense. Similarly, we can check that members of distinct groups differ in some obvious and relevant manner. This can be done by summarising the characteristics of each cluster and potentially visualising these summaries as a means of comparing them, e.g., using circle plots where the relative size of each circle corresponds with the relative magnitude of a given characteristic for each cluster compared with the overall average. This process can also help to determine how many segments are needed.
Kmeans clustering is probably the most popular clustering (or partitioning) method for customer segmentation and requires the analyst to prespecify the number of clusters required. The method works by assigning each observation to a cluster and then calculating the distance between each point in that cluster and the mean value of all the observations in that cluster. The points are assigned to the clusters so as to minimise the total (squared) distance between each observation and the corresponding mean. Figure 2 shows an example where a group of customers have been segmented based on their sensitivity to price and brand loyalty.
There’s often a great deal of subjectivity associated with cluster analysis with the number of clusters being determined based upon the usability and usefulness of the corresponding groupings. The final clusters are often given names that summarise their key traits, such as “young, upwardly mobile” and “double income, no kids”. In the example in Figure 2, we have identified the three distinct clusters as “value conscious”, “brand advocates” and “loyal to low cost” customers.
In this example it’s very easy to identify the three groups “by eye” in Figure 2. However, if we wanted to include more customer attributes, we would have more than two dimensions and it would be much harder to identify the groups. We couldn’t split the three groups above by looking at either axis (i.e., attribute) on its own, in one dimension, as the groups overlap in terms of both their price sensitivity scores and brand loyalty scores. It’s only when we look at both scores together in two dimensions that the three groups can be easily identified. This idea of looking at multiple dimensions in combination is particularly relevant to higherdimensional data where simply looking at a 2 or 3D plot won’t necessarily help. This is where we need statistical methods such as cluster analysis to be able to effectively look at all dimensions at once.
Dimensionality reduction techniques, such as socalled principal component analysis (PCA) or factor analysis, can also help in visualising and understanding higher dimensional data – we’ll blog about these techniques another time.
Predictive models are a useful alternative to clustering when we have a specific definition of a “good” customer, such as their lifetime value, on which to base the groupings. In this case, we can create a model (using linear or generalised linear regression, for example) to investigate the relationships between potential drivers and customer value. Those variables that are found to be statistically significant predictors of customer value can then be used to define our customer segments. See our recent blog post on “key driver analysis” for more information on this kind of predictive modelling.
A similar approach known as CHAID (Chisquared Automatic Interaction Detector) analysis uses an algorithm for discovering relationships between a categorical response variable and other categorical predictor variables, and we’re planning to blog on this soon.
The post Customer Segmentation appeared first on Select Statistical Consultants.
]]>The post Maximum Difference Scaling (MaxDiff) appeared first on Select Statistical Consultants.
]]>In our Market Research terminology blog series, we discuss a number of common terms used in market research analysis and explain what they are used for and how they relate to established statistical techniques. Here we discuss “Maximum Difference Scaling”, but check out our other articles on Key Driver Analysis, Customer Segmentation and CHAID, and look out for new articles on TURF and Brand Mapping, amongst others, coming soon. If there are other terms that you’d like us to blog on, we’d love to hear from you so please do get in touch.
MaxDiff (Maximum Difference or BestWorst Scaling) is a survey method in market research that was originally developed in the 1990’s and is used to try to gain an understanding of consumers’ likes and dislikes. Respondents are usually asked to select the most and least important attributes from a subset of product features. The question is then repeated a number of times with the list of attributes varied so that the respondent selects the best and worst features from a number of subsets of product characteristics. The goal of the research is to rank the attributes in terms of their importance to customers on a common scale, so that comparisons and tradeoffs between them can be made. See below for an example of a MaxDiff question looking at the attributes of a household appliance.
The method is easy for respondents to complete and forces them to make a discriminating choice amongst attributes. There is no opportunity for bias to occur due to differences in the use of rating scales (which is commonly seen across different countries and cultures) such as those that can occur with a fivepoint, noncomparative scale from “Not important” to “Extremely important”, for example. Furthermore only two selections need to be made from each list, making it arguably more manageable/practical than the ranking of each item. When there are four attributes in the list, such as in the example above, we learn about five of the six pairwise comparisons between the items but from just two customer choices; it is only the comparison between the two attributes which are not selected that remains unknown. For example, from the response above we know that:
Firstly, experimental design is required in MaxDiff to construct the lists of product characteristics to be chosen from, to determine the number and combinations of attributes per question and to determine the number of questions that each respondent must complete. These are chosen so as to get the best balance of attributes within each question, maximising the information obtained whilst minimizing the burden to the respondents. Ideally, combinations are chosen so that each item is shown an equal number of times and pairs of items appear together an equal number of times. Most often, socalled balanced incomplete block (BIB), or partially balanced incomplete block (PBIB) designs are used. Take a look at our case study “Judging at the Big Bang Fair” for another example of the application of experimental design.
A number of different approaches are used by market researchers to analyse MaxDiff survey results.
A simple, socalled “Counts analysis” approach involves calculating the difference between the numbers of times each item is chosen as best and worst (termed the “count”) and then ranking the attributes based on these differences. This can be done at both the individual respondent level and also aggregated over all respondents. However, this method fails to take the experimental design of the survey into account and, for example, doesn’t use the information obtained when two items appear together in a list to distinguish between those with a tied count. Furthermore, if the experimental design was unbalanced, and so some items appeared more often than others, counts analysis will give biased estimates as items that appear more frequently will have had more opportunities to be chosen as best or worst.
Alternatively, random utility (or discrete choice) models, such as logistic regression models, are commonly applied to MaxDiff data. Logistic regression models are designed to predict the probability of a binary dependent variable (e.g., a yes/no response) via a linear combination of independent explanatory variables. The MaxDiff experiment, though it involves discrete choices, clearly does not fit into this design. A trick however is therefore used to apply the methodology in this case.
The “trick” involves separating out the responses for each attribute in each list as a binary outcome (chosen or not chosen) for the dependent variable and then using dummy variables for the independent variables to indicate which attribute the response corresponds to and whether it was select as best (1) or worst (1). The coefficients for each attribute from the fitted model are then directly compared to give a rank ordering for the attributes in terms of customer preference. They are often also transformed and interpreted as estimates of the relative probabilities of each item being chosen as the best. So, for the example above, we might find that the design/aesthetic of the kettle has the highest “share of preference” with approximately 40% chance of being selected as most important compared to the other attributes in the list.
There are a number of issues with this analysis approach. Most importantly, the assumption of responses being independent, which the logistic regression model relies upon (and in fact almost all statistical techniques do), is clearly violated as each choice will be affected by the attributes that were available to select in the current list, and best and worse choices will clearly be correlated. Therefore, the resulting parameter estimates will be biased and cannot be relied upon.
A more robust analysis that can be applied to MaxDiff involves applying a rankordered logistic regression or “exploded logit” model. This allows us to model the partial rankings obtained from the responses to the MaxDiff questions (see the bullet point list above, for example), whilst accounting for the ties. This approach does not violate the independence assumption like the tricked logistic regression model above and, as before, allows you to estimate the rank ordering of the attributes in terms of customer preference or to estimate probabilities of attributes being selected as the best.
Despite this approach being more statistically sound, there are still questions over the interpretability of the results. In particular, we are only assessing the relative importance/desirability of the attributes and so it is crucial to carefully consider the product features to be included upfront. The results also don’t indicate if any of the features are likely to actually impact customer behaviour, and furthermore customers’ responses (selfstated importance) won’t necessarily reflect what they actually want.
As the MaxDiff best and worst selections only depend upon the rank ordering of the attributes and their analysis simply provides estimates of the rank ordering of attributes, it may be simpler to directly ask consumers to rank the attributes in the first place. Although this approach is slightly more intensive for respondents, it is simply a case of repeatedly asking for the most important attribute from a decreasingly long list of items. This also simplifies the data collection process as we no longer need to generate experimental designs.
The rankordered logistic regression models described above are explicitly designed to analyse these sorts of data and allow us to estimate and test for differences among items in respondents’ preferences for them. It’s also simple to incorporate predictor variables accounting for respondents’ or items’ characteristics, or both, that allow us to investigate what characteristics affect the rankings.
The post Maximum Difference Scaling (MaxDiff) appeared first on Select Statistical Consultants.
]]>The post Select at the YSS Showcase appeared first on Select Statistical Consultants.
]]>The day was split into two with a morning training session introducing the attendees to R and R Shiny. R Shiny is a fantastic package from RStudio that makes it incredibly easy to build interactive web applications and is increasingly being used by businesses to interact with their data (you can see some example apps here). The afternoon was dedicated to presentations from statisticians in a variety of sectors and industries including Government, Academia, Medical and Finance. We were also lucky enough to hear plenary talks from both Professor Jane Hutton and Professor David Hand.
Lynsey’s presentation focussed on her own career path from undergraduate student to MD of Select with details of the sorts of interesting projects we do here at Select (such as understanding customer retention and modelling problem debt) and skills that you might need if you were interested in becoming a consultant. Lynsey said of the event “I had a really interesting afternoon at the YSS Showcase. It was great to be able to impart some advice on how aspiring statistical consultants can start on their career path and I also really enjoyed hearing the other presentations. It’s amazing the variety of careers that you can have as a statistician and it really highlights the career benefits of this fascinating discipline.”
The post Select at the YSS Showcase appeared first on Select Statistical Consultants.
]]>The post Working with the Sheffield Master’s Programme appeared first on Select Statistical Consultants.
]]>The Advisory Committee meets once a year to advise the School of Maths and Statistics on the design and conduct of its MSc course, with particular reference to the current and developing needs of commercial organisations such as Select. During the meeting, the committee review student feedback forms on the various modules that make up the course, read through some of the most recent dissertations and have lunch with current students to hear firsthand what they thought of the course. “We had some really interesting and useful discussions with both the students and course leaders throughout the day” said Lynsey after the day. “Overall the feedback from the students was exceptionally positive, which is great to hear as I know from firsthand experience how useful the course is to employers”. One area that was particularly interesting to think about was the expanding field of data science and what sort of experience and qualifications future Data Scientists might need.
This year Lynsey was also invited to speak to the School of Maths Early Career group on her career to date and, more generally, on the role of a statistical consultant. The Early Career group is made up of both the School’s PhD students and their Research Associates and aims to invite speakers from a variety of different industries. During her talk, Lynsey gave some highlights of the sorts of interesting client projects that Select undertake and discussed what sort of skills are required to be a consultant (both statistical and nonstatistical). The session was well received with lots of lively discussion and questions at the end.
The post Working with the Sheffield Master’s Programme appeared first on Select Statistical Consultants.
]]>The post Key Driver Analysis appeared first on Select Statistical Consultants.
]]>In our Market Research terminology blog series, we discuss a number of common terms used in market research analysis and explain what they are used for and how they relate to established statistical techniques. Here we discuss “key driver analysis”, but take a look at our other posts on MaxDiff, Customer Segmentation and CHAID, and look out for new articles on TURF and Brand Mapping, amongst others, coming soon.
It’s important to identify and understand the drivers of key business outcomes, such as customer satisfaction or loyalty, in order to improve processes and maximise performance and profitability. You might want to understand, for example, which aspects of your service influence how likely a customer will be to recommend you to others. A so called key driver analysis can be used to address this sort of question.
A key driver analysis investigates the relationships between potential drivers and customer behavior such as the likelihood of a positive recommendation, overall satisfaction, or propensity to buy a product. This is often using data collected from a questionnaire, which might ask for a customer’s demographics, their level of satisfaction with various aspects of your company’s services (e.g., whether it was value for money, or whether the customer services department was helpful) as well as their likelihood of recommending your company to others (see below).
Correlations between the scores for the customer behaviour of interest (likelihood of recommendation) versus those for the potential drivers may then be calculated to see whether there is evidence of a relationship between them. If there is a positive correlation between satisfaction with the customer services department and the likelihood in recommending the company to others, for example, then satisfaction with customer services is said to drive recommendations in a positive direction. Drivers can also be associated with customer behaviour changing in a negative direction.
A key driver analysis is often performed using multiple linear regression to model the primary outcome as a linear combination of the potential drivers. Those drivers that are found to have a statistically significant effect are considered to be key drivers of the outcome and their model coefficients can be interpreted to understand the direction and strength of the relationships between the drivers and the outcome variable.
A key driver analysis can help you to understand what drives customer behaviour.
By including all of the potential drivers in one model, we can see which make up the most informative combination of drivers for the outcome. The model may also be used to make “What If?” predictions of the outcome for customers with specific values of each of the drivers (these may include the gender and agegroup of a customer, for example).
Where there are linear relationships (correlations) between two or more of the potential drivers, this can lead to difficulty in the interpretation of the model coefficients – so called multicollinearity. This can occur where two of the potential drivers are capturing similar information, for example, a questionnaire might ask whether the staff were friendly, and also whether they were helpful, which we would expect to be highly related.
There are various statistical approaches that can be used to deal with multicollienarity, including the use of principal component analysis to reduce the number of potential drivers to a set of linearly uncorrelated variables. These analyses that take account of multicollinearity are often called ‘true driver analyses’. It is important to note however, that it is only possible to establish an association between each driver and the outcome with a correlation or regression analysis, it is not possible to establish causation.
With a ‘key driver analysis’, statistical modelling can be used to quantify the relationships between multiple variables. This can help you to understand what drives customer behaviour and ultimately how to improve your performance.
The post Key Driver Analysis appeared first on Select Statistical Consultants.
]]>The post Select Welcomes Camille to the Team appeared first on Select Statistical Consultants.
]]>“Camille’s appointment is part of an ambitious plan to expand the consulting team over the next couple of years.” says Managing Director, Lynsey McColl, “With her experience and expertise across such a wide range of areas, Camille will help us continue to deliver the highest quality work for our clients as we pursue our plans to grow the business. We’re really pleased to welcome Camille to the team and are looking forward to working with her.”
The post Select Welcomes Camille to the Team appeared first on Select Statistical Consultants.
]]>The post Market Basket Analysis: Understanding Customer Behaviour appeared first on Select Statistical Consultants.
]]>“Market Basket Analysis allows retailers to identify relationships between the products that people buy.”
Retailers can use the insights gained from MBA in a number of ways, including:
Given how popular and valuable MBA is, we thought we’d produce the following stepbystep guide describing how it works and how you could go about undertaking your own Market Basket Analysis.
To carry out an MBA you’ll first need a data set of transactions. Each transaction represents a group of items or products that have been bought together and often referred to as an “itemset”. For example, one itemset might be: {pencil, paper, staples, rubber} in which case all of these items have been bought in a single transaction.
In an MBA, the transactions are analysed to identify rules of association. For example, one rule could be: {pencil, paper} => {rubber}. This means that if a customer has a transaction that contains a pencil and paper, then they are likely to be interested in also buying a rubber.
Before acting on a rule, a retailer needs to know whether there is sufficient evidence to suggest that it will result in a beneficial outcome. We therefore measure the strength of a rule by calculating the following three metrics (note other metrics are available, but these are the three most commonly used):
Support: the percentage of transactions that contain all of the items in an itemset (e.g., pencil, paper and rubber). The higher the support the more frequently the itemset occurs. Rules with a high support are preferred since they are likely to be applicable to a large number of future transactions.
Confidence: the probability that a transaction that contains the items on the left hand side of the rule (in our example, pencil and paper) also contains the item on the right hand side (a rubber). The higher the confidence, the greater the likelihood that the item on the right hand side will be purchased or, in other words, the greater the return rate you can expect for a given rule.
Lift: the probability of all of the items in a rule occurring together (otherwise known as the support) divided by the product of the probabilities of the items on the left and right hand side occurring as if there was no association between them. For example, if pencil, paper and rubber occurred together in 2.5% of all transactions, pencil and paper in 10% of transactions and rubber in 8% of transactions, then the lift would be: 0.025/(0.1*0.08) = 3.125. A lift of more than 1 suggests that the presence of pencil and paper increases the probability that a rubber will also occur in the transaction. Overall, lift summarises the strength of association between the products on the left and right hand side of the rule; the larger the lift the greater the link between the two products.
To perform a Market Basket Analysis and identify potential rules, a data mining algorithm called the ‘Apriori algorithm’ is commonly used, which works in two steps:
The thresholds at which to set the support and confidence are userspecified and are likely to vary between transaction data sets. R does have default values, but we recommend that you experiment with these to see how they affect the number of rules returned (more on this below). Finally, although the Apriori algorithm does not use lift to establish rules, you’ll see in the following that we use lift when exploring the rules that the algorithm returns.
To demonstrate how to carry out an MBA we’ve chosen to use R and, in particular, the arules package. For those that are interested we’ve included the R code that we used at the end of this blog.
Here, we follow the same example used in the arulesViz Vignette and use a data set of grocery sales that contains 9,835 individual transactions with 169 items. The first thing we do is have a look at the items in the transactions and, in particular, plot the relative frequency of the 25 most frequent items in Figure 1. This is equivalent to the support of these items where each itemset contains only the single item. This bar plot illustrates the groceries that are frequently bought at this store, and it is notable that the support of even the most frequent items is relatively low (for example, the most frequent item occurs in only around 2.5% of transactions). We use these insights to inform the minimum threshold when running the Apriori algorithm; for example, we know that in order for the algorithm to return a reasonable number of rules we’ll need to set the support threshold at well below 0.025.
By setting a support threshold of 0.001 and confidence of 0.5, we can run the Apriori algorithm and obtain a set of 5,668 results. These threshold values are chosen so that the number of rules returned is high, but this number would reduce if we increased either threshold. We would recommend experimenting with these thresholds to obtain the most appropriate values. Whilst there are too many rules to be able to look at them all individually, we can look at the five rules with the largest lift:
Rule  Support  Confidence  Lift 
{instant food products,soda}=>{hamburger meat}  0.001  0.632  19.00 
{soda, popcorn}=>{salty snacks}  0.001  0.632  16.70 
{flour, baking powder}=>{sugar}  0.001  0.556  16.41 
{ham, processed cheese}=>{white bread}  0.002  0.633  15.05 
{whole milk, instant food products}=>{hamburger meat}  0.002  0.500  15.04 
These rules seem to make intuitive sense. For example, the first rule might represent the sort of items purchased for a BBQ, the second for a movie night and the third for baking.
Rather than using the thresholds to reduce the rules down to a smaller set, it is usual for a larger set of rules to be returned so that there is a greater chance of generating relevant rules. Alternatively, we can use visualisation techniques to inspect the set of rules returned and identify those that are likely to be useful.
Using the arulesViz package, we plot the rules by confidence, support and lift in Figure 2. This plot illustrates the relationship between the different metrics. It has been shown that the optimal rules are those that lie on what’s known as the “supportconfidence boundary”. Essentially, these are the rules that lie on the right hand border of the plot where either support, confidence or both are maximised. The plot function in the arulesViz package has a useful interactive function that allows you to select individual rules (by clicking on the associated data point), which means the rules on the border can be easily identified.
There are lots of other plots available to visualise the rules, but one other figure that we would recommend exploring is the graphbased visualisation (see Figure 3) of the top ten rules in terms of lift (you can include more than ten, but these types of graphs can easily get cluttered). In this graph the items grouped around a circle represent an itemset and the arrows indicate the relationship in rules. For example, one rule is that the purchase of sugar is associated with purchases of flour and baking powder. The size of the circle represents the level of confidence associated with the rule and the colour the level of lift (the larger the circle and the darker the grey the better).
Market Basket Analysis is a useful tool for retailers who want to better understand the relationships between the products that people buy. There are many tools that can be applied when carrying out MBA and the trickiest aspects to the analysis are setting the confidence and support thresholds in the Apriori algorithm and identifying which rules are worth pursuing. Typically the latter is done by measuring the rules in terms of metrics that summarise how interesting they are, using visualisation techniques and also more formal multivariate statistics. Ultimately the key to MBA is to extract value from your transaction data by building up an understanding of the needs of your consumers. This type of information is invaluable if you are interested in marketing activities such as crossselling or targeted campaigns.
If you’d like to find out more about how to analyse your transaction data, please contact us and we’d be happy to help.
library("arules")
library("arulesViz")
#Load data set:
data("Groceries")
summary(Groceries)
#Look at data:
inspect(Groceries[1])
LIST(Groceries)[1]
#Calculate rules using apriori algorithm and specifying support and confidence thresholds:
rules = apriori(Groceries, parameter=list(support=0.001, confidence=0.5))
#Inspect the top 5 rules in terms of lift:
inspect(head(sort(rules, by ="lift"),5))
#Plot a frequency plot:
itemFrequencyPlot(Groceries, topN = 25)
#Scatter plot of rules:
library("RColorBrewer")
plot(rules,control=list(col=brewer.pal(11,"Spectral")),main="")
#Rules with high lift typically have low support.
#The most interesting rules reside on the support/confidence border which can be clearly seen in this plot.
#Plot graphbased visualisation:
subrules2 < head(sort(rules, by="lift"), 10)
plot(subrules2, method="graph",control=list(type="items",main=""))
The post Market Basket Analysis: Understanding Customer Behaviour appeared first on Select Statistical Consultants.
]]>The post Fraud: A Crime for Middle England appeared first on Select Statistical Consultants.
]]>Cases of fraud have long been included in official police recorded crime data, but these figures provide at best a limited picture of the extent of the problem. Many cases of fraud go unreported to the police – for example, when a bank reimburses money stolen via credit card fraud, the consumer rarely goes to the trouble of going to the police. Plus, the reliability of police recorded crime numbers across the board has been called into question in a 2014 report from the UK Statistics Authority, citing regional variation in crime recording practices and insufficient data quality assurance procedures.
To attempt to fill the gap between the police recorded fraud figures and reality, the ONS recently introduced new questions into the CSEW asking about fraud and cyber crime. The CSEW is a rolling victim based study that asks samples of people about their experience of crime over the last year. The fraud and cyber crime questions have only appeared in the last six months and as such these results are still considered by the ONS to be “experimental” statistics. However, already a picture is emerging suggesting that the demographics of the people affected by fraud are unlike that of most other crimes.
For example, while the proportion of adults experiencing violent crime and personal theft declines fairly steadily as they get older, the prevalence of both fraud and computer misuse crimes peaks in 4554 age group. Another contrast is seen in household income, with those who earn more experiencing a little less violence but higher rates of theft and fraud than low earners.
Where you live also has an effect on the prevalence of different types of crime. Crime has traditionally been seen primarily as a problem for city dwellers, and indeed more adults experience violence and theft in urban areas than in rural locations. But for fraud the situation is reversed, with adults in rural areas being around 10% more likely to have been defrauded in the last year than urban dwellers. Computer misuse crimes are equally prevalent in urban and rural areas.
All this suggests that fraud is a crime affecting middleaged, middle class people, and this is borne out when we look at the data grouped by the ONS’s population characteristics classification. Those most likely to experience fraud are “rural residents”, “urbanites” and “suburbanites” – groups that tend to be UK born, own their own home, and have professional jobs or be retired. These are also the groups with the lowest prevalence of violence and theft.
While these new statistics from the CSEW point towards some interesting differences between fraud and other types of crime, we conclude with a word of caution. The numbers presented here are estimates extrapolated from a survey of a sample of the population. The CSEW sample is quite large (around 35,000 people), but as mentioned above the fraud and cyber crime questions have only recently been added and the sample size for them is much smaller (about 9000 people). The smaller the sample the lower the precision of the estimate, and the problem is compounded when we split the sample up to look at subgroups of the population. For example, there is an odd spike in computer misuse crime in the “cosmopolitans” population class, a group characterised by young single adults, often students, with high ethnic integration. While cyber crime could genuinely be a particular problem within this group, a closer look reveals that this estimate is based on just 326 survey responses, suggesting that it needs to be treated with a degree of caution. Ideally all these estimates would come with an indication of their uncertainty, such as a confidence interval, but unfortunately the ONS does not provide this, nor does it provide enough information about their methodology in order for us to calculate it ourselves.
References
The Crime in England and Wales statistical bulletin for the year ending March 2016
Experimental tables providing estimates on fraud and computer misuse
An overview of fraud statistics for the year ending March 2016
The post Fraud: A Crime for Middle England appeared first on Select Statistical Consultants.
]]>