The post Seeing Statistics in Practice appeared first on Select Statistical Consultants.

]]>Sarah was pleased to be invited to speak to the students about her day-to-day life as a statistical consultant; focussing not only on the statistical challenges she faces in her role but the wider consultancy skills that are a crucial element of the job. Other speakers came from cyber security, actuarial science, and finance and gave an insight into the differing statistical problems being tackled in their respective industries.

Sarah and the other speakers also had the opportunity to attend poster presentation sessions by the students on their dissertation projects. “Meeting the students and hearing about their work was really interesting”, said Sarah. “It was great to see the diversity of both the projects they are working on and the statistical approaches being applied. It’s clear that the course is equipping the students with the necessary skills and enthusiasm for tackling challenging statistical problems, which they can take forward in their future careers be that in industry or academia.”

The post Seeing Statistics in Practice appeared first on Select Statistical Consultants.

]]>The post Trust in Numbers: a Pillar of Good Statistical Practice appeared first on Select Statistical Consultants.

]]>In the newly termed “post-truth” society in which we live, numbers and scientific evidence can often be (mis-)used to provide a certificate of credibility. Professor Spiegelhalter pointed out that the intentional falsification of numbers and scientific evidence is thankfully rare, and often the mis-use of statistics has more to do with attempts to make a story more appealing by using “high impact visual data representation”, simplifying the presentation of the results by removing any mention of uncertainty, or omitting any discussion of the limitations of the data, experiment or analyses.

While clarity and insight are key for the presentation of statistical results, this should not be at the expense of quality and transparency, if we as statisticians are to build the general public’s confidence in numbers. To help with this, the UK Statistics Authority has released a new Code Of Practice, centred around three pillars: Trustworthiness, Quality and Value. Whilst organisations producing official statistics are required to adhere by this code, any organisations producing data and statistics in general are encouraged to consider committing to the three pillars.

Here at Select, our consultants, as Chartered Statisticians and professional members of the Royal Statistical Society (RSS), and also bide by the RSS Code of Conduct which is designed to ensure that professional statisticians provide the highest level of statistical service and advice. We do not compromise on Trustworthiness, Quality or Value to make a finding more insightful or to create a better story.

The post Trust in Numbers: a Pillar of Good Statistical Practice appeared first on Select Statistical Consultants.

]]>The post Is there a North/South divide in GCSE performance? appeared first on Select Statistical Consultants.

]]>In a previous blog we looked at the GCSE results of pupils in different regions of England and examined the current and historical differences in attainment between the North and the South.

Updated analysis of the 2017 results by School Dash shows that the pattern of attainment (pupils in the South tending to perform better than pupils in the North) is still present in the latest GCSE results published by the DfE.

Mapping the percentage of pupils achieving 5 or more A* to C grades at GCSE (including grade 4 or higher in English and Maths) in 2017 for each Local Authority (LA; see Figure 1) shows the same pattern of attainment as with the 2015 GCSE results in our previous blog; that higher performing local authorities tend to be those located in the South (although as discussed in our previous blog there are clearly regional differences).

Fitting a statistical model to this data showed that, on average, 63% of pupils in the South gained 5+ A*-C grades compared to 59% of pupils in the North; this result was statistically significant.

Comparing GSCE results by region is very simplistic. It is likely that factors other than region affect the educational attainment of pupils. The DfE and other government departments publish a wealth of data about schools and regions, so we combined the GCSE results at LA-level with data about their pupils’ characteristics, teacher vacancies, and deprivation measures (averaged across each LA).

To include these variables in the analysis, we fit a statistical model to the GCSE results at LA-level and add them as explanatory variables to the model (we also included North/South as an explanatory variable). The factors that were significantly associated with GCSE performance are shown in the chart below. In this model there ceased to be any real difference between pupils from the North and the South; any differences being accounted for by differences in background factors.

For each of these factors, figure 2 below shows how the percentage of pupils achieving 5 or more good grades at GCSE deviates from the national average (61%). Also shown, for comparison, is the difference for pupils in the North compared to the South. Not only has the gap in performance between pupils in the North and South decreased (now less than 1 percentage point compared to the previous 4 percentage point difference), this difference is not statistically significant.

Figure 2 shows us that the variables that are statistically significant include those that represent:

- deprivation measures, namely the average IDACI (income deprivation affecting children index) and the proportion of the LA that is in the 10% most deprived areas in the country;
- pupils’ characteristics, namely the percentage of pupils eligible for free school meals (FSM), with English as an additional language (EAL) and special education need (SEN) support.

Clearly deprivation measures are associated with GCSE performance though their interpretation is complicated. GCSE performance tends to be lower in areas that are generally more deprived (measured by average IDACI), but this is offset somewhat in LAs with larger areas in the top 10% most deprived areas of the country. LAs with higher proportions of pupils eligible for FSM (also a measure of deprivation) tend to have lower GCSE performance.

LAs with higher proportions of pupils with EAL tend to have higher GCSE performance, while LAS with higher proportions of pupils with SEN support tend to have lower levels of GCSE performance. Once these factors have been taken into account there is no longer any real difference between the GCSE performance of LAs in the North and South of England.

This can be further illustrated by looking at the model residuals; these are the differences in GCSE performance that remain after taking account of differences that are due to levels of deprivation, and proportion of pupils with FSM, EAL and SEN. Figure 3 shows the model residuals for LAs in England. The figure illustrates that the regional differences dissipate; that LAs where performance is above average (indicated in shades of yellow to red) and LAs where performance is below average (indicated in shades of blue) are distributed across the country; there is no geographic pattern, confirming that the differences in performance are not driven by a North/South divide.

The model results highlight that deprivation is clearly an important factor associated with school performance. Mapping deprivation (in this case the average IDACI measure) for LAs illustrates the similarity to the pattern of GCSE performance.

Areas with higher GCSE performance in Figure 1 (shaded red) tend also to be the areas with lower deprivation in Figure 4 (paler shades), and areas with lower GCSE performance (e.g. cities such as Hull, Leicester, Derby, Nottingham, Stoke) have relatively high levels of deprivation (shaded darker purple). While there are areas in the North that are less deprived, e.g. North Yorkshire and the East Riding of Yorkshire, there are clusters of more deprived areas in other parts of Yorkshire, around the river Mersey, around Birmingham and in the North East.

It is noticeable that the ‘city effect’ seen in the previous blog, is still observed here; that the higher levels of deprivation and lower GCSE performance observed in a number of other cities is not reflected in LAs in central London. London has similar levels of deprivation as some of the areas in the North (the North East, the North West, Yorkshire and the West Midlands) and yet, while taking deprivation into account, the percentage of pupils gaining 5+ A*-Cs in LAs in London is between 5 and 7 percentage points higher than other regions of the country.

The link between education outcomes and disadvantage is not a new discovery and has been explored and discussed by others previously. Why London seems to do relatively well is not established and there are many other examples of schools and pupils that overcome their disadvantages and do well, demonstrating that the link between education outcomes and deprivation can be broken. A report by the Northern Powerhouse (Educating the North: driving ambition across the Powerhouse) published in February this year highlights “the devastating consequences of disadvantage in the North” and calls for “the government, local authorities, businesses and others to invest in our children and young people, to ensure they have the future they deserve.”

The factors associated with GCSE performance are multifaceted and complex. Even in this simple example we have shown that to really begin to understand why there are variations in GCSE performance it is important to use a statistical model. Once other variables are included in the model the North/South divide disappears. However, this model is limited and could be improved. More of the differences in regional performance (more of the variation) could be explained by adding more variables; we have not included any information about pupils’ home background, for example. If data were available, as well as background factors, further refinement could be added by drilling down to the school level, or lower. While statistical models can usually be improved they are often a balance between detail and parsimony.

The post Is there a North/South divide in GCSE performance? appeared first on Select Statistical Consultants.

]]>The post Why Use a Complex Sample for Your Survey? appeared first on Select Statistical Consultants.

]]>Most statistical analyses assume that the data collected are from a *simple random sample (SRS)* of the population of interest. So say, for example, that you were conducting a survey of employees in your workplace (this is the “population”), a simple random sample would be where each of your colleagues in the office (or “sampling units”) were equally likely to be sampled. However, it’s not always possible or practical to take a simple random sample. Simple random sampling requires access to the whole population of interest (a “complete sampling frame” listing the sampling units) which may not be feasible for large populations. If sampling units are widely spread out geographically, for example, it might also be prohibitively expensive to access and sample across the whole area. Or, if some members of the population (e.g., of a particular demographic background) are relatively low in number, a simple random sample might not obtain enough (or any) of these individuals to reliably measure their responses. So, even if a complete sampling frame is available, it might be much cheaper or more efficient to use a *complex sampling scheme* instead of SRS, such as multi-stage sampling, clustering and/or stratification, for example.

With these approaches, members of the population don’t all have the same probability of being selected into the sample. Complex samples are most often used for surveys, especially large national or multinational ones where simple random sampling is simply not practical. For example, suppose you were conducting a survey in a conflict-affected country and the target population was all adults aged over 18, totaling, say, 20 million individuals. You might be interested in how responses differ by occupation, but some categories (perhaps self-employed) may only represent a small fraction of the population. You might therefore consider stratifying your sampling to ensure that sufficient responses were obtained to make reliable estimates in each occupation group. The country may also be split geographically into, say, 40 states. To travel to and interview people in each of these states would likely be unfeasible, so cluster sampling might be used so that only a subset of the states needed to be accessed.

Remember – complex samples require statistical methods that take the sampling design into account.

Complex samples may also be incorporated into the design of cross-sectional observational studies or even interventional studies (such as clinical trials). The key thing to remember is that when analysing data from a survey using complex sampling, the statistical methods that you use must take the sampling design into account.

So, what are the most common complex sampling approaches and why and when are they used? We focus here on cluster sampling and stratified sampling. We’ll also discuss sampling without replacement which should also be taken into account when analysing your data.

In cluster sampling, the population is split into similar groups of individuals (“clusters”) and then a sample of these clusters is taken (the clusters are the sampling units in this case) so that all of the elements in the selected clusters are included in the sample. Clustering is appropriate when we expect elements in different clusters to be relatively similar (“homogeneous”), i.e., each cluster is representative of the population.

For example, suppose we wanted to gather the opinions of school children in a particular county in England, say Somerset. It would be difficult and expensive to interview all school-aged children in Somerset, so we take a sample of those children instead. However, taking a simple random sample of pupils in Somerset may mean that we still need to survey pupils in all, or a large proportion, of the schools in the county. It would be much cheaper to only survey the pupils in a subset of schools – so, we might cluster pupils according to their schools and then take a sample of the clusters (surveying all students within those selected clusters) to obtain a clustered sample of school children in the county.

This method is most efficient when most of the variation in the population is within clusters, rather than between them (higher within-cluster correlation increases the variance compared to SRS). Cluster sampling is generally used to reduce costs, by reducing the number of clusters that we sample within whilst maintaining the sample efficiency.

Stratified sampling involves splitting the members of the population into subgroups (“strata”) before sampling, and then applying sampling (usually SRS) separately within each and every group (“stratum”). This is in contrast to cluster sampling where whole clusters are sampled, rather than samples of individuals being taken within each group (i.e., stratum), as illustrated in Figure 1. Stratified sampling can help to ensure that the sample collected is representative of the population, by guaranteeing that sufficient individuals from each sub-group (e.g., gender, or socioeconomic status) will be sampled. This is especially important if some strata only represent a small proportion of the overall population and if the survey responses are expected to differ across the subgroups. For example, responses to a survey might differ by nationality so if we were to miss some of the nationalities in our sample, our results might be biased.

Stratified sampling is appropriate when elements in different clusters are relatively dissimilar (“heterogeneous”), whereas cluster sampling is most efficient when the majority of the variation in the population is within clusters.

Returning to the example of surveying school children in Somerset, we might want to estimate the proportion of pupils with different characteristics stratified (i.e., estimated separately) by school type (e.g., academy, faith school, voluntary aided school, etc.). In this case, we could use stratified sampling to ensure that pupils from different school types are adequately represented in our sample. Schools would be split into strata (e.g., by school type: academy, faith school, voluntary aided school, etc.), and then samples of pupils would be taken within each stratum, to ensure that pupils from each school type were adequately represented in the sample. Contrast this with cluster sampling where we would cluster pupils and then take a sample of the clusters (surveying all students within those selected clusters).

Each stratum can be sampled in proportion to the relative size of that sub-group in the total population (“proportionate allocation”) to make the overall sample as representative as possible. Or, larger samples can be obtained in strata with greater variability to minimise the sampling variance (“optimum allocation”), improving the efficiency of the sample overall.

It is also possible to combine stratified sampling with clustered sampling. For example, we might stratify schools by type and then take cluster samples of schools within each stratum. This is an example of a one-stage cluster sampling scheme but further stages of sampling could be also included. In two-stage cluster sampling, for example, after taking the sample of clusters, a sample of elements within each selected cluster is then taken. So, we might only interview a sample of the pupils in each selected school.

After completing your survey, you might find that the sample you have taken is not representative of the population (for example, 40% of the population might be male, whereas in the sample obtained only 20% might be males and so males are “under-sampled”). In this case *post-stratification* can be applied. Such differences can be due to non-response or incomplete coverage, which are an inevitable consequence of the fact that we cannot sample everyone in the population nor compel them to respond. If the sample is imbalanced with respect to key factors that are likely to affect the study/survey responses then they can lead to biases in the results. Sampling weights can be calculated to post-stratify the sample (to adjust the sample data after it has been collected) to ensure that the results are representative of the population. For more information on survey weighting and post-stratification, see our case study on the work we did recently with Sport Wales for their School Sport Survey.

Suppose you were taking a sample of animals from the wild, in order to estimate their average weights, for example. Once one animal had been caught and measured, it would then be released back into the wild. It’s possible, in this case, that you might catch and measure the same animal more than once – we call this “sampling with replacement”. *With replacement* means that once an individual is selected to be in the sample, that individual is placed back in the population to potentially be sampled again. There are two ways to select a sample from the population – with replacement, as in this example, or without replacement. *Without replacement* means that once an individual is sampled, that individual cannot be sampled again; they are not placed back in the population. This will often occur when a sample is preselected from a sampling frame, i.e., a list of all those in the population who can be sampled.

Many standard analysis techniques assume that the sample being analysed was obtained from a sample taken with replacement or from an infinite population (when the population is infinite, or extremely large, then there’s little difference between sampling with and without replacement). However, in practice, most simple random samples are actually taken without replacement from a finite population. In this case, the variability of our sample is actually less than expected, and therefore we can apply a *finite population correction* to account for this greater efficiency in the sampling process. Each sampled individual is always unique and therefore provides ‘new’ information when sampling without replacement, whereas it’s possible when sampling with replacement to have ‘repeated’ information. When sampling without replacement from a finite population, it may be possible to sample all individuals in which case we’ll have no uncertainty in our estimates. The correction only has a noticeable effect when the sampling fraction, i.e., the proportion of the population sampled, is large. A good rule-of-thumb to decide whether you need to apply a finite population correction is if you obtain a sample that makes up more than 5% of the population you should apply the correction. A finite population correction factor (FPC) is calculated, which is then multiplied by the standard error of the estimate. We’ve recently released a series of sample size and confidence interval calculators, which include a finite population correction – for more details (including the formula for the FPC) see the calculators on the Resources section of our website.

The most important thing to understand about complex sampling is that a more sophisticated analysis is needed when analysing the data collected – standard approaches are not necessarily appropriate. We must take account of the sample design in order for our conclusions to be reliable, whether we are estimating a characteristic of the population or testing for effects, for example.

The usual standard errors, assuming a simple random sample with replacement, will be incorrect if a complex sample has been taken. For example, a sample that is collected using cluster sampling underestimates the true population variance. Adjusting the standard errors to account for the complex sampling plan we find that they are larger, if correctly estimated, than those that would have been obtained assuming a simple random sample of the same size. This is because we might expect responses within a cluster to be more similar to each other than for randomly selected individuals across the population. Without correcting for these under-estimates, we increase the risk of falsely determining significant effects when they do not actually exist (“false positives”).

In the statistical software package SPSS, complex samples analysis plans can be generated which, when used alongside the corresponding *Analyze>Complex Samples* menu, ensure that the sample design is incorporated into the analysis. In R, the survey package similarly allows you to specify a complex survey design and carry out appropriate analyses taking the design into account. Other packages in R, such as the anesrake package are also useful for implementing survey weighting, for example.

Complex samples are a useful tool for creating more efficient (e.g., stratified sampling with optimum allocation) or cheaper (e.g., cluster sampling) sampling designs. However, it’s crucial when using a complex sample to account for the sampling design when analysing your data in order to ensure that the results are accurate and reliable. If you’re conducting a survey using complex sampling and need help with the survey design or analysis, contact us to find out how we can help.

- The Myth of Random Sampling
- Eurostat survey sampling reference guidelines (SSN 1977-0375 Eurostat Methodologies and Working papers)
- Analysis of Complex Sample Survey Data in SAS

The post Why Use a Complex Sample for Your Survey? appeared first on Select Statistical Consultants.

]]>The post Select Welcomes Jo to the Team appeared first on Select Statistical Consultants.

]]>“We’re really pleased to welcome Jo to the consulting team” says Managing Director, Lynsey McColl. “Jo has considerable experience in education and we are excited to work with her in developing this sector further within the company. Much of Jo’s knowledge and skills are also highly transferable, such as her experience in the design and analysis of surveys, and I know she is very much looking forward to working on projects from a wide range of sectors.”

The post Select Welcomes Jo to the Team appeared first on Select Statistical Consultants.

]]>The post Analysing Categorical Data Using Logistic Regression Models appeared first on Select Statistical Consultants.

]]>When analysing a continuous response variable we would normally use a simple linear regression model to explore possible relationships with other explanatory variables. We might for example, investigate the relationship between a response variable, such as a person’s weight, and other explanatory variables such as their height and gender.

“Logistic regression and multinomial regression models are specifically designed for analysing binary and categorical response variables.”

When the response variable is binary or categorical a standard linear regression model can’t be used, but we can use logistic regression models instead. These alternative regression models are specifically designed for analysing binary (e.g., yes/no) or categorical (e.g., Full-time/Part-time/Retired/Unemployed) response variables. Similar to linear regression models, logistic regression models can accommodate continuous and/or categorical explanatory variables as well as interaction terms to investigate potential combined effects of the explanatory variables (see our recent blog on Key Driver Analysis for more information).

Logistic regression models for binary response variables allow us to estimate the probability of the outcome (e.g., yes vs. no), based on the values of the explanatory variables. We could simply model this probability directly as a function of the explanatory variables but, instead, we use the logit function, logit(*p*) = ln(*p*/(1-*p*)), where *p* is the probability of the outcome occurring, in order to determine the corresponding log odds of the outcome which we then model as a linear combination of the explanatory variables. As with standard linear regression analyses, the model coefficients can then be interpreted in order to understand the direction and strength of the relationships between the explanatory variables and the response variable.

Suppose, for example, that we are interested in how likely a student is to be offered a place on a post-graduate course. We consider the potential effects of the student’s mark on the course’s admissions exam (EXAM), their academic grading from their undergraduate degree (GRAD) and the prestige of their undergraduate institution (RANK, taking values from 1 to 4). We collect data from 400 students applying to graduate school and record whether they were successful or not in being admitted onto the course – so our response variable is binary (admit/not admit). (These data are available from the UCLA Institute for digital research and education using the following link: http://www.ats.ucla.edu/stat/data/binary.csv.)

Running the logistic regression model (for example, using the statistical software package R), we obtain p-values for each explanatory variable and we find that all three explanatory variables are statistically significant (at the 5% significance level). So there’s evidence that each of these has an independent effect on the probability of a student being admitted (rather than just a difference observed due to chance). But what are these effects – are they positive or negative and how strong are they? We need to look at the coefficients estimated by the model in order to understand this and find, for example, that:

- For every one unit change in EXAM the log odds of admission (vs. non-admission) increases by 0.00226.
- Attending an undergraduate institution with a rank of 2, compared to an institution with a rank of 1, changes the log odds of admission by -0.675.

We can also exponentiate the coefficients and interpret them as *odds ratios*. This is the most common way of measuring the association between each explanatory variable and the outcome when using logistic regression. For the undergraduate institution rank above, the odds ratio for “if Rank=2” represents the odds of admission for an institution with Rank=2 compared to the odds of admission for an institution with Rank=1. The estimated odds ratio is exp(-0.675) = 0.509, which means that the odds of admission having attended a Rank=2 institution are 0.509 times that of the odds for having attended a Rank=1 institution (or equivalently 49% [= 0.509-1 x 100] lower). In other words, if the odds of a Rank=1 candidate are 1 to 10 (i.e., *p*=1/11 and 1-*p*=10/11), the odds of a Rank=2 candidate being admitted are about half as good or about 1 to 20 (i.e., *p*=1/21 and 1-*p*=20/21). So, for every Rank=2 applicant who is admitted, twenty Rank=2 candidates will be rejected, but for every Rank=1 applicant who is admitted, only ten Rank=1 candidates will be rejected.

Odds ratios can also be provided for continuous variables and in this case the odds ratio summarises the change in the odds per unit increase in the explanatory variable. For example, looking at the effect of GRAD above, the odds ratio (exp(0.804) = 2.23) says how the odds change per grade point – i.e., 2.23 times higher per point in this case. It’s important to note that, for continuous explanatory variables, their effect on the *probability* (as opposed to the odds) of the outcome is not constant across all values of the explanatory variable. Due to the logit transformation, the effect will be smaller for very low or very high values of the explanatory variable, and much larger for those in the middle.

We can also calculate a confidence interval to capture our uncertainty in the odds ratio estimate and we’ve put together an online odds ratio confidence interval calculator that you can use to do exactly this (you just need to enter your data from a contingency table). For the GRAD variable above, the 95% confidence interval for the odds ratio (estimated to be 2.23) is 1.17 to 4.32, so we’re 95% confident that this range covers the true odds ratio (if the study was repeated and the range calculated each time, we would expect the true value to lie within these ranges on 95% of occasions).

A key advantage of this modelling approach is that we are able to analyse the data all-in-one rather than splitting the data into subgroups and performing multiple tests (using a CHAID analysis, for example) which, with a reduced sample size, will have less statistical power. See our recent blog for further information on the importance and effect of sample size. By including all of the potential explanatory variables in one model, we can see which make up the most informative combination of predictors for the outcome.

All of the above (binary logistic regression modelling) can be extended to categorical outcomes (e.g., blood type: A, B, AB or O) – using multinomial logistic regression. The principles are very similar, but with the key difference being that one category of the response variable must be chosen as the reference category. Separate odds ratios are determined for all explanatory variables for each category of the response variable, except for the reference category. The odds ratios then represent the change in odds of the outcome being a particular category versus the reference category, for differing factor levels of the corresponding explanatory variable.

There are also extensions to the logistic regression model when the categorical outcome has a natural ordering (we call this ‘ordinal’ data as opposed to ‘nominal’ data). For example, the outcome might be the response to a survey where the answer could be “poor”, “average”, “good”, “very good”, and “excellent”. In this case we use ordered logistic regression modelling and we can explore whether the odds of being in a ‘higher’ category is associated with each of our explanatory variables.

These log-linear models can also be used to make predictions of the probability of an outcome for particular cases. We can input the values of the explanatory variables (into the formula generated by the model) for a range of possible scenarios and obtain the predicted odds or probability of the outcome in each case.

The model can be implemented within a tool, for example in Microsoft Excel or as a web app (see our recent post on Interacting with Your Data). This allows a range of predictions to be made and visualised easily. Prediction intervals can also be provided with each projection to quantify the associated uncertainty in the estimate – giving the range for which we are confident that the true probability will lie and allowing the user to consider best- and worst-case scenarios.

Logistic regression models are a great tool for analysing binary and categorical data, allowing you to perform a contextual analysis to understand the relationships between the variables, test for differences, estimate effects, make predictions, and plan for future scenarios. For a real World example of the value of logistic regression modelling, see our case study on developing a medical decision tool using binary logistic regression to help inform the assessment of whether to extubate intensive care patients.

Logistic regression models are also great tools for classification problems – take a look at our blog on Classifying Binary Outcomes to find out more.

The post Analysing Categorical Data Using Logistic Regression Models appeared first on Select Statistical Consultants.

]]>The post Camille is Awarded Chartered Statistician Status appeared first on Select Statistical Consultants.

]]>We’re pleased to announce that the prestigious Chartered Statistician designation has been granted to Camille by the Royal Statistical Society, recognising her extensive training and experience as a professional statistician.

The Chartered Statistician (CStat) status provides formal recognition of an individual’s statistical qualifications, professional training and experience and is the highest professional award for a statistician. To qualify, the Royal Statistical Society (RSS) requires an approved degree together with post-graduate training and experience as a professional statistician for at least 5 years, or alternatively the ability to demonstrate breadth and depth of statistical knowledge. Camille’s award, gained through the competency-based route, recognises her 10 years’ professional experience in a statistical role at the Pirbright Institute for Animal Health, University of Bristol and now here at Select Statistical Services. She was also able to demonstrate a strong and consistent commitment to continued professional development (CDP), another key criterion considered by the RSS in making the award.

Chartered Statisticians are required to abide by the Society’s code of conduct, and to adhere to their comprehensive CPD policy. Each CStat is required to regularly revalidate their qualification to ensure that they continue to adhere to the RSS’s strict guidelines which are designed to ensure that Chartered Statisticians provide the highest level of professional service to their clients.

Guidance on how to apply for the CStat award is available on the RSS web site, but the Select team are also very happy to offer advice and guidance on how to develop and maintain a suitable CPD programme and to apply for the CStat award.

The post Camille is Awarded Chartered Statistician Status appeared first on Select Statistical Consultants.

]]>The post CHAID (Chi-square Automatic Interaction Detector) appeared first on Select Statistical Consultants.

]]>In our Market Research terminology blog series, we discuss a number of common terms used in market research analysis and explain what they are used for and how they relate to established statistical techniques. Here we discuss “CHAID”, but take a look at our previous articles on Key Driver Analysis, Maximum Difference Scaling and Customer Segmentation, and look out for new articles on *TURF *and *Brand Mapping*, coming soon. If there are other terms that you’d like us to blog on, we’d love to hear from you so please do get in touch.

CHAID (**Ch**i-square **A**utomatic **I**nteraction **D**etector) analysis is an algorithm used for discovering relationships between a categorical response variable and other categorical predictor variables. It is useful when looking for patterns in datasets with lots of categorical variables and is a convenient way of summarising the data as the relationships can be easily visualised.

In practice, CHAID is often used in direct marketing to understand how different groups of customers might respond to a campaign based on their characteristics. So suppose, for example, that we run a marketing campaign and are interested in understanding what customer characteristics (e.g., gender, socio-economic status, geographic location, etc.) are associated with the response rate achieved. We build a CHAID “tree” showing the effects of different customer characteristics on the likelihood of response.

At the first level (the “trunk”) we have all customers and the overall response rate for the marketing campaign was, say, 24.3%. As we progress down the tree to the first “branch”, we identify the factor that has the greatest impact on the likelihood of response, and our overall population is broken down into groups (“leaves”) based upon their differing values of this characteristic – Urban/Rural. We might find that rural customers have a response rate of only 18.6%, whereas urban customers have a response rate of 28.5%. We check to see if this difference is statistically significant and, if it is, we retain these as new leaves. At the next branch, for each of the new groups (Urban/Rural), we then consider whether they can be further split into subgroups so that there is a significant difference in the dependent variable (the response rate). Urban homeowners may have a much higher response rate (36.1%) compared with urban non-homeowners (22.7%), and rural full-time workers might have a higher response rate (24.0%) than rural part-time workers (17.8%) or the rural retired/unemployed (5.3%), for example. At each step every predictor variable is considered to see if splitting the sample based on this factor leads to a statistically significant relationship with the response variable. Where there might be more than two groupings for a predictor, merging of the categories is also considered to find the best discrimination. If a statistically significant difference is observed then the most significant factor is used to make a split, which becomes the next branch in the tree.

The process repeats to find the predictor variable on each leaf that is most significantly related to the response, branch by branch, until no further factors are found to have a statistically significant effect on the response (e.g., likelihood of responding to the marketing campaign). The results can be visualised with a so-called tree diagram – see below, for example. In this case, we can see that urban homeowners (36.1%) have the highest response rates, followed by rural full-time workers (24.0%) and that these are therefore the best groups of customers to target. On the other hand, the lowest response rates were observed for the rural, retired/unemployed, aged over 65 years (1.4%).

As indicated in the name, CHAID uses Person’s Chi-square tests of independence, which test for an association between two categorical variables. A statistically significant result indicates that the two variables are not independent, i.e., there is a relationship between them. (See our recent blog post “Depression in Men ‘Regularly Ignored’…” for an example looking at the relationship between perceived mental health disorders and gender.)

Chi-square tests are applied at each of the stages in building the CHAID tree, as described above, to ensure that each branch is associated with a statistically significant predictor of the response variable (e.g., response rate). Bonferroni corrections, or similar adjustments, are used to account for the multiple testing that takes place. When testing with a 5% significance level (i.e., considering a p-value of less than 0.05 to be statistically significant) we have a one in 20 chance of finding a false-positive result; concluding that there is a difference when in fact none exists (see this light-hearted cartoon for further discussion of multiple testing). The more tests that we do, the greater the chance we will find one of these false-positive results (inflating the so-called Type I error), so adjustments to the p-values are used to counter this, so that stronger evidence is required to indicate a significant result.

CHAID can also be extended to apply to the case where we have a continuous response variable, for example, sales recorded in £’s. However, in this case F-tests rather than Chi-square tests are used. Continuous predictor variables can also be incorporated by determining cut-offs to create ordinal groups of variables, based, for example, on particular percentiles of the variable. So, we might band incomes into four groups, based on its quartiles, such as ≤ £15,000; > £15,000 & ≤ £20,000; > £20,000 & ≤ £33,000; and > £33,000.

Generally a large sample size is needed to perform a CHAID analysis. At each branch, as we split the total population, we reduce the number of observations available and with a small total sample size the individual groups can quickly become too small for reliable analysis.

When we are interested in identifying groups of customers for targeted marketing where we do not have a response variable on which to base the splits in our sample, we can use other market segmentation techniques such as cluster analysis (see our recent blog on Customer segmentation for further information).

CHAID is sometimes used as an exploratory method for predictive modelling. However, a more formal multiple logistic or multinomial regression model could be applied instead. These regression models are specifically designed for analysing binary (e.g., yes/no) or categorical response variables and can accommodate continuous and/or categorical predictor variables. Interaction terms could be included in the model to investigate the associations between predictors that are tested for in the CHAID algorithm, whilst allowing a wider range of possible model specifications which may well fit the data better. Another advantage of this modelling approach is that we are able to analyse the data all-in-one rather than splitting the data into subgroups and performing multiple tests. In particular, where a continuous response variable is of interest or there are a number of continuous predictors to consider, we would recommend performing a multiple regression analysis instead. See our recent blog post on Analysing Categorical Data Using Logistic Regression Models for further details of these more formal modelling approaches.

The post CHAID (Chi-square Automatic Interaction Detector) appeared first on Select Statistical Consultants.

]]>The post Select Team Cooks their Way to Fine Dining appeared first on Select Statistical Consultants.

]]>On a cold and crisp November day, the whole of Select Team (including future colleague Jo Morrison) met up at Exeter cookery school for a full day of developing new culinary skills and techniques.

The day included learning how to make pasta from scratch, using an insane number of free-range eggs. The smooth pasta dough was then turned into scrumptious spinach, ricotta and whole egg-yolk ravioli. Complete focus was required to close-up each delicate raviolo, removing all air bubbles but without breaking the yolk which was sitting comfortably on top of the spinach and ricotta filing.

The ravioli were then cooked to be the centre-piece of a well-deserved lunch, comprising a bed of rocket salad with mustard dressing, some wild mushrooms sautés in butter, grilled bacon and parmesan savings. This was enjoyed by all present, following the opening of the first Christmas crackers of the season.

The afternoon was spent working with chocolate to make a chocolate delice, a yummy dessert with a crunchy nutty base, covered with a rich chocolate ganache and finished by a chocolate mirror glaze.

A lot of skills, bowls, pans and pots were involved as well some blow torches!

After a full-on day learning new skills away from desks and computers, we all went home with a bag of home-cooked goodies and some sore feet.

The post Select Team Cooks their Way to Fine Dining appeared first on Select Statistical Consultants.

]]>The post Customer Segmentation appeared first on Select Statistical Consultants.

]]>In our Market Research terminology blog series, we discuss a number of common terms used in market research analysis and explain what they are used for and how they relate to established statistical techniques. Here we discuss “Customer Segmentation”, but have a look at our other posts on Key Driver Analysis, Maximum Difference Scaling and CHAID, and watch out for new articles on *TURF* and *Brand Mapping*, amongst others, coming soon. If there are other terms that you’d like us to blog on, we’d love to hear from you so please do get in touch.

Customer segmentation (sometimes also referred to as market segmentation) breaks down large groups of current and/or potential customers in a given market into smaller groups that are “similar” in terms of their preferences or characteristics. This allows you to adopt a different marketing mix (e.g., combination of price, product, promotion, and place) for each segment of the market. The same methods can also be used to select and target the best prospects, identifying those customers with the highest likely lifetime value or conversion rate, for example.

“Target your best prospects.”

Segmentation can be based upon a variety of factors including demographics, geography and spending behaviours as well as perceived needs and values. Traditionally, segmentation has focussed on identifying customer groups based on core demographics and values. However, value-based segmentation is now increasingly common. In this case we also group customers using variables that capture the revenue they generate, e.g., their lifetime value, and the costs of establishing and maintaining a relationship with them.

The segmentation process often begins by taking the most obvious market segments, such as male, female, teen and adult (so-called “a priori” segments) and breaking them up into smaller segments that are made up of actual or potential customers with specific shared characteristics. These characteristics are carefully selected as being those likely to affect customer behaviour and the segmentation process determines the relative importance of each in order to ensure that the final segmentation is of practical commercial value. See our recent blog post, “How Do Supermarkets Use Your Data?“, for a great example of the power of customer segmentation in creating accurate customer profiles to improve the targeting of products and services.

Customer segmentation can be used in both business to business (B2B) and business to consumer (B2C) sales and marketing. In the case of B2B, the “customers” that we are segmenting are businesses rather than individuals and so the characteristics on which we segment might differ, but the underlying statistical techniques used are just the same.

Data from an account or customer relationship management database are often used in customer segmentation as they provide a great resource of customer attributes. Additional data from other sources, including external databases, can also be used to supplement your own and allow you to consider potential as well current customers.

A number of different statistical techniques can be used in performing customer segmentation. We discuss two of the most common methods (clustering and predictive modelling) below, but other classification techniques, such as random forests and mixture models (or latent class analysis) can also be used.

Clustering is a so-called “unsupervised” analysis that is designed to categorise observations (in this case customers) into a number of different groups (“clusters”), with each being relatively similar based on their values for a range of different factors. In each case, some form of distance measure is used to determine how close together or far apart different customers are based on their attributes.

There are many flavours of clustering methods depending upon how you measure the distance between points within and between clusters and also on how you explore the different groupings. For example, we can use Ward’s distance which seeks to minimise the total variance between points within each cluster. Then, in order to construct the best clustering, we might use an iterative procedure starting with every point being assigned to its own cluster and then merge clusters successively so as to minimise the increase in the Ward’s distance. The process continues until there’s just one cluster containing all the observations. A so-called dendrogram (see Figure 1 for example) can be produced that shows which clusters are merged at each step and the associated variance total, allowing us to select the most appropriate number of clusters.

There is a subjective element to using these clustering techniques. Following the analysis, we would need to review the data and identify what the members of each cluster have in common in a meaningful and practical sense. Similarly, we can check that members of distinct groups differ in some obvious and relevant manner. This can be done by summarising the characteristics of each cluster and potentially visualising these summaries as a means of comparing them, e.g., using circle plots where the relative size of each circle corresponds with the relative magnitude of a given characteristic for each cluster compared with the overall average. This process can also help to determine how many segments are needed.

K-means clustering is probably the most popular clustering (or partitioning) method for customer segmentation and requires the analyst to pre-specify the number of clusters required. The method works by assigning each observation to a cluster and then calculating the distance between each point in that cluster and the mean value of all the observations in that cluster. The points are assigned to the clusters so as to minimise the total (squared) distance between each observation and the corresponding mean. Figure 2 shows an example where a group of customers have been segmented based on their sensitivity to price and brand loyalty.

There’s often a great deal of subjectivity associated with cluster analysis with the number of clusters being determined based upon the usability and usefulness of the corresponding groupings. The final clusters are often given names that summarise their key traits, such as “young, upwardly mobile” and “double income, no kids”. In the example in Figure 2, we have identified the three distinct clusters as “value conscious”, “brand advocates” and “loyal to low cost” customers.

In this example it’s very easy to identify the three groups “by eye” in Figure 2. However, if we wanted to include more customer attributes, we would have more than two dimensions and it would be much harder to identify the groups. We couldn’t split the three groups above by looking at either axis (i.e., attribute) on its own, in one dimension, as the groups overlap in terms of both their price sensitivity scores and brand loyalty scores. It’s only when we look at both scores together in two dimensions that the three groups can be easily identified. This idea of looking at multiple dimensions in combination is particularly relevant to higher-dimensional data where simply looking at a 2 or 3D plot won’t necessarily help. This is where we need statistical methods such as cluster analysis to be able to effectively look at all dimensions at once.

Dimensionality reduction techniques, such as so-called principal component analysis (PCA) or factor analysis, can also help in visualising and understanding higher dimensional data – we’ll blog about these techniques another time.

Predictive models are a useful alternative to clustering when we have a specific definition of a “good” customer, such as their lifetime value, on which to base the groupings. In this case, we can create a model (using linear or generalised linear regression, for example) to investigate the relationships between potential drivers and customer value. Those variables that are found to be statistically significant predictors of customer value can then be used to define our customer segments. See our recent blog post on “key driver analysis” for more information on this kind of predictive modelling.

A similar approach known as CHAID (Chi-squared Automatic Interaction Detector) analysis uses an algorithm for discovering relationships between a categorical response variable and other categorical predictor variables, and we’re planning to blog on this soon.

The post Customer Segmentation appeared first on Select Statistical Consultants.

]]>The post Maximum Difference Scaling (MaxDiff) appeared first on Select Statistical Consultants.

]]>In our Market Research terminology blog series, we discuss a number of common terms used in market research analysis and explain what they are used for and how they relate to established statistical techniques. Here we discuss “Maximum Difference Scaling”, but check out our other articles on Key Driver Analysis, Customer Segmentation and CHAID, and look out for new articles on *TURF* and *Brand Mapping*, amongst others, coming soon. If there are other terms that you’d like us to blog on, we’d love to hear from you so please do get in touch.

MaxDiff (Maximum Difference or Best-Worst Scaling) is a survey method in market research that was originally developed in the 1990’s and is used to try to gain an understanding of consumers’ likes and dislikes. Respondents are usually asked to select the most and least important attributes from a subset of product features. The question is then repeated a number of times with the list of attributes varied so that the respondent selects the best and worst features from a number of subsets of product characteristics. The goal of the research is to rank the attributes in terms of their importance to customers on a common scale, so that comparisons and trade-offs between them can be made. See below for an example of a MaxDiff question looking at the attributes of a household appliance.

The method is easy for respondents to complete and forces them to make a discriminating choice amongst attributes. There is no opportunity for bias to occur due to differences in the use of rating scales (which is commonly seen across different countries and cultures) such as those that can occur with a five-point, non-comparative scale from “Not important” to “Extremely important”, for example. Furthermore only two selections need to be made from each list, making it arguably more manageable/practical than the ranking of each item. When there are four attributes in the list, such as in the example above, we learn about five of the six pairwise comparisons between the items but from just two customer choices; it is only the comparison between the two attributes which are not selected that remains unknown. For example, from the response above we know that:

- Safety is more important than Design/aesthetic
- Safety is more important than Speed of boiling
- Safety is more important than Capacity
- Design/aesthetic is more important than Capacity
- Speed of boiling is more important than Capacity
- Design/aesthetic vs. Speed of boiling is unknown

Firstly, experimental design is required in MaxDiff to construct the lists of product characteristics to be chosen from, to determine the number and combinations of attributes per question and to determine the number of questions that each respondent must complete. These are chosen so as to get the best balance of attributes within each question, maximising the information obtained whilst minimizing the burden to the respondents. Ideally, combinations are chosen so that each item is shown an equal number of times and pairs of items appear together an equal number of times. Most often, so-called balanced incomplete block (BIB), or partially balanced incomplete block (P-BIB) designs are used. Take a look at our case study “Judging at the Big Bang Fair” for another example of the application of experimental design.

A number of different approaches are used by market researchers to analyse MaxDiff survey results.

A simple, so-called “Counts analysis” approach involves calculating the difference between the numbers of times each item is chosen as best and worst (termed the “count”) and then ranking the attributes based on these differences. This can be done at both the individual respondent level and also aggregated over all respondents. However, this method fails to take the experimental design of the survey into account and, for example, doesn’t use the information obtained when two items appear together in a list to distinguish between those with a tied count. Furthermore, if the experimental design was unbalanced, and so some items appeared more often than others, counts analysis will give biased estimates as items that appear more frequently will have had more opportunities to be chosen as best or worst.

Alternatively, random utility (or discrete choice) models, such as logistic regression models, are commonly applied to MaxDiff data. Logistic regression models are designed to predict the probability of a binary dependent variable (e.g., a yes/no response) via a linear combination of independent explanatory variables. The MaxDiff experiment, though it involves discrete choices, clearly does not fit into this design. A trick however is therefore used to apply the methodology in this case.

The “trick” involves separating out the responses for each attribute in each list as a binary outcome (chosen or not chosen) for the dependent variable and then using dummy variables for the independent variables to indicate which attribute the response corresponds to and whether it was select as best (1) or worst (-1). The coefficients for each attribute from the fitted model are then directly compared to give a rank ordering for the attributes in terms of customer preference. They are often also transformed and interpreted as estimates of the relative probabilities of each item being chosen as the best. So, for the example above, we might find that the design/aesthetic of the kettle has the highest “share of preference” with approximately 40% chance of being selected as most important compared to the other attributes in the list.

There are a number of issues with this analysis approach. Most importantly, the assumption of responses being independent, which the logistic regression model relies upon (and in fact almost all statistical techniques do), is clearly violated as each choice will be affected by the attributes that were available to select in the current list, and best and worse choices will clearly be correlated. Therefore, the resulting parameter estimates will be biased and cannot be relied upon.

A more robust analysis that can be applied to MaxDiff involves applying a rank-ordered logistic regression or “exploded logit” model. This allows us to model the partial rankings obtained from the responses to the MaxDiff questions (see the bullet point list above, for example), whilst accounting for the ties. This approach does not violate the independence assumption like the tricked logistic regression model above and, as before, allows you to estimate the rank ordering of the attributes in terms of customer preference or to estimate probabilities of attributes being selected as the best.

Despite this approach being more statistically sound, there are still questions over the interpretability of the results. In particular, we are only assessing the relative importance/desirability of the attributes and so it is crucial to carefully consider the product features to be included upfront. The results also don’t indicate if any of the features are likely to actually impact customer behaviour, and furthermore customers’ responses (self-stated importance) won’t necessarily reflect what they actually want.

As the MaxDiff best and worst selections only depend upon the rank ordering of the attributes and their analysis simply provides estimates of the rank ordering of attributes, it may be simpler to directly ask consumers to rank the attributes in the first place. Although this approach is slightly more intensive for respondents, it is simply a case of repeatedly asking for the most important attribute from a decreasingly long list of items. This also simplifies the data collection process as we no longer need to generate experimental designs.

The rank-ordered logistic regression models described above are explicitly designed to analyse these sorts of data and allow us to estimate and test for differences among items in respondents’ preferences for them. It’s also simple to incorporate predictor variables accounting for respondents’ or items’ characteristics, or both, that allow us to investigate what characteristics affect the rankings.

- What’s Your Preference? Asking survey respondents about their preferences creates new scaling decisions (Steve Cohen & Bryan Orme)
- The MaxDiff Killer: Rank-Ordered Logit Models
- Why doesn’t R have a MaxDiff package?

The post Maximum Difference Scaling (MaxDiff) appeared first on Select Statistical Consultants.

]]>The post Select at the YSS Showcase appeared first on Select Statistical Consultants.

]]>The day was split into two with a morning training session introducing the attendees to R and R Shiny. R Shiny is a fantastic package from RStudio that makes it incredibly easy to build interactive web applications and is increasingly being used by businesses to interact with their data (you can see some example apps here). The afternoon was dedicated to presentations from statisticians in a variety of sectors and industries including Government, Academia, Medical and Finance. We were also lucky enough to hear plenary talks from both Professor Jane Hutton and Professor David Hand.

Lynsey’s presentation focussed on her own career path from undergraduate student to MD of Select with details of the sorts of interesting projects we do here at Select (such as understanding customer retention and modelling problem debt) and skills that you might need if you were interested in becoming a consultant. Lynsey said of the event “I had a really interesting afternoon at the YSS Showcase. It was great to be able to impart some advice on how aspiring statistical consultants can start on their career path and I also really enjoyed hearing the other presentations. It’s amazing the variety of careers that you can have as a statistician and it really highlights the career benefits of this fascinating discipline.”

The post Select at the YSS Showcase appeared first on Select Statistical Consultants.

]]>The post Working with the Sheffield Master’s Programme appeared first on Select Statistical Consultants.

]]>The Advisory Committee meets once a year to advise the School of Maths and Statistics on the design and conduct of its MSc course, with particular reference to the current and developing needs of commercial organisations such as Select. During the meeting, the committee review student feedback forms on the various modules that make up the course, read through some of the most recent dissertations and have lunch with current students to hear first-hand what they thought of the course. “We had some really interesting and useful discussions with both the students and course leaders throughout the day” said Lynsey after the day. “Overall the feedback from the students was exceptionally positive, which is great to hear as I know from first-hand experience how useful the course is to employers”. One area that was particularly interesting to think about was the expanding field of data science and what sort of experience and qualifications future Data Scientists might need.

This year Lynsey was also invited to speak to the School of Maths Early Career group on her career to date and, more generally, on the role of a statistical consultant. The Early Career group is made up of both the School’s PhD students and their Research Associates and aims to invite speakers from a variety of different industries. During her talk, Lynsey gave some highlights of the sorts of interesting client projects that Select undertake and discussed what sort of skills are required to be a consultant (both statistical and non-statistical). The session was well received with lots of lively discussion and questions at the end.

The post Working with the Sheffield Master’s Programme appeared first on Select Statistical Consultants.

]]>The post Key Driver Analysis appeared first on Select Statistical Consultants.

]]>In our Market Research terminology blog series, we discuss a number of common terms used in market research analysis and explain what they are used for and how they relate to established statistical techniques. Here we discuss “key driver analysis”, but take a look at our other posts on MaxDiff, Customer Segmentation and CHAID, and look out for new articles on *TURF* and *Brand Mapping*, amongst others, coming soon.

It’s important to identify and understand the drivers of key business outcomes, such as customer satisfaction or loyalty, in order to improve processes and maximise performance and profitability. You might want to understand, for example, which aspects of your service influence how likely a customer will be to recommend you to others. A so called **key driver analysis** can be used to address this sort of question.

A key driver analysis investigates the relationships between potential drivers and customer behavior such as the likelihood of a positive recommendation, overall satisfaction, or propensity to buy a product. This is often using data collected from a questionnaire, which might ask for a customer’s demographics, their level of satisfaction with various aspects of your company’s services (e.g., whether it was value for money, or whether the customer services department was helpful) as well as their likelihood of recommending your company to others (see below).

Correlations between the scores for the customer behaviour of interest (likelihood of recommendation) versus those for the potential drivers may then be calculated to see whether there is evidence of a relationship between them. If there is a positive correlation between satisfaction with the customer services department and the likelihood in recommending the company to others, for example, then satisfaction with customer services is said to drive recommendations in a positive direction. Drivers can also be associated with customer behaviour changing in a negative direction.

A key driver analysis is often performed using multiple linear regression to model the primary outcome as a linear combination of the potential drivers. Those drivers that are found to have a statistically significant effect are considered to be *key drivers* of the outcome and their model coefficients can be interpreted to understand the direction and strength of the relationships between the drivers and the outcome variable.

A key driver analysis can help you to understand what drives customer behaviour.

By including all of the potential drivers in one model, we can see which make up the most informative combination of drivers for the outcome. The model may also be used to make “What If?” predictions of the outcome for customers with specific values of each of the drivers (these may include the gender and age-group of a customer, for example).

Where there are linear relationships (correlations) between two or more of the potential drivers, this can lead to difficulty in the interpretation of the model coefficients – so called multicollinearity. This can occur where two of the potential drivers are capturing similar information, for example, a questionnaire might ask whether the staff were friendly, and also whether they were helpful, which we would expect to be highly related.

There are various statistical approaches that can be used to deal with multicollienarity, including the use of principal component analysis to reduce the number of potential drivers to a set of linearly uncorrelated variables. These analyses that take account of multicollinearity are often called ‘true driver analyses’. It is important to note however, that it is only possible to establish an association between each driver and the outcome with a correlation or regression analysis, it is not possible to establish causation.

With a ‘key driver analysis’, statistical modelling can be used to quantify the relationships between multiple variables. This can help you to understand what drives customer behaviour and ultimately how to improve your performance.

The post Key Driver Analysis appeared first on Select Statistical Consultants.

]]>The post Select Welcomes Camille to the Team appeared first on Select Statistical Consultants.

]]>“Camille’s appointment is part of an ambitious plan to expand the consulting team over the next couple of years.” says Managing Director, Lynsey McColl, “With her experience and expertise across such a wide range of areas, Camille will help us continue to deliver the highest quality work for our clients as we pursue our plans to grow the business. We’re really pleased to welcome Camille to the team and are looking forward to working with her.”

The post Select Welcomes Camille to the Team appeared first on Select Statistical Consultants.

]]>The post Market Basket Analysis: Understanding Customer Behaviour appeared first on Select Statistical Consultants.

]]>“Market Basket Analysis allows retailers to identify relationships between the products that people buy.”

Retailers can use the insights gained from MBA in a number of ways, including:

- Grouping products that co-occur in the design of a store’s layout to increase the chance of cross-selling;
- Driving online recommendation engines (“customers who purchased this product also viewed this product”); and
- Targeting marketing campaigns by sending out promotional coupons to customers for products related to items they recently purchased.

Given how popular and valuable MBA is, we thought we’d produce the following step-by-step guide describing how it works and how you could go about undertaking your own Market Basket Analysis.

To carry out an MBA you’ll first need a data set of transactions. Each transaction represents a group of items or products that have been bought together and often referred to as an “itemset”. For example, one itemset might be: {pencil, paper, staples, rubber} in which case all of these items have been bought in a single transaction.

In an MBA, the transactions are analysed to identify rules of association. For example, one rule could be: {pencil, paper} => {rubber}. This means that if a customer has a transaction that contains a pencil and paper, then they are likely to be interested in also buying a rubber.

Before acting on a rule, a retailer needs to know whether there is sufficient evidence to suggest that it will result in a beneficial outcome. We therefore measure the strength of a rule by calculating the following three metrics (note other metrics are available, but these are the three most commonly used):

**Support:** the percentage of transactions that contain all of the items in an itemset (e.g., pencil, paper and rubber). The higher the support the more frequently the itemset occurs. Rules with a high support are preferred since they are likely to be applicable to a large number of future transactions.

**Confidence:** the probability that a transaction that contains the items on the left hand side of the rule (in our example, pencil and paper) also contains the item on the right hand side (a rubber). The higher the confidence, the greater the likelihood that the item on the right hand side will be purchased or, in other words, the greater the return rate you can expect for a given rule.

**Lift:** the probability of all of the items in a rule occurring together (otherwise known as the support) divided by the product of the probabilities of the items on the left and right hand side occurring as if there was no association between them. For example, if pencil, paper and rubber occurred together in 2.5% of all transactions, pencil and paper in 10% of transactions and rubber in 8% of transactions, then the lift would be: 0.025/(0.1*0.08) = 3.125. A lift of more than 1 suggests that the presence of pencil and paper increases the probability that a rubber will also occur in the transaction. Overall, lift summarises the strength of association between the products on the left and right hand side of the rule; the larger the lift the greater the link between the two products.

To perform a Market Basket Analysis and identify potential rules, a data mining algorithm called the ‘Apriori algorithm’ is commonly used, which works in two steps:

- Systematically identify itemsets that occur frequently in the data set with a support greater than a pre-specified threshold.
- Calculate the confidence of all possible rules given the frequent itemsets and keep only those with a confidence greater than a pre-specified threshold.

The thresholds at which to set the support and confidence are user-specified and are likely to vary between transaction data sets. R does have default values, but we recommend that you experiment with these to see how they affect the number of rules returned (more on this below). Finally, although the Apriori algorithm does not use lift to establish rules, you’ll see in the following that we use lift when exploring the rules that the algorithm returns.

To demonstrate how to carry out an MBA we’ve chosen to use R and, in particular, the **arules** package. For those that are interested we’ve included the R code that we used at the end of this blog.

Here, we follow the same example used in the arulesViz Vignette and use a data set of grocery sales that contains 9,835 individual transactions with 169 items. The first thing we do is have a look at the items in the transactions and, in particular, plot the relative frequency of the 25 most frequent items in Figure 1. This is equivalent to the support of these items where each itemset contains only the single item. This bar plot illustrates the groceries that are frequently bought at this store, and it is notable that the support of even the most frequent items is relatively low (for example, the most frequent item occurs in only around 2.5% of transactions). We use these insights to inform the minimum threshold when running the Apriori algorithm; for example, we know that in order for the algorithm to return a reasonable number of rules we’ll need to set the support threshold at well below 0.025.

By setting a support threshold of 0.001 and confidence of 0.5, we can run the Apriori algorithm and obtain a set of 5,668 results. These threshold values are chosen so that the number of rules returned is high, but this number would reduce if we increased either threshold. We would recommend experimenting with these thresholds to obtain the most appropriate values. Whilst there are too many rules to be able to look at them all individually, we can look at the five rules with the largest lift:

Rule |
Support |
Confidence |
Lift |

{instant food products,soda}=>{hamburger meat} | 0.001 | 0.632 | 19.00 |

{soda, popcorn}=>{salty snacks} | 0.001 | 0.632 | 16.70 |

{flour, baking powder}=>{sugar} | 0.001 | 0.556 | 16.41 |

{ham, processed cheese}=>{white bread} | 0.002 | 0.633 | 15.05 |

{whole milk, instant food products}=>{hamburger meat} | 0.002 | 0.500 | 15.04 |

These rules seem to make intuitive sense. For example, the first rule might represent the sort of items purchased for a BBQ, the second for a movie night and the third for baking.

Rather than using the thresholds to reduce the rules down to a smaller set, it is usual for a larger set of rules to be returned so that there is a greater chance of generating relevant rules. Alternatively, we can use visualisation techniques to inspect the set of rules returned and identify those that are likely to be useful.

Using the **arulesViz** package, we plot the rules by confidence, support and lift in Figure 2. This plot illustrates the relationship between the different metrics. It has been shown that the optimal rules are those that lie on what’s known as the “support-confidence boundary”. Essentially, these are the rules that lie on the right hand border of the plot where either support, confidence or both are maximised. The plot function in the arulesViz package has a useful interactive function that allows you to select individual rules (by clicking on the associated data point), which means the rules on the border can be easily identified.

There are lots of other plots available to visualise the rules, but one other figure that we would recommend exploring is the graph-based visualisation (see Figure 3) of the top ten rules in terms of lift (you can include more than ten, but these types of graphs can easily get cluttered). In this graph the items grouped around a circle represent an itemset and the arrows indicate the relationship in rules. For example, one rule is that the purchase of sugar is associated with purchases of flour and baking powder. The size of the circle represents the level of confidence associated with the rule and the colour the level of lift (the larger the circle and the darker the grey the better).

Market Basket Analysis is a useful tool for retailers who want to better understand the relationships between the products that people buy. There are many tools that can be applied when carrying out MBA and the trickiest aspects to the analysis are setting the confidence and support thresholds in the Apriori algorithm and identifying which rules are worth pursuing. Typically the latter is done by measuring the rules in terms of metrics that summarise how interesting they are, using visualisation techniques and also more formal multivariate statistics. Ultimately the key to MBA is to extract value from your transaction data by building up an understanding of the needs of your consumers. This type of information is invaluable if you are interested in marketing activities such as cross-selling or targeted campaigns.

If you’d like to find out more about how to analyse your transaction data, please contact us and we’d be happy to help.

`library("arules")`

`library("arulesViz")`

*#Load data set:*

`data("Groceries")`

`summary(Groceries)`

*#Look at data:*

`inspect(Groceries[1])`

`LIST(Groceries)[1]`

*#Calculate rules using apriori algorithm and specifying support and confidence thresholds:*

`rules = apriori(Groceries, parameter=list(support=0.001, confidence=0.5))`

*#Inspect the top 5 rules in terms of lift:*

`inspect(head(sort(rules, by ="lift"),5))`

*#Plot a frequency plot:*

`itemFrequencyPlot(Groceries, topN = 25)`

*#Scatter plot of rules:*

`library("RColorBrewer")`

`plot(rules,control=list(col=brewer.pal(11,"Spectral")),main="")`

*#Rules with high lift typically have low support.*

*#The most interesting rules reside on the support/confidence border which can be clearly seen in this plot.*

* *

*#Plot graph-based visualisation:*

`subrules2 <- head(sort(rules, by="lift"), 10)`

`plot(subrules2, method="graph",control=list(type="items",main=""))`

The post Market Basket Analysis: Understanding Customer Behaviour appeared first on Select Statistical Consultants.

]]>The post Fraud: A Crime for Middle England appeared first on Select Statistical Consultants.

]]>Cases of fraud have long been included in official police recorded crime data, but these figures provide at best a limited picture of the extent of the problem. Many cases of fraud go unreported to the police – for example, when a bank reimburses money stolen via credit card fraud, the consumer rarely goes to the trouble of going to the police. Plus, the reliability of police recorded crime numbers across the board has been called into question in a 2014 report from the UK Statistics Authority, citing regional variation in crime recording practices and insufficient data quality assurance procedures.

To attempt to fill the gap between the police recorded fraud figures and reality, the ONS recently introduced new questions into the CSEW asking about fraud and cyber crime. The CSEW is a rolling victim based study that asks samples of people about their experience of crime over the last year. The fraud and cyber crime questions have only appeared in the last six months and as such these results are still considered by the ONS to be “experimental” statistics. However, already a picture is emerging suggesting that the demographics of the people affected by fraud are unlike that of most other crimes.

For example, while the proportion of adults experiencing violent crime and personal theft declines fairly steadily as they get older, the prevalence of both fraud and computer misuse crimes peaks in 45-54 age group. Another contrast is seen in household income, with those who earn more experiencing a little less violence but higher rates of theft and fraud than low earners.

Where you live also has an effect on the prevalence of different types of crime. Crime has traditionally been seen primarily as a problem for city dwellers, and indeed more adults experience violence and theft in urban areas than in rural locations. But for fraud the situation is reversed, with adults in rural areas being around 10% more likely to have been defrauded in the last year than urban dwellers. Computer misuse crimes are equally prevalent in urban and rural areas.

All this suggests that fraud is a crime affecting middle-aged, middle class people, and this is borne out when we look at the data grouped by the ONS’s population characteristics classification. Those most likely to experience fraud are “rural residents”, “urbanites” and “suburbanites” – groups that tend to be UK born, own their own home, and have professional jobs or be retired. These are also the groups with the lowest prevalence of violence and theft.

While these new statistics from the CSEW point towards some interesting differences between fraud and other types of crime, we conclude with a word of caution. The numbers presented here are estimates extrapolated from a survey of a sample of the population. The CSEW sample is quite large (around 35,000 people), but as mentioned above the fraud and cyber crime questions have only recently been added and the sample size for them is much smaller (about 9000 people). The smaller the sample the lower the precision of the estimate, and the problem is compounded when we split the sample up to look at subgroups of the population. For example, there is an odd spike in computer misuse crime in the “cosmopolitans” population class, a group characterised by young single adults, often students, with high ethnic integration. While cyber crime could genuinely be a particular problem within this group, a closer look reveals that this estimate is based on just 326 survey responses, suggesting that it needs to be treated with a degree of caution. Ideally all these estimates would come with an indication of their uncertainty, such as a confidence interval, but unfortunately the ONS does not provide this, nor does it provide enough information about their methodology in order for us to calculate it ourselves.

**References**

The Crime in England and Wales statistical bulletin for the year ending March 2016

Experimental tables providing estimates on fraud and computer misuse

An overview of fraud statistics for the year ending March 2016

The post Fraud: A Crime for Middle England appeared first on Select Statistical Consultants.

]]>The post Changes to the Select Team! appeared first on Select Statistical Consultants.

]]>The aim of this restructure is to allow us to continue to deliver high quality statistical analyses for our current clients, whilst exploring the unique opportunities available as more and more companies become aware of the potential in their data. Lynsey said of her role, “I’m very excited to take on this role from Steve, who has done such a great job at guiding the Company in its first five years. I believe that Select’s potential is huge and I’m looking forward to working with all of the team to continue to grow and deliver interesting, varied and insightful consultancy work for our clients.”

As part of the recent reshuffle, Sarah has also been promoted to Senior Statistical Consultant to reflect the expertise that she has and the invaluable contribution that she makes to the company. Congratulations Sarah!

Finally, we’re excited to announce that we’ll be recruiting over the coming months so watch this space for adverts. It’s likely that we’ll be looking to take on two new Consultants to expand our team further.

The post Changes to the Select Team! appeared first on Select Statistical Consultants.

]]>The post Assessing and Improving Probability Prediction Models appeared first on Select Statistical Consultants.

]]>The *calibration* of a model refers to how close its predictions are to the observed outcomes in a sample of test cases. When the outcome is a continuous variable that has been modelled using linear regression, say, then the comparison of predictions and observations is straightforward. For example, predicted values can be plotted against observed values to see how well they match and then numerical measures of performance can be constructed using the errors (i.e. observed values minus the predictions). Common performance measures include the mean squared error and the mean absolute error for example. However, when the outcome is binary the situation is more complex because the predictions are probabilities of an event occurring (i.e., a number between 0 and 1), whilst the observations are binary (e.g., “yes” or “no”) and so one cannot be subtracted from the other to obtain a meaningful difference. The binary outcome could be encoded as a numeric variable – e.g. 0 for “no” and 1 for “yes” – but, even then, a plot of observations against predictions will be largely meaningless.

To get around the problem of comparing binary outcomes with their probability predictions, we can use a method known as binning, as follows. We put the predictions into a number of bins so that all of the predictions in a given bin have similar probability values. We then compare the average predicted probability within the bin with the observed frequency of events for the corresponding observations. If the model is well calibrated (i.e. the probabilities are good predictions of what actually happened) then these two quantities should be close to each other. This can be visualised by plotting the observed frequencies against the average bin probabilities and, if the model is well calibrated, then the points on this plot should lie close to the 45 degree line through the origin.

A good way of choosing the bins is to use equally-spaced quantiles of the predictions, so that the bins each contain equal numbers of observations. For example, if there are to be 10 bins then we would use the deciles of predictions, so that the smallest 10% of predictions go into the first bin, the next 10% of predictions into the next bin, and so on. The number of bins to use is a subjective choice, striking a balance between showing sufficient detail and having plenty of observations in each bin.

To illustrate this idea, consider the example introduced in our categorical data analysis blog where we predict the probability of a student being offered a place on a post-graduate course. We start by fitting a logistic regression model in which the student’s mark on the course’s admissions exam and their academic grading from their undergraduate degree are explanatory variables. Figure 1 shows a calibration diagram for this model using seven bins. We see that the points are curved away from the 45 degree line, indicating some degree of miscalibration. For example, the rightmost point is below the 45 degree line, which means that in this bin the event occurs less frequently than predicted i.e., the average probability for the bin is 0.49 while the observed frequency is 0.43. On the other hand, several of the points in the middle lie above the line indicating that when the model suggests a probability of between 30% and 40%, this is likely to be an under-estimate of the true probability.

The calibration diagram tells us that our model is miscalibrated, but what can we do about it? Miscalibration might mean that there is something missing from the model (e.g., an important explanatory variable has been left out). Alternatively, it might mean that that mathematical form of the model is a poor approximation to reality (e.g., an explanatory variable that has gone in as a linear effect in fact acts non-linearly).

However, sometimes we are limited by the data that are available, we find that non-linear transformations don’t help, and there is little we can do to improve the model itself. Fortunately though there is an alternative option for solving calibration problems: recalibration. In short, recalibration means adjusting the points on the calibration diagram so that they are closer to the 45 degree line.

In binary regression models, recalibration is usually done by fitting a new binary regression model that uses the original observations as the outcome and some transformation of the predicted probabilities as the explanatory variable. The best transformation to use is a subjective choice informed by the shape of the calibration diagram. It is generally possible to find a (often very complicated) transformation that moves the points on the calibration to lie exactly on the 45 degree line, making the calibration perfect, but such a choice would usually result in *overfitting* i.e., it would make the model fit very well to the data to hand at the expense of the performance of future predictions. To avoid overfitting, we follow the principle of parsimony and look for the simplest transformation that we believe adequately describes the relationship seen in the calibration diagram.

In the example above, the points in the calibration diagram follow a roughly curved path and so we might choose a quadratic transformation – one of the simplest non-linear transformations. Figure 2 shows the fitted curve on the calibration diagram. The curve is used to replace the original predictions to create new probabilities that have better calibration. For example, the curve maps an original prediction of 0.5 to a new value of 0.45. Once we have our calibrated probabilities, we can then repeat the original binning exercise with these new probabilities and produce a new calibration plot (see Figure 3) to see how well the calibrated probabilities perform.

The recalibrated points in Figure 3 generally lie closer to the 45 degree line than those in Figure 1, but it is often useful to quantify the improvement to see how much better the calibrated predictions are relative to those produced directly from the model. A common measure for this is the weighted mean squared difference between the observations (averaged within the bins) and the probabilities (also averaged within the bins), where the weights are given by the number of data points within the corresponding bin. With this measure, a smaller value indicates better calibrated predictions. In terms of the calibration diagram, it is equivalent to the (weighted) mean squared vertical distance from the points on the diagram to the 45 degree line. This is illustrated in Figure 4, in which the red lines show the distances that would be squared and averaged to calculate the calibration. In the example above, recalibration reduced the calibration measure from 0.0017 to 0.0009 (i.e. a reduction of about 50%).

Given that we can recalibrate to improve the model’s predictions, why shouldn’t we just repeat the process over and over until we get perfect predictions? This comes back to the problem of overfitting: each iteration of the recalibration process might improve the predictions specifically for the test cases, but making them perfect would likely worsen the performance of future predictions. We therefore usually follow a parsimonious approach and recalibrate just once.

It is worth noting that calibration is just one aspect of model performance and it is not in itself enough to identify whether or not a set of predictions are any good. In the above example, overall 32% of applicants are admitted onto the post-graduate course. If we had a model that always predicted that the probability of admission is 0.32, then in terms of calibration it would be perfect (because we would just have one non-zero bin within which the observed frequency matches the predictions), but it would be practically useless as it would not make any distinction between students. This suggests that calibration may need to be supplemented with other measures of performance in order to get an overall picture of how good the predictions are. Examples of other measures of the performance of probability predictions include resolution and the Brier score, which we will explore in a future blog.

We mentioned above that, ideally, calibration problems are solved by improving the original model. Recalibration can be a good fix when data are limited, but if more or better data are available, then improving the model will often lead to better all-round performance.

In the postgraduate admissions example there is, in fact, another variable available ranking the prestige of their undergraduate institution on a scale of 1-4. When we put this into the model as an extra explanatory variable, we find that the calibration diagram improves considerably (see Figure 5) and the improvement in the calibration measure is similar to that obtained from the recalibration performed above (a reduction of about 50%). The advantage of adding the extra variable over recalibrating is that it results in a more realistic model with improved interpretability. For example, the estimated coefficient of the new variable tells us that students from a higher ranked undergraduate institution are more likely to receive an offer, and a more detailed interpretation of the model could quantify this effect. There are also statistical arguments for why we should prefer model improvement over recalibration. For example, some other measures of performance (such as the resolution) are unaffected by recalibration and can only be improved by improving the model. We will expand upon these ideas in a future blog.

To sum up, calibration diagrams are a useful diagnostic tool for binary outcome regression models (e.g. logistic regression). They can be used to visualise calibration, which is an important aspect of model performance. Miscalibration indicates that something is wrong with the model, and the first step should always be to try to find the problem and fix it. However, when the data limits what can be done, recalibration is a useful tool for getting a working model that produces better calibrated predictions.

Calibration is just one example of a way to measure the performance of probability predictions. In a recent blog we also introduced another example: the ROC curve, which can be used to measure the classification performance of a binary regression model (i.e., its ability to successfully assign a specific outcome to each case, for example in order to make a decision). Other examples include resolution and the Brier score, which will be covered in a future blog. A whole literature exists about the many other methods for assessing prediction performance – see for example this book for a good introduction to the subject.

The post Assessing and Improving Probability Prediction Models appeared first on Select Statistical Consultants.

]]>The post How Are EU Migrants Represented Across the UK Workforce? appeared first on Select Statistical Consultants.

]]>Each month the Office for National Statistics (ONS) releases data on the UK Employment and Labour Market. These data include quarterly results from the Labour Force Survey (LFS), a survey of the employment circumstances of the UK population. It is the largest household survey in the UK and provides official measures of employment and unemployment. We’ll explore these data below and the picture they can give us of the UK workforce in the context of our membership in the EU.

Despite EU nationals making up only a third of all migrants in the UK, the ONS LFS data show that EU nationals constitute 64.3% of the migrant workforce.

Based on the most recent ONS release on 18^{th} May 2016, for January to March 2016 there were estimated to be 31.5 million people aged 16 or over in employment in the UK. As shown in Figure 1, of these workers (excluding 4.6 thousand with an unstated nationality), approximately 89.4% were UK nationals and 10.6% non-UK nationals; 6.8% being EU nationals (from one of the 27 EU member states excluding the UK) and 3.8% being from the rest of the World.

The proportion of all people working in the UK accounted for by UK nationals has decreased by just over 7 percentage points over the last 20 years, from 96.5% in the first quarter of 1997 to approximately 89.4% in the same quarter of 2016. This is accounted for by a rise in the proportion of the UK workforce made up of non-UK EU nationals, from 1.7% to 6.8% over the same period, and a smaller increase for non-EU nationals, from 1.9% to 3.8%. Hence, the increase in the proportion of non-UK nationals working in the UK mainly reflects the admission of several new member states to the EU over this period.

Looking at the numbers of people working in the UK by nationality (shown in the left-hand plot in Figure 2), we can see that there has been a steady increase in the number of non-UK nationals from EU countries over the last decade, from approximately 758,000 for January to March 2006 to 2.15 million for the same period in 2016. Whereas, since early 2009, the number of non-UK nationals from outside the EU working in the UK has been broadly consistent; at an estimated 1.25 million for January to March 2009 and 1.19 million for the same period in 2016.

If we consider the estimates for people born abroad, the picture is somewhat different (see right-hand plot in Figure 2). The numbers of people working in the UK who were born outside the UK are generally higher because the estimates for people born abroad working in the UK include some UK nationals. There is a larger discrepancy for non-EU nationals as foreign residents who are non-EU nationals are likely to have a greater incentive to apply for citizenship of the UK than those who are EU citizens and therefore already benefit from rights more comparable to those of UK nationals.

Some press agencies have used these ONS figures on the numbers of people in UK employment over time to make comment on the proportion of new jobs that have gone to non-UK nationals in different quarters. However, as this BuzzFeed article discusses, this could be misleading as these numbers don’t tell us how many new jobs there were nor what proportion have been filled by UK and non-UK workers. All we know is the net change in the number of people in employment, which will include some who have dropped out of work and some who have gained employment – see this article from independent, non-partisan, fact-checking charity Full Fact for a detailed explanation as to why proportions of net changes don’t generally make sense. The ONS provides a statement to this effect with each release of these data but, despite this, various major news networks have reported the figures incorrectly.

Working age, non-UK EU nationals have a higher employment rate than both non-EU nationals and UK nationals, at 78.0% compared with 61.7% and 74.4%, respectively.

If we consider the levels of employment as a proportion of the total number of UK and non-UK nationals of working age in the UK, we can also understand if and how the rates of employment might differ. These data are visualised in Figure 3, and we can see that for January to March 2016 (the most recent data available) the employment rate was highest for non-UK EU nationals at an estimated 78.0%, compared with 74.4% for UK nationals and a somewhat lower rate of 61.7% for non-EU nationals. The EU rate overtook the UK rate in January to March 2006 and has consistently remained higher in the following decade.

Eurostat also publish quarterly and annual data from the European Union Labour Force survey (EU-LFS), which brings together European household sample survey data on labour participation collated from national surveys. Using the most recently released, detailed results (from 26^{th} April 2016), we can explore the employment rate further and, in particular, see how the rates break-down by gender.

We can compute the employment rate as the ratio between the employed population and total population in the relevant age/gender group, excluding those with unreported citizenship. Looking at the last quarter of 2015, it appears that the lower UK employment rate for non-EU nationals is largely driven by a lower activity rate of foreign women from non-EU member states. The female employment rate for non-EU nationals was 51.5%, approximately 17 percentage points lower than for UK nationals (68.9%) and just under 22 percentage points lower than for non-UK EU nationals (73.0%). Whereas for men, the difference was less marked with non-EU nationals estimated to have an employment rate (72.6%) just under 6 percentage points lower than UK nationals (78.3%) and 12 percentage points lower than for non-UK EU nationals (84.6%).

In 2008, Eurostat carried out some more detailed research on the labour market situation of migrants in the EU, which was released as one of a series of ad-hoc modules covering a variety of topics. In their report (see page 82), Eurostat described how the economic activity rates of women were found to be lower when there were dependent children in the household. Across the EU, similar activity rates were recorded for foreign women and female citizens of their residing country without children. However, the activity rate dropped for foreign women with one dependent child in the household but remained relatively stable for female citizens of their residing country. This effect was found to be stronger for non-EU citizens. The activity rate gap also widened with the number of dependent children in the household. This may help to explain the differences in the female employment rates discussed above.

Similar to UK nationals, Public admin., education and health is the largest sector in which EU14 nationals work. In contrast, workers from newer, A10 member states, are more likely to work in Distribution, hotels and restaurants as the most popular sector, and have the highest rate of working in Manufacturing.

Roughly 10.6% of the UK workforce is made up of non-UK nationals, but what sorts of jobs are filled by migrants compared with UK nationals?

Ad-hoc, user-requested data available from the ONS (based on 2014 Annual Population Survey datasets), show how the UK workforce is broken down by industry sector for UK and non-UK nationals (see Figure 5). The data for non-UK nationals are available split by the newer, EU14 member states (Austria, Belgium, Denmark, Finland, France, Germany, Greece, Ireland, Italy, Luxembourg, Netherlands, Portugal, Spain and Sweden), the older A10 member states (Czech Republic, Estonia, Hungary, Latvia, Lithuania, Poland, Slovak Republic, Slovenia, Bulgaria and Romania) and rest of the World countries.

Public administration, education and health is the largest sector for UK nationals (30.8%), EU14 nationals (27.6%) and rest of the World nationals (28.1%), but only the fifth largest for A10 nationals (11.1%). There is clearly variability in the sectors in which EU nationals from different countries work within the UK. For A10 countries, the largest industry sector is Distribution, hotels and restaurants (27.6%), followed by Manufacturing (19.3%). The percentage of UK nationals in employment in Manufacturing in the UK is much lower (9.6%), similarly for EU14 nationals (7.7%) and rest of the World nationals (6.7%). Of the different nationalities, EU14 nationals have the highest proportion of their UK workforce in Banking and finance (23.4%), which is their second largest sector, compared with citizens from the UK (16.6%), A10 (13.1%) and rest of the World countries (18.5%).

Non-UK EU nationals make up at least 4.1% of FTE workers & 8.6% of FTE doctors in the NHS Hospital & Community Health Service.

The potential effect of leaving the EU on the NHS workforce has been a concern raised by those campaigning to stay in the EU. Here we look at the data available from the Health and Social Care Information Centre (HSCIC) on the nationality of the workforce in the NHS to help understand what the possible impact could be.

Figure 6 below shows the full time equivalent (FTE) number of workers in NHS Hospital & Community Health Service (HCHS) jobs for September of 2009 to 2014.

The total appears to have remained relatively stable at around 1.05 million FTEs over this time. Within this, there appears to have been an increase in the number and proportion of FTEs filled by UK nationals, from 731 thousand (69.8%) in 2009 to 847 thousand (79.6%) in 2014; and a corresponding increase for non-UK EU nationals, from 25.5 thousand (2.4%) to 43.9 thousand (4.1%). However, it is important to recognise that there are many NHS staff records, upon which these figures are based, that do not contain nationality data, either because someone chose not to specify or was not asked their nationality. There has been a notable decrease in the number of records for which the nationality was unknown, from 217.0 thousand (20.7%) to 105.7 thousand (9.9%), and this has an impact on the interpretation of these figures. It is not possible to say whether the unknowns are made up of similar proportions of the different nationalities nor whether the change in the proportion of unknowns in driven in equal measure by changes in reporting across the constituent nationalities. Therefore, the increases in the FTE numbers for UK nationals and non-UK EU nationals could simply be due to the increase in the completion of the nationality field over time in the administrative system from which this information has been derived.

We can say however, that the majority, at least 79.6 % (as of September 2014), of FTEs are filled by UK nationals. As discussed above, these data are only for NHS workers in the HCHS, the nationalities of GPs and other primary care staff, for example, are not recorded.

NHS HCHS jobs include non-medical staff, such as infrastructure support roles. If we look at the number of FTEs only for doctors (including locums), the picture is somewhat different (see Figure 7).

Whereas the total FTE figures remained flat, there has been an increase (+8.7%) in the number of FTE doctors in the NHS HCHS between September 2009 and 2014, from 98.1 thousand to 106.6 thousand, respectively. At least 69.3% of FTE doctors are UK nationals (as of September 2014), however, a higher proportion of these roles are filled by non-UK nationals compared with all NHS HCHS roles. At least 8.6% of FTE doctors are EU nationals and 15.5% non-EU nationals. This is due to a higher proportion of non-UK national FTEs in NHS HCHS roles being doctors (21.0% for EU nationals; 24.4% for non-EU nationals), compared with the percentage for UK nationals (8.7%), as shown in Figure 8 below.

Despite EU migrants making up only approximately a third of all migrants in the UK (as discussed in our recent post: EU Freedom of Movement), the ONS LFS data show that EU nationals constitute 64.3% of the migrant workforce in the UK. The lower UK employment rate for non-EU nationals appears to be largely driven by a lower activity rate of foreign women from non-EU member states, which is not the case for non-UK EU nationals.

Similar to UK nationals, Public administration, education and health is the largest sector in which non-UK EU nationals in the UK from the EU14, original member states work. In contrast, workers from the newer, A10 member states are more likely to work in Distribution, hotels and restaurants as the most popular sector, and have a higher rate of working in Manufacturing compared to UK, EU14 and non-EU nationals. Focussing on the health sector in more detail, data from the HSCIC reveal that non-UK EU nationals make up at least 4.1% of the FTE number of workers in the NHS Hospital & Community Health Service. However, a higher proportion of non-UK national FTEs in NHS HCHS roles are doctors compared with the percentage for UK nationals, and at least 8.6% of FTE doctors in the NHS HCHS are EU nationals.

These figures could be used by those who are campaigning to leave the EU who argue that Britons are losing out to foreign workers taking their jobs. However, on the other side of the debate, concerns have been raised as to whether EU citizens working in the UK would meet work visa rules in the case of the UK leaving the EU, which might leave sectors such as manufacturing or the NHS understaffed and have a negative impact on the UK economy.

Aside from the data reported here, there are a number of wider issues that should be considered in the context of the UK workforce and our membership in the EU. The Government, which is campaigning for the UK to remain in the EU, argues that being in the EU makes it more attractive to invest in the UK, leading to more jobs. Furthermore, a substantial number of employment rights (such as the legal limit on the number of hours employees can be required to work, and prevention of employers discriminating against workers who are disabled, for example) have their roots in EU legislation and there is the potential for many of these rights to be lost should the UK leave the EU (depending on negotiations on the UK’s subsequent relationship with the EU). However, leave campaigners also argue that being bound by EU employment rights legislation is damaging to the UK economy, imposing substantial costs on employers, and that there are already many employment-related issues that are not subject to EU legislation (e.g., the National Minimum Wage is a home-grown policy).

As with many of the issues surrounding the EU referendum, we cannot say with certainty what the impact on the UK workforce will be if the UK left the EU. Forecasting migration and employment is very difficult and the policies that would follow a vote to leave the EU are unknown in advance.

The post How Are EU Migrants Represented Across the UK Workforce? appeared first on Select Statistical Consultants.

]]>