# Glossary

We’re always careful to make sure that we explain any technical terms that we use, and here are some of the terms that you might see in our reports or which might come up when discussing your project with us.

If you have any suggestions for statistical terms that we could add to our glossary, please let us know.

a

## Acceptance Sampling

Acceptance Sampling is a quality control technique that is commonly applied to manufacturing processes. The aim is to decide whether to accept or reject a lot based on information about the lot quality from a sample. To determine a sampling plan, we need four pieces of information:

• The producer's quality level (PQL): the proportion of defective lots considered by the producer to be acceptable.
• Producer's risk (PR): the probability that a high quality lot is rejected. (Typically we use 5-10%).
• The consumer's quality level (CQL): the proportion of defective lots considered by the consumer to be unacceptable.
• Consumer's risk (CR): the probability a low quality lot is accepted. (Again we typically use 5-10%).

Assigning values to the above four numbers we can use some calculations to work out the required sample size, and the acceptance number.

## Akaike’s Information Criterion (AIC)

Similar to the Bayesian Information Criterion (BIC), Akaike's Information Criterion is commonly used to discriminate between statistical models. Though it does not itself provide a measure of how well a model fits the data, it can be used to measure and test the relative ability of two models to do so. Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. AIC not only rewards goodness of fit via a likelihood term, but also incorporates a penalty that is an increasing function of the number of parameters in the model. This penalty discourages over fitting and ensures that amongst a group of similarly performing models, the simplest will be selected.  The penalty term is smaller in AIC than in BIC. Information criterion are often used in stepwise regression.

## Alternative Hypothesis

The alternative hypothesis is a statement that reflects what a statistical hypothesis test is attempting to prove. If a statistical hypothesis test establishes that there is sufficient evidence to reject the null hypothesis, then it is in favour of the alternative hypothesis.  The value of the alternative hypothesis dictates whether the test is one or two-sided (or tailed).  For example, given a null hypothesis that the average number of sales between the current webpage design and a new design is equal, a two-sided alternative hypothesis test would state that the average number of customers is not equal, whilst a one-sided would state that the average number of sales for the new design is either greater or less than the original.  If it is the latter, then we are concerned with only one tail of the sampling distribution, rather than two.  The alternative hypothesis is usually denoted as HA.

## Autocorrelation

Autocorrelation describes the similarity between observations as a function of how close together those observations are, be that in time or space. A variable is autocorrelated if observations closer together are more similar (i.e., there is greater correlation between them) than observations further apart.

b

## Bayesian Information Criterion (BIC)

Similar to Akaike's Information Criterion (AIC), the Bayesian Information Criterion is commonly used to discriminate between statistical models. Though it does not itself provide a measure of how well a model fits the data, it can be used to measure and test the relative ability of two models to do so. Given a set of candidate models for the data, the preferred model is the one with the minimum BIC value. BIC not only rewards goodness of fit via a likelihood term, but also incorporates a penalty that is an increasing function of the number of parameters in the model. This penalty discourages over fitting and ensures that amongst a group of similarly performing models, the simplest will be selected. The penalty term is larger in BIC than in AIC.

## Bias

A statistic is biased if it is calculated in such a way that is systematically different from the population parameter of interest. Bias refers to how far on average the statistic lies from the parameter it is estimating, that is, the error which arises when estimating a quantity. See the following blog post for further discussion of bias and accuracy: Better a Dead Clock Than a Slow One?

## Box Plot

Box plots are a useful tool for visualising the distribution of a data set, its central value, its variability (or spread) and the presence of outliers.  The box includes the upper and lower quartiles and therefore the middle 50% of the data and the horizontal line within the box is the median.  The whiskers extend to 1.5 times the interquartile range and data points outside this range are marked as dots.

c

## Categorical Variable

A categorical variable is one that takes on a limited, fixed number of possible values. Examples of values that could be represented by a categorical variable include gender, blood type or the county that one lives in. Care needs to been taken when analysing a categorical variable, for example, the central tendancy cannot be measured by the mean or median, but the mode.  Typically data that are categorical are summarised in a contingency table.

## Coefficient of Variation (CV)

The coefficient of variation of a sample of data is the standard deviation divided by the arithmetic mean. It is a measure of the variability of the data relative to the average value.

## Confidence interval

When reporting a point estimate from a sample of data (for example, estimating the mean of your population), it is usually a good idea to provide a confidence interval that quantifies the uncertainty associated with the estimate. The interval is calculated from the sample of data and is the range of values in which we estimate the point estimate to lie, given our level of confidence expressed as a percentage (e.g., a 95% confidence interval).

## Confidence Level

The confidence level conveys the amount of uncertainty associated with an estimate. It is the chance that the confidence interval will contain the true value that you are trying to estimate. In other words, if a study was repeated and a 95% confidence interval calculated each time, you would expect the true value to lie within these ranges on 95% of occasions. A higher confidence level requires a larger sample size.

## Cox Proportional-Hazards Regression

A Cox proportional-hazards regression is used in a survival analysis to assess the effects of one, or a set of, explanatory variables on the risk of the outcome event (e.g., death). A logarithmic transformation is used to express the hazard of the event as a linear function of the explanatory variables. The hazard function is the instantaneous event rate at any one time. Cox proportional-hazards modelling, as the name indicates, depends upon the assumption that the survival distributions for the different levels of the explanatory variables have hazard functions that are proportional over time (i.e., constant relative hazard). This means that the effect of each explantory variable on the risk of the event must be constant over time. Tests of the assumption of proportional hazards can be conducted based on the model residuals (e.g., the scaled Schoenfeld residuals).

## Critical Region

The critical region of a statistical hypothesis test is the set of all values of the test statistic for which the null hypothesis is rejected in favour of the alternative hypothesis.  It is set to be the region where the probabitility of a value of the test statistic arising given the null hypothesis is less than the chosen significance level.

e

## Effect Size

The effect size is the estimated difference or change between the groups that we observe in our sample that we are looking to compare. To detect an effect size with a specified power, a smaller effect size will require a larger sample size.

f

## Finite Population Correction

When calculating the standard error of an estimate, the population of the sample is assumed to be infinite.  This assumption only holds if the sample size is much smaller than the population.  If, however, the sampling fraction (the number of samples relative to the population) is greater than 5%, then the standard error underestimates the accuracy of the estimate unless a finite population correction is applied.  The finite population correction accounts for the added precision gained by sampling a larger percentage of the population.

g

## Generalised Linear Model

Generalised linear modelling is a flexible generalisation of ordinary linear regression that allows for response variables that have error distributions other than a Normal (Gaussian) distribution. This allows for modelling of binary rather than continuous variable outcomes, for example, via logistic regression.

## Gini Coefficient

The Gini coefficient of a sample of data is equal to half the relative mean difference.

## Graph Theory

Graph theory is the study of points and lines. In particular, it involves the ways in which sets of points, called nodes or vertices, can be connected by lines or arcs, called links or edges. Graph theory has proven useful in conservation efforts where a node can represent regions where certain species exist (or where the habitat would be suitable) and the links represent migration paths or movement between those regions.

h

## Histogram

A histogram is a graph that displays the shape of the distribution of a set of data and is often used as an estimate of a probability distribution of a continuous variable.  The range of the data is divided into bins, and the number or proportions of the observations falling into each bin is plotted as adjacent rectangles. The area of the rectanges is equal to the frequency of the observations in the bin.  A histogram can also be normalised so that it displays relative frequencies and the area under the curve equals 1.  The choice of the bin width used is usually subjective; if the width is too small the histogram can be too ragged, but if it is too wide the shape can be oversmoothed and obscured.

## Hypothesis Testing

Hypothesis testing is a method of statistical inference that allows us to formally test between two competing hypotheses using data from a scientific study.  Conventionally, a hypothesis test is carried out to assess whether there is sufficient evidence from the data collected to disprove or reject the null hypothesis in favour of an alternative hypothesis. A test statistic is calculated from the data collected and compared to values we might expect to see if the null hypothesis is true. If the test statistic is beyond the range of values we might expect to see, this suggests that the null hypothesisis is false and should be rejected. This is carried out formally by calculating the p-value, which is the probability of observing the test value if the null hypothesis were true and indicates how extreme or unusual the test statistic is.  If the p-value is less than a pre-determined value (known as the significance level) then the result is said to be statistically significant (i.e., it is very unlikely to have occurred by chance alone) and the null hypothesis is rejected. We’ve put together some free, online statistical calculators to help you carry out some statistical calculations of your own, including a two-sample t-test which is a hypothesis test to compare two means.

i

## Independent and Identically Distributed (IID)

It is very common for statistical methods to assume that the data being analysed are IID (independent and identically distributed).  Two events are "independent" if knowing one has happened tells us nothing about whether the other happens.  In other words, the results are uncorrelated and the value of one observation does not influence the value of another.  The assumption that the data are "identically distributed" requires that the observations are generated from the same underlying probability distribution, i.e., that there is homogeneity in the observations and that the statistical properties of any one part of the sample of data are the same as for any other.

## Interquartile Range (IQR)

The interquartile range of a sample of data is the range of the middle half of the ordered values. It is a measure of "dispersion", i.e., spread.

k

## Key Driver Analysis

A key driver analysis investigates the relationships between potential drivers and customer behavior such as the likelihood of a positive recommendation, overall satisfaction, or propensity to buy a product.  See our recent blog post for more information.

l

## Likert Scale

The likert scale is commonly used when collecting responses from questionnaires or surveys that rely on opinion (for example, a customer satisfaction survey). A likert scale measures agreement or disagreement using a symmetric or balanced scale. It always has an odd number of categories so that the middle value can represent a neutral response. A typical likert scale is:

1. Strongly agree
2. Agree
3. Neither agree nor disagree
4. Disagree
5. Strongly disagree

Measuring responses using a likert scale results in data that are categorical and, more specifically, ordinal.

## Log-Rank Test

A Log-Rank test is used in a survival analysis to compare the risk of the event (e.g., death) across the factors of a categorical explanatory variable. The Log-Rank test is a non-parametric statistical test that is used to compare survival distributions and which makes use of the full survival data and takes account of right censoring (i.e., it is only known that the time of the event is greater than the days of follow up at the last assessment point).

## Logistic Transformation

The logistic transformation (logit) of a probability, p, is its log-odds, i.e., the logarithm of the odds p/(1-p). This transforms a variable constrained between 0 and 1 to a continuous variable between plus and minus infinity.

m

## Margin of Error

The margin of error is the level of precision you require. It is the range in which the value that you are trying to measure is estimated to be and is often expressed in percentage points (e.g., ±2%). A narrower margin of error requires a larger sample size.

## Maximum Entropy

The principle of maximum entropy states that, for a set of data, the probability distribution which best represents the current state of knowledge is the one that is maximally non-committal with regard to missing information. So, given observed data, we find the probability distribution that best describes these data, that is closest to a uniform distribution (equal likelihood across the range of possible values).

## Mean (Arithmetic)

The arithmetic mean (often simply called the mean) of a sample of data is the sum of the values divided by the number of values in the sample (the sample size). It is a summary statistic that measures the central tendency of the data, i.e., the "average" value.  It is sensitive to extreme values (outliers) compared to other measures of central tendancy.

## Mean Difference

The mean difference of a sample of data is the arithmetic mean absolute difference between any two values.  It is a measure of "dispersion", i.e., spread, that gives less weight to larger differences compared with the standard deviation.

## Median

The median of a sample of data is the middle value of the ordered data. It is a summary statistic that measures the central tendency of the data, i.e., the "average" value.  It is less sensitive to extreme values (outliers) compared to the arithmetic mean.

## Mode

The mode of a sample of data is the most commonly occuring value. It is a summary statistic that measures the central tendency of the data, i.e., the "average" value.

## Multicollinearity

Multicollinearity occurs where two or more predictor variables in a statistical model are highly correlated, meaning that one can nearly be linearly predicted from the others. In this situation the coefficient estimates for the variables may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, it only affects calculations regarding individual predictors. That is, a model with correlated predictors can indicate how well the entire set of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to the others.

n

## Nominal Variable

A nominal variable is a type of categorical data where the categories or groupings can only be differentiated by their labels or names (for example, gender or blood type). This is opposed to an ordinal variable where the labels can also be used to sort or rank the variable.

## Null Hypothesis

When carrying out a statistical hypothesis test, the null hypothesis is the default position and assumed to be true unless proven otherwise.  For example, there is no difference in the mean effect between two treatments or the probability of tossing a coin and obtaining a heads is 0.5.  The null hypothesis always relates to the statement that is being tested and is usually denoted as H0.

o

## Odds

If the probability of an event is p, then the odds of that event is p/(1-p).

## Odds Ratio

An odds ratio is a measure of association between an exposure and an outcome, for example, the assocation between patients that have been exposed to a treatment and are healthy.  The value represents the odds of an outcome given a particular exposure compared to the odds of the outcome occuring in the absence of the exposure.  For example, the odds of a patient being healthy having received the treatment is twice that of one who has not received the treatment.

## Ordinal Variable

An ordinal variable is a type of categorical data where the observations can be ranked or have a rating scale, for example, 1st, 2nd and 3rd or low, medium and high. This is opposed to a nominal variable where the categories reflect labels that cannot be meaningfully ordered. Responses from a questionnaire are often ordinal variables due to the scales that are set for question responses (for example, the response to a question may be to agree, neither agree nor disagree or disagree). A common ordinal scale used in questionnaires is the Likert scale.

p

## P-value

The p-value is associated with a statistical test and is the probability of a value of the test statistic arising which is as or more extreme than that observed, if the null hypothesis were true.  The null hypothesis is often rejected when the p-value is less than a certain significance level (often 0.05).  This indicates that the observed result would be highly unlikely under the null hypothesis.

## Polychoric correlation

For pairs of ordered categorical (ordinal) variables, the polychoric correlation is an estimate of the correlation between their continuous underlying latent variables.

## Power

Statistical power is the probability of finding a statistically significant result for a statistical test, given that there is an underlying effect in the population. A greater power requires a larger sample size.

## Probability Density

The probability density of a random variable is a function that describes the relative likelihood for the variable to take a given value.

r

## Range

The range of a sample of data is the maximum value minus the minimum value.  It is a measure of "dispersion", i.e., spread.

## Receiver Operating Characteristic (ROC) Curve

A Receiver Operating Characteristic (ROC) curve is commonly used to define a threshold for a test that is binary classifier (e.g., positive/negative outcome). To create the ROC curve a variety of threshold values are explored and, for each threshold, the sensitivity (the true positive rate) and the specificity (the true negative rate) of the test are calculated. The sensitivity is then plotted on the y-axis and 1 - specificity (equivalent to the false positive rate) on the x-axis. An 'optimal' threshold is chosen using the smallest Euclidean distance between the ROC curve and the top left-hand corner minimising the false positive rate and maximising the true positive rate.  Note that there are other measures that can be used to calculate the 'optimal' threshold that, for example, prioritise maximising the true positive rate over minimising the false positive rate.

## Regularisation

Regularisation methods in statistical modelling help to prevent the fitted model from describing random noise in the data instead of the underlying relationships. This can occur when a model is excessively complex, i.e., has too many parameters compared with the number of data points. Regularisation is usually applied by the inclusion of a penalty term in the modelling.

## Relative Mean Difference

The relative mean difference of a sample of data is the mean difference divided by the arithmetic mean.

## ROC Area Under the Curve (AUC)

The ROC area under the curve statistic is most often used for model comparison. An AUC estimate can be interpreted as the probability that the binary classifier considered in the ROC curve will assign a higher score to a randomly chosen positive example than to a randomly chosen negative example.

s

## Sensitivity Testing

Sensitivity testing is the study of how the uncertainty in the output of a statistical model can be apportioned to different sources of uncertainty in its inputs. This can be useful in testing the robustness of the model in the presence of uncertainty and in identifying model inputs that cause considerable uncertainty in the output.

## Significance Level

The significance level is the probability at which the null hypothesis is rejected in favour of the alternative hypothesis. It is typically set at either 5% or 1% and is used to define the cricital region of the statistical hypothesis test. If the p-value is less than the significance level, the null hypothesis is usually rejected and the result considered to be statistically significant. The significance level is also known as the alpha (α) level and also describes the rate of type I errors.

## Standard Deviation

The standard deviation of a sample of data is the square root of the arithmetic mean of the squared differences of the values from their mean value. It is a measure of "dispersion", i.e., spread, with low values indicating that the data tends to be close to the arithmetic mean and high values indicating that the data are more variable and tends to be spread out further away from the arithmetic mean.  The standard deviation can also be defined as the square root of the variance. The standard deviation usefully has the same units as the data and as the arithmetic mean.

## Standard Error

The standard error of an estimate (such as a sample mean) is a measure of its statistical accuracy, i.e., how closely the sample estimate represents the population. It is equal to the standard deviation of the theoretical distribution of a large population of such estimates.

## Statistical Significance

Statistical significance is an integral part of statistical hypothesis testing, used to assess whether an observed effect is due to chance alone. A statistically significant result is found when the probability of an effect occurring by chance is low (commonly a cut-off of less than 5% is used, i.e., a 5% significance level).

## Stepwise Regression

Stepwise regression is a method for exploring statistical models within a nested structure. Akaike's Information Criterion (AIC) is commonly used to discriminate between the models. Starting with the model containing all independent variables, the AIC of each sub-model containing all parameters except one is computed. The algorithm then selects the model with the lowest AIC. At the next step all models obtained by deleting a single independent variable or by adding one back in that had previously been deleted are evaluated and, again, the model with the lowest AIC value is selected. The algorithm continues until adding or deleting a dependent variable from the model produces no improvement in the AIC. Stepwise regression is routinely used for variable selection problems where the best combination of independent variables is required.

## Survival Analysis

A survival analysis is used to analyse data where the response variable is the time until the occurrence of an event of interest. Survival analysis is often used in medical statistics, where the event of interest might be death or occurrence/recurrence of a disease, for example.  Techniques for a survival analysis include Kaplan-Meier survival curves, Log-Rank tests, Cox proportional hazards regression, etc. See our Retention Modelling case study for an example of a survival analysis.

t

## Type I and II Errors

A type I error occurs when the null hypothesis is incorrectly rejected in a statistical test, i.e. it is rejected when it is in fact true. The probability of a type I error occurring is known as α and is equivalent to the significance level of the test. So, for example, if the significance level is 5% and the null hypothesis is true, a wrong decision will be made 1 in 20 times. A type II error occurs when the null hypothesis is incorrectly accepted, i.e. it is accepted when it is in fact false. The probability of a type II error occurring is known as β and is related to the statistical power of a hypothesis test (which equals 1-β). A trade-off is required between the two types of errors since reducing the risk of a type I error, increases the risk of a type II error.  Since a type I error is generally assumed to be more serious, the general convention is to allow a 5% chance for a type I error and a 20% chance of a type II error.

v

## Variance

The variance of a sample of data is the arithmetic mean of the squared differences of the values from their mean value. It is a measure of "dispersion", i.e., spread, with low values indicating that the data tends to be close to the arithmetic mean and high values indicating that the data are more variable and tend to be spread out further away from the arithmetic mean.  The variance can also be defined as the square of the standard deviation.