As some readers may be aware, there is currently much debate amongst professional statisticians on the topic of ‘statistical significance’ and the use of p-values – earlier this year the journal The American Statistician released a whole issue on the subject (Statistical Inference in the 21st Century: A World Beyond p < 0.05). A growing number of statisticians are arguing that statistical significance testing and reliance on p-values is endemic in statistical practice and that this is having negative consequences for scientific research. The implications of this debate reach beyond the statistical community; anyone who relies on statistics to help inform decision-making should be aware of the possible dangers of the naïve use of statistical significance testing. At Select, we agree that there are risks in the over-use of and over-reliance on ‘statistical significance’ (as highlighted in our previous note on statistical tests) and in this blog we explain why.
Suppose we want to know whether a die is biased. If there is no way to deduce with certainty the die’s fairness using physical tests, we might turn to statistical methods to infer whether it is fair or not. We roll the die a few times and check the results. Say 40% of the rolls land on a 5. Does this mean the die is biased? Answering this question is not straightforward, because no matter how many times we roll the die we will never know with certainty whether it is biased – we might just be seeing an unusual pattern of rolls for a fair die. Nevertheless, we might think that our assessment of the die should differ if we roll it 500 times rather than just 5 times. How do we navigate between unjustified certainty and falsely modest claims of ignorance?
A wrong approach to the dilemma, but an approach that is conceptually similar to determining statistical significance, would be to set a given number of rolls as the threshold for certainty. Let us say 500 rolls is the cut-off. So, if 40% of the die’s rolls land on a 5 after 500 rolls we determine that the die is biased, but if 40% of the die’s rolls land on a 5 after only 499 rolls we have no basis on which to claim that the die is not fair. Dismissing the evidence if it does not reach a given threshold and ignoring the room for error if it reaches the threshold makes little statistical or logical sense. However, in a statistical significance test, a p-value of less than 0.05 being considered ‘king’ and declared as ‘statistically significant’ is falling into this exact trap – it misguidedly interprets the reaching of an essentially arbitrary threshold as something approaching stone-cold proof.
The imposed diametric choice between certainty and cluelessness, reflected in the binary ‘statistically significant’ or ‘not statistically significant’ paradigm, makes little sense because it misrepresents the nature of statistical evidence. Statistical evidence grows gradually stronger as we gather more data or roll the die more times in our example above. Assuming we are sensible, this means that our confidence in our hypothesis should grow gradually stronger as we roll more times, all other things being equal. There is no statistical reason to venture into the realms of a ‘no evidence’ and ‘complete evidence’ dichotomy.
The argument that it makes practical sense might proceed as: “Decisions must be one or the other. There is no room for degrees when faced with a choice so we should set standards for evidence that mirror this.” There is an important question to be asked about the amount of evidence that should weigh in favour or against a decision; but to blur this question with the definition of the evidence itself is a misguided approach. There is logical room to say that we do not know for certain that something is the case but that, on balance, the decision to act as if the hypothesis is true is justified. Wrong decisions are more likely be made if we treat something as certain that is not.
At Select, we use statistics to give the best possible view of the inferential evidence in the data. This means embracing the uncertainty of statistical results and giving a range of likely scenarios which the data support with varying degrees. The decision-maker can then combine this evidence with the other factors that inevitably weigh into a decision such as expert knowledge, economic constraints, and the utilities of each of the possible outcomes. When statistical significance takes too prominent a place in the statistical results and p-values are over-interpreted there is more danger of data making decisions rather than informing them. We prefer to think about the valuable perspective data provide as one voice among many justifying a decision.
The great thing about embracing uncertainty is that once you start to think in these terms more possibilities present themselves.
Professional statisticians across the world are recognising and revolting against the damage that refusal to accept uncertainty can cause. Prominent in the discussion is a desire to move away from narrow-minded statistics where decisions are automated based on arbitrary thresholds and move towards a more thoughtful and holistic approach where data is used to inform decisions, not to make them. The great thing about embracing uncertainty is that once you start to think in these terms more possibilities present themselves. Maybe you can implement a probabilistic strategy rather than a dichotomous one. Maybe there are more, higher quality data you can collect to get closer to the truth. Maybe the data throws up a relationship that you are surprised by and decide to investigate further. Statistics and data are about much more than testing hypotheses, after all.