Post-Hoc Tests in Statistical Analysis

Article

Published: March 20, 2023

Elliot McClenaghan

Post-Hoc Tests in Statistical Analysis content piece image

Listen with

Speechify

0:00

Thank you. Listen to this article using the player above. ✖

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 4 minutes

In this article, we review the function of post-hoc tests in statistical analysis, how to interpret them and when to use them (and not use them).

What are post-hoc tests?

Post-hoc testing is carried out after a statistical analysis where you have performed multiple significance tests, ‘post-hoc’ coming from the Latin “after this”. Post-hoc analysis represents a way to adjust or ‘reinterpret’ results to account for the compounding uncertainty and risk of Type I error (more on that below) that is inherent in performing statistical tests. You may also see post-hoc tests referred to as multiple comparison tests (MCTs).

Significance testing

First, it may be helpful to recap what we mean by a significance test in statistics, and then explore how performing multiple tests may lead to spurious conclusions. Significance or hypothesis testing can be done with many types of data and in many different situations. The first step in performing one is to define a “null hypothesis”. We then calculate a p-value as a way of quantifying the strength of evidence against this null hypothesis. The p-value is the probability of observing a result as extreme, or more extreme, than that which you have observed if the null hypothesis were true. In other words, the probability that this result occurred due to chance. The smaller the p-value, the stronger the evidence against the null hypothesis.

For example, we may want to investigate whether mean systolic blood pressure differs between two groups of patients. We would test the null hypothesis (often written as H₀) that the two observed means are equal (no difference between the groups). We then calculate a test statistic, use a known theoretical distribution of that test statistic, and obtain and interpret the p-value that gives us an idea of the strength of evidence against the null hypothesis.

Correct interpretation of p-values can be a tricky business. While p-values exist on a continuum between 0 and 1, it is common to use an arbitrary cut-off value of 0.05 to represent a “statistically significant” result. The 0.05 significance level (or α level) can be useful for other purposes such as calculating the required sample size for a study.

What do post-hoc tests tell you?

Interpretation of multiple p-values becomes even trickier and this is the stage at which some researchers make use of post-hoc testing. If we test a null hypothesis that is in fact true, using the 0.05 significance level, there is a 0.95 probability of coming to the correct conclusion in accepting the null hypothesis. If we test two independent null hypotheses that are true, the probability of coming to the correct conclusion in accepting the null is now 0.95 x 0.95 = 0.90. Therefore, the more significance tests we perform together, the higher the compounding risk for us to mistakenly reject a null hypothesis that is in fact true (this is called a type I error or false positive – see Table 1). In other words, if we go on testing over and over, we will eventually find a “significant” result which is why care must be taken interpreting a p-value in the context of multiple tests. Moreover, at the 0.05 significance level, we might expect a significant result to be observed by chance alone once in every 20 significance tests. Post-hoc analysis, such as the Bonferroni test for multiple comparisons, aims to rebalance the compounding risk and adjust the p-values to reflect this risk of type I error. The Bonferroni test is in essence a series of t-tests performed on pairs of multiple groups being tested.

Test rejects null hypothesis

Test fails to reject null hypothesis

Null hypothesis is true

Type I error

False positive

Correct decision

No difference

Null hypothesis is false

Correct decision

True difference

Type II error

False negative

Table 1: Summary of the four possible outcomes for a hypothesis test of a difference.

Other common post-hoc tests include the following:

Tukey’s test – a common post-hoc test that makes adjustments to test statistics when comparing groups by calculating a Tukey’s Honest Significant Difference (HSD), an estimate of the difference between groups along with a confidence interval.
Scheffe’s test – a test which also adjusts the test statistics for comparisons between groups and calculates a 95% confidence interval around the difference but in a more conservative way than Tukey’s test.

Less common post-hoc tests exist for various situations, summaries of which can be found here. These tests tend to give similar results and simply approach post-hoc analysis in different ways.

The Bonferroni test

Calculation of the Bonferroni test is done by simply taking the significance level at which you are conducting your hypothesis test (usually α=0.05) and dividing it by the number of separate tests performed. For example, if a researcher is investigating the difference between two treatments in 10 subgroups of patients (so 10 separate significance tests denoted by n) the Bonferroni correction is calculated as α/n = 0.05/10 = 0.005.

Hence, if any of the significance tests gave a p-value of <0.005 we would conclude that the test was significant at the 0.05 significance level and that there was evidence for a difference between the two treatments in that subgroup.

When not to use post-hoc tests?

As with many statistical procedures, there are disadvantages and even sometimes controversy attached to the use of post-hoc testing. Some statisticians prefer not to use post-hoc tests such as the Bonferroni test due to the inflation of the risk of type II error (not rejecting the null hypothesis when it is in fact false) as the type I error is adjusted, the implication that a comparison should be interpreted differently according to how many other tests are performed and the reliance on post-hoc testing in absence of a focused research question and approach to hypothesis testing.

Instead, it is suggested that a study should be designed to be specific about which subgroup differences or hypotheses are of interest before an analysis is performed so that conclusions are led by causal frameworks and prior knowledge rather than the data and chance alone. An example of this in practice might look like the preregistration of a clinical trial, in order for the researchers to pre-record and justify hypotheses and study design before analysis takes place. With careful study design, analysis planning and interpretation of findings, many statisticians and analysts avoid post-hoc testing without foregoing methodological rigor.

Moreover, since the aim of post-hoc testing is reinterpreting or setting a new criterion for reaching a ‘statistically significant’ finding, some argue that halting the use of post-hoc testing is compatible with the movement away from the concept of statistical significance more generally. P-values can and have been shown to mislead researchers, and an over-reliance on the somewhat arbitrary threshold of a statistically significant result (<0.05) often ignores the context – such as statistical assumptions, data quality, prior studies in the area and underlying mechanisms – in which these findings are reached.

Meet the Author