Multiple testing problem| Acknowledgments: Didactical materials from www.serjeonstats.com and https://bookdown.org/mike/data_analysis/
2025-11-12
the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. This is done by performing many statistical tests on the data and only reporting those that come back with significant results. – Wikipedia (https://en.wikipedia.org/wiki/Data_dredging)
“P-hacking; it is also known as data-dredging, snooping, fishing, significance-chasing and double-dipping. “P-hacking,” says Simonsohn, “is trying multiple things until you get the desired result” — even unconsciously.” – Nuzzo. “Scientific method: Statistical errors. Nature
When conducting multiple hypothesis tests simultaneously, we increase the probability of false positives (Type I errors).
Suppose we perform \(n\) independent hypothesis tests, each with a signif. level \(\alpha\).
The probability of making at least one Type I error (false positive) is:
\[P(\text{at least one false positive})=1−(1−\alpha)^n\]
This probability is known as Family-wise error rate (FWER).
There are two groups of approaches to deal with this:
controlling the family-wise error rate (FWER)
controlling the false discovery rate (FDR)
FWER is a metric that quantifies the risk caused by multiple testing. It’s the probabilityof making one or more false discoveries (type I errors) when performing multiple hypotheses tests.
Large circles: sets of tests corrected for family-wise error. Small squares are null effects, small circles are real effects. Filled circles and squares represent rejections of the null hypothesis (“positives”). Two studies (filled large circles) produced one or more Type I errors (filled blue squares). Thus, \(FWER=2/9=0.22\).
Simply put, \(FDR = FP / (FP + TP)\), where \(FP\): false positive; \(TP\): true positive.
These are the same ‘data’ depicted in the first figure, but highlighting the information relevant to FDR. We have a total of 24 null rejections/significant effects (filled squares and circles), of which 3 are Type I errors (filled blue squares). So \(FDR=3/24=0.125\)
Controlling FWER: ensures that no more than \(100\times \alpha\%\) of your studies will produce any (\(\geq1\)) Type I errors. One way ( the most common) to do that is through the Bonferroni correction, which we saw last time.
Controlling FDR: ensures, that across your statistical tests, at most \(\alpha\%\) of your significant results are really Type I errors. One way to correct for FDR (and the most common) is the Benjamin–Hochberg procedure (BH).
Use FDR correction when:
You perform many hypothesis tests (e.g., genomics).
You want to balance false positives and false negatives.
Bonferroni is too strict, leading to low power.
Do NOT use FDR correction if:
Strict control of any false positives is required (e.g., drug approval studies).
There are only a few tests, in which case Bonferroni is appropriate.
p.adjust() function from base R performs both and also other types of corrections defined by the argument methods.Q: Can P values be negative? No. P values are fractions, so they are always between 0.0 and 1.0.
Q: Can a P value equal 1.0? A P value would equal 1.0 only in the rare case in which the treatment effect in your sample precisely equals the one defined by the null hypothesis. When a computer program reports that the P value is 1.0000, it often means that the P value is greater than 0.9999.
Q: Should P values be reported as fractions or percentages? By tradition, P values are always presented as fractions and never as percentages.
Q: Is a one-tailed P value always equal to half the two-tailed P value? Not always. Some sampling distributions are asymmetrical. Even if the distribution is symmetrical (as most are), the one-tailed P value is only equal to half the two-tailed value if you correctly predicted the direction of the difference (correlation, association, etc.) in advance. If the effect actually went in the opposite direction to your prediction, the one-tailed P will be >0.5 and greater than the two-tailed P value.
A reasonable definition of the P value is that it measures the strength of evidence against the null hypothesis. However, unless statistical power is very high (>90%), the P value does not do this reliably.
| Approach | Questions Answered |
|---|---|
| Confidence Interval | What range of true effect is consistent with the data? |
| P-value | What is strength of the evidence that the true effect is not zero? |
| Hypothesis Test | Is there enough evidence to make a decision based on a conclusion that the effect is not zero? |
P-values are often misunderstood because they answer a question that you never thought to ask.
In many situations, you know before collecting any data that the null hypothesis is almost certainly false.The difference or correlation or association in the overall population might be trivial, but it almost certainly isn’t zero.
The null hypothesis is usually that in the populations being studied there is zero difference between the means, zero correlation, etc.
People rarely conduct experiments or studies in which it is even conceivable that the null hypothesis is true.
Clinicians and scientists find it strange to calculate the probability of obtaining results that weren’t actually obtained. The math of theoretical probability distributions is beyond the easy comprehension of most scientists.

“If all else fails use significance at the \(\alpha=0.05\) level and hope no one notices”. https://xkcd.com/1478/
The value for which P=0.05, or 1 in 20 … is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not…”
“If one in twenty does not seem high enough odds, we may … draw the line at one in fifty …, or one in a hundred…. Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level.”
It is usual and convenient for experimenters to take 5 per cent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results. — R.A Fisher (statistician/eugenicist/geneticist)
Ok, I am exagerating, but you get the point.
1) Absolutists
“Every time you say ‘trending towards significance’, a statistician somewhere trips and falls down.” – someone on twitter
Consider statistics with nuance (e.g. “nearly statistically significant”) a sin
Reject things like declaring a P-value much smaller than 0.05 “highly significant”
Strengths:
Weaknesses
2) Continualists
Consider the job of the P-value is to express the strength of evidence against the null hypothesis
Reasoning: P-values as continuous variables that are prone to sampling error. If strength of evidence is continuous (rather than binary), our inferential conclusions should be too
Report p-values (usually with CIs of sample statistic) and let reader interpret based on their level of skepticism.
Strengths
Sees P-value as subject to sampling error (which it is!)
Recognizes and distinguishes between results that provide weak evidence, moderate evidence, and strong evidence against the null.
Weaknesses
As always, there is a trade-off. Just don’t change strategies after you run your tests.
From: makeameme.org
B215: Biostatistics with R