
Statistical Power | Acknowledgments: Y. Brandvain’s code snippets
2025-11-10
Which of the choices below is an appropriate alternative hypothesis for a null hypothesis which reads: “the mean systolic blood pressure for men is 120”?
A. The mean systolic blood pressure for men differs from 120.
B. The mean systolic blood pressure for men differs from that of women.
C. The mean systolic blood pressure for men is less than 120.
D. The mean systolic blood pressure for men is more than 120.
The P-value is best described as the probability that
A. We would observe the data we do if the null hypothesis is false.
B. We would observe the data we do if the null hypothesis is true.
C. The null hypothesis is false.
D. The null hypothesis is true.
State \(H_0\) and \(H_A\).
Calculate a test statistic.
Generate the null distribution.
Find critical value at specified \(\alpha\), and the p-value.
The Scream, by Edvard Munch - National Museum of Art, Architecture and Design, Public Domain
The p-value is the probability a sample from the null model would be as or more extreme than our sample. It quantifies your surprise if you assumed that the null was true.
More formal definitions:
The strength of evidence in the data against the null hypothesis
The long-run frequency of getting the same result or one more extreme if the null hypothesis is true.
p-value = the probability of obtaining a test statistic at least as extreme as the one from the data at hand, assuming:
Notice that this is a conditional probability (\(P(\text{observed_effect}|\text{null hypothesis is true})\): The probability that something happens, given that various other conditions hold. One common misunderstanding is to neglect some or all of the conditions.
Evaluating where the test statistic lies on a sampling distribution built from the null model.
We generate this null sampling distribution by:
Simulation
Permutation (shuffling)
Mathematical results developed by professional statisticians. These are built from the logic of probability theory.
We would not be surprised if the sample below came from the null model.
We would be very surprised if the sample below came from the null model.
One-tailed
Two-tailed
The ASA
P-values can indicate how incompatible the data are with a specified statistical model (aka, the null hypothesis).
P-values DO NOT measure the probability that the studied hypothesis is true, NOR the probability that the data were produced by random chance alone.
P-values DO NOT measure the size of an effect or the importance of a result. (you can have small effects and very significant p-values and vice-versa)
Scientific conclusions & business or policy decisions should not be based only on whether a P-value passes a specific threshold.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Proper inference requires full reporting and transparency.
The significance level, \(\alpha\), is the probability used as the criterion for rejecting the null hypothesis.
The significance level \(\alpha\) is the probability of making the wrong decision when the null hypothesis is true.
If the P-value is \(\leq \alpha\), we reject the null hypothesis, and say the result is “statistically significant”
The value of the test statistic required to achieve \(P \leq \alpha\) is called the “critical value”
Step: 4-5: Find the critical Value & P-Value

Step: 4-5: Find the critical Value & P-Value
Two tailed P-value: \(P(X \leq 4)+ P(X \geq 16)\)
\(=0.0059 + 0.0059 \approx 0.01182\)

\(P =\) 0.0118, so this result is unlikely under the null.
We reject \(H_0\) at the \(\alpha = 0.05\) significance threshold.
We conclude that red shirts perform better than we can reasonably expect by chance.
But we recognize that given many tries a pattern this extreme (or even more extreme) can occur without any dependence between victory and shorts color.
Even if you do everything right, hypothesis testing can get it wrong.
Hypothesis testing can go wrong in two types of ways:
Type I and type II errors are theoretical concepts. When you are analyzing a sample you don’t know whether you are making this kind of error or not. The only exception is when you’re simulating a scenario and checking the performance of a test on simulated data.
What improves statistical power?
The exact relationship between several features of the experiment: the threshold for significance (\(\alpha\)), size of the expected effect, variation present in the population (variance), alternative hypothesis (one or two sided), nature of the test (paired or unpaired), and sample size.
“Imperfectly understood CIs are more useful and less dangerous than incorrectly understood P values.” – Hoenig and Heisey (2001) The abuse of power: the pervasive fallacy of power calculations for data analysis. The American Statistician, 55(1), 19-24.
If the parameter value stated in the null hypothesis falls outside of the 95% confidence interval you estimated, it is almost certain your P-value will be significant at α = 0.05

If the parameter value stated in the null hypothesis falls inside of the 95% confidence interval you estimated, it is almost certain your P-value will be nonsignificant at α = 0.05

Both use sample data to make inferences about the underlying population
CI
Estimation is quantitative: it quantifies, puts a bound on the likely values of a parameter, and gives me the size of the effect
More revealing about the magnitude of the parameter or treatment effects
Does everything hypothesis testing can do
And yet are way less used in biological research
Hypothesis Testing
Hypothesis testing is more qualitative: does the estimated parameter value differ from a null model? Is there any effect at all (yes or no)
Widely used in biology
Decide (yes/no) whether sufficient evidence has been presented to support a scientific claim
They work very well together. Providing P- values and point estimates + CI is currently the best way to go
Correlation \(\neq\) causation
Confounding variables: an unmeasured variable that may cause both X and Y
Observations vs. Experiments:
Observed correlations are intriguing, and can generate plausible and important hypotheses.
Correlations in treatment (x) and outcome (y) in well-controlled experimental manipulations more strongly imply causation because we can (try to) control for confounding variables.
| Important | Not Important | |
|---|---|---|
| Significant | Polio vaccine reduces incidence of polio | Things you don’t care about, or already well known things |
| Not Significant | Suggestive evidence in a small study, leading to future work. OR No support for a thing thought to matter in a large study. | Studies with small sample size and high P-value OR Things you don’t care about. |
In a study, authors searched through a health database of 10 million residents of Ontario, Canada, collecting their reason for admission into a hospital and their astrological sign. They then asked if people with certain astrological signs are more likely to be admitted to the hospital for certain conditions
Result: 72 diseases occurred more frequently in people with one astrological sign than in people with all other astrological signs, and this different was found to be statistically significant
WOW!!!
Whenever you perform a hypothesis test, there is a chance of committing a type I error (rejecting the null hypothesis when it is actually true).
For ONE hypothesis test: the type 1 error rate (false positive) is simply α–the significance level!
When we conduct multiple hypothesis tests at once, we have to deal with something known as a family-wise error rate, which is the probability that at least one of the tests produces a false positive:
\[\text{Family-wise error rate}= 1-(1-\alpha)^n\] \(\alpha\): The significance level for a single hypothesis test
\(n\): The total number of tests
Astrology example:
For one hypothesis test: \(\text{Family-wise error rate}= 1- (1- 0.05)^1=0.05\)
For 5 hypothesis tests: \(\text{Family-wise error rate}= 1- (1- 0.05)^5=0.23\)
For 2,676 hypothesis tests: \(\text{Family-wise error rate}= 1- (1- 0.05)^{2796}=1\)
\[\alpha_{new}=\frac{\alpha_{orig}}{n}\]
Using the previous example, to keep the probability of making a type I error at 5% in such a study, our new significance level would have to be:
remember: there were 2,796 hypothesis tests in the study and they used a criterion of \(\alpha=0.05\)
\(\alpha_{new}=\frac{0.05}{2796}=1.79\times 10^{-5}\)
B215: Biostatistics with R