Confidence Intervals
2025-11-03
Error bars
Confidence intervals
Pseudoreplication
Estimate the standard error of the mean for a sample of size \(n=16\) that has \(s=25\):
Estimate the standard error of the mean for a sample of size \(n=16\) that has \(s=25\):
\(SEM=\frac{s}{\sqrt{n}}=\frac{25}{\sqrt{16}}=6.25\)
| Statistic | Parameter (population) | Point estimate (sample) |
|---|---|---|
| Mean | \(\mu\) | \(\bar x, \bar y, \bar X\), etc |
| Proportion | \(p\) | \(\hat{p}\) |
| Standard deviation | \(\sigma\) | \(s\) |
What are they?
Lines that extend from a point estimate to reflect the degree of uncertainty of the parameter being estimated
We’ve seen them several times in this lecture!

What do they measure?
Unfortunately, it depends on what one wants to convey:
Disclaimer: this is a general statistical principle and for this course we will abide to it. Standard deviation is a measure of spread/variability, not of uncertainty. It is possible that you learn elsewhere that in a specific field this is tolerated more than others.
If =< 100 data points: just plot it all in a strip chart (remember to add jitter and transparency)
Otherwise, to avoid clutter: violin plot, CFD, boxplot

Use CI (easier to interpret, wider) or SEM (narrower, harder to interpret) to compare:
Example: showing that the second plague (Yersinia pestis) pandemic shaped the gene pool of certain populations
A very asymmetrical distribution. Remade from Colquhoun (2003). The number of citations (X-axis) received (within five years of publication) by 500 papers randomly chosen from the journal Nature. The bin width is 20 citations. The first bar shows that 74 papers received between zero and 20 citations, the second bar shows that 80 papers received between 21 and 40 citations, and so on. Example from: Intuitive Biostatistics
If you were only told the mean and SD here are 101 and 43 or only saw plot A), below, you would imagine something like B, but C or D have these same summaries!
The mean and SD can be misleading. All four data sets in the graph have approximately the same mean (101) and SD (43). If you were only told those two values or only saw the bar graph (A), you’d probably imagine that the data look like Data set B. Data sets C and D have very different distributions, but the same mean, SD, and SEM as Data set B. Example from: Intuitive Biostatistics
How will your reader know if it’s the SD, SEM, CI, range, or something else?
Source: Mentioned in Intuitive Biostatistics as Frazer et al. (2006), a study of the bladder-relaxing effects of norepinephrine in old rats.
Loosely, attempt to define a range of numbers that might include the parameter of interest
Also a way of quantifying uncertainty
A “1 − 𝛼” confidence interval is an interval \((v_1, v_2)\), where \(v_1\) and \(v_2\) are instances of random variables satisfying \[𝑃(v_1 < \theta < v_2) \geq 1 − 𝛼\]
if \(\alpha=0.05\), we have a \(95\%\) CI.
if \(\alpha=0.01\), we have a \(99\%\) CI.
95% and 99% are difference confidence levels
, etc …
E.g. for a confidence interval encompassing from values 0.1 to 0.5 (inclusive)
Preferable:
Using the parameter (e.g. if we’re estimating the mean): \(0.1<\mu<0.5\)
\(\left[0.1,0.5\right]\)
\(0.3\pm 0.2\)
Fine:
0.1 to 0.5
Do not use: 0.1-0.5 or 0.1–0.5
Why? This can be confusing for negative numbers
✅ Correct ✅
~95% of 95% confidence intervals calculated from samples include the population mean.
We are 95% confident that the population mean lies within the 95% confidence interval.
😔 Incorrect
Why? The CI either contains the parameter or it does not contain it. The probability is associated with the process that generated the interval. And if we repeat this process many times, 95% of all intervals should in fact contain the true value of the parameter.
The width of the CI is approximately proportional to the reciprocal of the square root of the sample size
\[CI\propto \frac{1}{\sqrt{n}}\]
Two ways, mainly:
For all other cases:
Assumptions
If and only if:
THEN …
If assumptions are met, then…
a rough estimate of the 95% CI for the mean is:
Lower 95% CI: \(\bar X - 2\times SEM_{\bar X}\)
Upper 95% CI: \(\bar X + 2\times SEM_{\bar X}\)
100 replicate of \(n=5\) genes samples from all human genes
Another 100 replicates of n=5
From: makeameme.org
B215: Biostatistics with R