2025-10-29
Two key goals of inferential statistics:
estimate the characteristics of a population of interest based on a limited sample
hypothesis testing
Q: If sampling error is unavoidable, how can I trust an estimate? What is its precision? Why measure anything anyway? HELP!!!!
A: if we know something about how the samping process might affect the estimates I get, that can inform my confidence in my estimates and the significance level in hypothesis testing.
Q: Why do we need this?
A:
To learn about the whole population.
Test claims or hypotheses.
Make predictions with statistical models.
Calculate confidence intervals and p-values to measure uncertainty.
Q: How??
A: yes, young padawans. We use the sampling distribution – I know, panic not
Q: But why do we need the sampling distribution?
A:
Let’s start from the very beginning. Consider the roll of a six-sided die.
This probability distribution shows all possible outcomes of a single die roll and how likely they are. I.e., in the long term, if you roll the die MANY times, you would expect equal proportions of each result.
The sampling distribution can help us here.
The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when derived from a random sample of size \(n\) .
It may be considered as the distribution of the statistic for all possible samples from the same population of a given sample size.
One possible outcome leads to this histogram:
Population distribution
\[\neq\]
Sample distribution
\[\neq\]
SamplING distribution
In our previous example, we were interested in the lengths of human genes.
Say, from one sample of 20 genes.
In this case, the samplING distribution of mean gene lengths.
We can start building this by collecting the individual samPLE means we calculated before: \(\bar{X}=\{3110.3,2412,2911.45,...\}\), and making a histogram of \(\bar{X}\)s.
If instead of taking 3 independent samples of 20 genes we take, say, 100,000, we see a sampling distribution
Grand mean of \(\bar{X}\) approaches population mean \(\mu\)
Compare:
Parameter:
\(\sigma_{\bar{X}}=\frac{\sigma}{n}=\frac{2833.3}{\sqrt{20}}\approx 633.5\)
Estimates:
\(s_{\bar{X}}=543\)
estimated form the sampling distribution
\(SEM_{\bar{X}}=\frac{1517.998}{\sqrt{20}}\approx339.4\)
estimated from the first sample of 20 genes
\(\bar{X}\pm SE\)
In our previous example, our first sample had \(\bar{X}=2533.9\) and \(s_{X}=1517.998\), so \(SEM_{\bar{X}}=\frac{1517.998}{\sqrt{20}}\approx339.4\)
So we would report \(2533.9 \pm339.4 \text{ (SEM)}\). Later we will also see how the \(SEM\) is used in plots (error bars).
From: makeameme.org
B215: Biostatistics with R