1.1.Statistics and Samples

Bárbara D. Bitarello

2025-09-03

Overview

  • Goals of statistics
  • Populations and Samples
  • Statistics and parameters
  • Estimate errors due to sampling
  • Types of experiments
  • Types of variables

Learning Goals

  • Understand the major goal of statistics
  • Distinguish between a sample and a population
  • Distinguish between an estimate and a parameter
  • Identify why estimates from samples may deviate from parameters of populations
  • Identify the properties of a good sample
  • Be able to detect the differences between observational and experimental studies

What is statistics?

Tweet by user @kareem_carr: “In my opinion, the core of statistics is critical thinking with numbers. Some disciplines teach you how to think with numbers and others teach you critical thinking but statistics if where you learn to put the two together and this skill is essential for life in the 21st century.” (January 12, 2021)

Tweet from someone I don’t know, but very on point, so thank you!

What is statistics?

venn diagram shows statistics as the intersection of critical thinking and thinking with numbers

What is BIOstatistics?

venn diagram shows biostatistics as the intersection of critical thinking, biological data, and thinking with numbers

Why should you care about statistics?

  • Biology (ecology, genetics, immunology, microbiology, …)
  • Biomedical sciences
  • Public health
  • Data science
  • Economy, Psychology, Social Science

Why should anyone care about statistics?

  • Good science!
  • Critical evaluation of “scientific evidence”
  • Statistics and probability are not intuitive
  • We tend to jump to conclusions and we are very often wrong
  • Transferable skills

This is my adorable dog, Cashew

Challenge: data deluge

Cover of the magazine “The economist”. The heading reads: “The data deluge - And how to handle it: A 14-page special report”. Under the heading is a drawing of a person holding an umbrella in a rainstorm, but the rain is all 1s and 0s. The umbrella top is open but upside-down, collecting water which the person is using to water a plant from the handle of the umbrella.

Cover from “The Economist” (Feb 27-March 10, 2010)

Challenge: understanding

Comic from xfcd.com: lists p-values ranging from 0.001 to >=0.1 on the left, and on the right offers and interpretation.

From: https://xkcd.com/1478/

Goals

Goal: learn about the world

Meme from meme generator reads: Question: ”What do we want?”, Response: “Learn about the world!!!”, question: “Can we look at the entire world?”, Response: “No!!!”.

From: imgflip.com/memegenerator

In practice, then…

  • Challenge: We can’t look at the whole world.

  • Solution: Take a sample and generalize outward.

Great but…

  • New Challenge: Samples deviate from the populations by\ luck/chance (sampling error) or unrepresentative sampling (sampling bias)

  • Solution:

    • Estimate the values of important parameters

    • Have measures of uncertainty for estimates

    • Test hypotheses about those parameters

Let’s try again

  • Statistics are a quantitative technology for empirical science.
  • A logic and methodology for the measurement of uncertainty and for an examination of that uncertainty.
  • The key word here is uncertainty. Statistics becomes necessary when observations are variable.

What about biostats?

  • What is the motivating biological question?
  • What experiments can be done and/or data can be collected to address this question?
  • Do results support an interesting conclusion?
  • What are the shortcomings/limitations of statistical models and causal frameworks in the analysis?
  • How do I best communicate my results (including estimates, visualizations, conclusions, and caveats)?

The Central Obsession

  • Question: How do we make inferences about the WORLD from our finite observations?
  • Answer: Make models to account for the process of sampling and the associated hazards.

Important distinctions

  • Populations vs. samples
  • Parameters vs. estimates

Populations

In Biology| collection of interbreeding individuals of the same species that live in sufficient proximity that most mates are draw from this collection of individuals. This mostly applies to animals and, to some extent, plants.

In Statistics| the entire collection of individual units that a researcher is interested in. E.g. all women born in the US between 1990 and 200; all polar bears currently living in zoos; users of a certain social network in a certain age group; etc

Statistics & parameters

A parameter is some property of the world, i.e., the “truth”

A population of starfish

  • Parameters describe populations
  • E.g. proportion of pink starfish among all starfish of a given species in a certain location

A sample of starfish

  • Estimates (statistics) approximate parameters as inferred from samples
  • We estimate the proportion of pink as inferred from this sample and extrapolate to the population as an approximation.

In summary

Parameters and populations

  • Parameters describe Populations
  • Because we can’t sample an entire population, we usually don’t know parameters.

Estimates and samples

  • But we can get a good sense of the parameters from estimates we make from samples.
  • Estimates approximate parameters as inferred from Samples

Sampling: What could go wrong?

Meet sampling error and sampling bias

Sampling bias

  • If you collected these against a dark background without careful procedures…
  • This sample is biased. There is a higher proportion of orange stars than the population from which it was taken. Therefore, it is not a representative sample of the underlying population.

Sampling bias

Systematic difference between parameters & estimates.

The 1936 Election

Republican Alf Landon

Democrat Franklin Roosevelt

1936 Literary Digest poll

2.4 million responses to 10 million questionnaires , sent to people from telephone books and club lists.

1936 Literary Digest poll

Figure made in R with code borrowed from Y. Brandvain

Election: Roosevelt won in a landslide

This plot was made in R

Sampling Bias in the 1936 Polls

  • Questionnaire was more likely to reach rich people (who could afford phones & attend book clubs) than those with fewer means.
  • Voting and party preference are correlated with wealth.
  • Poorer people (underrepresented in the poll) supported Roosevelt, carrying him to victory.

That’s all for today

Forrest says "And that's all I wanted to say about that"

From: makeameme.org