last updated: 2021-10-11

Sampling and distributions

A curve has been found representing the frequency distribution of standard deviations of samples drawn from a normal population.

-Gosset. 1908, Biometrika 6:25.

What you will learn

 

  • Use of the histogram
  • Gaussian ain’t normal
  • Poisson
  • Binomial
  • Diagnosing the distribution
  • Practice exercises

Use of the histogram

# Let's simulate some fake weight data for 10,000 cats
set.seed(42)
cats <- rnorm(n = 10000, mean = 4, sd = 0.5)

Use of the histogram

  • Bars are counts of observations
  • ‘Bins’ non-overlapping
  • Shape is diagnostic

Gaussian ain’t normal

Things you can measure with continuous precision

  • The Gaussian is sometimes referred to as the ‘normal’ distribution
  • Implies it is typical
  • The Gaussian ain’t necessarily typical!
  • Described by mean and std. dev. (the Gaussian parameters)

Gaussian ain’t normal

Mean

# Data
myvar <- c(1,4,8,3,5,3,8,4,5,6)

# Mean the "hard" way
(myvar.mean <- sum(myvar)/length(myvar))
## [1] 4.7
# Mean the easy way
mean(myvar)
## [1] 4.7

Gaussian ain’t normal

Standard Deviation

# (NB this is the sample variance with [n-1])
(sum((myvar-myvar.mean)^2 / (length(myvar)-1)))
## [1] 4.9
# Variance the easy way 
var(myvar)
## [1] 4.9
# Std dev the easy way
sqrt(var(myvar))
## [1] 2.213594

Gaussian ain’t normal

Poisson

Counts of rare events (like deaths from being kicked by a horse in the Prussian army…)

 

  • Usually low mean value
  • Described by a single parameter \(\lambda\)
  • \(\lambda\) is both the mean and std. dev

Poisson

set.seed(42)
mypois <- rpois(n = 100, lambda = 3)
hist(mypois,
     main = "Ewes with triplets",
     xlab = "Count of Triplets")

Poisson

Binomial

Counts of events with exactly two outcomes, one of which might be a “success” (like ‘deaths from being kicked by a horse in the Prussian army…’heads’ or ‘tails’, live or die, disease or healthy, etc.)

 

  • Sometimes we are interested in the probability of success
  • Described by 2 parameters, p{success}, and the number of trials

Binomial

Diagnosing the distribution

 

A very common task faced when handling data is “diagnosing the distribution”. Just like a human doctor diagnosing an ailment, you examine the evidence, consider the alternatives, judge the context, and take a guess.

Diagnosing the distribution

 

  • Expectation based on the type of data
  • Graph the data and look
  • compare expected theoretical dist. with several known dist’s
  • try transformation (e.g. to ‘coerce’ to Gaussian)

Live coding