last updated: 2021-10-02

Summarizing data

 

The first thing you do with a big “pig” of a data set?

Weigh the pig.

What you will learn

 

  • Null Hypothesis Testing concept
  • Using summary() to describe data
  • Exploratory graphs & data
  • Exploratory Data Analysis (EDA) concept
  • Statistical analysis plan
  • Practice exercises

Null Hypothesis Testing concept

 

Null Hypothesis Testing: Kind of a big deal

  • Start with a question (hypothesis)
  • Translate into a statistical hypothesis
  • (regular hypothesis != statistical hypothesis)
  • Design data collection to represent The Population of interest
  • Use objective test

Null Hypothesis Testing concept

 

Your question (hypothesis) is stated in plain language

E.g. The number of pollinator species is greater in hedgerows when pesticides are not used nearby

Note this is phrased as a CLAIM

Null Hypothesis Testing concept

 

Your statistical hypothesis is formal and philosophical (and is typically not explicitly stated)

There are 2 parts

Null Hypothesis (technically the one you test): The mean species count of pollinators IS NOT different

Alternative Hypothesis: The mean species count of pollinators IS different

Null Hypothesis Testing concept

 

The statistical hypothesis is framed using a SPECIFIC statistical test

We say we “test the null hypothesis” by computing a p-value

The P-value is the probability we are wrong (an error) if we reject the Null

We compare the p-value to the maximum risk of this error (usually 0.05)

Using summary()

Chick weights data

chicks <- read.xlsx('data/2.1-chickwts.xlsx')
head(chicks)
##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean

Using summary()

Chick weights data

str(chicks)
## 'data.frame':    71 obs. of  2 variables:
##  $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
##  $ feed  : chr  "horsebean" "horsebean" "horsebean" "horsebean" ...

Using summary()

Summarize the whole data object

  • Just a way to peek at the data
  • Summarizing subsets may be desirable
summary(chicks)
##      weight          feed          
##  Min.   :108.0   Length:71         
##  1st Qu.:204.5   Class :character  
##  Median :258.0   Mode  :character  
##  Mean   :261.3                     
##  3rd Qu.:323.5                     
##  Max.   :423.0

Using summary()

Subsetting and slicing

summary(object = chicks$weight[which(chicks$feed == "casein")])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   216.0   277.2   342.0   323.6   370.8   404.0
summary(object = chicks$weight[which(chicks$feed == "horsebean")])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   108.0   137.0   151.5   160.2   176.2   227.0

Using summary()

Do not forget aggregate()

aggregate(x = chicks$weight, by = list(feed = chicks$feed), 
          FUN = function(x){ c(mean = mean(x), 
                               sd = sd(x),  
                               SEM = sd(x)/sqrt(length(x)))})
##        feed    x.mean      x.sd     x.SEM
## 1    casein 323.58333  64.43384  18.60045
## 2 horsebean 160.20000  38.62584  12.21456
## 3   linseed 218.75000  52.23570  15.07915
## 4  meatmeal 276.90909  64.90062  19.56827
## 5   soybean 246.42857  54.12907  14.46660
## 6 sunflower 328.91667  48.83638  14.09785

Using summary()

 

  • summary() can be useful
  • Explore and understand variable type and tendency
  • Used for “knowing your data”
  • Looking for unusual values and errors
  • None of this is mandatory, but can be useful
  • Take care to keep a tidy script…

Exploratory graphs & data

 

There are a few classic graph types we always use

  • Histogram hist(): distribution of numeric var
  • Box plot boxplot(): central tendency of numeric ~ factor
  • Scatterplot plot(): y ~ x of 2 numeric vars

Exploratory graphs & data

Histogram hist(): distribution of numeric var

hist(chicks$weight)
abline(v = mean(chicks$weight), 
       col = 'red', lty = 2, lwd = 2) # vanity

Exploratory graphs & data

Box plot boxplot(): central tendency of numeric ~ factor

select <- which(chicks$feed=='meatmeal' | chicks$feed=='horsebean')
boxplot(weight ~ feed, data = chicks[select, ], 
        xlab='Weight (g)', ylab = 'Feed')

Exploratory graphs & data

Scatterplot plot(): y ~ x of 2 numeric vars

x <- rnorm(10); y <- rnorm(10)
plot(y ~ x,
     col = 'red', pch = 16) # vanity

(EDA) Exploratory Data Analysis

EDA:

  • Begins every analysis in some form
  • Informal, may be haphazard
  • Testing assumptions (e.g. distributions)
  • Information for the analyst, not others
  • May or may not require reproducibility
  • Prior to “formal analysis”

(EDA) Exploratory Data Analysis

Formal analysis

  • Designed to generate EVIDENCE for CLAIMS
  • Strictly reproducible
  • Generate INFORMATION for others
  • Information for the analyst, not others
  • Usually contains graphical and statistical components

Statistical analysis plan

Best practice:

  • Hypothesis, objective, claim
  • Specify statistical model
  • Data collection methods, sampling
  • Identify effect size
  • Justify sample size

Statistical analysis plan

Statistical analysis plan

Statistical analysis plan

Live coding

Â