R Bootcamp 2.1 Question & Explore

last updated: 2021-10-02

Summarizing data

The first thing you do with a big “pig” of a data set?

Weigh the pig.

What you will learn

Null Hypothesis Testing concept
Using summary() to describe data
Exploratory graphs & data
Exploratory Data Analysis (EDA) concept
Statistical analysis plan
Practice exercises

Null Hypothesis Testing concept

Null Hypothesis Testing: Kind of a big deal

Start with a question (hypothesis)
Translate into a statistical hypothesis
(regular hypothesis != statistical hypothesis)
Design data collection to represent The Population of interest
Use objective test

Null Hypothesis Testing concept

Your question (hypothesis) is stated in plain language

E.g. The number of pollinator species is greater in hedgerows when pesticides are not used nearby

Note this is phrased as a CLAIM

Null Hypothesis Testing concept

Your statistical hypothesis is formal and philosophical (and is typically not explicitly stated)

There are 2 parts

Null Hypothesis (technically the one you test): The mean species count of pollinators IS NOT different

Alternative Hypothesis: The mean species count of pollinators IS different

Null Hypothesis Testing concept

The statistical hypothesis is framed using a SPECIFIC statistical test

We say we “test the null hypothesis” by computing a p-value

The P-value is the probability we are wrong (an error) if we reject the Null

We compare the p-value to the maximum risk of this error (usually 0.05)

Using `summary()`

Chick weights data

chicks <- read.xlsx('data/2.1-chickwts.xlsx')
head(chicks)

##   weight      feed
## 1    179 horsebean
## 2    160 horsebean
## 3    136 horsebean
## 4    227 horsebean
## 5    217 horsebean
## 6    168 horsebean

Using `summary()`

Chick weights data

str(chicks)

## 'data.frame':    71 obs. of  2 variables:
##  $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
##  $ feed  : chr  "horsebean" "horsebean" "horsebean" "horsebean" ...

Using `summary()`

Summarize the whole data object

Just a way to peek at the data
Summarizing subsets may be desirable

summary(chicks)

##      weight          feed          
##  Min.   :108.0   Length:71         
##  1st Qu.:204.5   Class :character  
##  Median :258.0   Mode  :character  
##  Mean   :261.3                     
##  3rd Qu.:323.5                     
##  Max.   :423.0

Using `summary()`

Subsetting and slicing

summary(object = chicks$weight[which(chicks$feed == "casein")])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   216.0   277.2   342.0   323.6   370.8   404.0

summary(object = chicks$weight[which(chicks$feed == "horsebean")])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   108.0   137.0   151.5   160.2   176.2   227.0

Using `summary()`

Do not forget aggregate()

aggregate(x = chicks$weight, by = list(feed = chicks$feed), 
          FUN = function(x){ c(mean = mean(x), 
                               sd = sd(x),  
                               SEM = sd(x)/sqrt(length(x)))})

##        feed    x.mean      x.sd     x.SEM
## 1    casein 323.58333  64.43384  18.60045
## 2 horsebean 160.20000  38.62584  12.21456
## 3   linseed 218.75000  52.23570  15.07915
## 4  meatmeal 276.90909  64.90062  19.56827
## 5   soybean 246.42857  54.12907  14.46660
## 6 sunflower 328.91667  48.83638  14.09785

Using `summary()`

summary() can be useful
Explore and understand variable type and tendency
Used for “knowing your data”
Looking for unusual values and errors
None of this is mandatory, but can be useful
Take care to keep a tidy script…

Exploratory graphs & data

There are a few classic graph types we always use

Histogram hist(): distribution of numeric var
Box plot boxplot(): central tendency of numeric ~ factor
Scatterplot plot(): y ~ x of 2 numeric vars

Exploratory graphs & data

Histogram hist(): distribution of numeric var

hist(chicks$weight)
abline(v = mean(chicks$weight), 
       col = 'red', lty = 2, lwd = 2) # vanity

Exploratory graphs & data

Box plot boxplot(): central tendency of numeric ~ factor

select <- which(chicks$feed=='meatmeal' | chicks$feed=='horsebean')
boxplot(weight ~ feed, data = chicks[select, ], 
        xlab='Weight (g)', ylab = 'Feed')

Exploratory graphs & data

Scatterplot plot(): y ~ x of 2 numeric vars

x <- rnorm(10); y <- rnorm(10)
plot(y ~ x,
     col = 'red', pch = 16) # vanity

(EDA) Exploratory Data Analysis

EDA:

Begins every analysis in some form
Informal, may be haphazard
Testing assumptions (e.g. distributions)
Information for the analyst, not others
May or may not require reproducibility
Prior to “formal analysis”

(EDA) Exploratory Data Analysis

Formal analysis

Designed to generate EVIDENCE for CLAIMS
Strictly reproducible
Generate INFORMATION for others
Information for the analyst, not others
Usually contains graphical and statistical components

Statistical analysis plan

Best practice:

Hypothesis, objective, claim
Specify statistical model
Data collection methods, sampling
Identify effect size
Justify sample size

Summarizing data

What you will learn

Null Hypothesis Testing concept

Null Hypothesis Testing concept

Null Hypothesis Testing concept

Null Hypothesis Testing concept

Using summary()

Using summary()

Using summary()

Using summary()

Using summary()

Using summary()

Exploratory graphs & data

Exploratory graphs & data

Exploratory graphs & data

Exploratory graphs & data

(EDA) Exploratory Data Analysis

(EDA) Exploratory Data Analysis

Statistical analysis plan

Statistical analysis plan

Statistical analysis plan

Statistical analysis plan

Live coding

Using `summary()`

Using `summary()`

Using `summary()`

Using `summary()`

Using `summary()`

Using `summary()`