Â
The first thing you do with a big “pig” of a data set?
Weigh the pig.
last updated: 2021-10-02
Â
The first thing you do with a big “pig” of a data set?
Weigh the pig.
Â
summary()
to describe dataÂ
Null Hypothesis Testing: Kind of a big deal
Â
Your question (hypothesis) is stated in plain language
E.g. The number of pollinator species is greater in hedgerows when pesticides are not used nearby
Note this is phrased as a CLAIM
Â
Your statistical hypothesis is formal and philosophical (and is typically not explicitly stated)
There are 2 parts
Null Hypothesis (technically the one you test): The mean species count of pollinators IS NOT different
Alternative Hypothesis: The mean species count of pollinators IS different
Â
The statistical hypothesis is framed using a SPECIFIC statistical test
We say we “test the null hypothesis” by computing a p-value
The P-value is the probability we are wrong (an error) if we reject the Null
We compare the p-value to the maximum risk of this error (usually 0.05)
summary()
Chick weights data
chicks <- read.xlsx('data/2.1-chickwts.xlsx') head(chicks)
## weight feed ## 1 179 horsebean ## 2 160 horsebean ## 3 136 horsebean ## 4 227 horsebean ## 5 217 horsebean ## 6 168 horsebean
summary()
Chick weights data
str(chicks)
## 'data.frame': 71 obs. of 2 variables: ## $ weight: num 179 160 136 227 217 168 108 124 143 140 ... ## $ feed : chr "horsebean" "horsebean" "horsebean" "horsebean" ...
summary()
Summarize the whole data object
summary(chicks)
## weight feed ## Min. :108.0 Length:71 ## 1st Qu.:204.5 Class :character ## Median :258.0 Mode :character ## Mean :261.3 ## 3rd Qu.:323.5 ## Max. :423.0
summary()
Subsetting and slicing
summary(object = chicks$weight[which(chicks$feed == "casein")])
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 216.0 277.2 342.0 323.6 370.8 404.0
summary(object = chicks$weight[which(chicks$feed == "horsebean")])
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 108.0 137.0 151.5 160.2 176.2 227.0
summary()
Do not forget aggregate()
aggregate(x = chicks$weight, by = list(feed = chicks$feed), FUN = function(x){ c(mean = mean(x), sd = sd(x), SEM = sd(x)/sqrt(length(x)))})
## feed x.mean x.sd x.SEM ## 1 casein 323.58333 64.43384 18.60045 ## 2 horsebean 160.20000 38.62584 12.21456 ## 3 linseed 218.75000 52.23570 15.07915 ## 4 meatmeal 276.90909 64.90062 19.56827 ## 5 soybean 246.42857 54.12907 14.46660 ## 6 sunflower 328.91667 48.83638 14.09785
summary()
Â
summary()
can be usefulÂ
There are a few classic graph types we always use
hist()
: distribution of numeric varboxplot()
: central tendency of numeric ~ factorplot()
: y ~ x of 2 numeric varsHistogram hist()
: distribution of numeric var
hist(chicks$weight) abline(v = mean(chicks$weight), col = 'red', lty = 2, lwd = 2) # vanity
Box plot boxplot()
: central tendency of numeric ~ factor
select <- which(chicks$feed=='meatmeal' | chicks$feed=='horsebean') boxplot(weight ~ feed, data = chicks[select, ], xlab='Weight (g)', ylab = 'Feed')
Scatterplot plot()
: y ~ x of 2 numeric vars
x <- rnorm(10); y <- rnorm(10) plot(y ~ x, col = 'red', pch = 16) # vanity
EDA:
Formal analysis
Best practice:
Â