last updated: 2021-10-19

Sampling and distributions

Ice cream sales and forest fires are correlated because both occur more often in the summer heat. But, correlation does not imply causation.

-Nate Silver

What you will learn

 

  • The question of correlation
  • Data and assumptions
  • Graphing
  • Tests and alternatives
  • Practice exercises

The question of correlation

 

  • Measure of association
  • 2 (or more) numeric variables
  • General correlation: a relationship
  • Specific correlation: correlation coefficient

The question of correlation

Graph a correlation with a scatterplot

The question of correlation

Given some data:

The question of correlation

plot(x = veg, y = arth,
     xlab = 'Vegetation biomass', ylab = 'Arth. abundance',
     main = 'A positive correlation',
     pch = 16, col = 'blue')

Data and assumptions

The correlation coefficient is the covariance of 2 numeric variables divided by the product of their standard deviations

-1 < r < 1

Data and assumptions

The correlation coefficient is the covariance of 2 numeric variables

# The 'hard' way
# (sample) covariance
cov_veg_arth <- sum( (veg-mean(veg))*(arth-mean(arth))) / 
                      (length(veg) - 1 )

# r
(r_arth_veg <- cov_veg_arth / (sd(veg) * sd(arth)))
## [1] 0.6056694

Data and assumptions

The correlation coefficient is the covariance of 2 numeric variables

Assumptions:

  • linear relationship between variables

  • Gaussian distribution for each variable

Data and assumptions

The correlation coefficient is the covariance of 2 numeric variables

# The 'easy' way
cor(veg, arth)
## [1] 0.6056694

Graphing

A range of correlation magnitudes and signs

Graphing

The pairs() function is useful for EDA

data(iris)
# pairs plot
pairs(iris[ , 1:4], pch = 16, 
      col = iris$Species) # Set color to species...

Tests and alternatives

 

  • Significance test (null: r == 0)
  • cor.test() function
  • Pearson correlation is traditional, with assumptions
  • Alternative weaker test is the Spearman Rank correlation if assumptions are not met

Tests and alternatives

cor.test(veg, arth)
## 
##  Pearson's product-moment correlation
## 
## data:  veg and arth
## t = 7.4966, df = 97, p-value = 3.101e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4637006 0.7173146
## sample estimates:
##       cor 
## 0.6056694

Tests and alternatives

 

Reporting results (NEVER PASTE RAW OUTPUT)

We found a significant correlation between vegetation biomass and arthropod abundance (Pearson’s r = 0.61, df = 97, P < 0.0001)

Live coding