last updated: 2021-10-26

Regression

Whenever the correlation between two scores is imperfect, there will be regression to the mean.

-Francis Galton

Regression

 

Linear regression is a foundational tool for scientists

  • Used to quantify causation

  • Specific numerical predictions

  • Foundation for other linear models…

What you will learn

 

  • The question of simple linear regression
  • Data and assumptions
  • Graphing
  • Tests and alternatives
  • Practice exercises

The question of simple linear regression

 

“Does X explain significant variation in Y”

The main question we tend to ask is whether the slope of the regression is different than zero

\(Y = \alpha + \beta X + \epsilon\)

Data and assumptions

Formal assumptions

Data and assumptions

Informal assumptions (the ones we have responsibility to evaluate)

  • (Relationship is linear)

  • (Numeric continuous data)

  • The residuals are “Gaussian”

  • Homoscedasticity

  • Independence of observations

Data and assumptions

Kaggle Fish market data

Species, character, fish spp

Weight, numeric, weight in grams

Length1, numeric, vertical length in cm

Length2, numeric, diagonal length in cm

Length3, numeric, cross length in cm

Height, numeric, height in cm

Width, numeric, diagonal width in cm

https://www.kaggle.com/aungpyaeap/fish-market

Data and assumptions

library(openxlsx)
dat <- read.xlsx('data/2.4-fish.xlsx')
head(dat)
##   Species Weight Length1 Length2 Length3  Height  Width
## 1   Bream    242    23.2    25.4    30.0 11.5200 4.0200
## 2   Bream    290    24.0    26.3    31.2 12.4800 4.3056
## 3   Bream    340    23.9    26.5    31.1 12.3778 4.6961
## 4   Bream    363    26.3    29.0    33.5 12.7300 4.4555
## 5   Bream    430    26.5    29.0    34.0 12.4440 5.1340
## 6   Bream    450    26.8    29.7    34.7 13.6024 4.9274

Data and assumptions

Slice out perch

dat$Species == 'Perch'
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [85]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [97]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [109]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [121]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE

Data and assumptions

Slice out perch

perch <- dat[dat$Species == "Perch" , ]
head(perch)
##    Species Weight Length1 Length2 Length3 Height  Width
## 73   Perch    5.9     7.5     8.4     8.8 2.1120 1.4080
## 74   Perch   32.0    12.5    13.7    14.7 3.5280 1.9992
## 75   Perch   40.0    13.8    15.0    16.0 3.8240 2.4320
## 76   Perch   51.5    15.0    16.2    17.2 4.5924 2.6316
## 77   Perch   70.0    15.7    17.4    18.5 4.5880 2.9415
## 78   Perch  100.0    16.2    18.0    19.2 5.2224 3.3216

Graphing regression

# literally the least you can do
plot(y = perch$Height, x = perch$Width)
lm0_perch <- lm(Height ~ Width, data = perch)
abline(lm0_perch)

Graphing regression

plot(y = perch$Height, x = perch$Width,
     ylab = "Height (cm)", xlab = "Width (cm)",
     main = "My perch regression plot",
     pch = 20, col = "blue", cex = 1)

abline(lm0_perch)

Graphing regression

Residuals Gaussian?

hist(residuals(lm0_perch))

Graphing regression

Residuals Gaussian?

plot(x = lm0_perch, which = 1)

Graphing regression

Graphing regression

Sometimes a statistical test may be used to diagnose the distribution

shapiro.test(residuals(lm0_perch))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lm0_perch)
## W = 0.96783, p-value = 0.1397

Graphing regression

We found no evidence the residuals deviated from the Gaussian expectation (Shapiro-Wilk: W = 0.97, n = 56, P = 0.14)

Is this the same as evidence the residuals “are Gaussian”? (no)

Tests and alternatives

## 
## Call:
## lm(formula = Height ~ Width, data = perch)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.23570 -0.28886 -0.02948  0.27910  1.55439 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.29630    0.20543   1.442    0.155    
## Width        1.59419    0.04059  39.276   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5342 on 54 degrees of freedom
## Multiple R-squared:  0.9662, Adjusted R-squared:  0.9656 
## F-statistic:  1543 on 1 and 54 DF,  p-value: < 2.2e-16

Tests and alternatives

Let’s report this:

\(length = 0.30 + 1.59*width\)

We found a significant linear relationship for Height predicting Weight in perch (regression: R-squared = 0.97, df = 1,54, P < 0.0001)

NB this is a test of whether the regression slope coefficient (1.59) is different to zero

Live coding