Whenever the correlation between two scores is imperfect, there will be regression to the mean.
-Francis Galton
last updated: 2021-10-26
Whenever the correlation between two scores is imperfect, there will be regression to the mean.
-Francis Galton
Â
Linear regression is a foundational tool for scientists
Used to quantify causation
Specific numerical predictions
Foundation for other linear models…
Â
Â
“Does X explain significant variation in Y”
The main question we tend to ask is whether the slope of the regression is different than zero
\(Y = \alpha + \beta X + \epsilon\)
Formal assumptions
Informal assumptions (the ones we have responsibility to evaluate)
(Relationship is linear)
(Numeric continuous data)
The residuals are “Gaussian”
Homoscedasticity
Independence of observations
Kaggle Fish market data
Species, character, fish spp
Weight, numeric, weight in grams
Length1, numeric, vertical length in cm
Length2, numeric, diagonal length in cm
Length3, numeric, cross length in cm
Height, numeric, height in cm
Width, numeric, diagonal width in cm
library(openxlsx) dat <- read.xlsx('data/2.4-fish.xlsx') head(dat)
## Species Weight Length1 Length2 Length3 Height Width ## 1 Bream 242 23.2 25.4 30.0 11.5200 4.0200 ## 2 Bream 290 24.0 26.3 31.2 12.4800 4.3056 ## 3 Bream 340 23.9 26.5 31.1 12.3778 4.6961 ## 4 Bream 363 26.3 29.0 33.5 12.7300 4.4555 ## 5 Bream 430 26.5 29.0 34.0 12.4440 5.1340 ## 6 Bream 450 26.8 29.7 34.7 13.6024 4.9274
Slice out perch
dat$Species == 'Perch'
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [97] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [109] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [121] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE ## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [157] FALSE FALSE FALSE
Slice out perch
perch <- dat[dat$Species == "Perch" , ] head(perch)
## Species Weight Length1 Length2 Length3 Height Width ## 73 Perch 5.9 7.5 8.4 8.8 2.1120 1.4080 ## 74 Perch 32.0 12.5 13.7 14.7 3.5280 1.9992 ## 75 Perch 40.0 13.8 15.0 16.0 3.8240 2.4320 ## 76 Perch 51.5 15.0 16.2 17.2 4.5924 2.6316 ## 77 Perch 70.0 15.7 17.4 18.5 4.5880 2.9415 ## 78 Perch 100.0 16.2 18.0 19.2 5.2224 3.3216
# literally the least you can do plot(y = perch$Height, x = perch$Width) lm0_perch <- lm(Height ~ Width, data = perch) abline(lm0_perch)
plot(y = perch$Height, x = perch$Width, ylab = "Height (cm)", xlab = "Width (cm)", main = "My perch regression plot", pch = 20, col = "blue", cex = 1) abline(lm0_perch)
Residuals Gaussian?
hist(residuals(lm0_perch))
Residuals Gaussian?
plot(x = lm0_perch, which = 1)
Sometimes a statistical test may be used to diagnose the distribution
shapiro.test(residuals(lm0_perch))
## ## Shapiro-Wilk normality test ## ## data: residuals(lm0_perch) ## W = 0.96783, p-value = 0.1397
We found no evidence the residuals deviated from the Gaussian expectation (Shapiro-Wilk: W = 0.97, n = 56, P = 0.14)
Is this the same as evidence the residuals “are Gaussian”? (no)
## ## Call: ## lm(formula = Height ~ Width, data = perch) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.23570 -0.28886 -0.02948 0.27910 1.55439 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.29630 0.20543 1.442 0.155 ## Width 1.59419 0.04059 39.276 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5342 on 54 degrees of freedom ## Multiple R-squared: 0.9662, Adjusted R-squared: 0.9656 ## F-statistic: 1543 on 1 and 54 DF, p-value: < 2.2e-16
Let’s report this:
\(length = 0.30 + 1.59*width\)
We found a significant linear relationship for Height predicting Weight in perch (regression: R-squared = 0.97, df = 1,54, P < 0.0001)
NB this is a test of whether the regression slope coefficient (1.59) is different to zero