cc Licensed under the Creative Commons attribution-noncommercial license. Please share & remix noncommercially, mentioning its origin.

library(ggplot2); theme_set(theme_bw())
library(ggExtra)
library(cowplot)
library(dotwhisker)

Linear models

Linear models: assumptions

Linear models: math

\[ z = a + bx + cy + \epsilon \] or (more predictor variables) \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \epsilon \] or (more flexible distribution syntax) \[ y \sim \textrm{Normal}(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots, \sigma^2) \] or (more complex sets of predictors) \[ \begin{split} \boldsymbol{\mu} & = \mathbf{X} \mathbf{\beta} \\ y_i & \sim \textrm{Normal}(\mu_i,\sigma^2) \end{split} \]

what does “linear” mean?

marginal vs. conditional distributions

example

MPG vs displacement for cars

cars_lm <- lm(log10(mpg)~log10(disp),mtcars)

(We’ll come back to how to judge this later)

categorical predictors

dd <- data.frame(flavour=rep(c("chocolate","vanilla"),c(2,3)))
print(dd)
##     flavour
## 1 chocolate
## 2 chocolate
## 3   vanilla
## 4   vanilla
## 5   vanilla
model.matrix(~flavour,dd)
##   (Intercept) flavourvanilla
## 1           1              0
## 2           1              0
## 3           1              1
## 4           1              1
## 5           1              1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$flavour
## [1] "contr.treatment"

R formulas

Formulas, continued

If confused, (1) try to write out the equation; (2) model.matrix()

Contrasts

too many ways to change contrasts (globally via options(); as attribute of factor; contrasts argument in lm())

Example 1 (treatment contrasts)

Data on ant colonies from Gotelli and Ellison (2004):

ants <- data.frame(
    place=rep(c("field","forest"),c(6,4)),
    colonies=c(12, 9, 12, 10,
               9, 6, 4, 6, 7, 10))
aggregate(colonies~place,data=ants,FUN=mean)
##    place colonies
## 1  field 9.666667
## 2 forest 6.750000
pr <- function(m) printCoefmat(coef(summary(m)),digits=3,signif.stars=FALSE)
pr(lm1 <- lm(colonies~place,data=ants))
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    9.667      0.958   10.09    8e-06
## placeforest   -2.917      1.515   -1.92     0.09

Ants: sum-to-zero contrasts

pr(lm2 <- update(lm1, contrasts=list(place=contr.sum)))
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    8.208      0.758   10.83  4.7e-06
## place1         1.458      0.758    1.92     0.09
data(lizards,package="brglm")

Interactions: example

Centering (Schielzeth 2010)

Scaling (Schielzeth 2010)

mtcars_big <- lm(mpg~.,data=mtcars)
mtcars_big_sc <- lm(mpg~.,data=as.data.frame(scale(mtcars)))
dwfun <- function(.,title) { dwplot(.,order_vars=names(sort(coef(.))))+
                                 geom_vline(xintercept=0,linetype=2)+
                                 ggtitle(title) }
plot_grid(dwfun(mtcars_big,"unscaled"),dwfun(mtcars_big_sc,"scaled"))

LM diagnostics

LM diagnostics

par(mfrow=c(2,2))
plot(cars_lm)

a bad model (Tiwari et al. 2006)

original data/fit …

Diagnostics

testing hypotheses and interpreting results

From LM to GLM

Why GLMs?

  • assumptions of LMs do break down sometimes
  • count data: discrete, non-negative
  • proportion data: discrete counts, \(0 \le x \le N\)

  • hard to transform to Normal
  • linear model doesn’t make sense

GLMs in action

  • vast majority of GLMs
    • logistic regression (binary/Bernoulli data)
    • Poisson regression (count data)
  • lots of GLM theory carries over from LMs
    • formulas
    • parameter interpretation (partly)
    • diagnostics (partly)

Most GLMs are logistic

Family

  • family: what kind of data do I have?
    • from first principles: family specifies the relationship between the mean and variance
    • binomial: proportions, out of a total number of counts; includes binary (Bernoulli) (“logistic regression”)
    • Poisson (independent counts, no set maximum, or far from the maximum)
    • other (Normal ("gaussian"), Gamma)
  • default family for glm is Gaussian

family definitions

  • link function plus variance function
  • typical defaults
    • Poisson: log (exponential)
    • binomial: logit/log-odds (logistic)

logit link/logistic function

  • qlogis() function (plogis() is logistic/inverse-link)
  • log-odds (\(\log(p/(1-p))\))
  • most natural scale for probability calculations
  • interpretation depends on base probability
    • small probability: like log (proportional)
    • large probability: like log(1-p)
    • intermediate (\(0.3 <p <0.7\)): effect \(\approx \beta/4\)

binomial models

  • for Poisson, Bernoulli responses we only need one piece of information
  • how do we specify denominator (\(N\) in \(k/N\))?
  • traditional R: response is two-column matrix cbind(k,N-k) not cbind(k,N)
  • also allowed: response is proportion (\(k/N\)), also specify weights=N
  • if equal for all cases and specified on the fly need to replicate:
    glm(p~...,data,weights=rep(N,nrow(data)))

diagnostics

  • a little harder than linear models: plot is still somewhat useful
  • binary data especially hard (e.g. arm::binnedplot)
  • goodness of fit tests, \(R^2\) etc. hard (can always compute cor(observed,predict(model, type="response")))
  • residuals are Pearson residuals by default (\((\textrm{obs}-\textrm{exp})/V(\textrm{exp})\)); predicted values are on the effect scale (e.g. log/logit) by default (use type="response" to get data-scale predictions)
  • also see DHARMa package

overdispersion

  • too much variance
  • more detail later
  • should have residual df \(\approx\) residual deviance

back-transformation

  • confidence intervals are symmetric on link scale
  • can back-transform estimates and CIs for log
  • logit is hard (must pick a reference level)
  • don’t back-transform standard errors!

estimation

  • iteratively re-weighted least-squares
  • usually Just Works

inference

like LMs, but:

  • one-parameter tests are usually \(Z\) rather than \(t\)
  • CIs based on standard errors are approximate (Wald)
  • confint.glm() computes likelihood profile CIs

Common(est?) glm() problems

  • binomial/Poisson models with non-integer data
  • failing to specify family (\(\to\) linear model); using glm() for linear models (unnecessary)
  • predictions on effect scale
  • using \((k,N)\) rather than \((k,N-k)\) in binomial models
  • back-transforming SEs rather than CIs

  • neglecting overdispersion
  • Poisson for underdispersed responses
  • equating negative binomial with binomial rather than Poisson (
  • worrying about overdispersion unnecessarily (binary/Gamma)
  • ignoring random effects

Example

AIDS (Australia: Dobson & Barnett)

aids <- read.csv("../data/aids.csv")
aids <- transform(aids, date=year+(quarter-1)/4)
print(gg0 <- ggplot(aids,aes(date,cases))+geom_point())

Easy GLMs with ggplot

print(gg1 <- gg0 + geom_smooth(method="glm",colour="red",
          method.args=list(family="quasipoisson")))

Equivalent code

g1 <- glm(cases~date,aids,family=quasipoisson(link="log"))
summary(g1)
## 
## Call:
## glm(formula = cases ~ date, family = quasipoisson(link = "log"), 
##     data = aids)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.7046  -0.7978   0.1218   0.6849   2.1217  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.023e+03  6.806e+01  -15.03 1.25e-11 ***
## date         5.168e-01  3.425e-02   15.09 1.16e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.354647)
## 
##     Null deviance: 677.26  on 19  degrees of freedom
## Residual deviance:  53.02  on 18  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Diagnostics (plot(g1))

acf(residuals(g1)) ## check autocorrelation

ggplot: check out quadratic model

print(gg2 <- gg1+geom_smooth(method="glm",formula=y~poly(x,2),
            method.args=list(family="quasipoisson")))

on log scale

print(gg2+scale_y_log10())

improved model

g2 <- update(g1,.~poly(date,2))
summary(g2)
## 
## Call:
## glm(formula = cases ~ poly(date, 2), family = quasipoisson(link = "log"), 
##     data = aids)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3290  -0.9071  -0.0761   0.8985   2.3209  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.86859    0.05004  77.311  < 2e-16 ***
## poly(date, 2)1  3.82934    0.25162  15.219 2.46e-11 ***
## poly(date, 2)2 -0.68335    0.19716  -3.466  0.00295 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 1.657309)
## 
##     Null deviance: 677.264  on 19  degrees of freedom
## Residual deviance:  31.992  on 17  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
anova(g1,g2,test="F") ## for quasi-models specifically
## Analysis of Deviance Table
## 
## Model 1: cases ~ date
## Model 2: cases ~ poly(date, 2)
##   Resid. Df Resid. Dev Df Deviance      F   Pr(>F)   
## 1        18     53.020                               
## 2        17     31.992  1   21.028 12.688 0.002399 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

new diagnostics

autocorrelation function

acf(residuals(g2)) ## check autocorrelation

References

Gotelli, Nicholas J., and Aaron M. Ellison. 2004. A Primer of Ecological Statistics. Sunderland, MA: Sinauer.

Schielzeth, Holger. 2010. “Simple Means to Improve the Interpretability of Regression Coefficients.” Methods in Ecology and Evolution 1: 103–13. doi:10.1111/j.2041-210X.2010.00012.x.

Tiwari, Manjula, Karen A. Bjorndal, Alan B. Bolten, and Benjamin M. Bolker. 2006. “Evaluation of Density-Dependent Processes and Green Turtle Chelonia Mydas Hatchling Production at Tortuguero, Costa Rica.” Marine Ecology Progress Series 326: 283–93.

Wilkinson, G. N., and C. E. Rogers. 1973. “Symbolic Description of Factorial Models for Analysis of Variance.” Applied Statistics 22 (3): 392–99. doi:10.2307/2346786.