cc Licensed under the Creative Commons attribution-noncommercial license. Please share & remix noncommercially, mentioning its origin.

library(ggplot2); theme_set(theme_bw())
library(ggExtra)
library(cowplot)
library(dotwhisker)

From Linear to generalized linear models

Why GLMs?

  • assumptions of linear models do break down sometimes

  • count data: discrete, non-negative

  • proportion data: discrete counts, \(0 \le x \le N\)

  • hard to transform to Normal

  • linear model doesn’t make sense

GLMs in action

  • vast majority of GLMs
    • logistic regression (binary/Bernoulli data)
    • Poisson regression (count data)
  • lots of GLM theory carries over from LMs
    • formulas
    • parameter interpretation (partly)
    • diagnostics (partly)

Most GLMs are logistic

Family

  • family: what kind of data do I have?
    • from first principles: family specifies the relationship between the mean and variance
    • binomial: proportions, out of a total number of counts; includes binary (Bernoulli) (“logistic regression”)
    • Poisson (independent counts, no maximum, or far from the maximum)
    • other (Normal ("gaussian"), Gamma)
  • default family for glm is Gaussian

family definitions

  • link function plus variance function
  • typical defaults
    • Poisson: log (exponential)
    • binomial: logit/log-odds (logistic)
    • Gamma: should probably use link="log" rather than inverse (default)

logit link/logistic function

  • qlogis() function (plogis() is logistic/inverse-link)
  • log-odds (\(\log(p/(1-p))\))
  • most natural scale for probability calculations
  • interpretation depends on base probability
    • small probability: like log (proportional)
    • large probability: like log(1-p)
    • intermediate (\(0.3 <p <0.7\)): effect \(\approx \beta/4\)

binomial models

  • for Poisson, Bernoulli responses we only need one piece of information
  • how do we specify denominator (\(N\) in \(k/N\))?
    • traditional R: response is two-column matrix of successes and failures [cbind(k,N-k) not cbind(k,N)]
    • also allowed: response is proportion (\(k/N\)), also specify weights=N
    • if equal for all cases and specified on the fly need to replicate:
      glm(p~...,data,weights=rep(N,nrow(data)))

diagnostics

  • harder than linear models: plot is still somewhat useful
  • binary data especially hard (e.g. arm::binnedplot)
  • goodness of fit tests, \(R^2\) etc. hard (can always compute cor(observed,predict(model, type="response")))
  • residuals are Pearson residuals by default (\((\textrm{obs}-\textrm{exp})/V(\textrm{exp})\)); predicted values are on the effect scale (e.g. log/logit) by default (use type="response" to get data-scale predictions)
  • also see DHARMa package

what to do about problems

  • consider problems in order: bias > heterosced. > outliers > distribution > overdisp.
  • bias: add covariates? polynomial or spline (GAM) predictors? change link function?
  • heteroscedasticity: change variance relationship? (NB1 vs NB2)
  • outliers: drop outliers? robust methods?

overdispersion

  • too much variance (after fixing other problems)
  • should have residual df \(\approx\) residual deviance
  • slightly better test: sum of squares of Pearson residuals \(\sim \chi^2\)
  • aods3::gof()
  • overdispersion < 1.05 (small); >5 maybe something wrong?

overdispersion: solutions

  • quasilikelihood: adjust standard errors, CIs, p-values
    • Wald tests only (see below)
  • overdispersed distributions (e.g. negative binomial, beta-binomial)
  • observation-level random effects (later)

back-transformation

  • confidence intervals are symmetric on link scale
  • can back-transform estimates and CIs for log
  • logit is hard (must pick a reference level)
  • don’t back-transform standard errors!

estimation

  • iteratively re-weighted least-squares
  • usually Just Works

inference

like LMs, but:

  • one-parameter tests are usually \(Z\) rather than \(t\)
  • CIs based on standard errors are approximate (Wald)
  • confint.glm() computes likelihood profile CIs

Common(est?) glm() problems

  • binomial/Poisson models with non-integer data
  • failing to specify family (default Gaussian: \(\to\) linear model); using glm() for linear models (unnecessary)
  • predictions on effect scale
  • using \((k,N)\) rather than \((k,N-k)\) with family=binomial
  • back-transforming SEs rather than CIs
  • neglecting overdispersion
  • Poisson for underdispersed responses
  • equating negative binomial with binomial rather than Poisson
  • worrying about overdispersion unnecessarily (binary/Gamma)
  • ignoring random effects

Example

AIDS (Australia: Dobson & Barnett)

aids <- read.csv("../../data/aids.csv")
## set up time variable
aids <- transform(aids, date=year+(quarter-1)/4)
print(gg0 <- ggplot(aids,aes(date,cases))+geom_point())

Easy GLMs with ggplot

print(gg1 <- gg0 + geom_smooth(method="glm",colour="red",
          method.args=list(family="quasipoisson")))
## `geom_smooth()` using formula = 'y ~ x'

Equivalent code

g1 <- glm(cases~date,aids,family=quasipoisson(link="log"))
summary(g1)
## 
## Call:
## glm(formula = cases ~ date, family = quasipoisson(link = "log"), 
##     data = aids)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.023e+03  6.806e+01  -15.03 1.25e-11 ***
## date         5.168e-01  3.425e-02   15.09 1.16e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.354647)
## 
##     Null deviance: 677.26  on 19  degrees of freedom
## Residual deviance:  53.02  on 18  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Note NA value for AIC …

Diagnostics (plot(g1))

acf(residuals(g1)) ## check autocorrelation

Diagnostics: the Big Lie

  • null-hypothesis testing of assumptions is iffy
    • never reject for small data sets
    • always reject for large data sets
  • but how do you know how large a problem to worry about?
  • gold standard: make your model more complex to show you didn’t need to worry in the first place

ggplot: check out quadratic model

print(gg2 <- gg1+geom_smooth(method="glm",formula=y~poly(x,2),
            method.args=list(family="quasipoisson")))
## `geom_smooth()` using formula = 'y ~ x'

on log scale

print(gg2+scale_y_log10())
## `geom_smooth()` using formula = 'y ~ x'

improved model

g2 <- update(g1,.~poly(date,2))
summary(g2)
## 
## Call:
## glm(formula = cases ~ poly(date, 2), family = quasipoisson(link = "log"), 
##     data = aids)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.86859    0.05004  77.311  < 2e-16 ***
## poly(date, 2)1  3.82934    0.25162  15.219 2.46e-11 ***
## poly(date, 2)2 -0.68335    0.19716  -3.466  0.00295 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 1.657309)
## 
##     Null deviance: 677.264  on 19  degrees of freedom
## Residual deviance:  31.992  on 17  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
anova(g1,g2,test="F") ## for quasi-models specifically
## Analysis of Deviance Table
## 
## Model 1: cases ~ date
## Model 2: cases ~ poly(date, 2)
##   Resid. Df Resid. Dev Df Deviance      F   Pr(>F)   
## 1        18     53.020                               
## 2        17     31.992  1   21.028 12.688 0.002399 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

new diagnostics

autocorrelation function

acf(residuals(g2)) ## check autocorrelation

Also see nlme::ACF for ACFs of grouped data …

exercise

  • pick a data set (from the course data sets if you like; OK to ignore grouping/clustering for now)
  • decide on a family and set of predictors
  • decide on predictors
  • fit a model
  • look at diagnostics
  • fix the problems?

What did you learn? Where did you get stuck?

References