Licensed under the Creative Commons attribution-noncommercial license. Please share & remix noncommercially, mentioning its origin.

library(ggplot2); theme_set(theme_bw())
library(ggExtra)
library(cowplot)
library(dotwhisker)

From Linear to generalized linear models

Why GLMs?

assumptions of linear models do break down sometimes
count data: discrete, non-negative
proportion data: discrete counts, \(0 \le x \le N\)
hard to transform to Normal
linear model doesn’t make sense

GLMs in action

vast majority of GLMs
- logistic regression (binary/Bernoulli data)
- Poisson regression (count data)
lots of GLM theory carries over from LMs
- formulas
- parameter interpretation (partly)
- diagnostics (partly)

Most GLMs are logistic

Family

family: what kind of data do I have?
- from first principles: family specifies the relationship between the mean and variance
- binomial: proportions, out of a total number of counts; includes binary (Bernoulli) (“logistic regression”)
- Poisson (independent counts, no maximum, or far from the maximum)
- other (Normal ("gaussian"), Gamma)
default family for glm is Gaussian

link functions

these transform prediction, not response
e.g. rather than \(\log(\mu) = \beta_0+\beta_1 x\), use \(\mu = \exp(\beta_0+\beta_1 x)\)
in this case log is the link function, exp is the inverse link function
extreme observations don’t cause problems (usually)

family definitions

link function plus variance function
typical defaults
- Poisson: log (exponential)
- binomial: logit/log-odds (logistic)
- Gamma: should probably use link="log" rather than inverse (default)

log link

proportional scaling of effects
small values of coefficients (\(<0.1\)) \(\approx\) proportionality
otherwise change per unit is \(\exp(\beta)\)
large parameter values (\(>10\)) mean some kind of trouble

logit link/logistic function

qlogis() function (plogis() is logistic/inverse-link)
log-odds (\(\log(p/(1-p))\))
most natural scale for probability calculations
interpretation depends on base probability
- small probability: like log (proportional)
- large probability: like log(1-p)
- intermediate (\(0.3 <p <0.7\)): effect \(\approx \beta/4\)

binomial models

for Poisson, Bernoulli responses we only need one piece of information
how do we specify denominator (\(N\) in \(k/N\))?
- traditional R: response is two-column matrix of successes and failures [cbind(k,N-k) not cbind(k,N)]
- also allowed: response is proportion (\(k/N\)), also specify weights=N
- if equal for all cases and specified on the fly need to replicate:
  glm(p~...,data,weights=rep(N,nrow(data)))

diagnostics

harder than linear models: plot is still somewhat useful
binary data especially hard (e.g. arm::binnedplot)
goodness of fit tests, \(R^2\) etc. hard (can always compute cor(observed,predict(model, type="response")))
residuals are Pearson residuals by default (\((\textrm{obs}-\textrm{exp})/V(\textrm{exp})\)); predicted values are on the effect scale (e.g. log/logit) by default (use type="response" to get data-scale predictions)
also see DHARMa package

what to do about problems

consider problems in order: bias > heterosced. > outliers > distribution > overdisp.
bias: add covariates? polynomial or spline (GAM) predictors? change link function?
heteroscedasticity: change variance relationship? (NB1 vs NB2)
outliers: drop outliers? robust methods?

overdispersion

too much variance (after fixing other problems)
should have residual df \(\approx\) residual deviance
slightly better test: sum of squares of Pearson residuals \(\sim \chi^2\)
aods3::gof()
overdispersion < 1.05 (small); >5 maybe something wrong?

overdispersion: solutions

quasilikelihood: adjust standard errors, CIs, p-values
- Wald tests only (see below)
overdispersed distributions (e.g. negative binomial, beta-binomial)
observation-level random effects (later)

back-transformation

confidence intervals are symmetric on link scale
can back-transform estimates and CIs for log
logit is hard (must pick a reference level)
don’t back-transform standard errors!

estimation

iteratively re-weighted least-squares
usually Just Works

inference

like LMs, but:

one-parameter tests are usually \(Z\) rather than \(t\)
CIs based on standard errors are approximate (Wald)
confint.glm() computes likelihood profile CIs

Common(est?) `glm()` problems

binomial/Poisson models with non-integer data
failing to specify family (default Gaussian: \(\to\) linear model); using glm() for linear models (unnecessary)
predictions on effect scale
using \((k,N)\) rather than \((k,N-k)\) with family=binomial
back-transforming SEs rather than CIs
neglecting overdispersion
Poisson for underdispersed responses
equating negative binomial with binomial rather than Poisson
worrying about overdispersion unnecessarily (binary/Gamma)
ignoring random effects

Example

AIDS (Australia: Dobson & Barnett)

aids <- read.csv("../../data/aids.csv")
## set up time variable
aids <- transform(aids, date=year+(quarter-1)/4)
print(gg0 <- ggplot(aids,aes(date,cases))+geom_point())

Easy GLMs with ggplot

print(gg1 <- gg0 + geom_smooth(method="glm",colour="red",
          method.args=list(family="quasipoisson")))

## `geom_smooth()` using formula = 'y ~ x'

Equivalent code

g1 <- glm(cases~date,aids,family=quasipoisson(link="log"))
summary(g1)

## 
## Call:
## glm(formula = cases ~ date, family = quasipoisson(link = "log"), 
##     data = aids)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.023e+03  6.806e+01  -15.03 1.25e-11 ***
## date         5.168e-01  3.425e-02   15.09 1.16e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.354647)
## 
##     Null deviance: 677.26  on 19  degrees of freedom
## Residual deviance:  53.02  on 18  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Note NA value for AIC …

Diagnostics (`plot(g1)`)

acf(residuals(g1)) ## check autocorrelation

Diagnostics: the Big Lie

null-hypothesis testing of assumptions is iffy
- never reject for small data sets
- always reject for large data sets
but how do you know how large a problem to worry about?
gold standard: make your model more complex to show you didn’t need to worry in the first place

ggplot: check out quadratic model

print(gg2 <- gg1+geom_smooth(method="glm",formula=y~poly(x,2),
            method.args=list(family="quasipoisson")))

## `geom_smooth()` using formula = 'y ~ x'

on log scale

print(gg2+scale_y_log10())

## `geom_smooth()` using formula = 'y ~ x'

improved model

g2 <- update(g1,.~poly(date,2))
summary(g2)

## 
## Call:
## glm(formula = cases ~ poly(date, 2), family = quasipoisson(link = "log"), 
##     data = aids)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.86859    0.05004  77.311  < 2e-16 ***
## poly(date, 2)1  3.82934    0.25162  15.219 2.46e-11 ***
## poly(date, 2)2 -0.68335    0.19716  -3.466  0.00295 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 1.657309)
## 
##     Null deviance: 677.264  on 19  degrees of freedom
## Residual deviance:  31.992  on 17  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

anova(g1,g2,test="F") ## for quasi-models specifically

## Analysis of Deviance Table
## 
## Model 1: cases ~ date
## Model 2: cases ~ poly(date, 2)
##   Resid. Df Resid. Dev Df Deviance      F   Pr(>F)   
## 1        18     53.020                               
## 2        17     31.992  1   21.028 12.688 0.002399 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

new diagnostics

autocorrelation function

acf(residuals(g2)) ## check autocorrelation

Also see nlme::ACF for ACFs of grouped data …

exercise

pick a data set (from the course data sets if you like; OK to ignore grouping/clustering for now)
decide on a family and set of predictors
decide on predictors
fit a model
look at diagnostics
fix the problems?

What did you learn? Where did you get stuck?

basic generalized linear models

Ben Bolker

From Linear to generalized linear models

Why GLMs?

GLMs in action

Most GLMs are logistic

Family

link functions

family definitions

log link

logit link/logistic function

binomial models

diagnostics

what to do about problems

overdispersion

overdispersion: solutions

back-transformation

estimation

inference

Common(est?) `glm()` problems

Example

AIDS (Australia: Dobson & Barnett)

Easy GLMs with ggplot

Equivalent code

Diagnostics (`plot(g1)`)

Diagnostics: the Big Lie

ggplot: check out quadratic model

on log scale

improved model

new diagnostics

autocorrelation function

exercise

References

basic generalized linear models

Ben Bolker

From Linear to generalized linear models

Why GLMs?

GLMs in action

Most GLMs are logistic

Family

link functions

family definitions

log link

logit link/logistic function

binomial models

diagnostics

what to do about problems

overdispersion

overdispersion: solutions

back-transformation

estimation

inference

Common(est?) glm() problems

Example

AIDS (Australia: Dobson & Barnett)

Easy GLMs with ggplot

Equivalent code

Diagnostics (plot(g1))

Diagnostics: the Big Lie

ggplot: check out quadratic model

on log scale

improved model

new diagnostics

autocorrelation function

exercise

References

Common(est?) `glm()` problems

Diagnostics (`plot(g1)`)