A very common situation in ecology (and elsewhere) is a
binary-outcome survival model where individuals, each measured once,
differ in their exposure, typically because they have been observed for
different lengths of time (this is different from typical survival
analysis, where we know either the exact moment that someone died,
or that they were still alive when we stopped observing them). The
classic statistical approach to this problem is to use a
complementary log-log (“cloglog”) link (Heisey, Shaffer, and White 2007). The
cloglog
function is \(C(\mu)=\log(-\log(1-\mu))\); its inverse is
\(\mu=C^{-1}(\eta) =
1-\exp(-\exp(\eta))\). Thus if we expect mortality \(\mu_0\) over a period \(\Delta t=1\) and the linear predictor \(\eta=C^{-1}(\mu_0)\) then \[
C^{-1}(\eta+\log \Delta t)=(1-\exp(-\exp(\eta) \cdot \Delta t)).
\] Some algebra shows that this is equal to \(1-(1-\mu_0)^{\Delta t}\), which is what we
want.
The function \(\exp(-\exp(x))\) is
called the Gompertz function (it is also the CDF of the
extreme value distribution), so fitting a model with this
inverse-link function (i.e. fitting a cloglog
link to the
survival, rather than the mortality, probability) is called a gompit
(or extreme value) regression.
The bottom line is that if we change the form of the dependence of survival on covariates from logistic to complementary log-log, we can sneak the exposure variable in via an offset.
To use this approach in R, use
family=binomial(link="cloglog")
and add a term of the form
offset(log(A))
to the formula (some modeling functions take
offset
as a separate argument).
If instead of using log(A)
as an offset you allow it to
be a covariate (predictor, independent variable, whatever you
want to call it), that implies a linearly increasing or decreasing
log-hazard → a hazard that is a power-law function of exposure → a Weibull
distribution for the distribution of survival times (when the
coefficient is 1, i.e. if we use an offset, this reduces to an
exponentially distributed survival time).
Suppose we don’t want to do this, and instead want to stick with the logistic form, but want to have the link function be \[ \mu = \left( \text{Logistic}(\eta) \right)^e, \] where \(e\) is the exposure (typically, but not necessarily, an integer). I will call this the power-logistic link approach; it is popular in analyses of nest survival (Shaffer 2004). The code below explores both the complementary log-log and power-logistic approaches. The complementary log-log approach is in some ways more computationally convenient than the power-exponential (it can be used in any GLM-based framework that provides a cloglog link option for binomial models). However, the inverse-cloglog function increases extremely quickly (relative to the inverse-logit/logistic) for \(x>0\):
The fundamental difference between the power-logistic model and the cloglog/log-hazard models is in the effects of covariates on the baseline (i.e., per-unit-time) mortality probability. The exponential decrease in survival with time (or \(\exp(-t^\gamma)\) for the Weibull case is a useful parsimonious model for time-dependence, but there’s no intrinsic reason that effects of covariates on daily survival/mortality probabilities should also follow a hazard model …
I get e-mail about this document a few times a year, implying that people are still using the power-logistic approach. I would be interested in hearing arguments or explanations from biostatisticians or ecologists about why the power-logistic might be preferred: is it just historical, does it fit the data better, or does it have advantages I haven’t thought of?
Some example data on red-winged blackbird nest survivorship in Brant County, Ontario from Reta Meng (part of her undergraduate thesis with Dr. Pat Chow-Fraser of McMaster University):
pkgs <- sort(c("lme4", "RTMB", "bbmle", "glmmTMB",
"tidyverse", "patchwork",
"emmeans", "visreg", "sjPlot", "effects",
"broom","broom.mixed"))
invisible(sapply(pkgs,library,character.only=TRUE))
theme_set(theme_bw())
require(knitr)
opts_chunk$set(tidy=FALSE,fig.width=5,fig.height=5)
Package versions:
## bbmle broom broom.mixed effects emmeans glmmTMB
## 1.0.26 1.0.8 0.2.9.7 4.2.2 1.11.1 1.1.11.9000
## lme4 patchwork RTMB sjPlot tidyverse visreg
## 1.1.37.9000 1.3.0 1.7 2.8.17 2.0.0 2.7.0
Parameters:
Exposure
= exposure dayssurv
= survival of the nest contents (1=nest has
contents, 0=nest empty)survive
= outcome of the nest (1=successful,
0=failure)nestheight
= height of nestaverageSPL
= sound pressure level near the nestdistroad
= distance from nest to roadwaterdepth
= water depth surrounding the nestdat <- read.csv("brant_survive.csv")
## orig_names <- scan("brant_survive.csv",what=character(),nlines=1,sep=",")
dat2 <- subset(dat,
select=c(Nest.ID,Exposure,Date,Surv,
NestHeight,AverageSPL,DistEdge,
DistWater,DistRoad))
dat2S <- subset(dat2,Exposure>0)
Download CSV (or get data from GitHub)
Total samples: 174 observations, 36 nests.
Marginal plots of survival prob vs predictors:
mdat <- pivot_longer(dat2, -(1:4), names_to = "variable")
ggplot(mdat,aes(x=value,y=Surv))+
geom_point(alpha=0.3, aes(size=Exposure))+
geom_smooth(method="gam", method.args = list(family = "binomial"),
formula = y ~ s(x, k = 12)) +
facet_wrap(~variable,scale="free_x")+
coord_cartesian(ylim=c(-0.05,1.05))+xlab("")+ylab("Survival")
In what follows I’m only going to model the effect of nest height on survivorship, but the example should generalize to any number of continuous or categorical predictors …
bbmle
The bbmle
package offers a formula interface that lets
us do general nonlinear MLE fitting reasonably conveniently.
library(bbmle)
m_mle2 <- mle2(Surv~dbinom(plogis(mu)^Exposure,size=1),
parameters=list(mu~NestHeight),
start=list(mu=2),data=dat2S)
summary(m_mle2)
## Maximum likelihood estimation
##
## Call:
## mle2(minuslogl = Surv ~ dbinom(plogis(mu)^Exposure, size = 1),
## start = list(mu = 2), data = dat2S, parameters = list(mu ~
## NestHeight))
##
## Coefficients:
## Estimate Std. Error z value Pr(z)
## mu.(Intercept) 1.81505 1.09315 1.6604 0.09684 .
## mu.NestHeight 1.09549 0.96283 1.1378 0.25521
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## -2 log L: 136.0476
glm
Generalized linear modeling code can be hacked to fit the power-logistic link, by providing a custom link function. SAS and R code to do this is available from a ‘wayback machine’ archive of Terry Shaffer’s USGS web page.
This is an updated version of Mark Herzog’s R code (should run in “recent” R versions, e.g at least >=3.5.0 [??])
library(MASS)
logexp <- function(exposure = 1) {
## hack to help with visualization, post-prediction etc etc
## FIXME: can we do more to make sure an exposure arg is not
## accidentally masked by a global ..exposure?
get_exposure <- function() {
if (exists("..exposure", env=.GlobalEnv))
return(get("..exposure", envir=.GlobalEnv))
exposure
}
linkfun <- function(mu) qlogis(mu^(1/get_exposure()))
## FIXME: is there some trick we can play here to allow
## evaluation in the context of the 'data' argument?
linkinv <- function(eta) plogis(eta)^get_exposure()
logit_mu_eta <- function(eta) {
ifelse(abs(eta)>30,.Machine$double.eps,
exp(eta)/(1+exp(eta))^2)
}
mu.eta <- function(eta) {
get_exposure() * plogis(eta)^(get_exposure()-1) *
logit_mu_eta(eta)
}
valideta <- function(eta) TRUE
link <- paste("logexp(", deparse(substitute(exposure)), ")",
sep="")
structure(list(linkfun = linkfun, linkinv = linkinv,
mu.eta = mu.eta, valideta = valideta,
name = link),
class = "link-glm")
}
This is basically a modified version of the logit link function
produced by binomial()
. The original version used
.Call(stats:::C_logit_mu_eta, eta, PACKAGE = "stats")
, but
I decided it was safer (if marginally slower) to write a pure-R version
of C_logit_mu_eta
and include it in the function.
Now use it:
m_glm <- glm(Surv~NestHeight,
family=binomial(link=logexp(dat2S$Exposure)),
data=dat2S,start=c(1,0))
summary(m_glm)
##
## Call:
## glm(formula = Surv ~ NestHeight, family = binomial(link = logexp(dat2S$Exposure)),
## data = dat2S, start = c(1, 0))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.8150 1.1141 1.629 0.103
## NestHeight 1.0955 0.9791 1.119 0.263
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 122.65 on 167 degrees of freedom
## Residual deviance: 136.05 on 166 degrees of freedom
## AIC: 140.05
##
## Number of Fisher Scoring iterations: 5
glm
is set up, the exposure variable can’t be drawn from
inside the data
argument, which makes it harder to use
methods such as predict
on the results of the fit. Setting
..exposure
globally provides a hack around this (see
below).RTMB is a general framework for writing likelihood functions (including those with random effects) that can be fitted efficiently. Setting up the model is slightly more complicated in RTMB (below I’ve made it slightly more complex than absolutely necessary, for generality). (More info on RTMB here and here.)
nllfun <- function(par) {
getAll(par, tmbdata) ## 'attach' parameter components and data
eta <- X %*% beta ## compute predicted log-odds of mortality
prob0 <- plogis(eta) ## probability per day
prob <- prob0^Exposure ## overall probability
REPORT(prob) ## for extracting computed info
ADREPORT(eta)
Surv <- OBS(Surv) ## enable simulation from the model
-sum(dbinom(Surv, prob, size = 1, log = TRUE)) ## compute negative log-likelihood
}
par0 <- list(beta = c(0,0)) ## starting parameter values
X <- model.matrix(~ 1 + NestHeight, data = dat2S) ## model for linear predictor (could add other preds etc)
tmbdata <- c(dat2S, list(X=X)) ## encapsulate data needed by the model
nll0 <- nllfun(par0) ## test base-R function
tmbfun <- MakeADFun(nllfun, par0, silent = TRUE) ## construct TMB object
class(tmbfun) <- "TMB" ## for later convenience (broom.mixed)
stopifnot(all.equal(tmbfun$fn(), nll0)) ## test TMB function
tmbfit <- with(tmbfun, nlminb(par, fn, gr)) ## fit model
Model summary:
glance(tmbfun)
## # A tibble: 1 × 4
## df logLik AIC BIC
## <int> <dbl> <dbl> <dbl>
## 1 2 -68.0 140. NA
tidy(tmbfun, conf.int = TRUE) |>
## unfortunately RTMB doesn't carry parameter names along
dplyr::mutate(term = colnames(X))
## type term estimate std.error conf.low conf.high
## 1 fixed (Intercept) 1.815019 1.0931417 -0.3274994 3.957537
## 2 fixed NestHeight 1.095502 0.9628243 -0.7915985 2.982603
In principle, we should also be able to implement this model with
random effects (i.e., allowing for differential exposure of different
nests) in either lme4
or in RTMB. (Doing it in
bbmle
is essentially impossible.)
Random effect of Nest.ID
:
library("lme4")
m_glmer <- glmer(Surv~NestHeight+(1|Nest.ID),
family=binomial(link=logexp(dat2S$Exposure)),
data=dat2S,start=list(theta=1,fixef=c(1,0)))
summary(m_glmer)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logexp(dat2S$Exposure) )
## Formula: Surv ~ NestHeight + (1 | Nest.ID)
## Data: dat2S
##
## AIC BIC logLik -2*log(L) df.resid
## 138.5 147.9 -66.2 132.5 165
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.0725 0.1819 0.2670 0.3420 0.7222
##
## Random effects:
## Groups Name Variance Std.Dev.
## Nest.ID (Intercept) 1.229 1.108
## Number of obs: 168, groups: Nest.ID, 30
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.541 1.539 1.001 0.317
## NestHeight 1.436 1.341 1.071 0.284
##
## Correlation of Fixed Effects:
## (Intr)
## NestHeight -0.974
“works” (but don’t know if it’s correct).
We can use RTMB’s built-in machinery to add a random effect.
nllfun_RE <- function(par) {
getAll(par, tmbdata_RE)
## must come *before* we try to use b ...
nllpen <- -sum(dnorm(b, 0, exp(lognestSD), log = TRUE))
eta <- drop(X %*% beta + Z %*% b) ## now with random (nest) effects included
prob0 <- 1/(1+exp(-eta)) ## plogis(eta) makes simulation machinery unhappy?
prob <- prob0^Exposure
REPORT(prob)
ADREPORT(eta)
Surv <- OBS(Surv)
nll <- -sum(dbinom(Surv, prob, size = 1, log = TRUE))
nll + nllpen
}
Z <- t(Matrix::fac2sparse(dat2S$Nest.ID))
tmbdata_RE <- c(tmbdata, list(Z=Z))
par0_RE <- list(
## beta=unname(tmbfit$par),
beta = unname(fixef(m_glmer)),
b = rep(0, ncol(Z)),
lognestSD = 0)
## nllfun(par0_RE)
tmbfun_RE <- MakeADFun(nllfun_RE, par0_RE, silent = TRUE, random = "b")
class(tmbfun_RE) <- "TMB"
tmbfit_RE <- with(tmbfun_RE, nlminb(par, fn, gr))
Easy to simulate from the model, either with existing parameters or with modified params:
summary(tmbfun_RE$simulate()$prob)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3159 0.7924 0.8946 0.8463 0.9515 0.9950
pp <- tmbfun_RE$env$last.par
pp[1] <- -5 ## decrease intercept drastically
summary(tmbfun_RE$simulate(par = pp)$prob)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 7.750e-07 1.831e-04 1.190e-02 3.942e-03 2.293e-01
Model comparison:
tt0 <- tidy(tmbfun_RE, conf.int = TRUE) |>
dplyr::mutate(term = c(colnames(X), "lognestSD"),
model = "RTMB")
tt1 <-tidy(m_glmer, conf.int = TRUE) |>
dplyr::mutate(term = ifelse(term == "sd__(Intercept)", "lognestSD", term),
estimate = ifelse(term == "lognestSD", log(estimate), estimate),
model = "glmer")
tt_comb <- bind_rows(tt0, tt1) |> dplyr::select(model, term, estimate, lwr = conf.low, upr = conf.high)
ggplot(tt_comb, aes(estimate, term, colour = model)) +
geom_pointrange(aes(xmin = lwr, xmax = upr),
position = position_dodge(width = 0.1))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_segment()`).
If we use cloglog
we need to make sure the offset is
handled properly by whatever prediction/plotting machinery we are using
downstream; if we use link=logexp
we need to set a
..exposure
variable in the global
environment, if we are plotting on the response (probability)
scale, to allow the hack we defined above to work. Make sure to remove
..exposure
afterwards to avoid confusion …
..exposure <- mean(dat2S$Exposure)
visreg(m_glmer, "NestHeight", scale="response")
rm(..exposure) ## clean up!
..exposure <- mean(dat2S$Exposure)
sjPlot::plot_model(m_glmer,type="pred")
rm(..exposure) ## clean up!
plot(emmeans(m_glmer, ~NestHeight, at=list(NestHeight=1:10))) ## "works"
..exposure <- mean(dat2S$Exposure)
plot(allEffects(m_glmer, type="response"))
rm(..exposure)
Is one or the other objectively better (in terms of AIC)?
m_glmer_cloglog <- update(m_glmer,
Surv ~ . + offset(log(Exposure)),
family = binomial(link = "cloglog"),
start = NULL,
control = glmerControl(optimizer = "bobyqa"))
m_glmmTMB_cloglog <- glmmTMB(
formula = formula(m_glmer_cloglog),
family = binomial(link = "cloglog"),
data = dat2S)
mod_list <- list(glmer_cloglog = m_glmer_cloglog,
glmmTMB_cloglog = m_glmmTMB_cloglog,
glmer_powlogit = m_glmer,
RTMB_powlogit = tmbfun_RE)
purrr::map_dfr(mod_list, glance, .id = "model") |>
mutate(NLL = max(logLik) - logLik) |>
dplyr::select(model, NLL) |>
arrange(NLL)
## # A tibble: 4 × 2
## model NLL
## <chr> <dbl>
## 1 glmmTMB_cloglog 0
## 2 glmer_cloglog 0.308
## 3 RTMB_powlogit 10.5
## 4 glmer_powlogit 10.5
Looks like glmmTMB
finds a slightly better solution than
glmer
… and cloglog is much better that power-logistic.
Surprising that we can tell … maybe not reliable for this small a data
set?
Compare predictions?
Most nests ‘die’ (are predated) only once, but nest 2AF has 3 cases
with Surv == 0
(is this a typo?)
with(subset(dat2S, Surv == 0),
table(Nest.ID) |> table())
##
## 1 3
## 17 1
subset(dat2S, Nest.ID == "2AF")
## Nest.ID Exposure Date Surv NestHeight AverageSPL DistEdge DistWater
## 24 2AF 3 10/05/2012 0 1 57.08 2.35 7.02
## 25 2AF 4 14/05/2012 0 1 57.08 2.35 7.02
## 26 2AF 2 16/05/2012 0 1 57.08 2.35 7.02
## DistRoad
## 24 20.62
## 25 20.62
## 26 20.62
If every nest either survives the whole time or is predated once and then is gone (i.e. not observed again), it might not be quite right to treat each observation separately. Adding a random effect of individual might help a little but might be papering over the cracks. i.e., we ought to model as a (possibly censored) geometric or exponential distribution.
Check geometric/exponential equivalence:
dgeom(2,0.5)
## [1] 0.125
diff(pexp(2:3,-log(0.5)))
## [1] 0.125
library("plyr")
dat_agg <- ddply(dat2S,"Nest.ID",
summarize,Exposure=sum(Exposure),
Surv=tail(Surv,1),
NestHeight=NestHeight[1],AverageSPL=AverageSPL[1],
DistEdge=DistEdge[1],
DistWater=DistWater[1],DistRoad=DistRoad[1])
dcgeom <- function(x,prob,cens,log=FALSE) {
ifelse(!cens,dgeom(x,prob,log=log),
pgeom(x,prob,lower.tail=FALSE,log.p=log))
}
dcexp <- function(x,rate,cens,log=FALSE) {
ifelse(!cens,dexp(x,rate,log=log),
pexp(x,prob,lower.tail=FALSE,log.p=log))
}
with(dat_agg,-sum(dcgeom(Exposure,prob=0.5,cens=Surv,log=TRUE)))
mle2(Exposure~dcgeom(prob=plogis(mu),cens=Surv),
parameters=list(mu~NestHeight),
start=list(mu=0),data=dat_agg)
fixme: use exponential instead of geometric distribution
loglog
link for modeling mortality? e.g. https://github.com/trobinj/trtools/blob/master/R/loglog.R
(although it doesn’t have the ‘clamping’ stuff that built-in
cloglog
has …tmbstan
solutions?