Analyzing binary survival data, accounting for differences in exposure

Introduction
Data
Methods
- Method 1: bbmle
Method 2: glm
- method 3: RTMB
Random effects
- lme4
- RTMB
downstream plotting
Comparing cloglog and power-logit fits
Survival model
Changelog
To do
References

Introduction

A very common situation in ecology (and elsewhere) is a binary-outcome survival model where individuals, each measured once, differ in their exposure, typically because they have been observed for different lengths of time (this is different from typical survival analysis, where we know either the exact moment that someone died, or that they were still alive when we stopped observing them). The classic statistical approach to this problem is to use a complementary log-log (“cloglog”) link (Heisey, Shaffer, and White 2007). The cloglog function is \(C(\mu)=\log(-\log(1-\mu))\); its inverse is \(\mu=C^{-1}(\eta) = 1-\exp(-\exp(\eta))\). Thus if we expect mortality \(\mu_0\) over a period \(\Delta t=1\) and the linear predictor \(\eta=C^{-1}(\mu_0)\) then \[ C^{-1}(\eta+\log \Delta t)=(1-\exp(-\exp(\eta) \cdot \Delta t)). \] Some algebra shows that this is equal to \(1-(1-\mu_0)^{\Delta t}\), which is what we want.

The function \(\exp(-\exp(x))\) is called the Gompertz function (it is also the CDF of the extreme value distribution), so fitting a model with this inverse-link function (i.e. fitting a cloglog link to the survival, rather than the mortality, probability) is called a gompit (or extreme value) regression.

The bottom line is that if we change the form of the dependence of survival on covariates from logistic to complementary log-log, we can sneak the exposure variable in via an offset.

To use this approach in R, use family=binomial(link="cloglog") and add a term of the form offset(log(A)) to the formula (some modeling functions take offset as a separate argument).

If instead of using log(A) as an offset you allow it to be a covariate (predictor, independent variable, whatever you want to call it), that implies a linearly increasing or decreasing log-hazard → a hazard that is a power-law function of exposure → a Weibull distribution for the distribution of survival times (when the coefficient is 1, i.e. if we use an offset, this reduces to an exponentially distributed survival time).

Suppose we don’t want to do this, and instead want to stick with the logistic form, but want to have the link function be \[ \mu = \left( \text{Logistic}(\eta) \right)^e, \] where \(e\) is the exposure (typically, but not necessarily, an integer). I will call this the power-logistic link approach; it is popular in analyses of nest survival (Shaffer 2004). The code below explores both the complementary log-log and power-logistic approaches. The complementary log-log approach is in some ways more computationally convenient than the power-exponential (it can be used in any GLM-based framework that provides a cloglog link option for binomial models). However, the inverse-cloglog function increases extremely quickly (relative to the inverse-logit/logistic) for \(x>0\):

The fundamental difference between the power-logistic model and the cloglog/log-hazard models is in the effects of covariates on the baseline (i.e., per-unit-time) mortality probability. The exponential decrease in survival with time (or \(\exp(-t^\gamma)\) for the Weibull case is a useful parsimonious model for time-dependence, but there’s no intrinsic reason that effects of covariates on daily survival/mortality probabilities should also follow a hazard model …

I get e-mail about this document a few times a year, implying that people are still using the power-logistic approach. I would be interested in hearing arguments or explanations from biostatisticians or ecologists about why the power-logistic might be preferred: is it just historical, does it fit the data better, or does it have advantages I haven’t thought of?

more here: difference from standard survival analysis (highly discretized/interval-censored; interested in covariates; maybe willing to use parametric models; maybe interested in density-dependent mortality? typically have one or more cross-sectional observations of a population, rather than a single longitudinal survey of a cohort)

Data

Some example data on red-winged blackbird nest survivorship in Brant County, Ontario from Reta Meng (part of her undergraduate thesis with Dr. Pat Chow-Fraser of McMaster University):

pkgs <- sort(c("lme4", "RTMB", "bbmle", "glmmTMB",
               "tidyverse", "patchwork",
               "emmeans", "visreg", "sjPlot", "effects",
               "broom","broom.mixed"))
invisible(sapply(pkgs,library,character.only=TRUE))
theme_set(theme_bw())
require(knitr)
opts_chunk$set(tidy=FALSE,fig.width=5,fig.height=5)

Package versions:

##       bbmle       broom broom.mixed     effects     emmeans     glmmTMB 
##      1.0.26       1.0.8     0.2.9.7       4.2.2      1.11.1 1.1.11.9000 
##        lme4   patchwork        RTMB      sjPlot   tidyverse      visreg 
## 1.1.37.9000       1.3.0         1.7      2.8.17       2.0.0       2.7.0

Parameters:

Exposure = exposure days
surv= survival of the nest contents (1=nest has contents, 0=nest empty)
survive = outcome of the nest (1=successful, 0=failure)
nestheight = height of nest
averageSPL= sound pressure level near the nest
distroad = distance from nest to road
waterdepth= water depth surrounding the nest

dat <- read.csv("brant_survive.csv")
## orig_names <- scan("brant_survive.csv",what=character(),nlines=1,sep=",")
dat2 <- subset(dat,
    select=c(Nest.ID,Exposure,Date,Surv,
             NestHeight,AverageSPL,DistEdge,
             DistWater,DistRoad))
dat2S <- subset(dat2,Exposure>0)

Download CSV (or get data from GitHub)

Total samples: 174 observations, 36 nests.

Marginal plots of survival prob vs predictors:

mdat <- pivot_longer(dat2, -(1:4), names_to = "variable")
ggplot(mdat,aes(x=value,y=Surv))+
    geom_point(alpha=0.3, aes(size=Exposure))+
    geom_smooth(method="gam", method.args = list(family = "binomial"),
                formula = y ~ s(x, k = 12)) +
    facet_wrap(~variable,scale="free_x")+
    coord_cartesian(ylim=c(-0.05,1.05))+xlab("")+ylab("Survival")

In what follows I’m only going to model the effect of nest height on survivorship, but the example should generalize to any number of continuous or categorical predictors …

Methods

Method 1: `bbmle`

The bbmle package offers a formula interface that lets us do general nonlinear MLE fitting reasonably conveniently.

library(bbmle)
m_mle2 <- mle2(Surv~dbinom(plogis(mu)^Exposure,size=1),
     parameters=list(mu~NestHeight),
     start=list(mu=2),data=dat2S)
summary(m_mle2)

## Maximum likelihood estimation
## 
## Call:
## mle2(minuslogl = Surv ~ dbinom(plogis(mu)^Exposure, size = 1), 
##     start = list(mu = 2), data = dat2S, parameters = list(mu ~ 
##         NestHeight))
## 
## Coefficients:
##                Estimate Std. Error z value   Pr(z)  
## mu.(Intercept)  1.81505    1.09315  1.6604 0.09684 .
## mu.NestHeight   1.09549    0.96283  1.1378 0.25521  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -2 log L: 136.0476

Advantages: simple, flexible.
Disadvantages: need to specify starting values; probably slow.

Method 2: `glm`

Generalized linear modeling code can be hacked to fit the power-logistic link, by providing a custom link function. SAS and R code to do this is available from a ‘wayback machine’ archive of Terry Shaffer’s USGS web page.

This is an updated version of Mark Herzog’s R code (should run in “recent” R versions, e.g at least >=3.5.0 [??])

library(MASS)
logexp <- function(exposure = 1) {
    ## hack to help with visualization, post-prediction etc etc
    ## FIXME: can we do more to make sure an exposure arg is not
    ##  accidentally masked by a global ..exposure?
    get_exposure <- function() {
        if (exists("..exposure", env=.GlobalEnv))
            return(get("..exposure", envir=.GlobalEnv))
        exposure
    }
    linkfun <- function(mu) qlogis(mu^(1/get_exposure()))
    ## FIXME: is there some trick we can play here to allow
    ##   evaluation in the context of the 'data' argument?
    linkinv <- function(eta) plogis(eta)^get_exposure()
    logit_mu_eta <- function(eta) {
        ifelse(abs(eta)>30,.Machine$double.eps,
               exp(eta)/(1+exp(eta))^2)
    }
    mu.eta <- function(eta) {       
        get_exposure() * plogis(eta)^(get_exposure()-1) *
            logit_mu_eta(eta)
    }
    valideta <- function(eta) TRUE
    link <- paste("logexp(", deparse(substitute(exposure)), ")",
                  sep="")
    structure(list(linkfun = linkfun, linkinv = linkinv,
                   mu.eta = mu.eta, valideta = valideta, 
                   name = link),
              class = "link-glm")
}

This is basically a modified version of the logit link function produced by binomial(). The original version used .Call(stats:::C_logit_mu_eta, eta, PACKAGE = "stats"), but I decided it was safer (if marginally slower) to write a pure-R version of C_logit_mu_eta and include it in the function.

Now use it:

m_glm <- glm(Surv~NestHeight,
         family=binomial(link=logexp(dat2S$Exposure)),
         data=dat2S,start=c(1,0))
summary(m_glm)

## 
## Call:
## glm(formula = Surv ~ NestHeight, family = binomial(link = logexp(dat2S$Exposure)), 
##     data = dat2S, start = c(1, 0))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   1.8150     1.1141   1.629    0.103
## NestHeight    1.0955     0.9791   1.119    0.263
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 122.65  on 167  degrees of freedom
## Residual deviance: 136.05  on 166  degrees of freedom
## AIC: 140.05
## 
## Number of Fisher Scoring iterations: 5

Advantages: probably faster, no need for starting values, more compact syntax, maybe extendable to GLMMs
Disadvantages: less flexible. Because of the way glm is set up, the exposure variable can’t be drawn from inside the data argument, which makes it harder to use methods such as predict on the results of the fit. Setting ..exposure globally provides a hack around this (see below).

method 3: RTMB

RTMB is a general framework for writing likelihood functions (including those with random effects) that can be fitted efficiently. Setting up the model is slightly more complicated in RTMB (below I’ve made it slightly more complex than absolutely necessary, for generality). (More info on RTMB here and here.)

nllfun <- function(par) {
    getAll(par, tmbdata)    ## 'attach' parameter components and data
    eta <- X %*% beta       ## compute predicted log-odds of mortality
    prob0 <- plogis(eta)    ## probability per day
    prob <- prob0^Exposure  ## overall probability
    REPORT(prob)            ## for extracting computed info
    ADREPORT(eta)
    Surv <- OBS(Surv)       ## enable simulation from the model
    -sum(dbinom(Surv, prob, size = 1, log = TRUE))  ## compute negative log-likelihood
}
par0 <- list(beta = c(0,0))                        ## starting parameter values
X <- model.matrix(~ 1 + NestHeight, data = dat2S)  ## model for linear predictor (could add other preds etc)
tmbdata <- c(dat2S, list(X=X))                     ## encapsulate data needed by the model
nll0 <- nllfun(par0)                               ## test base-R function
tmbfun <- MakeADFun(nllfun, par0, silent = TRUE)   ## construct TMB object
class(tmbfun) <- "TMB"                             ## for later convenience (broom.mixed)
stopifnot(all.equal(tmbfun$fn(), nll0))            ## test TMB function
tmbfit <- with(tmbfun, nlminb(par, fn, gr))        ## fit model

Model summary:

glance(tmbfun)

## # A tibble: 1 × 4
##      df logLik   AIC   BIC
##   <int>  <dbl> <dbl> <dbl>
## 1     2  -68.0  140.    NA

tidy(tmbfun, conf.int = TRUE) |>
    ## unfortunately RTMB doesn't carry parameter names along
    dplyr::mutate(term = colnames(X))

##    type        term estimate std.error   conf.low conf.high
## 1 fixed (Intercept) 1.815019 1.0931417 -0.3274994  3.957537
## 2 fixed  NestHeight 1.095502 0.9628243 -0.7915985  2.982603

Advantages: fast, flexible, extendable to GLMMs
Disadvantages: most complex code

Random effects

In principle, we should also be able to implement this model with random effects (i.e., allowing for differential exposure of different nests) in either lme4 or in RTMB. (Doing it in bbmle is essentially impossible.)

Random effect of Nest.ID:

lme4

library("lme4")
m_glmer <- glmer(Surv~NestHeight+(1|Nest.ID),
                 family=binomial(link=logexp(dat2S$Exposure)),
                 data=dat2S,start=list(theta=1,fixef=c(1,0)))
summary(m_glmer)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logexp(dat2S$Exposure) )
## Formula: Surv ~ NestHeight + (1 | Nest.ID)
##    Data: dat2S
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##     138.5     147.9     -66.2     132.5       165 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.0725  0.1819  0.2670  0.3420  0.7222 
## 
## Random effects:
##  Groups  Name        Variance Std.Dev.
##  Nest.ID (Intercept) 1.229    1.108   
## Number of obs: 168, groups:  Nest.ID, 30
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    1.541      1.539   1.001    0.317
## NestHeight     1.436      1.341   1.071    0.284
## 
## Correlation of Fixed Effects:
##            (Intr)
## NestHeight -0.974

“works” (but don’t know if it’s correct).

RTMB

We can use RTMB’s built-in machinery to add a random effect.

nllfun_RE <- function(par) {
  getAll(par, tmbdata_RE)
  ## must come *before* we try to use b ...
  nllpen <- -sum(dnorm(b, 0, exp(lognestSD), log = TRUE))
  eta <- drop(X %*% beta + Z %*% b)  ## now with random (nest) effects included
  prob0 <- 1/(1+exp(-eta))  ## plogis(eta) makes simulation machinery unhappy?
  prob <- prob0^Exposure
  REPORT(prob)
  ADREPORT(eta)
  Surv <- OBS(Surv)
  nll <- -sum(dbinom(Surv, prob, size = 1, log = TRUE))
  nll + nllpen
}
Z <- t(Matrix::fac2sparse(dat2S$Nest.ID))
tmbdata_RE <- c(tmbdata, list(Z=Z))
par0_RE <- list(
    ## beta=unname(tmbfit$par),
    beta = unname(fixef(m_glmer)),
    b = rep(0, ncol(Z)),
    lognestSD = 0)
## nllfun(par0_RE)
tmbfun_RE <- MakeADFun(nllfun_RE, par0_RE, silent = TRUE, random = "b")
class(tmbfun_RE) <- "TMB"
tmbfit_RE <- with(tmbfun_RE, nlminb(par, fn, gr))

Easy to simulate from the model, either with existing parameters or with modified params:

summary(tmbfun_RE$simulate()$prob)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3159  0.7924  0.8946  0.8463  0.9515  0.9950

pp <- tmbfun_RE$env$last.par
pp[1] <- -5  ## decrease intercept drastically
summary(tmbfun_RE$simulate(par = pp)$prob)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 7.750e-07 1.831e-04 1.190e-02 3.942e-03 2.293e-01

Model comparison:

tt0 <- tidy(tmbfun_RE, conf.int = TRUE) |>
  dplyr::mutate(term = c(colnames(X), "lognestSD"),
                model = "RTMB")
tt1 <-tidy(m_glmer, conf.int = TRUE) |>
  dplyr::mutate(term = ifelse(term == "sd__(Intercept)", "lognestSD", term),
                estimate = ifelse(term == "lognestSD", log(estimate), estimate),
                model = "glmer")
tt_comb <- bind_rows(tt0, tt1) |> dplyr::select(model, term, estimate, lwr = conf.low, upr = conf.high)
ggplot(tt_comb, aes(estimate, term, colour = model)) +
  geom_pointrange(aes(xmin = lwr, xmax = upr),
                  position = position_dodge(width = 0.1))

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_segment()`).

downstream plotting

If we use cloglog we need to make sure the offset is handled properly by whatever prediction/plotting machinery we are using downstream; if we use link=logexp we need to set a ..exposure variable in the global environment, if we are plotting on the response (probability) scale, to allow the hack we defined above to work. Make sure to remove ..exposure afterwards to avoid confusion …

..exposure <- mean(dat2S$Exposure)
visreg(m_glmer, "NestHeight", scale="response")

rm(..exposure) ## clean up!

..exposure <- mean(dat2S$Exposure)
sjPlot::plot_model(m_glmer,type="pred")

rm(..exposure) ## clean up!

plot(emmeans(m_glmer, ~NestHeight, at=list(NestHeight=1:10)))  ## "works"

..exposure <- mean(dat2S$Exposure)
plot(allEffects(m_glmer, type="response"))

rm(..exposure)

Comparing cloglog and power-logit fits

Is one or the other objectively better (in terms of AIC)?

m_glmer_cloglog <- update(m_glmer,
                          Surv ~ . + offset(log(Exposure)),
                          family = binomial(link = "cloglog"),
                          start = NULL,
                          control = glmerControl(optimizer = "bobyqa"))
m_glmmTMB_cloglog <- glmmTMB(
    formula = formula(m_glmer_cloglog),
    family = binomial(link = "cloglog"),
    data = dat2S)

mod_list <- list(glmer_cloglog = m_glmer_cloglog,
                 glmmTMB_cloglog = m_glmmTMB_cloglog,
                 glmer_powlogit = m_glmer,
                 RTMB_powlogit = tmbfun_RE)
purrr::map_dfr(mod_list, glance, .id = "model") |>
  mutate(NLL = max(logLik) - logLik) |>
  dplyr::select(model, NLL) |>
  arrange(NLL)

## # A tibble: 4 × 2
##   model              NLL
##   <chr>            <dbl>
## 1 glmmTMB_cloglog  0    
## 2 glmer_cloglog    0.308
## 3 RTMB_powlogit   10.5  
## 4 glmer_powlogit  10.5

Looks like glmmTMB finds a slightly better solution than glmer … and cloglog is much better that power-logistic. Surprising that we can tell … maybe not reliable for this small a data set?

Compare predictions?

Survival model

Most nests ‘die’ (are predated) only once, but nest 2AF has 3 cases with Surv == 0 (is this a typo?)

with(subset(dat2S, Surv == 0),
     table(Nest.ID) |> table())

## 
##  1  3 
## 17  1

subset(dat2S, Nest.ID == "2AF")

##    Nest.ID Exposure       Date Surv NestHeight AverageSPL DistEdge DistWater
## 24     2AF        3 10/05/2012    0          1      57.08     2.35      7.02
## 25     2AF        4 14/05/2012    0          1      57.08     2.35      7.02
## 26     2AF        2 16/05/2012    0          1      57.08     2.35      7.02
##    DistRoad
## 24    20.62
## 25    20.62
## 26    20.62

If every nest either survives the whole time or is predated once and then is gone (i.e. not observed again), it might not be quite right to treat each observation separately. Adding a random effect of individual might help a little but might be papering over the cracks. i.e., we ought to model as a (possibly censored) geometric or exponential distribution.

Check geometric/exponential equivalence:

dgeom(2,0.5)

## [1] 0.125

diff(pexp(2:3,-log(0.5)))

## [1] 0.125

library("plyr")
dat_agg <- ddply(dat2S,"Nest.ID",
          summarize,Exposure=sum(Exposure),
                    Surv=tail(Surv,1),
                   NestHeight=NestHeight[1],AverageSPL=AverageSPL[1],
                 DistEdge=DistEdge[1],
             DistWater=DistWater[1],DistRoad=DistRoad[1])
dcgeom <- function(x,prob,cens,log=FALSE) {
  ifelse(!cens,dgeom(x,prob,log=log),
         pgeom(x,prob,lower.tail=FALSE,log.p=log))
  }
dcexp <- function(x,rate,cens,log=FALSE) {
  ifelse(!cens,dexp(x,rate,log=log),
         pexp(x,prob,lower.tail=FALSE,log.p=log))
  }
with(dat_agg,-sum(dcgeom(Exposure,prob=0.5,cens=Surv,log=TRUE)))
mle2(Exposure~dcgeom(prob=plogis(mu),cens=Surv),
     parameters=list(mu~NestHeight),
     start=list(mu=0),data=dat_agg)

fixme: use exponential instead of geometric distribution

Changelog

2025-05-08: respect asymmetry of

To do

loglog link for modeling mortality? e.g. https://github.com/trobinj/trtools/blob/master/R/loglog.R (although it doesn’t have the ‘clamping’ stuff that built-in cloglog has …
test glmer vs RTMB mixed-model case with simulations
try with chat data?
pictures of power-logistic vs cloglog as a function of linear predictor and exposure?
finish cloglog comparisons, coef plots, etc. (AIC …)
benchmark?
WinBUGS/Stan/tmbstan solutions?
mention evaluation problem with exposure / fragility for predictions etc.
survival model examples: interval censoring etc.. Time-dependent covariates? Time-dependent hazard?
discuss individual-level frailty: identifiability issues? Are they overcome by exposure-dependence?
check and describe bird literature: Shaffer, Nur, Heisey, White, etc. Trends in “cloglog+ecology” vs “power-logit (whatever Shaffer calls it)” over time? Maybe this is all explained in Heisey, Shaffer, and White (2007)? Would still be interesting to see a trend analysis … see also Heathcote et al. (2023), Fletcher et al. (2021), Anttonen et al. (2023)

References

Anttonen, Perttu, Maria Perles-Garcia, Matthias Kunz, Goddert von Oheimb, Yi Li, Helge Bruelheide, Ke-Ping Ma, Chao-Dong Zhu, and Andreas Schuldt. 2023. “Predation Pressure by Arthropods, Birds, and Rodents Is Interactively Shaped by Tree Species Richness, Vegetation Structure, and Season.” Frontiers in Ecology and Evolution 11 (September). https://doi.org/10.3389/fevo.2023.1199670.

Fletcher, Robert J., Ellen P. Robertson, Caroline Poli, Sarah Dudek, Alfredo Gonzalez, and Brian Jeffery. 2021. “Conflicting Nest Survival Thresholds Across a Wetland Network Alter Management Benchmarks for an Endangered Bird.” Biological Conservation 253 (January): 108893. https://doi.org/10.1016/j.biocon.2020.108893.

Heathcote, Robert J. P., Mark A. Whiteside, Christine E. Beardsworth, Jayden O. Van Horik, Philippa R. Laker, Sivan Toledo, Yotam Orchan, Ran Nathan, and Joah R. Madden. 2023. “Spatial Memory Predicts Home Range Size and Predation Risk in Pheasants.” Nature Ecology & Evolution 7 (3): 461–71. https://doi.org/10.1038/s41559-022-01950-5.

Heisey, Dennis M, Terry L Shaffer, and Gary C White. 2007. “The ABCs of Nest Survival: Theory and Application from a Biostatistical Perspective.” Studies in Avian Biology 34.

Shaffer, Terry L. 2004. “A Unified Approach to Analyzing Nest Success.” The Auk 121 (2): 526–40. https://doi.org/10.1093/auk/121.2.526.

Analyzing binary survival data, accounting for differences in exposure

Ben Bolker

19 Jun 2025

Introduction

Data

Methods

Method 1: `bbmle`

Method 2: `glm`

method 3: RTMB

Random effects

lme4

RTMB

downstream plotting

Comparing cloglog and power-logit fits

Survival model

Changelog

To do

References

Analyzing binary survival data, accounting for differences in exposure

Ben Bolker

19 Jun 2025

Introduction

Data

Methods

Method 1: bbmle

Method 2: glm

method 3: RTMB

Random effects

lme4

RTMB

downstream plotting

Comparing cloglog and power-logit fits

Survival model

Changelog

To do

References

Method 1: `bbmle`

Method 2: `glm`