convergence problems

non-positive-definite Hessians

what does this mean?
modern mixed-model software does some form of non-linear optimization
- sometimes just the covariance parameters, sometimes fixed + covariance parameters
- (random-effects parameters are profiled out or estimated in an internal loop)
at the optimum we should have a gradient of zero and a positive definite Hessian

Hessian

\[ \left( \begin{array}{cccc} \frac{\partial^2 L}{\partial \beta_{1}^2} & \frac{\partial^2 L}{\partial \beta_{1} \partial \beta_{2}} & \frac{\partial^2 L}{\partial \beta_{1} \partial \beta_{3}} & \ldots \\ \frac{\partial^2 L}{\partial \beta_{2} \partial \beta_{1}} & \frac{\partial^2 L}{\partial \beta_{2}^2} & \frac{\partial^2 L}{\partial \beta_{2} \partial \beta_{3}} & \ldots \\ \frac{\partial^2 L}{\partial \beta_{3} \partial \beta_{1}} & \frac{\partial^2 L}{\partial \beta_{3} \partial \beta_{2}} & \frac{\partial^2 L}{\partial \beta_{3}^2} & \ldots \\ \vdots & \vdots & \vdots & \ddots \end{array} \right) \]

eigenvalues of the Hessian give curvature in the principal directions (== eigenvectors)

solutions

double-check the model definition and data
- one random effect per observation (when the conditional distribution has a scale parameter)
- all observations in a cluster have the same response value (e.g. here)
- etc.
scale and center predictor variables (improves stability)
general advice in Bolker et al. 2013¹ may be useful
extract Hessian and see which eigenvector(s) is/are associated with the minimal eigenvalue(s)
- extracting Hessian often requires some digging, e.g. model@optinfo$derivs (lme4) or model$sdr$cov.fixed (glmmTMB))

more examples

many GLMM parameters have a bounded domain
- variances/standard deviations/dispersion parameters (0 to $\infty$)
- probabilities (0 to 1)
the easiest way to handle bounded optimization is to fit a transformed, unconstrained version of the parameter
- log (variance, SD, ratios thereof, odds ratios)
- logit/log-odds (probabilities)
problems when the estimate is actually on the boundary of the original domain (transformed parameter $\to \pm \infty$)
different packages handle this in different ways

example: negative binomial dispersion

NBinom can only handle overdispersion, not equi- or underdispersion
example:

set.seed(102)
dd <- data.frame(x = rpois(1000, lambda = 2))
m1 <- MASS::glm.nb(x ~ 1, data = dd)
environment(m1$family$variance)$.Theta

solutions
- ignore it
- revert to Poisson model
- use a quasi-likelihood model (fit Poisson, then adjust SEs etc.)
- use a distribution that allows underdispersion: (glmmTMB) genpois, compois, or an ordinal model

example: non-zero-inflated models

zero-inflation models fit a finite mixture of “structural” and “sampling” zeros
“many zeros does not mean zero inflation”²
log-odds(0) $\to -\infty$
similar to NB case

`lme4` convergence warnings

a specific (problematic!) case

extra check on top of optimizer-based convergence checks
many false positives

diagnosing convergence warnings

allFit
are parameters close enough?
what is the distribution of negative log-likelihoods?
which parameters are unstable?

library(lme4)
source("R/conv_ex.R")
model1 <- lmer(eval~1 + group*(emint_n + grade_n) + (1 + grade_n+emint_n|class), data=dd)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00229598 (tol = 0.002, component 1)

Scaling (autoscaling is on the list for lme4) …

double_cols <- sapply(dd, is.double)
dd_sc <- dd
dd_sc[double_cols] <- sapply(dd[double_cols], \(x) drop(scale(x)))
model2 <- update(model1, data =  dd_sc)

library(broom.mixed) ## for tidy()
library(ggplot2)
source("R/allFit_utils.R")

## Loading required package: colorspace

aa <- allFit(model1)

## bobyqa : [OK]
## Nelder_Mead : [OK]
## nlminbwrap :

## boundary (singular) fit: see help('isSingular')

## [OK]
## nmkbw :

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## unable to evaluate scaled gradient

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge: degenerate Hessian with 2 negative eigenvalues

## [OK]
## optimx.L-BFGS-B : [OK]
## nloptwrap.NLOPT_LN_NELDERMEAD : [OK]
## nloptwrap.NLOPT_LN_BOBYQA :

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
## Model failed to converge with max|grad| = 0.00229598 (tol = 0.002, component 1)

## [OK]

tt <- tidy(aa)

Check log-likelihoods: do most or all optimizers agree?

glance_allfit_NLL(aa)

## # A tibble: 7 × 2
##   optimizer                          NLL_rel
##   <chr>                                <dbl>
## 1 bobyqa                        0           
## 2 nloptwrap.NLOPT_LN_NELDERMEAD 0.0000000207
## 3 Nelder_Mead                   0.0000000284
## 4 optimx.L-BFGS-B               0.000000209 
## 5 nloptwrap.NLOPT_LN_BOBYQA     0.000000591 
## 6 nlminbwrap                    1.14        
## 7 nmkbw                         2.50

Run convergence check by hand:

checkConv(model1@optinfo$derivs, model1@optinfo$val, lmerControl()$checkConv, model1@lower)

## Warning in checkConv(model1@optinfo$derivs, model1@optinfo$val,
## lmerControl()$checkConv, : Model failed to converge with max|grad| = 0.00229598
## (tol = 0.002, component 1)

## $code
## [1] -1
## 
## $messages
## [1] "Model failed to converge with max|grad| = 0.00229598 (tol = 0.002, component 1)"

(in this case it is the scaled gradient that fails the test).

Check

plot_allfit(aa) + theme_set(theme_bw())

Check random effect components (may or may not be of interest):

plot_allfit(aa, keep_effects = "ran_pars")

## Warning: Removed 28 rows containing missing values or values outside the scale range
## (`geom_segment()`).

## Warning: Removed 21 rows containing missing values or values outside the scale range
## (`geom_segment()`).

singular fits

best-fit value of a variance can be 0
e.g. expected among-group variance in a 1-way ANOVA is $\sigma^2_b + (\sigma^2_w)/n$. If we substitute sample values for these quantities we can get a method-of-moments estimate <0 …
^3,4
lme4 is a little more graceful than glmmTMB; fits

simfun <- function() {
    dd <- expand.grid(f = factor(1:3), rep = 1:10)
    dd$y <- suppressMessages(simulate(~ 1 + (1|f),
                     newdata = dd,
                     newparams = list(beta = 0, theta = 0.1, sigma = 1),
                     family = gaussian))[[1]]
    m <- lmer(y ~ 1 + (1 | f), data = dd,
              control = lmerControl(calc.derivs=FALSE, check.conv.singular = "ignore"))
    getME(m, "theta")
}
simfun()

## f.(Intercept) 
##             0

set.seed(101)
thvec <- replicate(500, simfun())
hist(thvec, breaks = 50)

library(lme4)
data(Orthodont,package="nlme")
Orthodont$nsex <- as.numeric(Orthodont$Sex=="Male")
Orthodont$nsexage <- with(Orthodont, nsex*age)
m1 <- lmer(distance ~ age + (age|Subject) + (0+nsex|Subject),
           start = c(1.8,-0.1,0.14, 0.5),
           ##(0 + nsexage|Subject),
           data=Orthodont)

## boundary (singular) fit: see help('isSingular')

VarCorr(m1)

##  Groups    Name        Std.Dev.   Corr  
##  Subject   (Intercept) 2.3270e+00       
##            age         2.2643e-01 -0.609
##  Subject.1 nsex        6.9976e-05       
##  Residual              1.3100e+00

rePCA(m1)

## $Subject
## Standard deviations (1, .., p=3):
## [1] 1.7794421974 0.1368062328 0.0000534153
## 
## Rotation (n x k) = (3 x 3):
##             [,1]       [,2] [,3]
## [1,] -0.99822608 0.05953739    0
## [2,]  0.05953739 0.99822608    0
## [3,]  0.00000000 0.00000000    1
## 
## attr(,"class")
## [1] "prcomplist"

library(nlme)

## 
## Attaching package: 'nlme'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Orthodont

## The following object is masked from 'package:lme4':
## 
##     lmList

## The following object is masked from 'package:dplyr':
## 
##     collapse

m2 <- lme(distance ~ age, random = ~age|Subject,
    data = Orthodont,
    weights = varIdent(form = ~1|Sex))
library(glmmTMB)
m3 <- glmmTMB(distance ~ age + (age|Subject),
              dispformula = ~Sex,
              REML = TRUE,
              data = Orthodont)
exp(fixef(m3)$disp)

## (Intercept)   SexFemale 
##    1.645261    0.404098

logLik(m3)

## 'log Lik.' -210.8233 (df=7)

model simplification

drop terms, e.g. correlation between slopes and intercepts⁵
(should probably center continuous covariates first)
Matuschek et al⁶ say you should be dropping terms anyway
drop correlations (e.g. (f||g) or diag(f|g)
reduced-rank models (glmmTMB does factor analytic models, e.g. rr(f1*f2|g, d = 5); or gllvm package)
buildmer package (??)

model simplication⁷

include_graphics("pix/uriarte.png")

regularization

blme: assign priors, do maximum a posteriori estimation
- default for RE terms: SD $\sim$ Gamma(shape = 2.5, rate = 0)⁸
priors in glmmTMB (new)
priors in Bayesian pkgs (of course) MCMCglmm, brms, rstanarm

complete separation (GLM(M)-specific)

Bernoulli data with 0/1 separation, count data with 0/positive separation
Firth regression; add priors

example (from glmmTMB vignette)

load("data/culcita.RData")
cdatx <- culcita_dat[-20,]
form <- predation~ttt+(1|block)
cmod_glmer <- glmer(form, data=cdatx, family=binomial)
cmod_glmmTMB <- glmmTMB(form, data=cdatx, family=binomial)
cmod_bglmer <- bglmer(form, data=cdatx, family=binomial,
                    fixef.prior = normal(cov = diag(9,4)))
cprior <- data.frame(prior = rep("normal(0,3)", 2),
                     class = rep("beta", 2),
                     coef = c("(Intercept)", ""))
cmod_glmmTMB_p <- update(cmod_glmmTMB, priors = cprior)
tt <- (tibble::lst(cmod_glmer, cmod_glmmTMB, cmod_bglmer, cmod_glmmTMB_p)
    |> purrr::map_dfr(~ tidy(., effects = "fixed"), .id = "model")
)
ggplot(tt, aes(x = estimate, y = term, colour = model)) +
       geom_pointrange(aes(xmin = estimate - 2*std.error,
                           xmax = estimate + 2*std.error),
                       position = position_dodge(width = 0.5))

Bolker, B. M. et al. Strategies for fitting nonlinear ecological models in R, AD Model Builder, and BUGS. Methods in Ecology and Evolution 4, 501–512 (2013).

Warton, D. I. Many zeros does not mean zero inflation: Comparing the goodness-of-fit of parametric models to multivariate abundance data. Environmetrics 16, 275–289 (2005).

Pryseley, A., Tchonlafi, C., Verbeke, G. & Molenberghs, G. Estimating negative variance components from Gaussian and non-Gaussian data: A mixed models approach. Computational Statistics & Data Analysis 55, 1071–1085 (2011).

Bridge, H., Morgan, K. E. & Frost, C. Negative variance components and intercept-slope correlations greater than one in magnitude: How do such ‘non-regular’ random intercept and slope models arise, and what should be done when they do? Statistics in Medicine n/a, (2024).

Barr, D. J., Levy, R., Scheepers, C. & Tily, H. J. Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language 68, 255–278 (2013).

Matuschek, H., Kliegl, R., Vasishth, S., Baayen, H. & Bates, D. Balancing type I error and power in linear mixed models. Journal of Memory and Language 94, 305–315 (2017).

Uriarte, M. & Yackulic, C. B. Preaching to the unconverted. Ecological Applications 19, 592–596 (2009).

Chung, Y., Rabe-Hesketh, S., Dorie, V., Gelman, A. & Liu, J. A nondegenerate penalized likelihood estimator for variance parameters in multilevel models. Psychometrika 1–25 (2013) doi:10.1007/s11336-013-9328-2.

mixed model troubleshooting

overview

obligatory Tolstoy quote

categories of warnings

messages vs warnings vs errors

load packages

convergence problems

non-positive-definite Hessians

Hessian

solutions

more examples

example: negative binomial dispersion

example: non-zero-inflated models

`lme4` convergence warnings

a specific (problematic!) case

diagnosing convergence warnings

singular fits

model simplification

model simplication⁷

regularization

complete separation (GLM(M)-specific)

example (from glmmTMB vignette)

mixed model troubleshooting

overview

obligatory Tolstoy quote

categories of warnings

messages vs warnings vs errors

load packages

convergence problems

non-positive-definite Hessians

Hessian

solutions

more examples

example: negative binomial dispersion

example: non-zero-inflated models

lme4 convergence warnings

a specific (problematic!) case

diagnosing convergence warnings

singular fits

model simplification

model simplication7

regularization

complete separation (GLM(M)-specific)

example (from glmmTMB vignette)

`lme4` convergence warnings

model simplication⁷