choosing random effects terms in mixed models

Author

Ben Bolker

Published

November 7, 2024

Source file is here. Also see Emi Tanaka’s approach to the example below, mostly with AS-REML.

An R-centric discussion of what to do to manage the complexity of mixed models, specifically their random effect components.

General framework

maximal model

The maximal model (Barr et al. 2013) is the model that includes all random effect terms that would be identifiable, given the experimental (observational) experimental design, given arbitrarily large samples (both of groups and of observations within groups), what I would call satisfying ``strong identifiability’’.

all categorical variables in the data with exchangeable levels are treated as random effects (Yarkoni 2022)
all terms in the fixed-effect model are included as varying terms where possible, i.e. within every grouping variable where the values of the fixed-effect term vary within (and not just between) groups
for models where the conditional distribution has an estimated scale parameter (e.g. Gaussian, Gamma) and a varying term with \(n\) parameters (e.g. \(n=2\) for a random-slopes model), there must be at least \(n+1\) observations per group in order for the scale parameter (residual variance in the case of a Gaussian) to be identifiable, e.g. see “starlings” example here. For models with fixed scale/dispersion (e.g. Poisson, Gamma) the limitation is \(n\) observations per group (unless an observation-level random effect is added, in which case we again need \(n+1\)). For models with estimated dispersion rather than scale (e.g. negative binomial), the model is theoretically identifiable with only \(n\) observations per group, but it probably won’t work.

model constraint strategies

When we constrain the covariance matrix (i.e. any model for specific than “general positive semi-definite”), the model results and definition depend on the contrasts used. For factor terms, it might make the most sense to specify sum-to-zero contrasts unless there is a clear reference/control level against which all other levels should be compared; for continuous terms, the predictors should probably be zero centered (or at least zeroed at some sensible reference level; see the end of section 2.2 in Bates, Mächler, et al. (2015)).

compound-symmetric models

For factor models with many levels, compound-symmetric models are a good simplification approach. This restricts the correlations among levels to be identical for all pairs of levels (i.e., a correlation matrix with all off-diagonal elements equal to the same constant \(\rho\)).

For example, working with the built-in cbpp data set where period is a four-level factor, the covariance matrices implied by (period|herd) vs. (1|period/herd) are:

\[ (\textrm{intercept}, \textrm{slope}) = \textrm{MVN}\left(\boldsymbol 0, \left[ \begin{array}{cccc} \sigma^2_{\{h|1\}} & . & . & . \\ \sigma_{\{h|1\},\{h|p_{21}\}} & \sigma^2_{\{h|p_{21}\}} & . & . \\ \sigma_{\{h|1\}, \{h|p_{31}\}} & \sigma_{\{h|p_{21}\},\{h|p_{31}\}} & \sigma^2_{\{h|p_{31}\}} & . \\ \sigma_{\{h|1\} ,\{h|p_{41}\}} & \sigma_{\{h|p_{21}\},\{h|p_{41}\}} & \sigma_{\{h|p_{31}\},\{h|p_{41}\}} & \sigma^2_{\{h|p_{41}\}} \end{array} \right] \right) \] (=\((n(n+1))/2 = (4\times 5)/2 = 10\) parameters) vs. \[ \left[ \begin{array}{cccc} \sigma^2 & . & . & . \\ \rho \sigma^2 & \sigma^2 & . & . \\ \rho \sigma^2 & \rho \sigma^2 & \sigma^2 & . \\ \rho \sigma^2 & \rho \sigma^2 & \rho \sigma^2 & \sigma^2 \\ \end{array} \right] \] where \(\sigma^2 = \sigma^2_{\{b|1\}}+\sigma^2_{\{herd:period|1\}}\), \(\rho = \sigma^2_{\{b|1\}}/\sigma^2\) (=2 parameters; \(\rho\) must be >0)

The shortcut of using a nested model works in most mixed model frameworks (almost any platform that allows for multiple random effects will allow them to be nested), but is restricted to positive compound-symmetric models (i.e. \(\rho>0\)). Some packages allow general compound symmetric matrices (i.e. \(-1 < \rho < 1\)), e.g. using pdCompSymm in nlme, cs() in glmmTMB, cosy in brms (apparently MCMCglmm doesn’t have a CS struture?) …
We also need to distinguish between heterogeneous compound symmetry (i.e., variances are the same for every level; \(n+1\) parameters for an \(n \times n\) covariance matrix) and homogeneous compound symmetry (all variances are the same, 2 parameters regardless of the size of the cov matrix). The nesting trick described above only handles homogeneous CS. cs() in glmmTMB does heterogeneous CS, homogeneous CS could be handled by using the map argument to fix all of the standard deviations to be the same (or a homcs() structure could be added if there’s enough demand for it).

independence models

This is probably the best known approach to model simplification (e.g. as recommended by Barr et al. (2013)). It’s actually a special case of CS, with \(\rho=0\). Also called “diagonal” for obvious (?) reasons.

in lme4 (or glmmTMB or brms), use || in place of |; glmmTMB allows diag() for heterogenous diagonal/independence models (and homdiag() for homogeneous); MCMCglmm uses idv() (homogeneous) and idh() (heterogeneous).
|| (still) doesn’t work in lme4 for factor-valued terms; use afex::mixed or glmmTMB, or expand the model in dummies ((0 + dummy(f, level1)|g) + (0 + dummy(f, level2)|g) + ...; kind of a pain for large numbers of factors, although you could construct it via string processing if you wanted)
see general comments above about the dependence of the constrained model on the zero point of the numeric terms

dropping terms

Especially for vector-valued random effects consisting of variation in multiple numeric predictors, it may make sense to drop the random effect term for terms with very small variance, or terms that are of less interest. For example, you might reduce the random-effect term (1 + fire + NPP | biome) to (1 + fire | biome) by dropping the variation in NPP across biomes.

reduced-rank models

Factor-analytic or reduced-rank models are a more flexible approach to simplifying complex covariance functions, by doing something analogous to computing a principal components analysis on the covariance matrix and keeping a subset of the terms. In glmmTMB you can do this with the rr() covariance structure; see the relevant section of the covariance structure vignette, or this draft by McGillycuddy et al., in press in J. Stat. Software.

parameter constraint strategies

Instead of changing the covariance model to reduce its dimensionality, you can add a Bayesian prior (or a regularization term, if you’re not Bayesian) to make the model better behaved and/or keep it from being singular.

If you want to apply a minimally informative prior that will prevent singularity, Chung et al. (2013) show that an improper Gamma(shape=2, rate=0) is the weakest Gamma prior for the standard deviation that will prevent a posterior mode at zero. (The mode will be “approximately one standard error away from zero when the maximum likelihood estimate is at zero”.) In practice this might be implemented using a very small rate parameter or very large shape parameter. A Wishart prior with \(\nu = 1 + d\) and infinite scale might (?) have similar properties for vector-valued random effects/covariance matrices larger than 1×1. The blme package implements this approach (although with a default shape of 2.5 rather than 2); it can also be implemented in glmmTMB (see the priors vignette), and of course in any of the full-Bayesian packages (brms, rstanarm, MCMCglmm, etc.)
If you’re a real Bayesian or are otherwise comfortable with stronger priors, you can choose from a variety of other priors (e.g. half-Normal or half-\(t\) or Gamma or log-Normal priors for standard deviations, Wishart or inverse-Wishart distributions for full covariance matrices, LKJ priors for correlations) to help out your model. (LKJ priors are a nice generalization of the diagonal covariance structures above, as an LKJ prior with \(\eta>1\) pushes the correlation matrix to be closer to diagonal without constraining it to be exactly diagonal …)
a real Bayesian would also say you should not choose priors just to make your model behave better (e.g. eliminate divergences in a Hamiltonian Monte Carlo run). Instead, you should use prior predictive simulations to tune the priors to limit the random-effects structures to those that produce reasonable output (and then cross your fingers that this also resolves your computational problems).

multiple RE terms

Given all these options, any combination of which might apply to different RE terms in a model with more than one random effect, how should one choose among them?

if a model seemed to be sampling or converging to an optimum reasonably well but the fit was singular (i.e. on the boundary of the space of allowed covariance estimates), and the results were sufficiently robust that you could get estimates and standard errors/CIs for all of the values of interest), you might choose to proceed with the singular model. Singular fits are theoretically valid; they just make a lot of downstream procedures, such as those based on calculating curvature of the log-likelihood surface, harder. It might also be reasonable to suppose that numerical procedures could be less reliable in this case.
Barr et al. (2013) suggest “keep[ing] it maximal”, i.e. reducing the model only where necessary to deal convergence issues (they don’t actually mention singular fits, nor say much about the issues with convergence checks in lme4: > In cases of nonconvergence, simplification of the random effects structure proceeded as follows. For between-items designs, the by-subjects random slope was dropped. For within-items designs, statistics from the partially converged model were inspected, and the slope associated with smaller variance was dropped It doesn’t seem that they considered models with multiple random effects terms (where one wouldn’t necessarily know which term to try simplifying first …)
Matuschek et al. (2017) prefer a model selection approach; they lay out a stepwise approach based on likelihood ratio tests with a relaxed threshold (\(\alpha = 0.2\)) for retaining terms. Based on a model with two random-slopes terms, they suggest a sequence¹

full model
independent slopes and intercepts for both terms [\(\rho = 0\)]
drop the random slope for one term [item-specific random slopes]
drop the random slope for the other term [subject-specific random slopes]
drop both random slopes They also consider AIC-based selection.

Moritz, Batllori, and Bolker (2023) used a more complex model than either of the references above: the among-group variation of an intercept and four numeric predictors (i.e., a 5×5 covariance matrix), applied at three different grouping levels. They consider the possibilities of (1) a full (unconstrained) covariance matrix, (2) a diagonal/independent covariance, or (3) an intercept-only random effect, for each grouping variable, i.e. a total of \(3^3 = 81\) possible models. They fitted all 81 models and chose the model with the best AIC among only the models with non-singular fits
Singmann and Kellen (2019) doesn’t present any methods beyond those suggested above, but the section on “Specifying the Random Effects Structure” has a clear description
several R packages have implemented model selection machinery for the random effects (most of the options, e.g. MuMIn, glmulti, glmmLasso, do model selection only on the fixed effects component)
- The buildmer package is probably the fanciest: it
[f]inds the largest possible regression model that will still converge for various types of regression analyses … and then optionally performs stepwise elimination similar to the forward and backward effect-selection methods in SAS, based on the change in log-likelihood or its significance, Akaike’s Information Criterion, the Bayesian Information Criterion, the explained deviance, or the F-test of the change in R². Its allowed steps appear to include both including/excluding particular terms from the RE term for a particular grouping variable, as well as including or excluding complete terms (i.e. adding or dropping a grouping variable)
- step.lmerModLmerTest() from lmerTest does backward stepwise selection (among random-effects terms, i.e. not considering simplification of individual terms as discussed above) based on likelihood ratio tests
- ffRanefLMER.fnc() from LMERConvenienceFunctions does forward selection (the description of the ran.effects argument that specifies the possible random effects is complicated …)
- the asremlPlus package has a lot of machinery, and some detailed examples, for model building and selection using AS-REML

Example

setup

library(lme4)
library(Matrix)
library(glmmTMB)
## For rlkj_corr_cholesky; LKJ simulation also available in `rethinking` package
## (GitHub only: remotes::install_github("rmcelreath/rethinking"))
library(nimble)
library(blme)
conflicted::conflict_prefer("simulate", "stats")

For this example we’ll think about a forest restoration experiment with four plots; 5 restoration treatments that are applied in each plot (a randomized complete block design); and 6 replicates within each block/treatment.

library(lme4)
set.seed(101)
dd <- expand.grid(plot = factor(1:4),
                  ttt = factor(LETTERS[1:5]),
                  rep = 1:6)
## sample a random 5x5 correlation matrix
cchol <- nimble::rlkj_corr_cholesky(n = 1, eta = 1.2, p = 5) *
    ## scale by random SDs
    rlnorm(5, meanlog = 0, sdlog = 0.2)
theta <- t(cchol)[lower.tri(cchol, diag = TRUE)]
dd$y <- suppressMessages(
    simulate( ~ ttt + (ttt|plot),
             family = gaussian,
             newdata = dd,
             newparams = list(beta = seq(1, by = -0.5, length.out = 5),
                              theta = theta,
                              sigma = 1)))[[1]]

maximal model

We would like to be able to fit ~ ttt + (ttt|plot), but that’s very unlikely to actually work:

m1 <- lmer(y ~ ttt + (ttt|plot), data  = dd)

boundary (singular) fit: see help('isSingular')

isSingular(m1)

[1] TRUE

getME(m1, "theta")[getME(m1, "lower") ==0]

plot.(Intercept)        plot.tttB        plot.tttC        plot.tttD 
       0.9465067        1.6882164        0.6333647        0.0000000 
       plot.tttE 
       0.0000000

Seems pretty clear here that the real rank of the estimate is 3 …

diagnostics

Investigate eigenvectors of the estimated \(\Sigma\); we can see again that the covariance matrix has 2 trivial (\(\approx 0\)) eigenvalues, but not much else; the rotation matrix doesn’t give us any particularly useful clues about how to simplify the model (maybe a heatmap with column (eigenvector) order held fixed and reordering and hierarchical clustering done on the rows would be helpful for identifying clusters? Ideally, the clustering would be done based on shared similarity based on absolute values (i.e., we might want to identify linear combinations/contrasts among terms that would allow us to simplify the model … surely someone has done something like this?) [rePCA() definitely could use some additions to make it more useful, e.g. a plot method … even the original paper (Bates, Kliegl, et al. 2015) doesn’t offer much guidance …]

(r <- rePCA(m1))

$plot
Standard deviations (1, .., p=5):
[1] 2.124898e+00 1.259986e+00 1.015020e+00 1.794922e-08 0.000000e+00

Rotation (n x k) = (5 x 5):
           [,1]        [,2]       [,3]       [,4]          [,5]
[1,] -0.2084363 -0.35887528  0.6933225 -0.2152625  5.483875e-01
[2,]  0.8630464  0.30871339  0.2515409  0.1513137  2.714376e-01
[3,]  0.1172202 -0.58554419  0.1806518  0.7293481 -2.807378e-01
[4,]  0.1177251  0.06948193  0.5379321 -0.3810027 -7.394451e-01
[5,]  0.4290752 -0.65438127 -0.3660944 -0.5036412 -1.942890e-16

attr(,"class")
[1] "prcomplist"

image(Matrix(r$plot$rotation))

constraining the model

m1_homcs_lmer <- lmer(y ~ ttt + (1|plot/ttt), data  = dd)
m1_homcs_glmmTMB <- glmmTMB(y ~ ttt + (1|plot/ttt), data  = dd,
                            REML = TRUE)
m1_hetcs_glmmTMB <- glmmTMB(y ~ ttt + cs(ttt|plot), data  = dd,
                            REML = TRUE)

independence models

m1_diag_glmmTMB <- glmmTMB(y ~ ttt + diag(ttt|plot), data  = dd,
                           REML = TRUE)
## also try afex::mixed?

reduced-rank models

m1_rr1_glmmTMB <- glmmTMB(y ~ ttt + rr(ttt|plot, d=1), data  = dd,
                           REML = TRUE)
m1_diag_rr1_glmmTMB <- glmmTMB(y ~ ttt +
                                   diag(ttt|plot) + 
                                   rr(ttt|plot, d=1), data  = dd,
                               REML = TRUE)
m1_diag_rr2_glmmTMB <- glmmTMB(y ~ ttt +
                                   rr(ttt|plot, d=2), data  = dd,
                           REML = TRUE)

parameter constraints/regularization

Constraining just the standard deviations to be positive doesn’t help us: the 5×5 correlation matrix is only rank-4 … a singularity-avoiding prior for the correlation matrix isn’t available yet.

gprior <- data.frame(prior = "gamma(1e8, 2.5)",
                     class = "ranef",
                     coef = "")
m1_us_prior_glmmTMB <- glmmTMB(y ~ ttt +
                                   (ttt|plot),
                               data  = dd,
                               prior = gprior,
                               REML = TRUE)

Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')

Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; false convergence (8). See vignette('troubleshooting'),
help('diagnose')

cormat <- cov2cor(VarCorr(m1_us_prior_glmmTMB)$cond[[1]])
print(eigen(cormat)$value, digits = 3)

[1] 3.81e+00 8.32e-01 2.62e-01 9.17e-02 4.93e-11

AIC comparison

The more complex models really aren’t doing well in terms of AIC in this case …

all_fits <- ls(pattern = "^m1_")
all_fit_list <- mget(all_fits)
bbmle::AICtab(all_fit_list)

                    dAIC df
m1_homcs_glmmTMB     0.0 8 
m1_homcs_lmer        0.0 8 
m1_diag_glmmTMB      1.9 11
m1_hetcs_glmmTMB     3.6 12
m1_diag_rr1_glmmTMB  9.0 16
m1_diag_rr2_glmmTMB 14.0 15
m1_rr1_glmmTMB      29.4 11
m1_us_prior_glmmTMB   NA 21

to do

provide examples of all (!?) of these approaches. In addition to the simulated example I started with (a single RE with a multi-level factor), might need to include examples with (1) multiple continuous predictors and (2) multiple RE terms in order to illustrate everything …
fancy ideas; RJMCMC for Bayesian models? Hierarchical clustering to lump/reduce factor levels? Priors on distribution of RR variances (e.g. geometric)?

references

Barr, Dale J., Roger Levy, Christoph Scheepers, and Harry J. Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” Journal of Memory and Language 68 (3): 255–78. https://doi.org/10.1016/j.jml.2012.11.001.

Bates, Douglas, Reinhold Kliegl, Shravan Vasishth, and Harald Baayen. 2015. “Parsimonious Mixed Models.” arXiv:1506.04967 [Stat], June. http://arxiv.org/abs/1506.04967.

Bates, Douglas, Martin Mächler, Benjamin M. Bolker, and Steven C. Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Chung, Yeojin, Sophia Rabe-Hesketh, Vincent Dorie, Andrew Gelman, and Jingchen Liu. 2013. “A Nondegenerate Penalized Likelihood Estimator for Variance Parameters in Multilevel Models.” Psychometrika, 1–25. https://doi.org/10.1007/s11336-013-9328-2.

Matuschek, Hannes, Reinhold Kliegl, Shravan Vasishth, Harald Baayen, and Douglas Bates. 2017. “Balancing Type I Error and Power in Linear Mixed Models.” Journal of Memory and Language 94: 305–15. https://doi.org/10.1016/j.jml.2017.01.001.

Moritz, Max A., Enric Batllori, and Benjamin M. Bolker. 2023. “The Role of Fire in Terrestrial Vertebrate Richness Patterns.” Ecology Letters 26 (4): 563–74. https://doi.org/10.1111/ele.14177.

Singmann, Henrik, and David Kellen. 2019. “An Introduction to Mixed Models for Experimental Psychology.” In New Methods in Cognitive Psychology, edited by Daniel Spieler and Eric Schumacher, 1st ed., 4–31. Routledge. https://doi.org/10.4324/9780429318405-2.

Yarkoni, Tal. 2022. “The Generalizability Crisis.” Behavioral and Brain Sciences 45 (January): e1. https://doi.org/10.1017/S0140525X20001685.

Footnotes

I don’t understand how we choose between models 3 and 4 in this list; they are both nested in model 2 and contain model 5, but neither is nested in the other …↩︎