Introduction

This is an informal FAQ list for the r-sig-mixed-models mailing list.

The most commonly used functions for mixed modeling in R are

Other sources of help

  • the mailing list is r-sig-mixed-models@r-project.org
    • sign up here
    • archives here
    • or Google search with the tag site:https://stat.ethz.ch/pipermail/r-sig-mixed-models/
  • The source code of this document is available on GitHub; the rendered (HTML) version lives on GitHub pages.
  • Searching on StackOverflow with the [r] [mixed-models] tags, or on CrossValidated with the [mixed-model] tag may be helpful (these sites also have an [lme4] tag).

DISCLAIMERS:

  • (G)LMMs are hard - harder than you may think based on what you may have learned in your second statistics class, which probably focused on picking the appropriate sums of squares terms and degrees of freedom for the numerator and denominator of an \(F\) test. ‘Modern’ mixed model approaches, although more powerful (they can handle more complex designs, lack of balance, crossed random factors, some kinds of non-Normally distributed responses, etc.), also require a new set of conceptual tools. In order to use these tools you should have at least a general acquaintance with classical mixed-model experimental designs but you should also, probably, read something about modern mixed model approaches. Littell et al. (2006) and Pinheiro and Bates (2000) are two places to start, although Pinheiro and Bates is probably more useful if you want to use R. Other useful references include Gelman and Hill (2006) (focused on Bayesian methods) and Zuur et al. (2009). If you are going to use generalized linear mixed models, you should understand generalized linear models (Dobson and Barnett (2008), Faraway (2006), and McCullagh and Nelder (1989) are standard references; the last is the canonical reference, but also the most challenging).
  • All of the issues that arise with regular linear or generalized-linear modeling (e.g.: inadequacy of p-values alone for thorough statistical analysis; need to understand how models are parameterized; need to understand the principle of marginality and how interactions can be treated; dangers of overfitting, which are not mitigated by stepwise procedures; the non-existence of free lunches) also apply, and can apply more severely, to mixed models.
  • When SAS (or Stata, or Genstat/AS-REML or …) and R differ in their answers, R may not be wrong. Both SAS and R may be `right’ but proceeding in a different way/answering different questions/using a different philosophical approach (or both may be wrong …)
  • The advice in this FAQ comes with absolutely no warranty of any sort.

Model definition

Model specification

The following formula extensions for specifying random-effects structures in R are used by

  • lme4
  • nlme (nested effects only, although crossed effects can be specified with more work)
  • glmmADMB and glmmTMB

MCMCglmm uses a different specification, inherited from AS-REML.

(Modified from Robin Jeffries, UCLA:)

formula meaning
(1|group) random group intercept
(x|group) = (1+x|group) random slope of x within group with correlated intercept
(0+x|group) = (-1+x|group) random slope of x within group: no variation in intercept
(1|group) + (0+x|group) uncorrelated random intercept and random slope within group
(1|site/block) = (1|site)+(1|site:block) intercept varying among sites and among blocks within sites (nested random effects)
site+(1|site:block) fixed effect of sites plus random variation in intercept among blocks within sites
(x|site/block) = (x|site)+(x|site:block) = (1 + x|site)+(1+x|site:block) slope and intercept varying among sites and among blocks within sites
(x1|site)+(x2|block) two different effects, varying at different levels
x*site+(x|site:block) fixed effect variation of slope and intercept varying among sites and random variation of slope and intercept among blocks within sites
(1|group1)+(1|group2) intercept varying among crossed random effects (e.g. site, year)

Another possibly useful treatment

Should I treat factor xxx as fixed or random?

This is in general a far more difficult question than it seems on the surface. There are many competing philosophies and definitions. For example, from Gelman (2005):

Before discussing the technical issues, we briefly review what is meant by fixed and random effects. It turns out that different—in fact, incompatible—definitions are used in different contexts. [See also Kreft and de Leeuw (1998), Section 1.3.3, for a discussion of the multiplicity of definitions of fixed and random effects and coefficients, and Robinson (1998) for a historical overview.] Here we outline five definitions that we have seen: 1. Fixed effects are constant across individuals, and random effects vary. For example, in a growth study, a model with random intercepts αi and fixed slope β corresponds to parallel lines for different individuals i, or the model yit = αi + βt. Kreft and de Leeuw [(1998), page 12] thus distinguish between fixed and random coefficients. 2. Effects are fixed if they are interesting in themselves or random if there is interest in the underlying population. Searle, Casella and McCulloch [(1992), Section 1.4] explore this distinction in depth. 3. “When a sample exhausts the population, the corresponding variable is fixed; when the sample is a small (i.e., negligible) part of the population the corresponding variable is random” [Green and Tukey (1960)]. 4. “If an effect is assumed to be a realized value of a random variable, it is called a random effect” [LaMotte (1983)]. 5. Fixed effects are estimated using least squares (or, more generally, maximum likelihood) and random effects are estimated with shrinkage [“linear unbiased prediction” in the terminology of Robinson (1991)]. This definition is standard in the multilevel modeling literature [see, e.g., Snijders and Bosker (1999), Section 4.2] and in econometrics.

Another useful comment (via Kevin Wright) reinforcing the idea that “random vs. fixed” is not a simple, cut-and-dried decision: from Schabenberger and Pierce (2001), p. 627:

Before proceeding further with random field linear models we need to remind the reader of the adage that one modeler’s random effect is another modeler’s fixed effect.

Clark and Linzer (2015) address this question from a mostly econometric perspective, focusing mostly on practical variance/bias/RMSE criteria.

One point of particular relevance to ‘modern’ mixed model estimation (rather than ‘classical’ method-of-moments estimation) is that, for practical purposes, there must be a reasonable number of random-effects levels (e.g. blocks) – more than 5 or 6 at a minimum. This is not surprising if you consider that random effects estimation is trying to estimate an among-block variance. For example, from Crawley (2002) p. 670:

Are there enough levels of the factor in the data on which to base an estimate of the variance of the population of effects? No, means [you should probably treat the variable as] fixed effects.

Some researchers (who treat fixed vs random as a philosophical rather than a pragmatic decision) object to this approach.

Treating factors with small numbers of levels as random will in the best case lead to very small and/or imprecise estimates of random effects; in the worst case it will lead to various numerical difficulties such as lack of convergence, zero variance estimates, etc.. (A small simulation exercise shows that at least the estimates of the standard deviation are downwardly biased in this case; it’s not clear whether/how this bias would affect the point estimates of fixed effects or their estimated confidence intervals.) In the classical method-of-moments approach these problems may not arise (because the sums of squares are always well defined as long as there are at least two units), but the underlying problems of lack of power are there nevertheless.

Also see this thread on the r-sig-mixed-models mailing list.

Nested or crossed?

  • Relatively few mixed effect modeling packages can handle crossed random effects, i.e. those where one level of a random effect can appear in conjunction with more than one level of another effect. (This definition is confusing, and I would happily accept a better one.) A classic example is crossed temporal and spatial effects. If there is random variation among temporal blocks (e.g. years) ‘’and’‘random variation among spatial blocks (e.g. sites),’‘and’‘if there is a consistent year effect across sites and’‘vice versa’’, then the random effects should be treated as crossed.
  • lme4 does handled crossed effects, efficiently; if you need to deal with crossed REs in conjunction with some of the features that nlme offers (e.g. heteroscedasticity of residuals via weights/varStruct, correlation of residuals via correlation/corStruct, see p. 163ff of Pinheiro and Bates (2000) (section 4.2.2: Google books link)
  • I rarely find it useful to think of fixed effects as “nested” (although others disagree); if for example treatments A and B are only measured in block 1, and treatments C and D are only measured in block 2, one still assumes (because they are fixed effects) that each treatment would have the same effect if applied in the other block. (One might like to estimate treatment-by-block interactions, but in this case the experimental design doesn’t allow it; one would have to have multiple treatments measured within each block, although not necessarily all treatments in every block.) One would code this analysis as response~treatment+(1|block) in lme4. Also, in the case of fixed effects, crossed and nested specifications change the parameterization of the model, but not anything else (e.g. the number of parameters estimated, log-likelihood, model predictions are all identical). That is, in R’s model.matrix function (which implements a version of Wilkinson-Rogers notation) a*b and a/b (which expand to 1+a+b+a:b and 1+a+a:b respectively) give model matrices with the same number of columns.
  • Whether you explicitly specify a random effect as nested or not depends (in part) on the way the levels of the random effects are coded. If the ‘lower-level’ random effect is coded with unique levels, then the two syntaxes (1|a/b) (or (1|a+a:b)) and (1|a)+(1|b) are equivalent. If the lower-level random effect has the same labels within each larger group (e.g. blocks 1, 2, 3, 4 within sites A, B, and C) then the explicit nesting (1|a/b) is required. It seems to be considered best practice to code the nested level uniquely (e.g. A1, A2, …, B1, B2, …) so that confusion between nested and crossed effects is less likely.

Model extensions (overdispersion, zero-inflation, …)

Overdispersion

Testing for overdispersion/computing overdispersion factor

  • with the usual caveats, plus a few extras – counting degrees of freedom, etc. – the usual procedure of calculating the sum of squared Pearson residuals and comparing it to the residual degrees of freedom should give at least a crude idea of overdispersion. The following attempt counts each variance or covariance parameter as one model degree of freedom and presents the sum of squared Pearson residuals, the ratio of (SSQ residuals/rdf), the residual df, and the \(p\)-value based on the (approximately!!) appropriate \(\chi^2\) distribution. Do PLEASE note the usual, and extra, caveats noted here: this is an APPROXIMATE estimate of an overdispersion parameter. Even in the GLM case, the expected deviance per point equaling 1 is only true as the distribution of individual deviates approaches normality, i.e. the usual \(\lambda>5\) rules of thumb for Poisson values and \(\textrm{min}(Np, N(1-p)) > 5\) for binomial values (e.g. see Venables and Ripley (2002), p. 209). (And that’s without the extra complexities due to GLMM, i.e. the “effective” residual df should be large enough to make the sums of squares converge on a \(\chi^2\) distribution …)

The following function works for (at least) lme4 and glmmADMB

overdisp_fun <- function(model) {
  ## number of variance parameters in 
  ##   an n-by-n variance-covariance matrix
  vpars <- function(m) {
    nrow(m)*(nrow(m)+1)/2
  }
  model.df <- sum(sapply(VarCorr(model),vpars))+length(fixef(model))
  rdf <- nrow(model.frame(model))-model.df
  rp <- residuals(model,type="pearson")
  Pearson.chisq <- sum(rp^2)
  prat <- Pearson.chisq/rdf
  pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
  c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}

Example:

library(lme4)
set.seed(101)  
d <- data.frame(y=rpois(1000,lambda=3),x=runif(1000),
                f=factor(sample(1:10,size=1000,replace=TRUE)))
m1 <- glmer(y~x+(1|f),data=d,family=poisson)
overdisp_fun(m1)
##        chisq        ratio          rdf            p 
## 1026.7791432    1.0298687  997.0000000    0.2497584
library(glmmADMB)  ## 0.7.7
m2 <- glmmadmb(y~x+(1|f),data=d,family="poisson")
overdisp_fun(m2)
##        chisq        ratio          rdf            p 
## 1026.7585031    1.0298480  997.0000000    0.2499024

The gof function in the aods3 package works for recent versions of the lme4 package – it provides essentially the same set of information.

Fitting models with overdispersion?

  • quasilikelihood estimation: MASS::glmmPQL. Quasi- was deemed unreliable in lme4, and is no longer available. (Part of the problem was questionable numerical results in some cases; the other problem was that DB felt that he did not have a sufficiently good understanding of the theoretical framework that would explain what the algorithm was actually estimating in this case.) geepack::geelgm may be workable (haven’t tried it)
  • individual-level random effects (MCMCglmm or recent version of lme4, 0.999375-34 or later) [or WinBUGS or ADMB or …]. If you want to a citation for this approach, try Elston et al. (2001), who cite Lawson et al. (1999); apparently there is also an example in section 10.5 of Maindonald and Braun (2010), and (according to an R-sig-mixed-models post) this is also discussed by Rabe-Hesketh and Skrondal (2008). Also see Browne et al. (2005) for an example in the binomial context (i.e. logit-normal-binomial rather than lognormal-Poisson). Agresti’s excellent (2002) book Agresti (2002) also discusses this (section 13.5), referring back to Breslow (1984) and Hinde (1982). [Notes: (a) I haven’t checked all these references myself, (b) I can’t find the reference any more, but I have seen it stated that individual-level random effect estimation is probably dodgy for PQL approaches as used in Elston et al 2001]
  • alternative distributions
  • Poisson-lognormal model (see above, “individual-level random effects”)
  • negative binomial
    • lme4::glmer.nb() should fit a negative binomial, although it is somewhat slow and fragile compared to some of the other methods suggested here.
    • glmmADMB (R-forge) and glmmTMB (Github) will fit two parameterizations of the negative binomial: family=“nbinom” gives the classic parameterization with var/mean=(1+mean/k) (“NB2” in Hardin and Hilbe’s terminology) while family=“nbinom1” gives a parameterization with var/mean=phi*mean (“NB1” to Hardin and Hilbe). The latter might also be called a “quasi-Poisson” parameterization because it matches the mean-variance relationship assumed by quasi-Poisson models.
  • gamlss.mx:gamlssNP
  • WinBUGS/JAGS (via R2WinBUGS/Rjags)
  • AD Model Builder (possibly via R2ADMB)
  • gnlmm in the repeated package (off-CRAN)
  • ASREML

Negative binomial models in glmmADMB or glmmTMB and lognormal-Poisson models in glmer (or MCMCglmm) are probably the best quick alternatives. If you need to explore alternatives (different variance-mean relationships, different distributions), then ADMB, TMB, WinBUGS, Stan are the most flexible alternatives.

Gamma GLMMs

While one (well, OK I) would naively think that GLMMs with Gamma distributions would be just as easy (or hard) as any other sort of GLMMs, it seems that they are in fact harder to implement (or at least not yet thoroughly/stably working in lme4). Basic simulated examples of Gamma GLMMs can fail in lme4 despite analogous problems with Poisson, binomial, etc. distributions. Solutions: - the default inverse link seems particularly problematic; try other links (especially family=Gamma(link="log")) if that is possible/makes sense - consider whether a lognormal model (i.e. a regular LMM on logged data) would work/makes sense

  • glmmPQL
  • glmmADMB
  • other packages (WinBUGS) etc. ?
  • gamlssNP?

Zero-inflation

Count data

  • MCMCglmm handles zero-truncated, zero-inflated, and zero-altered models, although specifying the models is a little bit tricky: see Sections 5.3 to 5.5 of the CourseNotes vignette
  • glmmADMB handles
    • zero-inflated models (with a single zero-inflation parameter – i.e., the level of zero-inflation is assumed constant across the whole data set)
    • truncated Poisson and negative binomial distributions (which allows two-stage fitting of hurdle models)
  • glmmTMB handles a variety of Z-I and Z-T models (allows covariates, and random effects, in the zero-alteration model)
  • brms does too
  • Gavin Simpson has a detailed writeup showing that mgcv::gam() can do simple mixed models with zero-inflation, and comparing mgcv with glmmTMB results
  • gamlssNP in the gamlss.mx package should handle zero-inflation, and the gamlss.tr package should handle truncated (i.e. hurdle) models – but I haven’t tried them
  • roll-your-own: ADMB/R2admb, WinBUGS/R2WinBUGS, TMB, Stan, …

Continuous data

Continuous data are a special case where the mixture model may make less sense.

  • The simplest solution is to fit a Bernoulli model to the zero/non-zero data, then a continuous model (lognormal? Gamma?) for the non-zero values. If you want to fit a truncated continuous distribution to the non-zero values, that gets harder …
  • The cplm package handles ‘Tweedie compound Poisson linear models’, which in a particular range of parameters allows for skewed continuous responses with a spike at zero

Tests for zero-inflation

  • you can use a likelihood ratio test between the regular and zero-inflated version of the model, but be aware of boundary issues (search “boundary” elsewhere on this page …) – the null value (no zero inflation) is on the boundary of the feasible space
  • you can use AIC or variations, with the same caveats
  • you can use Vuong’s test, which is often recommended for testing zero-inflation in GLMs, because under some circumstances the various model flavors under consideration (hurdle vs zero-inflated vs “vanilla”) are not nested. Vuong’s test is implemented (and referenced) in the pscl package, and should be feasible to implement for GLMMs, but I don’t know of an implementation. Someone should let me (BMB) know if they find one.
  • two untested but reasonable approaches:
  • use a ‘simulate’ method if it exists to construct a simulated distribution of the proportion of zeros in your dataset, and compare it to the observed proportion
  • compute the probability of a zero for each observation. On the basis of Bernoulli trials, compute the expected number of zeros and the confidence intervals – compare it with the observed number.

Spatial and temporal correlation models, heteroscedasticity (“R-side” models)

In nlme these so-called R-side (R for “residual”) structures are accessible via the weights/VarStruct (heteroscedasticity) and correlation/corStruct (spatial or temporal correlation) arguments and data structures. This extension is a bit harder than it might seem. In LMMs it is a natural extension to allow the residual error terms to be components of a single multivariate normal draw; if that MVN distribution is uncorrelated and homoscedastic (i.e. proportional to an identity matrix) we get the classic model, but we can in principle allow it to be correlated and/or heteroscedastic.

It is not too hard to define marginal correlation structures that don’t make sense. One class of reasonably sensible models is to always assume an observation-level random effect (as MCMCglmm does for computational reasons) and to allow that random effect to be MVN on the link scale (so that the full model is lognormal-Poisson, logit-normal binomial, etc., depending on the link function and family).

For example, a relatively simple Poisson model with spatially correlated errors might look like this:

\[ \begin{split} \eta & \sim \textrm{MVN}(a + b x, \Sigma) \\ \Sigma_{ij} & = \sigma^2 \exp(-d_{ij}/s) \\ y_i & \sim \textrm{Poisson}(\lambda=\exp(\eta_i)) \end{split} \]

That is, the marginal distributions of the response values are Poisson-lognormal, but on the link (log) scale the latent Normal variables underlying the response are multivariate normal, with a variance-covariance matrix described by an exponential spatial correlation function with scale parameter \(s\).

How can one achieve this?

  • These types of models are not implemented in lme4, for either LMMs or GLMMs; they are fairly low priority, and it is hard to see how they could be implemented for GLMMs (the equivalent for LMMs is tedious but should be straightforward to implement).
  • For LMMs, you can use the spatial/temporal correlation structures that are built into (n)lme
  • You can use the spatial/temporal correlation structures available for (n)lme, which include basic geostatistical (space) and ARMA-type (time) models.
library(sos)
findFn("corStruct")

finds additional possibilities in the ramps (extended geostatistical) and ape (phylogenetic) packages.

  • You can use these structures in GLMMs via MASS::glmmPQL (see Dormann et al.)
  • geepack::geeglm
  • geoR, geoRglm (power tools); these are mostly designed for fitting spatial random field GLMMs via MCMC – not sure that they do random effects other than the spatial random effect
  • R-INLA (super-power tool)
  • it is possible to use AD Model Builder to fit spatial GLMMs, as shown in these AD Model Builder examples; this capability is not in the glmmADMB package (and may not be for a while!), but it would be possible to run AD Model Builder via the R2admb package (requires installing – and learning! ADMB)
  • geoBUGS, the geostatistical/spatial correlation module for WinBUGS, is another alternative (but again requires going outside of R)

Estimation

What methods are available to fit (estimate) GLMMs?

(adapted from Bolker et al TREE 2009)

Method Advantages Disadvantages Packages
Penalized quasi-likelihood Flexible, widely implemented Likelihood inference may be inappropriate; biased for large variance or small means PROC GLIMMIX (SAS), GLMM (GenStat), glmmPQL (R:MASS), ASREML-R
Laplace approximation More accurate than PQL Slower and less flexible than PQL glmer (R:lme4,lme4a), glmm.admb (R:glmmADMB), INLA, glmmTMB, AD Model Builder, HLM
Gauss-Hermite quadrature More accurate than Laplace Slower than Laplace; limited to 2‑3 random effects PROC NLMIXED (SAS), glmer (R:lme4, lme4a), glmmML (R:glmmML), xtlogit (Stata)
Markov chain Monte Carlo Highly flexible, arbitrary number of random effects; accurate Slow, technically challenging, Bayesian framework MCMCglmm (R:MCMCglmm), rstanarm (R), brms (R), MCMCpack (R), WinBUGS/OpenBUGS (R interface: BRugs/R2WinBUGS), JAGS (R interface: rjags/R2jags), AD Model Builder (R interface: R2admb), glmm.admb (post hoc MCMC after Laplace fit) (R:glmmADMB)

Troubleshooting

  • double-check the model specification and the data for mistakes
  • center and scale continuous predictor variables (e.g. with scale())
  • try all available optimizers (e.g. several different implementations of BOBYQA and Nelder-Mead, L-BFGS-B from optim, nlminb(), …). While this will of course be slow for large fits, we consider it the gold standard; if all optimizers converge to values that are practically equivalent (it’s up to the user to decide what “practically equivalent means for their case”), then we would consider the model fit to be good enough. For example:
source(system.file("utils", "allFit.R", package="lme4"))
modelfit.all <- allFit(model)
ss <- summary(modelfit.all)

Convergence warnings

Most of the current advice about troubleshooting lme4 convergence problems can be found in the help page ?convergence. That page explains that the convergence tests in the current version of lme4 (1.1-11, February 2016) generate lots of false positives. We are considering raising the gradient warning threshold to 0.01 in future releases of lme4. In addition to the general troubleshooting tips above:

  • double-check the Hessian calculation with the more expensive Richardson extrapolation method (see examples)
  • restart the fit from the apparent optimum, or from a point perturbed slightly away from the optimum (getME(model,c("theta","beta")) should retrieve the parameters in a form suitable to be used as the start parameter)

Singular models: random effect variances estimated as zero, or correlations estimated as +/- 1

It is very common for overfitted mixed models to result in singular fits. Technically, singularity means that some of the \(\boldsymbol \theta\) (variance-covariance Cholesky decomposition) parameters corresponding to diagonal elements of the Cholesky factor are exactly zero, which is the edge of the feasible space, or equivalently that the variance-covariance matrix has some zero eigenvalues (i.e. is positive semidefinite rather than positive definite), or (almost equivalently) that some of the variances are estimated as zero or some of the correlations are estimated as +/-1. This commonly occurs in two scenarios:

  • small numbers of random-effect levels (e.g. <5), as illustrated in these simulations and discussed (in a somewhat different, Bayesian context) by Gelman (2006).
  • complex random-effects models, e.g. models of the form (f|g) where f is a categorical variable with a relatively large number of levels, or models with several different random-slopes terms.

  • When using lme4, singularity is most obviously detectable in the output of summary.merMod() or VarCorr.merMod() when a variance is estimated as 0 (or very small, i.e. orders of magnitude smaller than other variance components) or when a correlation is estimated as exactly \(\pm 1\). However, as pointed out by D. Bates et al. (2015), singularities in larger variance-covariance matrices can be hard to detect: checking for small values among the diagonal elements of the Cholesky factor is a good start.

theta <- getME(model,"theta")
## diagonal elements are identifiable because they are fitted
##  with a lower bound of zero ...
diag.element <- getME(model,"lower")==0
any(theta[diag.element]<1e-5)
  • In MCMCglmm, singular or near-singular models will provoke an error and a requirement to specify a stronger prior.

At present there are a variety of strong opinions about how to resolve such problems. Briefly:

  • Barr et al. (2013) suggest always starting with the maximal model (i.e. the most random-effects component of the model that is theoretically identifiable given the experimental design) and then dropping terms when singularity or non-convergence occurs (please see the paper for detailed recommendations …)
  • Matuschek et al. (2017) and D. Bates et al. (2015) strongly disagree, suggesting that models should be simplified a priori whenever possible; they also provide tools for diagnosing and mitigating singularity.
  • One alternative (suggested by Robert LaBudde) for the small-numbers-of-levels scenario is to “fit the model with the random factor as a fixed effect, get the level coefficients in the sum to zero form, and then compute the standard deviation of the coefficients.” This is appropriate for users who are (a) primarily interested in measuring variation (i.e. the random effects are not just nuisance parameters, and the variability [rather than the estimated values for each level] is of scientific interest), (b) unable or unwilling to use other approaches (e.g. MCMC with half-Cauchy priors in WinBUGS), (c) unable or unwilling to collect more data. For the simplest case (balanced, orthogonal, nested designs with normal errors) these estimates of standard deviations should equal the classical method-of-moments estimates.
  • Bayesian approaches allow the user to specify a informative prior that avoids singularity.
    • The blme package (Chung et al. 2013) provides a wrapper for the lme4 machinery that adds a particular form of weak prior to get an approximate a Bayesian maximum a posteriori estimate that avoids singularity.
    • The MCMCglmm package allows for priors on the variance-covariance matrix
    • The rstanarm and brms packages provide wrappers for the Stan Hamiltonian MCMC engine that fit GLMMs via lme4 syntax, again allowing a variety of priors to be set.
  • If a variance component is zero, dropping it from the model will have no effect on any of the estimated quantities (although it will affect the AIC, as the variance parameter is counted even though it has no effect). Pasch, Bolker, and Phelps (2013) gives one example where random effects were dropped because the variance components were consistently estimated as zero. Conversely, if one chooses for philosophical grounds to retain these parameters, it won’t change any of the answers.

Setting residual variances to zero

For some problems it would be convenient to be able to set the residual variance term to zero. This is difficult in lme4, because the model is parameterized internally in such a way that the residual variance is profiled out (i.e., calculated directly from a residual deviance term) and the random-effects variances are scaled by the residual variance.

It is possible to use blmer to fix the residual variance to a very small (albeit not exactly zero) value.

Other problems

  • When using lme4 to fit GLMMs with link functions that do not automatically constrain the response to the allowable range of the distributional family (e.g. binomial models with a log link, where the estimated probability can be >1, or inverse-Gamma models, where the estimated mean can be negative), it is not unusual to get the error

    PIRLS step-halvings failed to reduce deviance in pwrssUpdate

    This occurs because lme4 doesn’t do anything to constrain the predicted values, so NaN values pop up, which aren’t handled gracefully. If possible, switch to a link function to one that constrains the response (e.g. logit link for binomial or log link for Gamma).

Errors (old)

most or all of the errors listed here are obsolete/refer to several-years-old versions of lme4. This section is being retained for now for historical purposes, but will eventually be dropped (relying on Github to keep it in the historical record).

  • As pointed out by several users (here, here, and here, for example), the AICs computed for lmer models in summary and anova are different; summary uses the REML specification as specified when fitting the model, while anova always uses REML=FALSE (to safeguard users against incorrectly using restricted MLs to compare models with different fixed effect components). (This behavior is slightly overzealous since users might conceivably be using anova to compare models with different random effects [although this also subject to boundary effects as described elsewhere in this document …])
  • Error: Length(f1) == length(f2) is not TRUE, typically accompanied by a warning that numerical expression has xxx elements: only the first used: make sure that your grouping variables are all factors.
  • mer_finalize(ans) : gr cannot be computed at initial par (65) warning, followed by Error in asMethod(object) : matrix is not symmetric [1,2], or Error in mer_finalize(ans) : Downdated X'X is not positive definite, 1. or Error ... rank of X = <m> < ncol(X) = <n>:
  • especially if this happens on the first step (the number at the end of the error message is “1”), you may have incorporated some perfectly collinear terms in your design matrix (see this Stack Exchange question for further information). To confirm that this problem, try a glm or lm fit (i.e. dropping the random effects) and see if some of the parameters are set to NA. (Whether linear modeling software can handle rank-deficient cases depends on the details of the computational linear algebra routines used: so far, those used in lme4 have traded the robustness of being able to handle rank-deficiency for speed). One example of a workaround (constructing the design matrix by hand, or in glm, and then dropping the collinear terms) is shown here.
  • Alternatively, you may just have mostly collinear (rather than perfectly collinear) terms. Try centering and scaling your data (usually a good idea anyway, and centering removes the correlation between the intercept and the other predictors). Also check the correlation of the predictors, and if they are strongly correlated try some of the standard tactics – discard predictors, use principal components analysis, …
  • Error in UseMethod("ranef") : no applicable method for 'ranef' applied to an object of class "mer": this is usually caused by package conflicts, especially with the nlme package. Try detach("package:nlme") (the glmmADMB package may also cause such errors).
  • Other warnings about “false convergence” often indicates model mis-specification, or insufficient data for estimation; they do not necessarily indicate an error, but results that should be scrutinized particularly carefully. Replicate the analysis in another package, if possible; make sure that the results are consistent with those of a simplified version of the model. Another approach is to use a Bayesian model with stabilizing priors (see MCMCglmm, or brglm)
  • function 'cholmod_start' not provided by package 'Matrix': this error, and errors like it, are due to the lme4 and Matrix package being out of sync. Here is a table giving versions that do seem to work: || Type || OS || R version || Source || lme4 version || Matrix version || works? || || Source || (Ubuntu 10.04) || 2.13.0 || CRAN || 0.999375-39 || 0.999375-50 || yes || || Source || (Ubuntu 10.04) || 2.13.0 || R-forge || 0.999375-40 || 0.9996875-0 || yes || || Source || (Ubuntu 10.04) || 2.13.0 || SVN || 0.999375-40 (r1322)|| 0.9996875-0 (r2678)|| yes ||

REML for GLMMs

While restricted maximum likelihood (REML) procedures (Wikipedia are well established for linear mixed models, it is less clear how one should define and compute the equivalent criteria (integrating out the effects of fixed parameters) for GLMMs. Millar (2011) and Berger, Liseo, and Wolpert (1999) are possible starting points in the peer-reviewed literature, and there are mailing-list discussions of these issues here and here.

Note that in the current version of lme4, glmer silently ignores the REML specification (!)

Model diagnostics

Inference and confidence intervals

Testing hypotheses

What are the p-values listed by summary(glmerfit) etc.? Are they reliable?

By default, in keeping with the tradition in analysis of generalized linear models, lme4 and similar packages display the Wald Z-statistics for each parameter in the model summary. These have one big advantage: they’re convenient to compute. However, they are asymptotic approximations, assuming both that (1) the sampling distributions of the parameters are multivariate normal (or equivalently that the log-likelihood surface is quadratic) and that (2) the sampling distribution of the log-likelihood is (proportional to) \(\chi^2\). The second approximation is discussed further under “Degrees of freedom”. The first assumption usually requires an even greater leap of faith, and is known to cause problems in some contexts (for binomial models failures of this assumption are called the Hauck-Donner effect), especially with extreme-valued parameters.

Methods for testing single parameters

From worst to best:

  • Wald \(Z\)-tests
  • For balanced, nested LMMs where degrees of freedom can be computed according to classical rules: Wald \(t\)-tests
  • Likelihood ratio test, either by setting up the model so that the parameter can be isolated/dropped (via anova or drop1, or via computing likelihood profiles
  • Markov chain Monte Carlo (MCMC) or parametric bootstrap confidence intervals

Tests of effects (i.e. testing that several parameters are simultaneously zero)

From worst to best:

  • Wald chi-square tests (e.g. car::Anova)
  • Likelihood ratio test (via anova or drop1)
  • For balanced, nested LMMs where df can be computed: conditional F-tests
  • For LMMs: conditional F-tests with df correction (e.g. Kenward-Roger in pbkrtest package). (Stroup (2014) states on the basis of (unpresented) simulation results that K-R actually works reasonably well for GLMMs. However, at present the code in KRmodcomp only handles LMMs.)
  • MCMC or parametric, or nonparametric, bootstrap comparisons (nonparametric bootstrapping must be implemented carefully to account for grouping factors)

Is the likelihood ratio test reliable for mixed models?

  • It depends.
  • Not for fixed effects in finite-size cases (see Pinheiro and Bates (2000)): may depend on ‘denominator degrees of freedom’ (number of groups) and/or total number of samples - total number of parameters
  • Conditional F-tests are preferred for LMMs, if denominator degrees of freedom are known

Why doesn’t lme4 display denominator degrees of freedom/p values? What other options do I have?

There is an R FAQ entry on this topic, which links to a mailing list post by Doug Bates (there is also a voluminous mailing list thread reproduced on the R wiki). The bottom line is

  • In general it is not clear that the null distribution of the computed ratio of sums of squares is really an F distribution, for any choice of denominator degrees of freedom. While this is true for special cases that correspond to classical experimental designs (nested, split-plot, randomized block, etc.), it is apparently not true for more complex designs (unbalanced, GLMMs, temporal or spatial correlation, etc.).
  • For each simple degrees-of-freedom recipe that has been suggested (trace of the hat matrix, etc.) there seems to be at least one fairly simple counterexample where the recipe fails badly.
  • Other df approximation schemes that have been suggested (Satterthwaite, Kenward-Roger, etc.) would apparently be fairly hard to implement in lme4/nlme, both because of a difference in notational framework and because naive approaches would be computationally difficult in the case of large data sets. (The Kenward-Roger approach has now been implemented in the pbkrtest package (as KRmodcomp): although it was derived for LMMs, Stroup (2014) states on the basis of (unpresented) simulation results that it actually works reasonably well for GLMMs. However, at present the code in KRmodcomp only handles LMMs.)
  • Note that there are several different issues at play in finite-size (small-sample) adjustments, which apply slightly differently to LMMs and GLMMs.
  • When the responses are normally distributed and the design is balanced, nested etc. (i.e. the classical LMM situation), the scaled deviances and differences in deviances are exactly F-distributed and looking at the experimental design (i.e., which treatments vary/are replicated at which levels) tells us what the relevant degrees of freedom are.
  • When the data are not classical (crossed, unbalanced, R-side effects), we might still guess that the deviances etc. are approximately F-distributed but that we don’t know the real degrees of freedom – this is what the Satterthwaite, Kenward-Roger, Fai-Cornelius, etc. approximations are supposed to do.
  • When the responses are not normally distributed (as in GLMs and GLMMs), and when the scale parameter is not estimated (as in standard Poisson- and binomial-response models), then the deviance differences are only asymptotically F- or chi-square-distributed (i.e. not for our real, finite-size samples). In standard GLM practice, we usually ignore this problem; there is some literature on finite-size corrections for GLMs under the rubrics of “Bartlett corrections” and “higher order asymptotics” (see McCullagh and Nelder (1989), Cordeiro, Paula, and Botter (1994), Cordeiro and Ferrari (1998) and the cond package (on CRAN) [which works with GLMs, not GLMMs]), but it’s rarely used. (The bias correction/Firth approach implemented in the brglm package attempts to address the problem of finite-size bias, not finite-size non-chi-squaredness of the deviance differences.)
    • When the scale parameter in a GLM is estimated rather than fixed (as in Gamma or quasi-likelihood models), it is sometimes recommended to use an \(F\) test to account for the uncertainty of the scale parameter (e.g. Venables and Ripley (2002) recommend anova(...,test="F") for quasi-likelihood models)
    • Combining these issues, one has to look pretty hard for information on small-sample or finite-size corrections for GLMMs: Feng, Braun, and McCulloch (2004) and Bell and Grunwald (2010) look like good starting points, but it’s not at all trivial.
  • Because the primary authors of lme4 are not convinced of the utility of the general approach of testing with reference to an approximate null distribution, and because of the overhead of anyone else digging into the code to enable the relevant functionality (as a patch or an add-on), this situation is unlikely to change in the future.

Df alternatives:

  • use MASS::glmmPQL (uses old nlme rules approximately equivalent to SAS ‘inner-outer’ rules) for GLMMs, or (n)lme for LMMs
  • Guess the denominator df from standard rules (for standard designs) and apply them to t or F tests
  • Run the model in lme (if possible) and use the denominator df reported there (which follow a simple ‘inner-outer’ rule which should correspond to the canonical answer for simple/orthogonal designs), applied to t or F tests. For the explicit specification of the rules that lme uses, see page 91 of Pinheiro and Bates (this page was previously available on Google Books, but the link is no longer useful, so here are the relevant paragraphs):

These conditional tests for fixed-effects terms require denominator degrees of freedom. In the case of the conditional \(F\)-tests, the numerator degrees of freedom are also required, being determined by the term itself. The denominator degrees of freedom are determined by the grouping level at which the term is estimated. A term is called inner relative to a factor if its value can change within a given level of the grouping factor. A term is outer to a grouping factor if its value does not changes within levels of the grouping factor. A term is said to be estimated at level \(i\), if it is inner to the \(i-1\)st grouping factor and outer to the \(i\)th grouping factor. For example, the term Machine in the fm2Machine model is outer to Machine %in% Worker and inner to Worker, so it is estimated at level 2 (Machine %in% Worker). If a term is inner to all \(Q\) grouping factors in a model, it is estimated at the level of the within-group errors, which we denote as the \(Q+1\)st level.

The intercept, which is the parameter corresponding to the column of all 1’s in the model matrices \(X_i\), is treated differently from all the other parameters, when it is present. As a parameter it is regarded as being estimated at level 0 because it is outer to all the grouping factors. However, its denominator degrees of freedom are calculated as if it were estimated at level \(Q+1\). This is because the intercept is the one parameter that pools information from all the observations at a level even when the corresponding column in \(X_i\) doesn’t change with the level.

Letting \(m_i\) denote the total number of groups in level \(i\) (with the convention that \(m_0=1\) when the fixed effects model includes an intercept and 0 otherwise, and \(m_{Q+1}=N\)) and \(p_i\) denote the sum of the degrees of freedom corresponding to the terms estimated at level \(i\), the \(i\)th level denominator degrees of freedom is defined as

\[ \mathrm{denDF}_i = m_i - (m_{i-1} + p_i), i = 1, \dots, Q \]

This definition coincides with the classical decomposition of degrees of freedom in balanced, multilevel ANOVA designs and gives a reasonable approximation for more general mixed-effects models.

  • use SAS, Genstat (AS-REML), Stata?
  • Assume infinite denominator df (i.e. \(Z\)/\(\chi^2\) test rather than \(t\)/\(F\)) if number of groups is large (>45? Various rules of thumb for how large is “approximately infinite” have been posed, including [in Angrist and Pischke’s ‘’Mostly Harmless Econometrics’’], 42 (in homage to Douglas Adams)

Testing significance of random effects

  • the most common way to do this is to use a likelihood ratio test, i.e. fit the full and reduced models (the reduced model is the model with the focal variance(s) set to zero). For example:
library(lme4)
m2 <- lmer(Reaction~Days+(1|Subject)+(0+Days|Subject),sleepstudy,REML=FALSE)
m1 <- update(m2,.~Days+(1|Subject))
m0 <- lm(Reaction~Days,sleepstudy)
anova(m2,m1,m0) ## two sequential tests
## Data: sleepstudy
## Models:
## m0: Reaction ~ Days
## m1: Reaction ~ Days + (1 | Subject)
## m2: Reaction ~ Days + (1 | Subject) + (0 + Days | Subject)
##    Df    AIC    BIC  logLik deviance   Chisq Chi Df Pr(>Chisq)    
## m0  3 1906.3 1915.9 -950.15   1900.3                              
## m1  4 1802.1 1814.8 -897.04   1794.1 106.214      1  < 2.2e-16 ***
## m2  5 1762.0 1778.0 -876.00   1752.0  42.075      1  8.782e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With recent versions of lme4, goodness-of-fit (deviance) can be compared between (g)lmer and (g)lm models, although anova() must be called with the mixed ((g)lmer) model listed first. Keep in mind that LRT-based null hypothesis tests are conservative when the null value (such as \(\sigma^2=0\)) is on the boundary of the feasible space; in the simplest case (single random effect variance), the p-value is approximately twice as large as it should be (Pinheiro and Bates 2000).

  • Consider not testing the significance of random effects. If the random effect is part of the experimental design, this procedure may be considered ‘sacrificial pseudoreplication’ (Hurlburt 1984). Using stepwise approaches to eliminate non-significant terms in order to squeeze more significance out of the remaining terms is dangerous in any case.
  • consider using the RLRsim package, which has a fast implementation of simulation-based tests of null hypotheses about zero variances, for simple tests. (However, it only applies to lmer models, and is a bit tricky to use for more complex models.)
library(RLRsim)
## compare m0 and m1
exactLRT(m1,m0)
## 
##  simulated finite sample distribution of LRT. (p-value based on
##  10000 simulated values)
## 
## data:  
## LRT = 106.21, p-value < 2.2e-16
## compare m1 and m2
mA <- update(m2,REML=TRUE)
m0 <- update(mA, . ~ . - (0 + Days|Subject))
m.slope  <- update(mA, . ~ . - (1|Subject))
exactRLRT(m0=m0,m=m.slope,mA=mA)
## 
##  simulated finite sample distribution of RLRT.
##  
##  (p-value based on 10000 simulated values)
## 
## data:  
## RLRT = 42.796, p-value < 2.2e-16
  • Parametric bootstrap: fit the reduced model, then repeatedly simulate from it and compute the differences between the deviance of the reduced and the full model for each simulated data set. Compare this null distribution to the observed deviance difference. This procedure is implemented in the pbkrtest package.
(pb <- pbkrtest::PBmodcomp(m2,m1,seed=101))
## Parametric bootstrap test; time: 43.15 sec; samples: 1000 extremes: 0;
## Requested samples: 1000 Used samples: 575 Extremes: 0
## large : Reaction ~ Days + (1 | Subject) + (0 + Days | Subject)
## small : Reaction ~ Days + (1 | Subject)
##          stat df   p.value    
## LRT    42.075  1 8.782e-11 ***
## PBtest 42.075     0.001736 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Standard errors of variance estimates

  • Paraphrasing Doug Bates: the sampling distribution of variance estimates is in general strongly asymmetric: the standard error may be a poor characterization of the uncertainty.
  • lme4 allows for computing likelihood profiles of variances and computing confidence intervals on their basis; these likelihood profile confidence intervals are subject to the usual caveats about the LRT with finite sample sizes.
  • Using an MCMC-based approach (the simplest/most canned is probably to use the MCMCglmm package, although its mode specifications are not identical to those of lme4) will provide posterior distributions of the variance parameters: quantiles or credible intervals (HPDinterval() in the coda package) will characterize the uncertainty.
  • (don’t say we didn’t warn you …) [n]lme fits contain an element called apVar which contains the approximate variance-covariance matrix (derived from the Hessian, the matrix of (numerically approximated) second derivatives of the likelihood (REML?) at the maximum (restricted?) likelihood values): you can derive the standard errors from this list element via sqrt(diag(lme.obj$apVar)). For whatever it’s worth, though, these estimates might not match the estimates that SAS gives which are supposedly derived in the same way.
  • it’s not a full solution, but there is some more information here. I have some delta-method computations there that are off by a factor of 2 for the residual standard deviation, as well as some computations based on reparameterizing the deviance function.

P-values: MCMC and parametric bootstrap

Abandoning the approximate \(F\)/\(t\)-statistic route, one ends up with the more general problem of estimating \(p\)-values. There is a wider range of options here, although many of them are computationally intensive …

Markov chain Monte Carlo sampling:

  • pseudo-Bayesian: post-hoc sampling, typically (1) assuming flat priors and (2) starting from the MLE, possibly using the approximate variance-covariance estimate to choose a candidate distribution
    • via mcmcsamp (if available for your problem: i.e. LMMs with simple random effects – not GLMMs or complex random effects)
    • via pvals.fnc in the languageR package, a wrapper for mcmcsamp)
    • in AD Model Builder, possibly via the glmmADMB package (use the mcmc=TRUE option) or the R2admb package (write your own model definition in AD Model Builder), or outside of R
    • via the sim function from the arm package (simulates the posterior only for the beta (fixed-effect) coefficients; not yet working with development lme4; would like a better formal description of the algorithm …?)
  • fully Bayesian approaches
    • via the MCMCglmm package
    • glmmBUGS (a WinBUGS wrapper/R interface)
    • JAGS/WinBUGS/OpenBUGS etc., via the rjags/r2jags/R2WinBUGS/BRugs packages

Status of mcmcsamp

mcmcsamp is a function for lme4 that is supposed to sample from the posterior distribution of the parameters, based on flat/improper priors for the parameters [ed: I believe, but am not sure, that these priors are flat on the scale of the theta (Cholesky-factor) parameters]. At present, in the CRAN version (lme4 0.999999-0) and the R-forge “stable” version (lme4.0 0.999999-1), this covers only linear mixed models with uncorrelated random effects.

As has been discussed in a variety of places (e.g. on r-sig-mixed models, and on the r-forge bug tracker, it is challenging to come up with a sampler that accounts properly for the possibility that the posterior distributions for some of the variance components may be mixtures of point masses at zero and continuous distributions. Naive samplers are likely to get stuck at or near zero. Doug Bates has always been a bit unsure that mcmcsamp is really performing as intended, even in the limited cases it now handles.

Given this uncertainty about how even the basic version works, the lme4 developers have been reluctant to make the effort to extend it to GLMMs or more complex LMMs, or to implement it for the development version of lme4 … so unless something miraculous happens, it will not be implemented for the new version of lme4. As always, users are encouraged to write and share their own code that implements these capabilities …

Parametric bootstrap

The idea here is that in order to do inference on the effect of (a) predictor(s), you (1) fit the reduced model (without the predictors) to the data; (2) many times, (2a) simulate data from the reduced model; (2b) fit both the reduced and the full model to the simulated (null) data; (2c) compute some statistic(s) [e.g. t-statistic of the focal parameter, or the log-likelihood or deviance difference between the models]; (3) compare the observed values of the statistic from fitting your full model to the data to the null distribution generated in step 2. - PBmodcomp in the pbkrtest package - see the example in help("simulate-mer") in the lme4 package to roll your own, using a combination of simulate() and refit(). - bootMer in lme4 version >1.0.0 - a presentation at UseR! 2009 (abstract, slides) went into detail about a proposed bootMer package and suggested it could work for GLMMs too – but it does not seem to be active.

Predictions and/or confidence (or prediction) intervals on predictions

Note that none of the following approaches takes the uncertainty of the random effects parameters into account …

The general recipe for computing predictions from a linear or generalized linear model is to

  • figure out the model matrix \(X\) corresponding to the new data;
  • matrix-multiply \(X\) by the parameter vector \(\beta\) to get the predictions (or linear predictor in the case of GLM(M)s);
  • extract the variance-covariance matrix of the parameters \(V\)
  • compute \(X V X^{\prime}\) to get the variance-covariance matrix of the predictions;
  • extract the diagonal of this matrix to get variances of predictions;
  • if computing prediction rather than confidence intervals, add the residual variance;
  • take the square-root of the variances to get the standard deviations (errors) of the predictions;
  • compute confidence intervals based on a Normal approximation;
  • for GL(M)Ms, run the confidence interval boundaries (not the standard errors) through the inverse-link function.

lme

library(nlme) 
fm1 <- lme(distance ~ age*Sex, random = ~ 1 + age | Subject,
           data = Orthodont) 
plot(Orthodont,asp="fill") ## plot responses by individual

## note that expand.grid() orders factor levels by *order of
## appearance* -- must match levels(Orthodont$Sex)
newdat <- expand.grid(age=c(8,10,12,14), Sex=c("Female","Male")) 
newdat$pred <- predict(fm1, newdat, level = 0)

## [-2] drops response from formula
Designmat <- model.matrix(formula(fm1)[-2], newdat)
predvar <- diag(Designmat %*% vcov(fm1) %*% t(Designmat)) 
newdat$SE <- sqrt(predvar) 
newdat$SE2 <- sqrt(predvar+fm1$sigma^2)

library(ggplot2) 
pd <- position_dodge(width=0.4) 
g0 <- ggplot(newdat,aes(x=age,y=pred,colour=Sex))+ 
   geom_point(position=pd)
cmult <- 2  ## could use 1.96 instead
g0 + geom_linerange(aes(ymin=pred-cmult*SE,ymax=pred+cmult*SE), position=pd)

## prediction intervals 
g0 + geom_linerange(aes(ymin=pred-cmult*SE2,ymax=pred+cmult*SE2), position=pd)