- Introduction
- References
- Model definition
- Model extensions
- Estimation
- Model diagnostics
- Inference and confidence
intervals
- Testing
hypotheses
- What
are the p-values listed by
`summary(glmerfit)`

etc.? Are they reliable? - Methods for testing single parameters
- Tests of effects (i.e. testing that several parameters are simultaneously zero)
- Is the likelihood ratio test reliable for mixed models?
- Why
doesn’t
`lme4`

display denominator degrees of freedom/p values? What other options do I have? - Testing significance of random effects
- Standard errors of variance estimates
- P-values: MCMC and parametric bootstrap

- What
are the p-values listed by
- Predictions and/or confidence (or prediction) intervals on predictions
- Confidence intervals on conditional means/BLUPs/random effects
- Power analysis

- Testing
hypotheses
- Model selection and averaging
- Model summaries (goodness-of-fit, decomposition of variance, etc.)
- Miscellaneous/procedural
- Mixed modeling packages
- Bibliography

This is an informal FAQ list for the `r-sig-mixed-models`

mailing list.

The most commonly used functions for mixed modeling in R are

*linear mixed models*:`aov()`

,`nlme::lme`

^{1},`lme4::lmer`

;`brms::brm`

*generalized linear mixed models*(GLMMs)- frequentist:
`MASS::glmmPQL`

,`lme4::glmer`

;`glmmTMB`

- Bayesian:
`MCMCglmm::MCMCglmm`

;`brms::brm`

- frequentist:
*nonlinear mixed models*:`nlme::nlme`

,`lme4::nlmer`

;`brms::brm`

*GNLMMs*:`brms::brm`

Another quick-and-dirty way to search for mixed-model related packages on CRAN:

`grep("l.?m[me][^t]",rownames(available.packages()),value=TRUE)`

```
## [1] "blmeco" "buildmer" "cellVolumeDist" "climextRemes"
## [5] "curtailment" "glmertree" "glmm.hp" "glmmEP"
## [9] "glmmfields" "glmmLasso" "glmmML" "glmmPen"
## [13] "glmmrBase" "glmmrOptim" "glmmSeq" "glmmTMB"
## [17] "jlmerclusterperm" "lamme" "lme4" "lmeInfo"
## [21] "lmeresampler" "lmerPerm" "lmerTest" "lmeSplines"
## [25] "lmmot" "lmmpar" "lrmest" "lsmeans"
## [29] "mailmerge" "mlmm.gwas" "mvglmmRank" "nlmeU"
## [33] "nlmeVPC" "palmerpenguins" "SherlockHolmes" "tglkmeans"
## [37] "trouBBlme4SolveR" "vagalumeR" "vglmer"
```

There are some false positives here
(e.g. `palmerpenguins`

); see here if you’re interested in “regex
golf”.

- the mailing list is
`r-sig-mixed-models@r-project.org`

- The source code of this document is available on GitHub; the rendered (HTML) version lives on GitHub pages.
- Searching on StackOverflow with the [r]
[mixed-models] tags, or on CrossValidated with the [mixed-model]
tag may be helpful (these sites also have an
`[lme4]`

tag).

**DISCLAIMERS:**

- (G)LMMs are hard - harder than you may think based on what you may have learned in your second statistics class, which probably focused on picking the appropriate sums of squares terms and degrees of freedom for the numerator and denominator of an \(F\) test. ‘Modern’ mixed model approaches, although more powerful (they can handle more complex designs, lack of balance, crossed random factors, some kinds of non-Normally distributed responses, etc.), also require a new set of conceptual tools. In order to use these tools you should have at least a general acquaintance with classical mixed-model experimental designs but you should also, probably, read something about modern mixed model approaches. Littell et al. (2006) and Pinheiro and Bates (2000) are two places to start, although Pinheiro and Bates is probably more useful if you want to use R. Other useful references include Gelman and Hill (2006) (focused on Bayesian methods) and Zuur et al. (2009b). If you are going to use generalized linear mixed models, you should understand generalized linear models (Dobson and Barnett (2008), Faraway (2006), and McCullagh and Nelder (1989) are standard references; the last is the canonical reference, but also the most challenging).
- All of the issues that arise with regular linear or generalized-linear modeling (e.g.: inadequacy of p-values alone for thorough statistical analysis; need to understand how models are parameterized; need to understand the principle of marginality and how interactions can be treated; dangers of overfitting, which are not mitigated by stepwise procedures; the non-existence of free lunches) also apply, and can apply more severely, to mixed models.
- When SAS (or Stata, or Genstat/AS-REML or …) and R differ in their answers, R may not be wrong. Both SAS and R may be `right’ but proceeding in a different way/answering different questions/using a different philosophical approach (or both may be wrong …)
- The advice in this FAQ comes with
**absolutely no warranty of any sort**.

- UCLA IDRE statistical consulting
- Barr (2020) Chapters 5-8

- pinheiro_mixed-effects_2000: LMM only.
- Zuur et al. (2009b): Focused on ecology.
- Gelman and Hill (2006): LMM and GLMM; Bayesian; examples from social science. Intermediate mathematics.
- (Rethinking)

The following formula extensions for specifying random-effects structures in R are used by

`lme4`

`nlme`

(nested effects only, although crossed effects can be specified with more work)`glmmADMB`

and`glmmTMB`

`MCMCglmm`

uses a different specification, inherited from
AS-REML.

(Modified from Robin Jeffries, UCLA:)

formula | meaning |
---|---|

`(1|group)` |
random group intercept |

`(x|group)` =
`(1+x|group)` |
random slope of x within group with correlated intercept |

`(0+x|group)` =
`(-1+x|group)` |
random slope of x within group: no variation in intercept |

`(1|group) + (0+x|group)` |
uncorrelated random intercept and random slope within group |

`(1|site/block)` =
`(1|site)+(1|site:block)` |
intercept varying among sites and among blocks within sites (nested random effects) |

`site+(1|site:block)` |
fixed effect of sites plus random variation in
intercept among blocks within sites |

`(x|site/block)` =
`(x|site)+(x|site:block)` =
`(1 + x|site)+(1+x|site:block)` |
slope and intercept varying among sites and among blocks within sites |

`(x1|site)+(x2|block)` |
two different effects, varying at different levels |

`x*site+(x|site:block)` |
fixed effect variation of slope and intercept varying among sites and random variation of slope and intercept among blocks within sites |

`(1|group1)+(1|group2)` |
intercept varying among crossed random effects (e.g. site, year) |

Or in a little more detail:

equation | formula |
---|---|

\(β_0 + β_{1}X_{i} + e_{si}\) | n/a (Not a mixed-effects model) |

\((β_0 + b_{S,0s}) + β_{1}X_i + e_{si}\) | `∼ X + (1∣Subject)` |

\((β_0 + b_{S,0s}) + (β_{1} + b_{S,1s}) X_i + e_{si}\) | `~ X + (1 + X∣Subject)` |

\((β_0 + b_{S,0s} + b_{I,0i}) + (β_{1} + b_{S,1s}) X_i + e_{si}\) | `∼ X + (1 + X∣Subject) + (1∣Item)` |

As above, but \(S_{0s}\), \(S_{1s}\) independent | `∼ X + (1∣Subject) + (0 + X∣ Subject) + (1∣Item)` |

\((β_0 + b_{S,0s} + b_{I,0i}) + β_{1}X_i + e_{si}\) | `∼ X + (1∣Subject) + (1∣Item)` |

\((β_0 + b_{I,0i}) + (β_{1} + b_{S,1s})X_i + e_{si}\) | `∼ X + (0 + X∣Subject) + (1∣Item)` |

Modified from: http://stats.stackexchange.com/questions/13166/rs-lmer-cheat-sheet?lq=1 (Livius)

The **magic** development version of the equatiomatic
package can handle mixed models
(`remotes::install_github("datalorax/equatiomatic")`

),
e.g.

```
library(lme4)
library(equatiomatic)
fm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
equatiomatic::extract_eq(fm1)
```

\[ \begin{aligned} \operatorname{Reaction}_{i} &\sim N \left(\alpha_{j[i]} + \beta_{1j[i]}(\operatorname{Days}), \sigma^2 \right) \\ \left( \begin{array}{c} \begin{aligned} &\alpha_{j} \\ &\beta_{1j} \end{aligned} \end{array} \right) &\sim N \left( \left( \begin{array}{c} \begin{aligned} &\mu_{\alpha_{j}} \\ &\mu_{\beta_{1j}} \end{aligned} \end{array} \right) , \left( \begin{array}{cc} \sigma^2_{\alpha_{j}} & \rho_{\alpha_{j}\beta_{1j}} \\ \rho_{\beta_{1j}\alpha_{j}} & \sigma^2_{\beta_{1j}} \end{array} \right) \right) \text{, for Subject j = 1,} \dots \text{,J} \end{aligned} \]

It doesn’t handle GLMMs (yet), but you could fit two fake models — one LMM like your GLMM but with a Gaussian response, and one GLM with the same family/link function as your GLMM but without the random effects — and put the pieces together.

More possibly useful links:

- Rense Nieuwenhuis’s blogpost/lesson on lme4 model specification
- CrossValidated’s lmer cheat sheet
- Kristoffer Magnusson’s Using R and lme/lmer to fit different two- and three-level longitudinal models

This is in general a far more difficult question than it seems on the surface. There are many competing philosophies and definitions. For example, from Gelman (2005):

Before discussing the technical issues, we briefly review what is meant by fixed and random effects. It turns out that different—in fact, incompatible—definitions are used in different contexts. [See also Kreft and de Leeuw (1998), Section 1.3.3, for a discussion of the multiplicity of definitions of fixed and random effects and coefficients, and Robinson (1998) for a historical overview.] Here we outline five definitions that we have seen: 1. Fixed effects are constant across individuals, and random effects vary. For example, in a growth study, a model with random intercepts αi and fixed slope β corresponds to parallel lines for different individuals i, or the model yit = αi + βt. Kreft and de Leeuw [(1998), page 12] thus distinguish between fixed and random coefficients. 2. Effects are fixed if they are interesting in themselves or random if there is interest in the underlying population. Searle, Casella and McCulloch [(1992), Section 1.4] explore this distinction in depth. 3. “When a sample exhausts the population, the corresponding variable is fixed; when the sample is a small (i.e., negligible) part of the population the corresponding variable is random” [Green and Tukey (1960)]. 4. “If an effect is assumed to be a realized value of a random variable, it is called a random effect” [LaMotte (1983)]. 5. Fixed effects are estimated using least squares (or, more generally, maximum likelihood) and random effects are estimated with shrinkage [“linear unbiased prediction” in the terminology of Robinson (1991)]. This definition is standard in the multilevel modeling literature [see, e.g., Snijders and Bosker (1999), Section 4.2] and in econometrics.

Another useful comment (via Kevin Wright) reinforcing the idea that “random vs. fixed” is not a simple, cut-and-dried decision: from Schabenberger and Pierce (2001), p. 627:

Before proceeding further with random field linear models we need to remind the reader of the adage that one modeler’s random effect is another modeler’s fixed effect.

Clark and Linzer (2015) address this question from a mostly econometric perspective, focusing mostly on practical variance/bias/RMSE criteria.

One point of particular relevance to ‘modern’ mixed model estimation (rather than ‘classical’ method-of-moments estimation) is that, for practical purposes, there must be a reasonable number of random-effects levels (e.g. blocks) – more than 5 or 6 at a minimum. This is not surprising if you consider that random effects estimation is trying to estimate an among-block variance. For example, from Crawley (2002) p. 670:

Are there enough levels of the factor in the data on which to base an estimate of the variance of the population of effects? No, means [you should probably treat the variable as] fixed effects.

Some researchers (who treat fixed vs random as a philosophical rather than a pragmatic decision) object to this approach.

Also see a very thoughtful chapter in Hodges (2016).

Treating factors with small numbers of levels as random will in the best case lead to very small and/or imprecise estimates of random effects; in the worst case it will lead to various numerical difficulties such as lack of convergence, zero variance estimates, etc.. (A small simulation exercise shows that at least the estimates of the standard deviation are downwardly biased in this case; it’s not clear whether/how this bias would affect the point estimates of fixed effects or their estimated confidence intervals.) In the classical method-of-moments approach these problems may not arise (because the sums of squares are always well defined as long as there are at least two units), but the underlying problems of lack of power are there nevertheless.

Thierry Onkelinx has a blog post with some simulations on the impact of the number of levels and concludes with a few recommendations for the number of levels of the grouping variable \(n_s\): > - get \(n_s > 1000\) levels when an accurate estimate of the random effect variance is crucial. E.g. when a single number will be use for power calculations. > - get \(n_s > 100\) levels when a reasonable estimate of the random effect variance is sufficient. E.g. power calculations with sensitivity analysis of the random effect variance. > - get \(n_s > 20\) levels for an experimental study > - in case \(10 < n_s <20\) you should validate the model very cautious before using the output > - in case \(n_s < 10\) it is safer to use the variable as a fixed effect.

Oberpriller, Leite, and Pichler (2021) also performed a simulation study and found that while the estimates are similar for treating a variable with a small number of levels as fixed or random are similar, there was an impact on Type 1 and Type 2 error rates. They also found that the precise random effects structure (e.g., inclusion of random slopes) had a large impact on these properties.

Also see this thread on the r-sig-mixed-models mailing list and this question on CrossValidated.

- Relatively few mixed effect modeling packages can handle crossed random effects, i.e. those where one level of a random effect can appear in conjunction with more than one level of another effect. (This definition is confusing, and I would happily accept a better one.) A classic example is crossed temporal and spatial effects. If there is random variation among temporal blocks (e.g. years) ‘’and’’ random variation among spatial blocks (e.g. sites), ‘’and’’ if there is a consistent year effect across sites and ‘’vice versa’’, then the random effects should be treated as crossed.
`lme4`

does handled crossed effects, efficiently- if you need to deal with crossed REs in conjunction with some of the
features that
`nlme`

offers (e.g. heteroscedasticity of residuals via`weights`

/`varStruct`

, correlation of residuals via`correlation`

/`corStruct`

, or if you want to used crossed REs with the`gamlss`

package, see p. 163ff of Pinheiro and Bates (2000) (section 4.2.2: Google books link). I give a worked example here. As far as I can tell, a couple of hacks are necessary to get this to work: (1) the data must be expressed as a`groupedData`

object (at least, I haven’t managed to get it to work in any other way); (2) the crossed effects must be*nested within another grouping factor*- in the example here I define a dummy group, which is awkward (it makes the variance component for this group and the residual variance jointly unidentifiable), but otherwise seems to work OK. - I rarely find it useful to think of fixed effects as “nested”
(although others disagree); if for example treatments A and B are only
measured in block 1, and treatments C and D are only measured in block
2, one still assumes (because they are fixed effects) that each
treatment would have the same effect if applied in the other block. (One
might like to estimate treatment-by-block interactions, but in this case
the experimental design doesn’t allow it; one would have to have
multiple treatments measured within each block, although not necessarily
all treatments in every block.) One would code this analysis as
`response~treatment+(1|block)`

in`lme4`

. Also, in the case of fixed effects, crossed and nested specifications change the parameterization of the model, but not anything else (e.g. the number of parameters estimated, log-likelihood, model predictions are all identical). That is, in R’s`model.matrix`

function (which implements a version of Wilkinson-Rogers notation)`a*b`

and`a/b`

(which expand to`1+a+b+a:b`

and`1+a+a:b`

respectively) give model matrices with the same number of columns. - Whether you explicitly specify a random effect as nested or not
depends (in part) on the way the levels of the random effects are coded.
If the ‘lower-level’ random effect is coded with unique levels, then the
two syntaxes
`(1|a/b)`

(or`(1|a)+(1|a:b)`

) and`(1|a)+(1|b)`

are equivalent. If the lower-level random effect has the same labels within each larger group (e.g. blocks 1, 2, 3, 4 within sites A, B, and C) then the explicit nesting`(1|a/b)`

is required. It seems to be considered best practice to code the nested level uniquely (e.g. A1, A2, …, B1, B2, …) so that confusion between nested and crossed effects is less likely.

- with the usual caveats, plus a few extras – counting degrees of
freedom, etc. – the usual procedure of calculating the sum of squared
Pearson residuals and comparing it to the residual degrees of freedom
should give at least a crude idea of overdispersion. The following
attempt counts each variance or covariance parameter as one model degree
of freedom and presents the sum of squared Pearson residuals, the ratio
of (SSQ residuals/rdf), the residual df, and the \(p\)-value based on the (approximately!!)
appropriate \(\chi^2\) distribution.
**Do PLEASE note the usual, and extra, caveats noted here: this is an APPROXIMATE estimate of an overdispersion parameter**. Even in the GLM case, the expected deviance per point equaling 1 is only true as the distribution of individual deviates approaches normality, i.e. the usual \(\lambda>5\) rules of thumb for Poisson values and \(\textrm{min}(Np, N(1-p)) > 5\) for binomial values (e.g. see Venables and Ripley (2002), p. 208-209). (And that’s without the extra complexities due to GLMM, i.e. the “effective” residual df should be large enough to make the sums of squares converge on a \(\chi^2\) distribution …) - Remember that (1) overdispersion is irrelevant for models that estimate a scale parameter (i.e. almost anything but Poisson or binomial: Gaussian, Gamma, negative binomial …) and (2) overdispersion is not estimable (and hence practically irrelevant) for Bernoulli models (= binary data = binomial with \(N=1\)).
- The recipes below may need adjustment for some of the more complex
model types allowed by
`glmmTMB`

(e.g. zero-inflation/variable dispersion), where it’s less clear what to measure to estimate overdispersion.

The following function should work for a variety of model types (at
least `glmmADMB`

, `glmmTMB`

, `lme4`

,
…).

```
overdisp_fun <- function(model) {
rdf <- df.residual(model)
rp <- residuals(model,type="pearson")
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df=rdf, lower.tail=FALSE)
c(chisq=Pearson.chisq,ratio=prat,rdf=rdf,p=pval)
}
```

Example:

```
library(lme4)
library(glmmTMB)
```

```
set.seed(101)
d <- data.frame(x=runif(1000),
f=factor(sample(1:10,size=1000,replace=TRUE)))
suppressMessages(d$y <- simulate(~x+(1|f), family=poisson,
newdata=d,
newparams=list(theta=1,beta=c(0,2)))[[1]])
m1 <- glmer(y~x+(1|f),data=d,family=poisson)
overdisp_fun(m1)
```

```
## chisq ratio rdf p
## 1035.9966326 1.0391140 997.0000000 0.1902294
```

```
m2 <- glmmTMB(y~x+(1|f),data=d,family="poisson")
overdisp_fun(m2)
```

```
## chisq ratio rdf p
## 1035.9961394 1.0391135 997.0000000 0.1902323
```

The `gof`

function in the `aods3`

provides
similar functionality (it reports both deviance- and \(\chi^2\)-based estimates of overdispersion
and tests).

quasilikelihood estimation: MASS::glmmPQL. Quasi- was deemed unreliable in

`lme4`

, and is no longer available. (Part of the problem was questionable numerical results in some cases; the other problem was that DB felt that he did not have a sufficiently good understanding of the theoretical framework that would explain what the algorithm was actually estimating in this case.) geepack::geelgm may be workable (haven’t tried it)If you really want quasi-likelihood analysis for

`glmer`

fits, you can do it yourself by adjusting the coefficient table - i.e., by multiplying the standard error by the square root of the dispersion factor^{2}and recomputing the \(Z\)- and \(p\)-values accordingly, as follows:

```
## extract summary table; you may also be able to do this via
## broom::tidy or broom.mixed::tidy
quasi_table <- function(model,ctab=coef(summary(model)),
phi=overdisp_fun(model)["ratio"]) {
qctab <- within(as.data.frame(ctab),
{ `Std. Error` <- `Std. Error`*sqrt(phi)
`z value` <- Estimate/`Std. Error`
`Pr(>|z|)` <- 2*pnorm(abs(`z value`), lower.tail=FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(m1),digits=3)
```

```
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2277 0.2700 0.84 0.4
## x 2.0640 0.0528 39.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

```
## to use this with glmmTMB, we need to separate out the
## conditional component of the summary
printCoefmat(quasi_table(m2,
ctab=coef(summary(m2))[["cond"]]),
digits=3)
```

```
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2277 0.2700 0.84 0.4
## x 2.0640 0.0528 39.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Another version, this one tidyverse-centric:

```
library(broom.mixed)
library(dplyr)
tidy_quasi <- function(model, phi=overdisp_fun(model)["ratio"],
conf.level=0.95) {
tt <- (tidy(model, effects="fixed")
%>% mutate(std.error=std.error*sqrt(phi),
statistic=estimate/std.error,
p.value=2*pnorm(abs(statistic), lower.tail=FALSE))
)
return(tt)
}
tidy_quasi(m1)
```

```
## # A tibble: 2 × 6
## effect term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 fixed (Intercept) 0.228 0.270 0.843 0.399
## 2 fixed x 2.06 0.0528 39.1 0
```

`tidy_quasi(m2)`

```
## # A tibble: 2 × 7
## effect component term estimate std.error statistic p.value
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 fixed cond (Intercept) 0.228 0.270 0.843 0.399
## 2 fixed cond x 2.06 0.0528 39.1 0
```

These functions make some simplifying assumptions: (1) this overdispersion computation is approximate

In this case using quasi-likelihood doesn’t make much difference, since the data we simulated in the first place were Poisson.) Keep in mind that once you switch to quasi-likelihood you will either have to eschew inferential methods such as the likelihood ratio test, profile confidence intervals, AIC, etc., or make more heroic assumptions to compute “quasi-” analogs of all of the above (such as QAIC).

- observation-level random effects (OLRE: this approach should work in
most packages). If you want to a citation for this approach, try Elston et al. (2001), who cite Lawson et al. (1999); apparently there is also
an example in section 10.5 of Maindonald and
Braun (2010), and (according to an R-sig-mixed-models post) this
is also discussed by Rabe-Hesketh and Skrondal
(2008). Also see Browne et al.
(2005) for an example in the binomial context
(i.e. logit-normal-binomial rather than lognormal-Poisson). Agresti’s
excellent (2002) book Agresti (2002) also
discusses this (section 13.5), referring back to Breslow (1984) and Hinde
(1982). [
**Notes**: (a) I haven’t checked all these references myself, (b) I can’t find the reference any more, but I have seen it stated that observation-level random effect estimation is probably dodgy for PQL approaches as used in Elston et al 2001] - alternative distributions
- Poisson-lognormal model for counts or binomial-logit-Normal model for proportions (see above, “observation-level random effects”)
- negative binomial for counts or beta-binomial for proportions
`lme4::glmer.nb()`

should fit a negative binomial, although it is somewhat slow and fragile compared to some of the other methods suggested here.`lme4`

cannot fit beta-binomial models (these cannot be formulated as a part of the exponential family of distributions)- glmmTMB will fit
two parameterizations of the negative binomial:
`family="nbinom2"`

gives the classic parameterization with \(\sigma^2=\mu(1+\mu/k)\) (“NB2” in Hardin and Hilbe’s terminology) while`family="nbinom1"`

gives a parameterization with \(\sigma^2=\phi \mu\), \(\phi>1\) (“NB1” to Hardin and Hilbe). The latter might also be called a “quasi-Poisson” parameterization because it matches the mean-variance relationship assumed by quasi-Poisson models, i.e. the variance is strictly proportional to the mean (although the proportionality constant must be >1, a limitation that does not apply to quasi-likelihood approaches). (glmmADMB will also fit these models, with`family="nbinom"`

for NB2, but is deprecated in favour of glmmTMB.) `glmmTMB`

allows beta-binomial models ((Harrison 2015) suggests comparing beta-binomial with OLRE models to assess reliability)- the
`brms`

package has a`negbinomial`

family (no beta-binomial, but it does have a wide range of other families)

- other packages/approaches (less widely used, or requiring a bit more effort)

Negative binomial models in `glmmTMB`

and
lognormal-Poisson models in `glmer`

(or
`MCMCglmm`

) are probably the best quick alternatives for
overdispersed count data. If you need to explore alternatives (different
variance-mean relationships, different distributions), then
`ADMB`

, `TMB`

, `WinBUGS`

,
`Stan`

, `NIMBLE`

are the most flexible
alternatives.

Underdispersion (much *less* variability than expected) is a
less common problem than overdispersion.

- mild underdispersion is sometimes ignored, since it tends in general to lead to conservative rather than anti-conservative results
- quasi-likelihood (and the quasi-hack listed above) can handle under- as well as overdispersion
- some other solutions exist, but are less widely implemented
- for distributions with a small range (e.g. litter sizes of large
mammals), one can treat responses as ordinal (e.g. using the
`ordinal`

package, or`MCMCglmm`

or`brms`

for Bayesian solutions) - the COM-Poisson distribution and generalized Poisson distributions,
implemented in
`glmmTMB`

, can handle underdispersion (J. Hilbe recommends the latter in this CrossValidated answer). (`VGAM`

has a generalized Poisson distribution, but doesn’t handle random effects.)

- for distributions with a small range (e.g. litter sizes of large
mammals), one can treat responses as ordinal (e.g. using the

While one (well, OK I) would naively think that GLMMs with Gamma
distributions would be just as easy (or hard) as any other sort of
GLMMs, it seems that they are in fact harder to implement. Basic
simulated examples of Gamma GLMMs can fail in lme4 despite analogous
problems with Poisson, binomial, etc. distributions. Solutions: - the
default inverse link seems particularly problematic; try other links
(especially `family=Gamma(link="log")`

) if that is
possible/makes sense - consider whether a lognormal model (i.e. a
regular LMM on logged data) would work/makes sense. - Lo and Andrews (2015) argue that the Gamma
family with an *identity* link is superior to lognormal models
for reaction-time data. I (BMB) don’t find their argument particularly
convincing, but lots of people want to do this. Unfortunately this is
technically challenging (see here), because it is
likely that some “illegal” values (predicted responses \(\le 0\)) will occur while fitting the
model, even if the final fitted model makes no impossible predictions.
Thus something has to be done to make the model-fitting machinery
tolerant of such values (i.e. returning `NA`

for these model
evaluations, or clamping illegal values to the constrained space with an
appropriate smooth penalty function).

Gamma models can be fitted by a wide variety of platforms
(`lme4::glmer`

, `MASS::glmmPQL`

,
`glmmADMB`

, `glmmTMB`

,
`MixedModels.jl`

, `MCMCglmm`

, `brms`

…
not sure about others.

Proportion data where the denominator (e.g. maximum possible number
of successes for a given observation) is not known can be modeled using
a Beta distribution. Smithson and Verkuilen
(2006) is a good introduction for non-statisticians (*not*
in the mixed-model case), and the `betareg`

package (Cribari-Neto and Zeileis 2009) handles
*non*-mixed Beta regressions. The `glmmTMB`

and
`brms`

packages handle Beta mixed models (`brms`

also handles zero-inflated and zero-one inflated models).

See e.g. Martin et al. (2005) or Warton (2005) (“many zeros does not mean zero inflation”) or Zuur et al. (2009a) for general information on zero-inflation.

`MCMCglmm`

handles zero-truncated, zero-inflated, and zero-altered models, although specifying the models is a little bit tricky: see Sections 5.3 to 5.5 of the CourseNotes vignette`glmmADMB`

handles- zero-inflated models (with a single zero-inflation parameter – i.e., the level of zero-inflation is assumed constant across the whole data set)
- truncated Poisson and negative binomial distributions (which allows two-stage fitting of hurdle models)

`glmmTMB`

handles a variety of Z-I and Z-T models (allows covariates, and random effects, in the zero-alteration model)`brms`

does too- so does
`GLMMadaptive`

- Gavin Simpson has a detailed
writeup showing that
`mgcv::gam()`

can do simple mixed models (Poisson, not NB) with zero-inflation, and comparing`mgcv`

with`glmmTMB`

results `gamlssNP`

in the`gamlss.mx`

package should handle zero-inflation, and the`gamlss.tr`

package should handle truncated (i.e. hurdle) models – but I haven’t tried them- roll-your-own: ADMB/R2admb, WinBUGS/R2WinBUGS, TMB, Stan, …

Continuous data are a special case where the mixture model for
zero-inflated data is less relevant, because observations that are
exactly zero occur with *probability* (but not probability
density) zero. There are two cases of interest:

In this case zero is a problematic observation for the distribution; it’s either impossible or infinitely (locally) likely. Some examples:

- Gamma distribution: probability density at zero is infinite (if shape<1) or zero (if shape>1); it’s finite only for an exponential distribution (shape==1)
- Lognormal distribution: the probability density at zero is zero.
- Beta distribution: the probability densities at 0 and 1 are zero (if the corresponding shape parameter is >1) or infinite (if shape<1)

The best solution depends very much on the data-generating mechanism.

- If the bad (0/1) values are generated by rounding (e.g. proportions that are too close to the boundaries are reported as being on the boundaries), the simplest solution is to “squeeze” these in slightly, e.g. \(y \to (y +a)/2a\) for some sensible value of \(a\) (Smithson and Verkuilen 2006)
- If you think that zero values are generated by a separate process,
the simplest solution is to fit a Bernoulli model to the zero/non-zero
data, then a
*conditional*continuous model for the non-zero values; this is effectively a*hurdle model*. - you might have
*censored*data where all values below a certain limit (e.g. a detection limit) are recorded as zero. The The lmec package handles*linear*mixed models;`brms`

and`GLMMadaptive`

both provide support for censored data in mixed models. - The
`cplm`

and`glmmTMB`

packages handles ‘Tweedie compound Poisson linear models’, which in a particular range of parameters allows for skewed continuous responses with a spike at zero

In this case (e.g. a spike of zeros in the center of an otherwise continuous distribution), the hurdle model probably makes the most sense.

- you can use a likelihood ratio test between the regular and zero-inflated version of the model, but be aware of boundary issues (search “boundary” elsewhere on this page …) – the null value (no zero inflation) is on the boundary of the feasible space
- you can use AIC or variations, with the same caveats
- you can use Vuong’s test, which is often recommended for testing
zero-inflation in GLMs, because under some circumstances the various
model flavors under consideration (hurdle vs zero-inflated vs “vanilla”)
are not nested. Vuong’s test is implemented (and referenced) in the
`pscl`

package, but not for (G)LMMs. However, the`nonnest`

package provides an example (in conjunction with the`merDeriv`

package) for using its`vuongtest`

function with`merMod`

objects. (May also work with`glmmTMB`

, haven’t tried it …) - two untested but reasonable approaches:
- use a
`simulate()`

method if it exists to construct a simulated distribution of the proportion of zeros expected overall from your model, and compare it to the observed proportion of zeros in the data set - compute the probability of a zero for each observation. On the basis of (conditionally) independent Bernoulli trials, compute the expected number of zeros and the confidence intervals – compare it with the observed number.

- use a

In `nlme`

these so-called **R-side** (R for
“residual”) structures are accessible via the
`weights`

/`VarStruct`

(heteroscedasticity) and
`correlation`

/`corStruct`

(spatial or temporal
correlation) arguments and data structures. This extension is a bit
harder than it might seem. In LMMs it is a natural extension to allow
the residual error terms to be components of a single multivariate
normal draw; if that MVN distribution is uncorrelated and homoscedastic
(i.e. proportional to an identity matrix) we get the classic model, but
we can in principle allow it to be correlated and/or
heteroscedastic.

It is not too hard to define marginal correlation structures that don’t make sense. One class of reasonably sensible models is to always assume an observation-level random effect (as MCMCglmm does for computational reasons) and to allow that random effect to be MVN on the link scale (so that the full model is lognormal-Poisson, logit-normal binomial, etc., depending on the link function and family).

For example, a relatively simple Poisson model with spatially correlated errors might look like this:

\[ \begin{split} \eta & \sim \textrm{MVN}(a + b x, \Sigma) \\ \Sigma_{ij} & = \sigma^2 \exp(-d_{ij}/s) \\ y_i & \sim \textrm{Poisson}(\lambda=\exp(\eta_i)) \end{split} \]

That is, the marginal distributions of the response values are
Poisson-lognormal, but on the link (log) scale the latent Normal
variables underlying the response are *multivariate* normal, with
a variance-covariance matrix described by an exponential spatial
correlation function with scale parameter \(s\).

How can one achieve this?

- These types of models are not implemented in
`lme4`

, for either LMMs or GLMMs; they are fairly low priority, and it is hard to see how they could be implemented for GLMMs (the equivalent for LMMs is tedious but should be straightforward to implement). - For LMMs, you can use the spatial/temporal correlation structures that are built into (n)lme
- You can use the spatial/temporal correlation structures available for (n)lme, which include basic geostatistical (space) and ARMA-type (time) models.

```
library(sos)
findFn("corStruct")
```

finds additional possibilities in the `ramps`

(extended
geostatistical) and `ape`

(phylogenetic) packages.

- You can use these structures in GLMMs via
`MASS::glmmPQL`

(see Dormann et al.) - geepack::geeglm
- geoR, geoRglm (power tools); these are mostly designed for fitting spatial random field GLMMs via MCMC – not sure that they do random effects other than the spatial random effect
- R-INLA (super-power tool)
- it is possible to use AD Model Builder to fit spatial GLMMs, as
shown in these AD Model Builder
examples; this capability is not in the
`glmmADMB`

package (and may not be for a while!), but it would be possible to run AD Model Builder via the R2admb package (requires installing – and learning! ADMB) - geoBUGS, the geostatistical/spatial correlation module for WinBUGS, is another alternative (but again requires going outside of R)

*Complete separation* occurs in a binary-response model when
there is some linear combination of the parameters that perfectly
separates failures from successes - for example, when all of the
observations are zero for some particular combination of categories. The
symptoms of this problem are unrealistically large parameter estimates;
ridiculously large Wald standard errors (the *Hauck-Donner
effect*); and various warnings.

In particular, binomial `glmer()`

models with complete
separation can lead to “Downdated VtV is not positive definite”
(e.g. see here) or
“PIRLS step-halvings failed to reduce deviance in pwrssUpdate” errors
(e.g. see here).
Roughly speaking, the complete separation is likely to appear even if
one considers only the fixed effects part of the model (counterarguments
or counterexamples welcome!), suggesting two quick-and-dirty diagnostic
methods. If `fixed_form`

is the formula including only the
fixed effects:

`summary(g1 <- glm(fixed_form, family=binomial, data=...))`

will show one or more of the following symptoms:- warnings that
`glm.fit: fitted probabilities numerically 0 or 1 occurred`

- parameter estimates of large magnitude
(e.g.
`any(abs(g1$coefficients)>8)`

, assuming that predictors are either categorical or scaled to have standard deviations of \(\approx 1\)) - extremely large Wald standard errors, and large p-values
(
*Hauck-Donner effect*) - the
`detectseparation`

package has a method for detecting complete separation:`library("detectseparation"); update(g1,method="detect_separation")`

. This should say whether complete separation occurs, and in which (combinations of) variables, e.g.

- warnings that

```
Separation: TRUE
Existence of maximum likelihood estimates
(Intercept) height
Inf Inf
0: finite value, Inf: infinity, -Inf: -infinity
```

If complete separation is occurring between categories of a single
categorical fixed-effect predictor with a large number of levels, one
option would be to treat this fixed effect as a random effect, which
will allow some degree of shrinkage to the mean. (It might be reasonable
to specify the variance of this term *a priori* to a large value
[minimal shrinkage], rather than trying to estimate it from the
data.)

(**TODO**: worked example)

The general approach to handling complete separation in logistic
regression is called *penalized regression*; it’s available in
the `brglm`

, `brglm2`

, `logistf`

, and
`rms`

packages. However, these packages don’t handle mixed
models, so the best available *general* approach is to use a
Bayesian method that allows you to set a prior on the fixed effects,
e.g. a Gaussian with standard deviation of 3; this can be done in any of
the Bayesian GLMM packages (e.g. `blme`

,
`MCMCglmm`

, `brms`

, …) (See supplementary
material for Fox et al. 2016 for a worked example.)

I’m not aware of easy ways to fit mixed models with non-Gaussian
random effects distributions in R (i.e., convenient, flexible,
well-tested implementations). McCulloch and
Neuhaus (2011) discusses when this misspecification may be
important. This
presentation discusses various approaches to solving the problem
(e.g. using a Gamma rather than a Normal distribution of REs in log-link
models). The `spaMM`

package implements H-likelihood models
(Lee, Nelder, and Pawitan 2017), and
claims to allow a range of random-effects distributions (perhaps not
well tested though …)

In principle you can implement any random-effects distribution you
want in a fully capable Bayesian modeling language
(e.g. JAGS/Stan/PyMC/etc.); see e.g. this
StackOverflow answer, which uses the `rethinking`

package’s interface to Stan.

(adapted from Bolker et al TREE 2009)

Method | Advantages | Disadvantages | Packages |
---|---|---|---|

Penalized quasi-likelihood | Flexible, widely implemented | Likelihood inference may be inappropriate; biased for large variance or small means | PROC GLIMMIX (SAS), GLMM (GenStat), glmmPQL (R:MASS), ASREML-R |

Laplace approximation | More accurate than PQL | Slower and less flexible than PQL | glmer (R:lme4,lme4a), glmm.admb (R:glmmADMB), INLA, glmmTMB, AD Model Builder, HLM |

Gauss-Hermite quadrature | More accurate than Laplace | Slower than Laplace; limited to 2‑3 random effects | PROC NLMIXED (SAS), glmer (R:lme4, lme4a), glmmML (R:glmmML), xtlogit (Stata) |

Markov chain Monte Carlo | Highly flexible, arbitrary number of random effects; accurate | Slow, technically challenging, Bayesian framework | MCMCglmm (R:MCMCglmm), rstanarm (R), brms (R), MCMCpack (R), WinBUGS/OpenBUGS (R interface: BRugs/R2WinBUGS), JAGS (R interface: rjags/R2jags), AD Model Builder (R interface: R2admb), glmm.admb (post hoc MCMC after Laplace fit) (R:glmmADMB) |

- double-check the model specification and the data for mistakes
- center and scale continuous predictor variables (e.g. with
`scale()`

) - try all available optimizers (e.g. several different implementations
of BOBYQA and Nelder-Mead, L-BFGS-B from
`optim`

,`nlminb()`

, …). While this will of course be slow for large fits, we consider it the gold standard; if all optimizers converge to values that are practically equivalent (it’s up to the user to decide what “practically equivalent means for their case”), then we would consider the model fit to be good enough. For example:

```
modelfit.all <- lme4::allFit(model)
ss <- summary(modelfit.all)
```

Most of the current advice about troubleshooting `lme4`

convergence problems can be found in the help page
`?convergence`

. That page explains that the convergence tests
in the current version of `lme4`

(1.1-11, February 2016)
generate lots of false positives. We are considering raising the
gradient warning threshold to 0.01 in future releases of
`lme4`

. In addition to the general troubleshooting tips
above:

- double-check the Hessian calculation with the more expensive Richardson extrapolation method (see examples)
- restart the fit from the apparent optimum, or from a point perturbed
slightly away from the optimum
(
`getME(model,c("theta","beta"))`

should retrieve the parameters in a form suitable to be used as the`start`

parameter) - a common error is to specify an offset to a log-link model as a raw
searching-effort value, i.e.
`offset(effort)`

rather than`offset(log(effort))`

. While the intention is to fit a model where \(\textrm{counts} \propto \textrm{effort}\), specifying`offset(effort)`

leads to a model where \(\textrm{counts} \propto \exp(\textrm{effort})\) instead;`exp(effort)`

is often a huge (and model-destabilizing) number.

It is very common for overfitted mixed models to result in singular
fits. Technically, singularity means that the random effects
variance-covariance matrix is of *less than full rank*. There are
various ways to describe this, from more to less technical:

some of the eigenvalues of the covariance matrix are zero, or effectively zero;

some combinations of the elements of the random-effects vector are perfectly multicollinear;

some linear combinations of elements of the random-effects vector have zero variance;

an \(n \times n\) covariance matrix corresponds to an \(n\)-dimensional ellipsoid where the lengths of the major axes are proportional to the eigenvalues; the ellipsoid is “flat” in some directions, e.g. an ellipse has collapsed to a line segment

In simple cases where a random effect term is represented by a single variance (

*scalar*random effects), this is reflected in a variance estimate that is zero or near zero. Functions such as`nlme::lme()`

or`glmmTMB()`

that estimate variances on the log scale will often*not*report a singular fit, but will instead return a very small value (1e-6 or less) for the random-effects variance; on the log scale, this will correspond to a parameter estimate that is a large negative number — and, usually, warnings about non-positive-definite Hessians or (in the case of`lme()`

) ridiculously large Wald confidence intervals returned by`intervals()`

.In the case of a two-dimensional random effect (such as a random-slopes model), this typically corresponds to a perfect (+/- 1) correlation between the slope and intercept

in higher-dimensional random effects (such as the random effect of a categorical variable with more than two levels, or a random-slopes model with more than one covariate), it’s pretty much impossible to see at a glance that the covariance matrix is singular. Extracting the RE covariance matrix and computing its eigenvalues (this is what

`rePCA`

in the`lme4`

package does) will tell you. In the particular case of`lme4`

, singularity is detectable by seeing if any of the elements of the \(\boldsymbol \theta\) (variance-covariance Cholesky decomposition) vector corresponding to diagonal elements are (near) zero; this is what`?isSingular`

does.

Singular fits commonly occur in two scenarios:

small numbers of random-effect levels (e.g. <5), as illustrated in these simulations and discussed (in a somewhat different, Bayesian context) by Gelman (2006).

complex random-effects models, e.g. models of the form

`(f|g)`

where`f`

is a categorical variable with a relatively large number of levels, or models with several different random-slopes terms.In

`MCMCglmm`

, singular or near-singular fits will provoke an error and a requirement to specify a stronger prior.

At present there are a variety of strong opinions about how to resolve such problems, which are sometimes conflated with the general problem of how to decide on the appropriate complexity of the random-effects component of a model. Briefly:

- If a variance component is zero, dropping it from the model will have no effect on any of the estimated quantities (although it will affect the AIC, as the variance parameter is counted even though it has no effect). Pasch, Bolker, and Phelps (2013) gives one example where random effects were dropped because the variance components were consistently estimated as zero. Conversely, if one chooses for philosophical grounds to retain these parameters, it won’t change any of the answers.
- Barr et al. (2013) suggest always
starting with the maximal model (i.e. the most random-effects component
of the model that is
*theoretically*identifiable given the experimental design) and then dropping terms when singularity or non-convergence occurs (please see the paper for detailed recommendations …) - Matuschek et al. (2017) and Bates, Kliegl, et al. (2015) disagree,
suggesting that models should be simplified
*a priori*whenever possible. In particular, they suggest \(p\)-value-based stepwise reduction of the random effects model using a loose \(p\)-value criterion (e.g. \(\alpha_{\text LRT} = 0.2\)). They also provide tools for diagnosing and mitigating singularity. - One alternative (suggested by Robert LaBudde) for the small-numbers-of-levels scenario is to “fit the model with the random factor as a fixed effect, get the level coefficients in the sum to zero form, and then compute the standard deviation of the coefficients.” This is appropriate for users who are (a) primarily interested in measuring variation (i.e. the random effects are not just nuisance parameters, and the variability [rather than the estimated values for each level] is of scientific interest), (b) unable or unwilling to use other approaches (e.g. MCMC with half-Cauchy priors in WinBUGS), (c) unable or unwilling to collect more data. For the simplest case (balanced, orthogonal, nested designs with normal errors) these estimates of standard deviations should equal the classical method-of-moments estimates.
- Bayesian approaches allow the user to specify a informative prior
that avoids singularity.
- The
`blme`

package (Chung et al. 2013) provides a wrapper for the`lme4`

machinery that adds a particular form of weak prior to get an approximate a Bayesian maximum*a posteriori*estimate that avoids singularity. - The
`MCMCglmm`

package allows for priors on the variance-covariance matrix - The
`rstanarm`

and`brms`

packages provide wrappers for the Stan Hamiltonian MCMC engine that fit GLMMs via`lme4`

syntax, again allowing a variety of priors to be set.

- The

For some problems it would be convenient to be able to set the
residual variance term to zero, or a fixed value. This is difficult in
`lme4`

, because the model is parameterized internally in such
a way that the residual variance is profiled out (i.e., calculated
directly from a residual deviance term) and the random-effects variances
are scaled by the residual variance.

Searching the r-sig-mixed-models list for “fix residual variance”

- This is done in the
`metafor`

package, for meta-analytic models - You can use the
`blme`

package to fix the residual variance: from Vincent Dorie,

```
library(blme)
blmer(formula = y ~ 1 + (1 | group), weights = V,
resid.prior = point(1.0), cov.prior = NULL)
```

This sets the residual variance to 1.0. You *cannot* use this
to make it exactly zero, but you can make it very small (and experiment
with setting it to different small values, e.g. 0.001 vs 0.0001, to see
how sensitive the results are). - Similarly, you can fix the residual
variance to a small positive value in `[n]lme`

via the
`control()`

argument (Heisterkamp et
al. 2017):

```
nlme::lme(Reaction~Days,random=~1|Subject,
data=lme4::sleepstudy,
control=list(sigma=1e-8))
```

- the
`glmmTMB`

package can set the residual variance to (approximately) zero, by specifying`dispformula = ~0`

(in fact the value can be set via`glmmTMBControl(zerodisp_val=...)`

; the default value is`log(sqrt(.Machine$double.eps))`

) - There is an rrBlupMethod6 package on CRAN (“Re-parametrization of mixed model formulation to allow for a fixed residual variance when using RR-BLUP for genom[e]wide estimation of marker effects”), but it seems fairly special-purpose.
- it might be possible
*in principle*to adapt`lme4`

’s internal`devfun2()`

function (used in the likelihood profiling computation for LMMs), which uses a specified value of the residual standard deviation in computing likelihood, but as Bates, Mächler, et al. (2015) say:

The resulting function is not useful for general nonlinear optimization — one can easily wander into parameter regimes corresponding to infeasible (non-positive semidefinite) variance-covariance matrices — but it serves for likelihood profiling, where one focal parameter is varied at a time and the optimization over the other parameters is likely to start close to an optimum.

`lme4`

error messagesMost of the following error messages are relatively unusual, and happen mostly with complex/large/unstable models. There is often no simple fix; the standard suggestions for troubleshooting are (1) try rescaling and/or centering predictors; (2) see if a simpler model can be made to work; (3) look for severe lack of balance and/or complete separation in the data set.

`PIRLS step-halvings failed to reduce deviance in pwrssUpdate`

- this can also occur due to complete or quasi-complete separation (see Penalization/handling complete separation
- When using
`lme4`

to fit GLMMs with link functions that do not automatically constrain the response to the allowable range of the distributional family (e.g. binomial models with a log link, where the estimated probability can be >1, or inverse-Gamma models, where the estimated mean can be negative), it is not unusual to get this error. This occurs because`lme4`

doesn’t do anything to constrain the predicted values, so`NaN`

values pop up, which aren’t handled gracefully. If possible, switch to a link function to one that constrains the response (e.g. logit link for binomial or log link for Gamma). - otherwise this message often occurs when there is something else wrong with the model or data, e.g. - a model fitted to underdispersed data includes both a negative binomial response and observation-level random effects - negative response values for a link function that doesn’t allow them

`Downdated VtV is not positive definite`

: no specific advice, see general suggestions above`convergence code 3 from bobyqa: bobyqa -- a trust region step failed to reduce q`

: again no specific advice about fixing this, although there is a useful discussion of the meaning of the error message on CrossValidated

- While restricted maximum likelihood (REML) procedures (Wikipedia are well established for linear mixed models, it is less clear how one should define and compute the equivalent criteria (integrating out the effects of fixed parameters) for GLMMs. Millar (2011) and Berger, Liseo, and Wolpert (1999) are possible starting points in the peer-reviewed literature, and there are mailing-list discussions of these issues here and here.
- Attempting to use
`REML=TRUE`

with`glmer`

will produce the warning`extra argument(s) ‘REML’ disregarded`

`glmmTMB`

allows`REML=TRUE`

for GLMMs (it uses the Laplace approximation to integrate over the fixed effect parameters), since version 0.2.2

`summary(glmerfit)`

etc.?
Are they reliable?By default, in keeping with the tradition in analysis of generalized
linear models, `lme4`

and similar packages display the Wald
Z-statistics for each parameter in the model summary. These have one big
advantage: they’re convenient to compute. However, they are asymptotic
approximations, assuming both that (1) the sampling distributions of the
parameters are multivariate normal (or equivalently that the
log-likelihood surface is quadratic) and that (2) the sampling
distribution of the log-likelihood is (proportional to) \(\chi^2\). The second approximation is
discussed further under “Degrees of freedom”. The first assumption
usually requires an even greater leap of faith, and is known to cause
problems in some contexts (for binomial models failures of this
assumption are called the *Hauck-Donner effect*), especially with
extreme-valued parameters.

From worst to best:

- Wald \(Z\)-tests
**For balanced, nested LMMs**where degrees of freedom can be computed according to classical rules: Wald \(t\)-tests- Likelihood ratio test, either by setting up the model so that the
parameter can be isolated/dropped (via
`anova`

or`drop1`

, or via computing likelihood profiles - Markov chain Monte Carlo (MCMC) or parametric bootstrap confidence intervals

From worst to best:

- Wald chi-square tests (e.g.
`car::Anova`

) - Likelihood ratio test (via
`anova`

or`drop1`

) **For balanced, nested LMMs**where df can be computed: conditional F-tests**For LMMs**: conditional F-tests with df correction (e.g. Kenward-Roger in`pbkrtest`

package: see notes on K-R etc below.- MCMC or parametric, or nonparametric, bootstrap comparisons (nonparametric bootstrapping must be implemented carefully to account for grouping factors)

- It depends.
- Not for fixed effects in finite-size cases (see Pinheiro and Bates (2000)): may depend on ‘denominator degrees of freedom’ (number of groups) and/or total number of samples - total number of parameters
- Conditional F-tests are preferred for LMMs,
**if**denominator degrees of freedom are known

`lme4`

display denominator degrees of
freedom/p values? What other options do I have?There is an R FAQ entry on this topic, which links to a mailing list post by Doug Bates (there is also a voluminous mailing list thread reproduced on the R wiki). The bottom line is

- For special cases that correspond to classical experimental designs
(i.e. balanced designs that are nested, split-plot, randomized block,
etc.) … we can show that the null distributions of particular ratios of
sums of squares follow an \(F\)
distribution with known numerator and denominator degrees of freedom
(and hence the sampling distributions of particular contrasts are
t-distributed with known df). In more complicated situations
(unbalanced, GLMMs, crossed random effects, models with temporal or
spatial correlation, etc.) it is not in general clear that the null
distribution of the computed ratio of sums of squares is really an F
distribution, for
*any*choice of denominator degrees of freedom. - For each simple degrees-of-freedom recipe that has been suggested (trace of the hat matrix, etc.) there seems to be at least one fairly simple counterexample where the recipe fails badly (e.g. see this r-help thread from September 2006).
- When the responses are normally distributed and the design is balanced, nested etc. (i.e. the classical LMM situation), the scaled deviances and differences in deviances are exactly \(F\)-distributed and looking at the experimental design (i.e., which treatments vary/are replicated at which levels) tells us what the relevant degrees of freedom are (see “df alternatives” below)
- Two approaches to approximating df (Satterthwaite and Kenward-Roger)
have been implemented in R, Satterthwaite in
`lmerTest`

and Kenward-Roger in`pbkrtest`

(as`KRmodcomp`

) (various packages such as`lmerTest`

,`emmeans`

,`car`

, etc., import`pbkrtest::get_Lb_ddf`

).K-R is probably the most reliable option (Schaalje, McBride, and Fellingham 2002), although it may be prohibitively computationally expensive for large data sets.

K-R was derived for LMMs (and for REML?) in particular, it isn’t clear how it would apply to GLMMs. Walter W. Stroup (2014) states (referencing W. W. Stroup (2013)) that K-R actually works reasonably well for GLMMs (K-R is not implemented in R for GLMMs; Stroup suggests that a pseudo-likelihood (Wolfinger and O’Connell 1993) approach is necessary in order to implement K-R for GLMMs):

Notice the non-integer values of the denominator df. They, and the \(F\) and \(p\) values, reflect the procedure developed by Kenward and Roger (2009) to account for the effect of the covariance structure on degrees of freedom and standard errors. Although the Kenward–Roger adjustment was derived for the LMM with normally distributed data and is an ad hoc procedure for GLMMs with non-normal data, informal simulation studies consistently have suggested that the adjustment is accurate. The Kenward-Roger adjustment requires that the SAS GLIMMIX default computing algorithm, pseudo-likelihood, be used rather than the Laplace algorithm used to obtain AICC statistics. Stroup (2013b) found that for binomial and Poisson GLMMs, pseudo-likelihood with the Kenward–Roger adjustment yields better Type I error control than Laplace while preserving the GLMM’s advantage with respect to power and accuracy in estimating treatment means.

- There are several different issues at play in finite-size
(small-sample) adjustments, which apply slightly differently to LMMs and
GLMMs.
- When the data don’t fit into the classical framework (crossed, unbalanced, R-side effects), we might still guess that the deviances etc. are approximately F-distributed but that we don’t know the real degrees of freedom – this is what the Satterthwaite, Kenward-Roger, Fai-Cornelius, etc. approximations are supposed to do.
- When the responses are not normally distributed (as in GLMs and
GLMMs), and when the scale parameter is not estimated (as in standard
Poisson- and binomial-response models), then the deviance differences
are only asymptotically F- or chi-square-distributed (i.e. not for our
real, finite-size samples). In standard GLM practice, we usually ignore
this problem; there is some literature on finite-size corrections for
GLMs under the rubrics of “Bartlett corrections” and “higher order
asymptotics” (see McCullagh and Nelder
(1989), Cordeiro, Paula, and Botter
(1994), Cordeiro and Ferrari (1998)
and the
`cond`

package (on CRAN) [which works with GLMs, not GLMMs]), but it’s rarely used. (The bias correction/Firth approach implemented in the`brglm`

package attempts to address the problem of finite-size bias, not finite-size non-chi-squaredness of the deviance differences.) - When the scale parameter in a GLM is estimated rather than fixed (as
in Gamma or quasi-likelihood models), it is sometimes recommended to use
an \(F\) test to account for the
uncertainty of the scale parameter (e.g. Venables
and Ripley (2002) recommend
`anova(...,test="F")`

for quasi-likelihood models) - Combining these issues, one has to look pretty hard for information on small-sample or finite-size corrections for GLMMs: Feng, Braun, and McCulloch (2004) and Bell and Grunwald (2010) look like good starting points, but it’s not at all trivial.

- use MASS::glmmPQL (uses old
`nlme`

rules approximately equivalent to SAS ‘inner-outer’/‘within-between’ rules) for GLMMs, or`(n)lme`

for LMMs - Guess the denominator df from standard rules (for standard designs, e.g. see Gotelli and Ellison (2004)) and apply them to \(t\) or \(F\) tests
- Run the model in
`lme`

(if possible) and use the denominator df reported there (which follow a simple ‘inner-outer’ rule which should correspond to the canonical answer for simple/orthogonal designs), applied to \(t\) or \(F\) tests. For the explicit specification of the rules that`lme`

uses, see page 91 of Pinheiro and Bates (*this page was previously available on Google Books, but the link is no longer useful, so here are the relevant paragraphs*):

These conditional tests for fixed-effects terms require denominator degrees of freedom. In the case of the conditional \(F\)-tests, the numerator degrees of freedom are also required, being determined by the term itself. The denominator degrees of freedom are determined by the grouping level at which the term is estimated. A term is called inner relative to a factor if its value can change within a given level of the grouping factor. A term is outer to a grouping factor if its value does not changes within levels of the grouping factor. A term is said to be estimated at level \(i\), if it is inner to the \(i-1\)st grouping factor and outer to the \(i\)th grouping factor. For example, the term

`Machine`

in the`fm2Machine`

model is outer to`Machine %in% Worker`

and inner to`Worker`

, so it is estimated at level 2 (`Machine %in% Worker`

). If a term is inner to all \(Q\) grouping factors in a model, it is estimated at the level of the within-group errors, which we denote as the \(Q+1\)st level.The intercept, which is the parameter corresponding to the column of all 1’s in the model matrices \(X_i\), is treated differently from all the other parameters, when it is present. As a parameter it is regarded as being estimated at level 0 because it is outer to all the grouping factors. However, its denominator degrees of freedom are calculated as if it were estimated at level \(Q+1\). This is because the intercept is the one parameter that pools information from all the observations at a level even when the corresponding column in \(X_i\) doesn’t change with the level.

Letting \(m_i\) denote the total number of groups in level \(i\) (with the convention that \(m_0=1\) when the fixed effects model includes an intercept and 0 otherwise, and \(m_{Q+1}=N\)) and \(p_i\) denote the sum of the degrees of freedom corresponding to the terms estimated at level \(i\), the \(i\)th level denominator degrees of freedom is defined as

\[ \mathrm{denDF}_i = m_i - (m_{i-1} + p_i), i = 1, \dots, Q \]

This definition coincides with the classical decomposition of degrees of freedom in balanced, multilevel ANOVA designs and gives a reasonable approximation for more general mixed-effects models.

Note that the implementation used in `lme`

**gets
the wrong answer for random-slopes models**:

```
library(nlme)
lmeDF <- function(formula=distance~age,random=~1|Subject) {
mod <- lme(formula,random,data=Orthodont)
aa <- anova(mod)
return(setNames(aa[,"denDF"],rownames(aa)))
}
lmeDF()
```

```
## (Intercept) age
## 80 80
```

`lmeDF(random=~age|Subject) ## wrong!`

```
## (Intercept) age
## 80 80
```

I (BB) have re-implemented this algorithm in a way that does slightly better for random-slopes models (but may still get confused!), see here.

```
source("R/calcDenDF.R")
calcDenDF(~age,"Subject",nlme::Orthodont)
```

```
## (Intercept) age
## 80 80
```

`calcDenDF(~age,data=nlme::Orthodont,random=~1|Subject)`

```
## (Intercept) age
## 80 80
```

`calcDenDF(~age,data=nlme::Orthodont,random=~age|Subject) ## off by 1`

```
## (Intercept) age
## 81 25
```

- use SAS, Genstat (AS-REML), Stata?
- Assume infinite denominator df (i.e. \(Z\)/\(\chi^2\) test rather than \(t\)/\(F\)) if number of groups is large (>45? Various rules of thumb for how large is “approximately infinite” have been posed, including (in Angrist and Pischke 2009), 42 (in homage to Douglas Adams)

- the most common way to do this is to use a likelihood ratio test, i.e. fit the full and reduced models (the reduced model is the model with the focal variance(s) set to zero). For example:

```
library(lme4)
m2 <- lmer(Reaction~Days+(1|Subject)+(0+Days|Subject),sleepstudy,REML=FALSE)
m1 <- update(m2,.~Days+(1|Subject))
m0 <- lm(Reaction~Days,sleepstudy)
anova(m2,m1,m0) ## two sequential tests
```

```
## Data: sleepstudy
## Models:
## m0: Reaction ~ Days
## m1: Reaction ~ Days + (1 | Subject)
## m2: Reaction ~ Days + (1 | Subject) + (0 + Days | Subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m0 3 1906.3 1915.9 -950.15 1900.3
## m1 4 1802.1 1814.8 -897.04 1794.1 106.214 1 < 2.2e-16 ***
## m2 5 1762.0 1778.0 -876.00 1752.0 42.075 1 8.782e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

With recent versions of `lme4`

, goodness-of-fit (deviance)
can be compared between `(g)lmer`

and `(g)lm`

models, although `anova()`

must be called with the mixed
(`(g)lmer`

) model listed first. Keep in mind that LRT-based
null hypothesis tests are conservative when the null value (such as
\(\sigma^2=0\)) is on the boundary of
the feasible space (Self and Liang 1987; Stram
and Lee 1994; Goldman and Whelan 2000); in the simplest case
(single random effect variance), the p-value is approximately twice as
large as it should be (Pinheiro and Bates
2000).

- Consider
*not*testing the significance of random effects. If the random effect is part of the experimental design, this procedure may be considered ‘sacrificial pseudoreplication’ (Hurlbert 1984). Using stepwise approaches to eliminate non-significant terms in order to squeeze more significance out of the remaining terms is dangerous in any case. - consider using the
`RLRsim`

package, which has a fast implementation of simulation-based tests of null hypotheses about zero variances, for simple tests. (However, it only applies to`lmer`

models, and is a bit tricky to use for more complex models.)

```
library(RLRsim)
## compare m0 and m1
exactLRT(m1,m0)
```

```
##
## simulated finite sample distribution of LRT. (p-value based on 10000
## simulated values)
##
## data:
## LRT = 106.21, p-value < 2.2e-16
```

```
## compare m1 and m2
mA <- update(m2,REML=TRUE)
m0B <- update(mA, . ~ . - (0 + Days|Subject))
m.slope <- update(mA, . ~ . - (1|Subject))
exactRLRT(m0=m0B,m=m.slope,mA=mA)
```

```
##
## simulated finite sample distribution of RLRT.
##
## (p-value based on 10000 simulated values)
##
## data:
## RLRT = 42.796, p-value < 2.2e-16
```

- Parametric bootstrap: fit the reduced model, then repeatedly
simulate from it and compute the differences between the deviance of the
reduced and the full model for each simulated data set. Compare this
null distribution to the observed deviance difference. This procedure is
implemented in the
`pbkrtest`

package (messages and warnings suppressed).

`(pb <- pbkrtest::PBmodcomp(m2,m1,seed=101))`

```
## Bootstrap test; time: 15.32 sec; samples: 1000; extremes: 0;
## Requested samples: 1000 Used samples: 501 Extremes: 0
## large : Reaction ~ Days + (1 | Subject) + (0 + Days | Subject)
## Reaction ~ Days + (1 | Subject)
## stat df p.value
## LRT 42.075 1 8.782e-11 ***
## PBtest 42.075 0.001992 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

- Paraphrasing Doug Bates: the sampling distribution of variance estimates is in general strongly asymmetric: the standard error may be a poor characterization of the uncertainty.
`lme4`

allows for computing likelihood profiles of variances and computing confidence intervals on their basis; these likelihood profile confidence intervals are subject to the usual caveats about the LRT with finite sample sizes.- Using an MCMC-based approach (the simplest/most canned is probably
to use the
`MCMCglmm`

package, although its mode specifications are not identical to those of lme4) will provide posterior distributions of the variance parameters: quantiles or credible intervals (`HPDinterval()`

in the`coda`

package) will characterize the uncertainty. - (don’t say we didn’t warn you …)
`[n]lme`

fits contain an element called`apVar`

which contains the approximate variance-covariance matrix (derived from the Hessian, the matrix of (numerically approximated) second derivatives of the likelihood (REML?) at the maximum (restricted?) likelihood values): you can derive the standard errors from this list element via`sqrt(diag(lme.obj$apVar))`

. For whatever it’s worth, though, these estimates might not match the estimates that SAS gives which are supposedly derived in the same way. - it’s not a full solution, but there is some more information here. I have some delta-method computations there that are off by a factor of 2 for the residual standard deviation, as well as some computations based on reparameterizing the deviance function.

Abandoning the approximate \(F\)/\(t\)-statistic route, one ends up with the more general problem of estimating \(p\)-values. There is a wider range of options here, although many of them are computationally intensive …

- pseudo-Bayesian: post-hoc sampling, typically (1) assuming flat
priors and (2) starting from the MLE, possibly using the approximate
variance-covariance estimate to choose a candidate distribution
- via
`mcmcsamp`

(if available for your problem: i.e. LMMs with simple random effects – not GLMMs or complex random effects) - via
`pvals.fnc`

in the`languageR`

package, a wrapper for mcmcsamp) - in AD Model Builder, possibly via the
`glmmADMB`

package (use the`mcmc=TRUE`

option) or the`R2admb`

package (write your own model definition in AD Model Builder), or outside of R - via the
`sim`

function from the`arm`

package (simulates the posterior only for the beta (fixed-effect) coefficients; not yet working with development lme4; would like a better formal description of the algorithm …?)

- via
- fully Bayesian approaches
- via the
`MCMCglmm`

package `glmmBUGS`

(a WinBUGS wrapper/R interface)- JAGS/WinBUGS/OpenBUGS etc., via the
`rjags`

/`r2jags`

/`R2WinBUGS`

/`BRugs`

packages

- via the

`mcmcsamp`

is a function for lme4 that is supposed to
sample from the posterior distribution of the parameters, based on
flat/improper priors for the parameters [ed: I believe, but am not sure,
that these priors are flat **on the scale of the theta
(Cholesky-factor) parameters**]. At present, in the CRAN version
(lme4 0.999999-0) and the R-forge “stable” version (lme4.0 0.999999-1),
this covers only linear mixed models with uncorrelated random
effects.

As has been discussed in a variety of places (e.g. on
r-sig-mixed models, and on
the r-forge bug tracker, it is challenging to come up with a sampler
that accounts properly for the possibility that the posterior
distributions for some of the variance components may be mixtures of
point masses at zero and continuous distributions. Naive samplers are
likely to get stuck at or near zero. Doug Bates has always been a bit
unsure that `mcmcsamp`

is really performing as intended, even
in the limited cases it now handles.

Given this uncertainty about how even the basic version works, the
`lme4`

developers have been reluctant to make the effort to
extend it to GLMMs or more complex LMMs, or to implement it for the
development version of lme4 … so unless something miraculous happens, it
will not be implemented for the new version of `lme4`

. As
always, users are encouraged to write and share their own code that
implements these capabilities …

The idea here is that in order to do inference on the effect of (a)
predictor(s), you (1) fit the reduced model (without the predictors) to
the data; (2) many times, (2a) simulate data from the reduced model;
(2b) fit both the reduced and the full model to the simulated (null)
data; (2c) compute some statistic(s) [e.g. t-statistic of the focal
parameter, or the log-likelihood or deviance difference between the
models]; (3) compare the observed values of the statistic from fitting
your full model to the data to the null distribution generated in step
2. - `PBmodcomp`

in the `pbkrtest`

package - see
the example in `help("simulate-mer")`

in the
`lme4`

package to roll your own, using a combination of
`simulate()`

and `refit()`

. - `bootMer`

in `lme4`

version >1.0.0 - a presentation at UseR! 2009
(abstract,
slides)
went into detail about a proposed `bootMer`

package and
suggested it could work for GLMMs too – but it does not seem to be
active.

Note that none of the following approaches takes the uncertainty of the random effects parameters into account … if you want to take RE parameter uncertainty into account, a Bayesian approach is probably the easiest way to do it.

The general recipe for computing predictions from a linear or generalized linear model is to

- figure out the model matrix \(X\) corresponding to the new data;
- matrix-multiply \(X\) by the parameter vector \(\beta\) to get the predictions (or linear predictor in the case of GLM(M)s);
- extract the variance-covariance matrix of the parameters \(V\)
- compute \(X V X^{\prime}\) to get the variance-covariance matrix of the predictions;
- extract the diagonal of this matrix to get variances of predictions;
- if computing prediction rather than confidence intervals, add the residual variance;
- take the square-root of the variances to get the standard deviations (errors) of the predictions;
- compute confidence intervals based on a Normal approximation;
- for GL(M)Ms, run the confidence interval boundaries (not the standard errors) through the inverse-link function.

```
library(nlme)
fm1 <- lme(distance ~ age*Sex, random = ~ 1 + age | Subject,
data = Orthodont)
plot(Orthodont,asp="fill") ## plot responses by individual
```