generating simulated data from mixed models

Goals

power analysis (can be done with packages such as simr etc.
testing methods
constructing reproducible examples
testing effects of model misspecification
benchmarking

Simulating covariates

useful functions: expand.grid, rep, rnorm (and other random deviate generators), sample, gl
can use covariate structure from an existing data set
see also lme4::mkDataTemplate

by hand

traightforward to simulate simple cases by hand, e.g.

set.seed(101)
dd <- data.frame(x = rnorm(100), f = factor(rep(1:10, 10)))
X <- model.matrix(~x, data = dd)
beta <- c(1,2)
b <- rnorm(10)
mu <- X %*% beta + b[dd$f]
dd$y <- rnorm(nrow(dd), mean = mu, sd = 1)

transparent
gets harder as the models get complicated (vector-valued REs, correlation structures, etc.)
using package functions to simulate minimizes mismatches

`simulate` functions

many modeling tools have a simulate function that simulates new responses from a fitted model object
some (?which?) have newdata or newparams arguments
de novo simulation: simulate in lme4, simulate_new in glmmTMB

de novo specification

formula: specified as in package (one-sided)
newdata: covariate/grouping variable information
family: as in package
newparams: a list of parameter vectors

parameters for `lme4::simulate`

beta: fixed-effect parameters
theta: random-effect covariances (scaled Cholesky factor; == log-SD for scalar REs). Note that lme4 reorders the random effects in decreasing order of the number of clusters …
sigma: residual SD, shape parameter
see mkParsTemplate

parameters for `glmmTMB::simulate_new`

beta: fixed-effect parameters (conditional model)
betazi, betadisp: fixed-effect parameters for zero-inflation (logit scale), dispersion models (log scale)
theta: for each RE term, log-SD vector followed by correlation parameters (see covariance structure vignette for more details)
thetazi: ditto
psi: shape parameters for Tweedie/Student-t/etc.
show_pars = TRUE will show the structure of the parameter vector (at least the length of each component)

parameterization of “unstructured” covariance matrices

positive semi-definiteness (all eigenvalues of a symmetric matrix \(\ge 0\)) is an awkward constraint
lots of different options¹
most use Cholesky factorization: \(\Sigma = \Lambda \Lambda^\top\). For uniqueness, constrain the diagonal elements \(\Lambda_{ii} \geq 0\).

lme4 parameterization

lme4 uses a scaled Cholesky factorization, i.e. \(\Sigma = \sigma^2_{\textrm{resid}} \Lambda \Lambda^\top\)
singular iff any \(\Lambda_{ii} = 0\). In practice we report singularity if \(\Lambda_{ii} < 10^{-4}\)
use “box-constrained” nonlinear optimization to set lower bounds on diagonal elements of \(\Lambda\)
translation isn’t so nice, e.g. for a 3×3 matrix we get \[ \Lambda \Lambda^\top = \left( \begin{array}{ccc} \theta_1 & 0 & 0 \\ \theta_2 & \theta_4 & 0 \\ \theta_3 & \theta_5 & \theta_6 \end{array} \right) \left( \begin{array}{ccc} \theta_1 & \theta_2 & \theta_3 \\ 0 & \theta_4 & \theta_5 \\ 0 & 0 & \theta_6 \end{array} \right) = \left( \begin{array}{ccc} \theta_1^2 & \theta_1 \theta_2 & \theta_1 \theta_3 \\ \theta_1 \theta_2 & \theta_2^2 + \theta_4^2 & \theta_2 \theta_3 + \theta_4 \theta_5 \\ \theta_1 \theta_3 & \theta_2 \theta_3 + \theta_4 \theta_5 & \theta_3^2 + \theta_5^2 + \theta_6^2 \end{array} \right) \] so e.g. constraining \(\sigma^2_3 = \theta_3^2 + \theta_5^2 + \theta_6^2\) to a particular value is a nuisance.
little-used/poorly documented conversion functions at ?lme4::vcconv
e.g.

corrsd <- matrix(c(2.0, 0.4, 0.1,
                   0.4, 1.0, 0.3,
                   0.1, 0.3, 3.0),
                 3, 3)
get_lower_tri <- function(x) x[lower.tri(x, diag = TRUE)]
corrsd |> lme4:::sdcor2cov() |> chol() |> t() |> get_lower_tri()

## [1] 2.0000000 0.4000000 0.3000000 0.9165151 0.8510498 2.8610687

glmmTMB parameterization

first separates the covariance matrix into variances and correlations
\(\Sigma = \textrm{Diag}(S) C \textrm{Diag}(S)\)
\(S\) is the vector of standard deviations (parameterized as log SDs to ensure positivity)
\(C\) is the correlation matrix, parameterized as \[ D^{-1/2} L L^\top D^{-1/2} \] where \(L\) is strictly lower triangular, \(D = \textrm{diag}(L L^\top)\)
this is just to make sure we get a positive definite matrix with 1 on the diagonal …
For a single correlation parameter \(\theta_0\), this works out to \(\rho = \theta_0/\sqrt{1+\theta_0^2}\)
Advantages: easier to constrain/put priors on SDs separately. Correlation matrix not much harder to deal with.
Disadvantages: log-SD parameterization is more awkward for singular fits
for structured covariance matrices, see covariance structure vignette

GLMM family parameterizations

for conditional distributions with dispersion/shape parameters, need to know the parameterization being used
e.g. negative binomial, Tweedie, etc. (typically log-shape or log-dispersion)
zero-inflation parameters on logit scale

Example

Pinheiro, J. C. & Bates, D. M. Unconstrained parametrizations for variance-covariance matrices. Statistics and Computing 6, 289–296 (1996).

generating simulated data from mixed models

Goals

Simulating covariates

by hand

simulate functions

de novo specification

parameters for lme4::simulate

parameters for glmmTMB::simulate_new

parameterization of “unstructured” covariance matrices

lme4 parameterization

glmmTMB parameterization

GLMM family parameterizations

Example

`simulate` functions

parameters for `lme4::simulate`

parameters for `glmmTMB::simulate_new`