Goals
- power analysis (can be done with packages such as
simr
etc.
- testing methods
- constructing reproducible examples
- testing effects of model misspecification
- benchmarking
Simulating covariates
- useful functions:
expand.grid
, rep
,
rnorm
(and other random deviate generators),
sample
, gl
- can use covariate structure from an existing data set
- see also
lme4::mkDataTemplate
by hand
- traightforward to simulate simple cases by hand, e.g.
set.seed(101)
dd <- data.frame(x = rnorm(100), f = factor(rep(1:10, 10)))
X <- model.matrix(~x, data = dd)
beta <- c(1,2)
b <- rnorm(10)
mu <- X %*% beta + b[dd$f]
dd$y <- rnorm(nrow(dd), mean = mu, sd = 1)
- transparent
- gets harder as the models get complicated (vector-valued REs,
correlation structures, etc.)
- using package functions to simulate minimizes mismatches
simulate
functions
- many modeling tools have a
simulate
function that
simulates new responses from a fitted model object
- some (?which?) have
newdata
or newparams
arguments
- de novo simulation:
simulate
in
lme4
, simulate_new
in
glmmTMB
de novo specification
- formula: specified as in package (one-sided)
newdata
: covariate/grouping variable information
family
: as in package
newparams
: a list of parameter vectors
parameters for lme4::simulate
beta
: fixed-effect parameters
theta
: random-effect covariances (scaled Cholesky
factor; == log-SD for scalar REs). Note that lme4
reorders the random effects in decreasing order of the number of
clusters …
sigma
: residual SD, shape parameter
- see
mkParsTemplate
parameters for glmmTMB::simulate_new
beta
: fixed-effect parameters (conditional model)
betazi
, betadisp
: fixed-effect parameters
for zero-inflation (logit scale), dispersion models (log scale)
theta
: for each RE term, log-SD vector followed
by correlation parameters (see covariance
structure vignette for more details)
thetazi
: ditto
psi
: shape parameters for Tweedie/Student-t/etc.
show_pars = TRUE
will show the structure of the
parameter vector (at least the length of each component)
parameterization of “unstructured” covariance matrices
- positive semi-definiteness (all eigenvalues of a symmetric matrix
\(\ge 0\)) is an awkward
constraint
- lots of different options1
- most use Cholesky factorization: \(\Sigma = \Lambda \Lambda^\top\). For
uniqueness, constrain the diagonal elements \(\Lambda_{ii} \geq 0\).
lme4 parameterization
lme4
uses a scaled Cholesky factorization,
i.e. \(\Sigma = \sigma^2_{\textrm{resid}}
\Lambda \Lambda^\top\)
- singular iff any \(\Lambda_{ii} =
0\). In practice we report singularity if \(\Lambda_{ii} < 10^{-4}\)
- use “box-constrained” nonlinear optimization to set lower bounds on
diagonal elements of \(\Lambda\)
- translation isn’t so nice, e.g. for a 3×3 matrix we get \[
\Lambda \Lambda^\top =
\left(
\begin{array}{ccc}
\theta_1 & 0 & 0 \\
\theta_2 & \theta_4 & 0 \\
\theta_3 & \theta_5 & \theta_6
\end{array}
\right)
\left(
\begin{array}{ccc}
\theta_1 & \theta_2 & \theta_3 \\
0 & \theta_4 & \theta_5 \\
0 & 0 & \theta_6
\end{array}
\right)
=
\left(
\begin{array}{ccc}
\theta_1^2 & \theta_1 \theta_2 & \theta_1 \theta_3 \\
\theta_1 \theta_2 & \theta_2^2 + \theta_4^2 & \theta_2
\theta_3 + \theta_4 \theta_5 \\
\theta_1 \theta_3 & \theta_2 \theta_3 + \theta_4 \theta_5 &
\theta_3^2 + \theta_5^2 + \theta_6^2
\end{array}
\right)
\] so e.g. constraining \(\sigma^2_3 =
\theta_3^2 + \theta_5^2 + \theta_6^2\) to a particular value is a
nuisance.
- little-used/poorly documented conversion functions at
?lme4::vcconv
- e.g.
corrsd <- matrix(c(2.0, 0.4, 0.1,
0.4, 1.0, 0.3,
0.1, 0.3, 3.0),
3, 3)
get_lower_tri <- function(x) x[lower.tri(x, diag = TRUE)]
corrsd |> lme4:::sdcor2cov() |> chol() |> t() |> get_lower_tri()
## [1] 2.0000000 0.4000000 0.3000000 0.9165151 0.8510498 2.8610687
glmmTMB parameterization
- first separates the covariance matrix into variances and
correlations
- \(\Sigma = \textrm{Diag}(S) C
\textrm{Diag}(S)\)
- \(S\) is the vector of standard
deviations (parameterized as log SDs to ensure positivity)
- \(C\) is the correlation matrix,
parameterized as \[
D^{-1/2} L L^\top D^{-1/2}
\] where \(L\) is strictly lower
triangular, \(D = \textrm{diag}(L
L^\top)\)
- this is just to make sure we get a positive definite matrix with 1
on the diagonal …
- For a single correlation parameter \(\theta_0\), this works out to \(\rho = \theta_0/\sqrt{1+\theta_0^2}\)
- Advantages: easier to constrain/put priors on SDs
separately. Correlation matrix not much harder to deal with.
- Disadvantages: log-SD parameterization is more
awkward for singular fits
- for structured covariance matrices, see covariance
structure vignette
GLMM family parameterizations
- for conditional distributions with dispersion/shape parameters, need
to know the parameterization being used
- e.g. negative binomial, Tweedie, etc. (typically log-shape or
log-dispersion)
- zero-inflation parameters on logit scale