14 May 2024

Definitions and scope

GLMMs: the biggest picture

  • what is the scope of this talk?
  • linear fixed effects (possibly with a link function)
  • linear random effects/latent variables (ditto)
  • known clusters with independent sets of latent variables
  • multivariate Normal Gaussian distribution of latent variables

in other words

\[ \begin{split} \underbrace{Y_i}_{\text{response}} & \sim \overbrace{\text{Distr}}^{\substack{\text{conditional} \\ \text{distribution}}}(\underbrace{g^{-1}(\eta_i)}_{\substack{\text{inverse} \\ \text{link} \\ \text{function}}},\underbrace{\phi}_{\substack{\text{scale} \\ \text{parameter}}}) \\ \underbrace{\boldsymbol \eta}_{\substack{\text{linear} \\ \text{predictor}}} & = \underbrace{\boldsymbol X\boldsymbol \beta}_{\substack{\text{fixed} \\ \text{effects}}} + \underbrace{\boldsymbol Z\boldsymbol b}_{\substack{\text{random} \\ \text{effects}}} \\ \underbrace{\boldsymbol b}_{\substack{\text{conditional} \\ \text{modes}}} & \sim \text{MVN}(\boldsymbol 0, \underbrace{\Sigma(\boldsymbol \theta)}_{\substack{\text{covariance} \\ \text{matrix}}}) \end{split} \]

… possibly allow zero-inflation/hurdle component, dispersion (scale) model (e.g. \(\phi = \exp(\boldsymbol X_d \beta_d (? + \boldsymbol Z_d \boldsymbol b_d ?))\))

Past

What is a routine statistical analysis?

depends on:

  • data size
  • computation time/cost, memory requirements
  • availability of reliable software
  • complexity of the statistical model
  • ANOVA → linear models → GLMs → mixed models → ???

Commoditization

  • “goods … that are distinguishable in terms of attributes (uniqueness or brand) end up becoming simple commodities in the eyes of the market or consumers” (Wikipedia)
  • basic components become part of a stack/utilities for higher-level features
  • operating system → linear algebra packages (BLAS, LAPACK) → regression tools → generalized linear models → mixed models → … ?
  • build on top of well-tested frameworks

Democratization

  • as long as people are educated enough
  • ¿can we afford enough statistical consultants?
  • Efron: “recommending that scientists use [X] is like giving the neighborhood kids the key to your F-16” (Gelman 2008)
    (X = “Bayes’ Theorem” here but could be “mixed models”?)
  • make easy things easy and hard things possible
    (and warn about inadvisable things?)
  • importance of sensible defaults

Domain-specific languages

  • limited set of requirements
  • R (“a free software environment”): also SAS …
  • Wilkinson-Rogers-Bates formulas (Wilkinson et al. 1973)
    • initial target: designed experiments
    • GENOTYPE*POL(SITE, 1,SITEMEAN)*POLND(DENSITY)
    • GENSTAT, expanded in S/R (Chambers et al. 1992)
    • expanded further to include random effects (Pinheiro et al. 2000; Tanaka et al. 2019)
    • further expansions … (brms package etc.)
  • too compact (McElreath 2015)?
    who really knows what y ~ x1*x2 + (x1*x2|g) means?

Present

Ever-expanding possibilities

Development and maintenance

Open source software is:

  • powerful
  • equitable
  • reproducible
  • transparent
  • cutting-edge
  • cheap!

Development and maintenance

BUT:

  • academics are paid to publish Cool Stuff
  • researchers are paid to Get Stuff Done
  • computational statisticians are rare
  • mixed models are a niche market:
    lme4 is approximately #92 (out of 21K packages) on CRAN

Development models

  • “the cathedral and the bazaar”
  • R (core): cathedral(ish)
  • R package ecosystem: bazaar-ish

Consolidation

Partial solution to the Babel of the R ecosystem: multi-use front- and back-ends for workflows

  • workflows: tidymodels/mixedmodels/easystats
  • diagnostics: DHARMa, performance::check_model()
  • coefficient information: broom.mixed, parameters
  • model information: broom.mixed
  • predictions: marginaleffects, emmeans

Future

Machine learning/AI

machine/statistical learning

  • emphasis on algorithms and scalability
  • elevate prediction over inference (Breiman 2001)
  • (modern) nonparametric models
  • automatic differentiation

new software stacks

  • METIS/SuiteSparse/CHOLMOD
    (Davis, Karypis: used in lme4, TMB)
  • INLA meshes (Lindgren, Rue: INLA, sdmTMB)
  • spline basis construction (Wood/mgcv)
  • autodifferentiation

autodiff everywhere

  • magic algorithm: automatic gradients with cost <5\(\times\) cost of objective function (Griewank et al. 2003)
  • CppAD, TMB (Kristensen), Stan (Carpenter/Betancourt), PyTorch, Tensorflow (greta), NIMBLE (de Valpine) (Valpine et al. 2017)
  • memory-intensive

prediction vs inference

  • any ‘parameter’ of a model is an expectation of a change in the response
    • partial dependence (marginal effects), conditional dependence:
  • users are almost always interested in counterfactuals
  • … and estimates of uncertainty (or they should be)
  • uncertainty in a counterfactual effect == inference ???
  • connections to causal inference

smooths everywhere

  • Revisit this layer: \(\boldsymbol b\sim \text{MVN}(\boldsymbol 0, \Sigma(\boldsymbol \theta))\)
  • if \(\Sigma\) is diagonal then this is equivalent to putting a ridge penalty (\(\boldsymbol b\boldsymbol b^\top = \sum b_i^2\)) on the conditional modes
  • we could use structured covariance matrices (compound symmetry, AR1, etc.)
  • … penalizing by \(\boldsymbol b\Sigma \boldsymbol b^T\) is equivalent to using a smoothing matrix \(\Sigma^{-1}\)
  • Gaussian processes, splines, Markov random fields …
  • mgcv package provides implementations that can be used anywhere (Wood 2017)
    • gamm4, glmmTMB, sdmTMB, brms

Scalability

  • dimensions of a mixed model
    • number of observations (\(10^2\) - \(10^8\))
    • number of latent variables (\(10\) - \(10^4\))
    • number of clusters (\(10\) - \(10^4\))
    • number of top-level parameters (\(\beta\), \(\theta\)): \(10\) - \(1000\)
  • MLE scaling typically \({\cal O}(N^{3/2})\); method-of-moments \({\cal O}(N)\)
    (stochastic gradient descent???)
  • Recent work on scalable MM (Hillary M. Heiling et al. 2024a; Hillary M. Heiling et al. 2024b; Gao et al. 2017; Gao et al. 2020; Ghosh et al. 2022a; Ghosh et al. 2022b; Ghandwani et al. 2023)

Algorithms

  • linear algebra (decompositions) (Bates et al. 2015)
  • EM algorithm
  • brute-force (RE)ML with autodiff (Brooks et al. 2017)
  • MCEM
  • ??

Sparsity

  • computational trick or statement about the world?
  • combining mixed models with sparsity-inducing penalization (e.g. glmmLasso)
  • INLA meshes leverage conditional independence of spatial points to handle spatial problems efficiently (sdmTMB)

Dimension reduction

  • mgcv smooths use rank reduction techniques instead of relying on sparsity
  • factor-analytic models: glmmTMB, gllvm

Sliding toward Bayes

  • regularization/priors
    • Bayes: justified by belief
    • frequentism: justified by
      • decreased bias (e.g. Firth logistic regression)
      • decreased error (Stein’s paradox, ridge regression)
  • empirical Bayes: regularization/priors only on intermediate layers
  • maximum a posteriori (MAP) estimation: INLA, blme
  • for good confidence intervals, MCMC is faster than parametric bootstrapping (e.g. pharmacokinetic models)

Why aren’t we all Bayesian?

  • concern about subjectivity, ease of use (Efron 1986)
  • Bayes is still slower

Looking ahead

  • further integration of advanced/scalable techniques into mainstream packages
  • improved modularity of estimation engines
  • improved workflow support
  • community support?

references

Bates, D et al. 2015. Journal of Statistical Software 67 (1): 1–48. doi:10.18637/jss.v067.i01.

Breiman, L. 2001. Statistical Science 16 (3) (August): 199–215. http://www.jstor.org/stable/2676681.

Brooks, ME et al. 2017. R Journal 9: 378–400. https://journal.r-project.org/archive/2017/RJ-2017-066/RJ-2017-066.pdf.

Chambers, JM et al. 1992. In Statistical Models in S. Routledge.

Efron, B. 1986. The American Statistician 40 (1) (February): 1–5. doi:10.1080/00031305.1986.10475342. https://www.tandfonline.com/doi/abs/10.1080/00031305.1986.10475342.

Gao, K et al. 2017. Electronic Journal of Statistics 11 (1): 1235–1296. doi:10.1214/17-EJS1236. https://projecteuclid.org/euclid.ejs/1492135234.

Gao, K et al. 2020. Statistica Sinica 30 (4): 1741–1771. https://www.jstor.org/stable/26969394.

Gelman, A. 2008. Bayesian Analysis 3: 445–450. doi:10.1214/08-BA318. http://ba.stat.cmu.edu/journal/2008/vol03/issue03/gelman.pdf.

Ghandwani, D et al. 2023. arXiv. doi:10.48550/arXiv.2307.12378. http://arxiv.org/abs/2307.12378.

Ghosh, S et al. 2022a. Electronic Journal of Statistics 16 (2) (January): 4604–4635. doi:10.1214/22-EJS2047. https://projecteuclid.org/journals/electronic-journal-of-statistics/volume-16/issue-2/Scalable-logistic-regression-with-crossed-random-effects/10.1214/22-EJS2047.full.

——— et al. 2022b. The Annals of Statistics 50 (1) (February): 560–583. doi:10.1214/21-AOS2121. https://projecteuclid.org/journals/annals-of-statistics/volume-50/issue-1/Backfitting-for-large-scale-crossed-random-effects-regressions/10.1214/21-AOS2121.full.

Griewank, A et al. 2003. Proc. Appl. Math. Mech 2 (1): 45–49. doi:10.1002/pamm.200310012.

Heiling, Hillary M et al. 2024a. Biometrics 80 (1) (March): ujae016. doi:10.1093/biomtc/ujae016. https://doi.org/10.1093/biomtc/ujae016.

Heiling, Hillary M et al. 2024b. arXiv. doi:10.48550/arXiv.2305.08204. http://arxiv.org/abs/2305.08204.

McElreath, R. 2015. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Boca Raton: Chapman; Hall/CRC.

Pinheiro, JC et al. 2000. Mixed-effects models in S and S-PLUS. New York: Springer.

Tanaka, E et al. 2019. In Statistics and Data Science, ed by. Hien Nguyen, 3–21. Communications in Computer and Information Science. Singapore: Springer. doi:10.1007/978-981-15-1960-4_1.

Valpine, P de et al. 2017. Journal of Computational and Graphical Statistics (April).

Wilkinson, GN et al. 1973. Applied Statistics 22 (3): 392–399. doi:10.2307/2346786.

Wood, S. 2017. Generalized additive models: An introduction with R. 2d ed. CRC texts in statistical science. Chapman & Hall.