Bayesian computational mixed models in R

Licensed under the Creative Commons attribution-noncommercial license. Please share & remix noncommercially, mentioning its origin.

Why Bayes?

philosophically satisfying
addresses model complexity problems
(use regularizing or truly informative priors)
accounts for all levels of uncertainty
makes post-modeling inference easy

Basic Bayes

Posterior probability \(\propto\) likelihood \(\times\) prior

Priors, continued

continued debate over priors
“uninformative” or “flat”: probably not really
“conjugate”: mathematically/computationally convenient
regularizing priors: neutral, but not completely uninformative
- keeps model from misbehaving
- informative priors: non-neutral, real information (Crome, Thomas, and Moore 1996, @McCarthy2007)
simple ‘uninformative’ priors for variances etc. might not be (Gelman 2006)
uniform priors are simple but problematic (Carpenter 2017)
remember, scale of parameter matters!

typical neutral/regularizing priors

fixed-effect parameters (Greenland and Mansournia 2015)
- typically Normal, mean 0, std dev 3–5
- assume parameters are scaled or on log/logit scale
- Student-\(t\)/Cauchy allow heavier tails
variance parameters
- Gamma(small shape): typical but problematic
- Gamma(shape=2,\(\lambda \to 0\)): weakly regularizing (Chung et al. 2013) (blme package)
correlation matrices
- Wishart, inverse-Wishart: small shape parameters
- LKJ or “onion” priors (Lewandowski, Kurowicka, and Joe 2009): eta>1 makes extreme correlations less likely

Sampling

Almost all modern Bayesian methods depend on stochastic sampling schemes

e.g.

conjugate sampling: for easy cases where we can derive the posterior distribution
Gibbs sampling: stepwise sampling of different model components
Metropolis-Hastings: choose candidate distribution and accept/reject
Hamiltonian Monte Carlo

Metropolis-Hastings

sampling parameter space
start at a point (set of parameter values) \(A\)
- pick a new point \(B\) (from candidate distribution)
- evaluate \(P\)= prior \(\times\) likelihood
- always accept if \(P(B)\) better than \(P(A)\)
- if \(B\) is worse, accept with probability \(P(B)/P(A)\)
- (extra term if candidate distribution is asymmetric)
- generally converges to posterior distribution

acceptance ratio

candidate distribution too big: lots of rejection
candidate distribution too small: lots of acceptance, but small moves
either problem leads to inefficient sampling, highly correlated chains
optimal acceptance \(\approx 20\%\)

Hamiltonian MC

start at a point
simulate a particle moving along the surface with random momentum
hard parts: finding the gradient, knowing when to stop (“No U-Turn Sampler”)

Burn-in, adaptation and thinning

burn-in: wait for chain to travel from starting point to highest posterior density region
adaptation: use chain performance (acceptance/rejection) to tune candidate distribution
thinning: if chain is correlated, subsample (e.g. down to 1000 samples)

Multiple chains

Best way of assessing convergence
Can be run in parallel (different cores/machines)
Start at widely dispersed points (but workable)
Run until coverage of chains overlaps

Diagnostics: tests of convergence

Gelman-Rubin statistic (potential scale reduction factor: \(R < 1.1\)? 1.02?); needs multiple chains (Vats and Knudson 2018)
effective sample size (\(>500\))
traceplot (should look like white noise)

Tackling fitting/convergence problems

strengthen priors (keep model from getting in trouble)
simplify model
reparameterize (?)
run the model for longer (and thin correspondingly)

Diagnostics: model adequacy

plots of predictions
posterior predictive distributions
parameter correlations: pairs(fitted,gap=0,pch=".")

Model goodness of fit (relative)

leave-one-out cross-validation (loo)
WAIC

Inference

samplers return a matrix of parameter values (values of each parameter for each sample)
Posterior medians (or means)
Quantile intervals (quantile())
- or highest posterior density intervals: (coda::HPDinterval, emdbook::HPDregion)

Inference on functions of parameters

e.g. CIs on predictions
easy! compute whatever function you want

Carpenter, Bob. 2017. “Computational and Statistical Issues with Uniform Interval Priors.” Statistical Modeling, Causal Inference, and Social Science. http://andrewgelman.com/2017/11/28/computational-statistical-issues-uniform-interval-priors/.

Chung, Yeojin, Sophia Rabe-Hesketh, Vincent Dorie, Andrew Gelman, and Jingchen Liu. 2013. “A Nondegenerate Penalized Likelihood Estimator for Variance Parameters in Multilevel Models.” Psychometrika, 1–25. https://doi.org/10.1007/s11336-013-9328-2.

Crome, F. H. J., M. R. Thomas, and L. A. Moore. 1996. “A Novel Bayesian Approach to Assessing Impacts of Rain Forest Logging.” Ecological Applications 6: 1104–23.

Gelman, Andrew. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models.” Bayesian Analysis 1 (3): 515–33. http://ba.stat.cmu.edu/journal/2006/vol01/issue03/gelman.pdf.

Greenland, Sander, and Mohammad Ali Mansournia. 2015. “Penalization, Bias Reduction, and Default Priors in Logistic and Related Categorical and Survival Regressions.” Statistics in Medicine 34 (23): 3133–43. https://doi.org/10.1002/sim.6537.

Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100 (9): 1989–2001. https://doi.org/10.1016/j.jmva.2009.04.008.

McCarthy, M. 2007. Bayesian Methods for Ecology. Cambridge, England: Cambridge University Press.

Vats, Dootika, and Christina Knudson. 2018. “Revisiting the Gelman-Rubin Diagnostic.” arXiv:1812.09384 [Stat], December. http://arxiv.org/abs/1812.09384.