Licensed under the Creative Commons attribution-noncommercial license. Please share & remix noncommercially, mentioning its origin.
Why Bayes?
- philosophically satisfying
- addresses model complexity problems
(use regularizing or truly informative priors)
- accounts for all levels of uncertainty
- makes post-modeling inference easy
Basic Bayes
Posterior probability \(\propto\) likelihood \(\times\) prior
Priors, continued
- continued debate over priors
- “uninformative” or “flat”: probably not really
- “conjugate”: mathematically/computationally convenient
- regularizing priors: neutral, but not completely uninformative
- keeps model from misbehaving
- informative priors: non-neutral, real information (Crome, Thomas, and Moore 1996, @McCarthy2007)
- simple ‘uninformative’ priors for variances etc. might not be (Gelman 2006)
- uniform priors are simple but problematic (Carpenter 2017)
- remember, scale of parameter matters!
typical neutral/regularizing priors
- fixed-effect parameters (Greenland and Mansournia 2015)
- typically Normal, mean 0, std dev 3–5
- assume parameters are scaled or on log/logit scale
- Student-\(t\)/Cauchy allow heavier tails
- variance parameters
- Gamma(small shape): typical but problematic
- Gamma(shape=2,\(\lambda \to 0\)): weakly regularizing (Chung et al. 2013) (
blme
package)
- correlation matrices
- Wishart, inverse-Wishart: small shape parameters
- LKJ or “onion” priors (Lewandowski, Kurowicka, and Joe 2009):
eta>1
makes extreme correlations less likely
Sampling
Almost all modern Bayesian methods depend on stochastic sampling schemes
e.g.
- conjugate sampling: for easy cases where we can derive the posterior distribution
- Gibbs sampling: stepwise sampling of different model components
- Metropolis-Hastings: choose candidate distribution and accept/reject
- Hamiltonian Monte Carlo
Metropolis-Hastings
- sampling parameter space
- start at a point (set of parameter values) \(A\)
- pick a new point \(B\) (from candidate distribution)
- evaluate \(P\)= prior \(\times\) likelihood
- always accept if \(P(B)\) better than \(P(A)\)
- if \(B\) is worse, accept with probability \(P(B)/P(A)\)
- (extra term if candidate distribution is asymmetric)
- generally converges to posterior distribution
acceptance ratio
- candidate distribution too big: lots of rejection
- candidate distribution too small: lots of acceptance, but small moves
- either problem leads to inefficient sampling, highly correlated chains
- optimal acceptance \(\approx 20\%\)
Hamiltonian MC
- start at a point
- simulate a particle moving along the surface with random momentum
- hard parts: finding the gradient, knowing when to stop (“No U-Turn Sampler”)
Burn-in, adaptation and thinning
- burn-in: wait for chain to travel from starting point to highest posterior density region
- adaptation: use chain performance (acceptance/rejection) to tune candidate distribution
- thinning: if chain is correlated, subsample (e.g. down to 1000 samples)
Multiple chains
- Best way of assessing convergence
- Can be run in parallel (different cores/machines)
- Start at widely dispersed points (but workable)
- Run until coverage of chains overlaps
Diagnostics: tests of convergence
- Gelman-Rubin statistic (potential scale reduction factor: \(R < 1.1\)? 1.02?); needs multiple chains (Vats and Knudson 2018)
- effective sample size (\(>500\))
- traceplot (should look like white noise)
Tackling fitting/convergence problems
- strengthen priors (keep model from getting in trouble)
- simplify model
- reparameterize (?)
- run the model for longer (and thin correspondingly)
Diagnostics: model adequacy
- plots of predictions
- posterior predictive distributions
- parameter correlations:
pairs(fitted,gap=0,pch=".")
Model goodness of fit (relative)
- leave-one-out cross-validation (
loo
)
- WAIC
Inference
- samplers return a matrix of parameter values (values of each parameter for each sample)
- Posterior medians (or means)
- Quantile intervals (
quantile()
)
- or highest posterior density intervals: (
coda::HPDinterval
, emdbook::HPDregion
)
Inference on functions of parameters
- e.g. CIs on predictions
- easy! compute whatever function you want
Chung, Yeojin, Sophia Rabe-Hesketh, Vincent Dorie, Andrew Gelman, and Jingchen Liu. 2013. “A Nondegenerate Penalized Likelihood Estimator for Variance Parameters in Multilevel Models.” Psychometrika, 1–25. https://doi.org/10.1007/s11336-013-9328-2.
Crome, F. H. J., M. R. Thomas, and L. A. Moore. 1996. “A Novel Bayesian Approach to Assessing Impacts of Rain Forest Logging.” Ecological Applications 6: 1104–23.
Greenland, Sander, and Mohammad Ali Mansournia. 2015. “Penalization, Bias Reduction, and Default Priors in Logistic and Related Categorical and Survival Regressions.” Statistics in Medicine 34 (23): 3133–43. https://doi.org/10.1002/sim.6537.
Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100 (9): 1989–2001. https://doi.org/10.1016/j.jmva.2009.04.008.
McCarthy, M. 2007. Bayesian Methods for Ecology. Cambridge, England: Cambridge University Press.