Licensed under the Creative Commons attribution-noncommercial license. Please share & remix noncommercially, mentioning its origin.
Why Bayes?
- philosophically satisfying
- addresses model complexity problems
(use regularizing or truly informative priors)
- accounts for all levels of uncertainty
- makes post-modeling inference easy
Basic Bayes
Posterior probability \(\propto\) likelihood \(\times\) prior
Priors, continued
- continued debate over priors
- “uninformative” or “flat”: probably not really
- “conjugate”: mathematically/computationally convenient
- regularizing priors: neutral, but not completely uninformative
- keeps model from misbehaving
- informative priors: non-neutral, real information (Crome, Thomas, and Moore 1996, @McCarthy2007)
- simple ‘uninformative’ priors for variances etc. might not be (Gelman 2006)
- uniform priors are simple but problematic (Carpenter 2017)
- remember, scale of parameter matters!
typical neutral/regularizing priors
- fixed-effect parameters (Greenland and Mansournia 2015)
- typically Normal, mean 0, std dev 3–5
- assume parameters are scaled or on log/logit scale
- Student-\(t\)/Cauchy allow heavier tails
- variance parameters
- Gamma(small shape): typical but problematic
- Gamma(shape=2,\(\lambda \to 0\)): weakly regularizing (Chung et al. 2013) (
- correlation matrices
- Wishart, inverse-Wishart: small shape parameters
- LKJ or “onion” priors (Lewandowski, Kurowicka, and Joe 2009):
makes extreme correlations less likely
Almost all modern Bayesian methods depend on stochastic sampling schemes
- conjugate sampling: for easy cases where we can derive the posterior distribution
- Gibbs sampling: stepwise sampling of different model components
- Metropolis-Hastings: choose candidate distribution and accept/reject
- Hamiltonian Monte Carlo
- sampling parameter space
- start at a point (set of parameter values) \(A\)
- pick a new point \(B\) (from candidate distribution)
- evaluate \(P\)= prior \(\times\) likelihood
- always accept if \(P(B)\) better than \(P(A)\)
- if \(B\) is worse, accept with probability \(P(B)/P(A)\)
- (extra term if candidate distribution is asymmetric)
- generally converges to posterior distribution
acceptance ratio
- candidate distribution too big: lots of rejection
- candidate distribution too small: lots of acceptance, but small moves
- either problem leads to inefficient sampling, highly correlated chains
- optimal acceptance \(\approx 20\%\)
Hamiltonian MC
- start at a point
- simulate a particle moving along the surface with random momentum
- hard parts: finding the gradient, knowing when to stop (“No U-Turn Sampler”)
Burn-in, adaptation and thinning
- burn-in: wait for chain to travel from starting point to highest posterior density region
- adaptation: use chain performance (acceptance/rejection) to tune candidate distribution
- thinning: if chain is correlated, subsample (e.g. down to 1000 samples)
Multiple chains
- Best way of assessing convergence
- Can be run in parallel (different cores/machines)
- Start at widely dispersed points (but workable)
- Run until coverage of chains overlaps
Diagnostics: tests of convergence
- Gelman-Rubin statistic (potential scale reduction factor: \(R < 1.1\)? 1.02?); needs multiple chains (Vats and Knudson 2018)
- effective sample size (\(>500\))
- traceplot (should look like white noise)
Tackling fitting/convergence problems
- strengthen priors (keep model from getting in trouble)
- simplify model
- reparameterize (?)
- run the model for longer (and thin correspondingly)
Diagnostics: model adequacy
- plots of predictions
- posterior predictive distributions
- parameter correlations:
Model goodness of fit (relative)
- leave-one-out cross-validation (
- samplers return a matrix of parameter values (values of each parameter for each sample)
- Posterior medians (or means)
- Quantile intervals (
- or highest posterior density intervals: (
, emdbook::HPDregion
Inference on functions of parameters
- e.g. CIs on predictions
- easy! compute whatever function you want
Chung, Yeojin, Sophia Rabe-Hesketh, Vincent Dorie, Andrew Gelman, and Jingchen Liu. 2013. “A Nondegenerate Penalized Likelihood Estimator for Variance Parameters in Multilevel Models.” Psychometrika, 1–25.
Crome, F. H. J., M. R. Thomas, and L. A. Moore. 1996. “A Novel Bayesian Approach to Assessing Impacts of Rain Forest Logging.” Ecological Applications 6: 1104–23.
Greenland, Sander, and Mohammad Ali Mansournia. 2015. “Penalization, Bias Reduction, and Default Priors in Logistic and Related Categorical and Survival Regressions.” Statistics in Medicine 34 (23): 3133–43.
Lewandowski, Daniel, Dorota Kurowicka, and Harry Joe. 2009. “Generating Random Correlation Matrices Based on Vines and Extended Onion Method.” Journal of Multivariate Analysis 100 (9): 1989–2001.
McCarthy, M. 2007. Bayesian Methods for Ecology. Cambridge, England: Cambridge University Press.