Includes material from Ian Dworkin and Jonathan Dushoff, but they bear no responsibility for the contents.

Introduction

Why Bayes?

  • alternative philosophy of statistics/approach to inference
  • magic black boxes for estimation (BEAST, MrBayes, BayesTraits, …)
  • build-your-own model (McElreath 2020; M. McCarthy 2007; Clark 2020; Hobbs and Hooten 2015) (JAGS, Stan)
  • better handling of uncertainty/error propagation (Elderd, Dukic, and Dwyer 2006; Ludwig 1996)
  • informative priors for data-poor decisions (M. A. McCarthy and Masters 2005)
  • priors for regularization (Lemoine 2019)

What do you need to know?

  • basic meanings of output (point estimates, CIs)
  • nuts and bolts of particular tools
  • usually, more about probability distributions than you already knew
  • for build-your-own methods, lots more about your model
  • basics of Markov Chain Monte Carlo diagnostics

Tools

  • mixed models: rstanarm, MCMCglmm, brms, INLA
  • build-your-own: toolboxes BUGS et al. (JAGS), Stan/rethinking, TMB, greta, …
  • MCMC diagnostic tools (coda, bayestestR)

Inference

Frequentist Bayesian
Discrete hypothesis testing null-hypothesis significance testing; AIC etc (every stats textbook; Burnham and Anderson (2002)) Bayes factors; Bayesian indices of significance (Makowski et al. 2019; Shi and Yin 2021)
Continuous/quantitative (estimation with uncertainty) MLE etc. + confidence intervals (Bolker 2008) posterior means/medians and credible intervals

Principles

  • Use confidence intervals instead of P values, when possible
    • Broad-sense confidence intervals include Bayesian credible intervals
    • Consider scientific “significance” (importance) separately from statistical “significance” (clarity)
  • If you need P values, use values that are consistent with your confidence intervals, when possible

Bayes Theorem

  • If \(A_i\) are alternative events (exactly one must happen), then:

\[ \newcommand{\pr}{\textrm{Pr}} \pr(A_i|B) = \frac{\pr(B|A_i) \pr(A_i)}{\sum \pr(B|A_j) \pr(A_j)} \]

  • \(\pr(A_i)\) the prior probability of \(A_i\)

  • \(\pr(A_i|B)\) is the posterior probability of \(A_i\), given event \(B\)

  • People argue about Bayesian inference, but nobody argues about Bayes theorem

  • Now let’s change \(A_i\) to \(H_i\) (“hypothesis”, which can denote a model or a particular parameter value) and \(B\) to \(D\) (“data”); we get

\[ \begin{split} \pr(H_i|D) & = \frac{\pr(D|H_i) \pr(H_i)}{\sum \pr(D|H_j) \pr(H_j)} \\ & = \frac{\pr(D|H_i) \pr(H_i)}{\pr(D)} \end{split} \]

If \(D\) is the data, then \(\pr(H_i)\) is the prior probability of hypothesis \(H_i\) and \(\pr(D|H_i)\) is the likelihood of hypothesis \(H_i\).

The denominator is the probability of observing the data under any of the hypotheses. It looks scary, and it is computationally scary (when the \(H_i\) represent a set of continuous parameter values, the sum becomes an integral; when a model has lots of continuous parameters, it becomes a high-dimensional integral). However, most tools for Bayesian inference represent elegant ways to avoid ever having to compute the denominator explicitly, so in practice you won’t have to worry about it. You may sometimes see Bayes’ Rule written out as \(\textrm{posterior} \propto \textrm{likelihood} \times \textrm{prior}\), where \(\propto\) means “proportional to”, to emphasize that we can often avoid thinking about the denominator.

Bolker 2008 Figure 4.2: Decomposition of the unconditional probability of the observed data (\(\pr(D)\)) into the sum of the probabilities of the intersection of the data with each possible hypothesis (\(\sum_{j=1}^N \pr(D | H_j) \pr(H_j)\)). The entire gray ellipse in the middle represents \(\pr(D)\). Each wedge (e.g. the hashed area \(H_5\)) represents an alternative hypothesis; the area corresponds to \(\pr(H_5)\). The ellipse is divided into “pizza slices” (e.g. \(D \cap H_5\) , hashed and colored area). The area of each slice corresponds to \(D \cap H_j\) (\(\pr(D \cap H_j) = \pr(D|H_j) \pr(H_j)\)) , the joint probability of the data \(D\) (ellipse) and the particular hypothesis \(H_j\) (wedge). The posterior probability \(\pr(H_j|D)\) is the fraction of the ellipse taken by \(H_j\), i.e. the area of the pizza slice divided by the area of the ellipse.

Bayesian inference

  • Go from a statistical model of how your data are generated, to a probability model of parameter values
    • Requires prior distributions describing the assumed likelihood of parameters before these observations are made
    • Use Bayes theorem to go from probability of the data given parameters to the probability of parameters given data
  • Once we have a posterior distribution, we can calculate a best guess for each parameter
    • Mean, median or mode (also called MAP/maximum a posteriori in Bayesian contexts)
    • Only median is scale-independent

Confidence intervals

  • We do hypothesis tests using “credible intervals” — these are like confidence intervals, except that we really believe (relying on our assumptions) that there is a 95% chance that the value is in the 95% credible interval
    • There are a lot of ways to do this (Hespanhol et al. 2019). You need to decide in advance.
    • Quantiles are principled, but not easy in >1 dimension
    • Highest posterior density is straightforward, but scale-dependent
    • They’re the same if the posterior distribution is symmetric

Bolker 2008 Figure 6.11: ``Bayesian 95% credible interval (gray), and 5% tail areas (hashed), for the tadpole predation data (weak prior: shape=(1,1)).”

  • Example, a linear relationship is significant if the credible interval for the slope does not include zero
  • A difference between groups is significant if the credible interval for the difference does not include zero

Advantages

  • Assumptions more explicit
  • Probability statements more straightforward
  • Very flexible
  • Can combine information from different sources

Disadvantages

  • More assumptions required
  • More difficult to calculate answers
    • easy problems are easy
    • medium problems are hard
    • very hard or impossible problems are hard

Assumptions

Prior distributions

  • Often start with a prior distribution that has little information
    • Let the data do the work (Edwards 1996)
  • This often means a normal (or lognormal, or Gamma) with a very large variance
    • We can test for sensitivity to this choice
  • Can also use a very broad uniform distribution (on log, or linear scale)
    • Common, but avoid if possible
    • Do you really believe that a value of 2.00 is as likely as any other likely value, but a value of 1.999 is absolutely impossible?
  • Prior choices don’t matter much if your data are very informative

Examples

  • “Complete ignorance” can be harder to specify than you think
  • Linear vs. log scale: do we expect the probability of being between 10 and 11 grams to be:
    • ≈ Prob(between 100 and 101 grams) or
    • ≈ Prob(between 100 and 110 grams)
  • Linear vs. inverse scale: if we are waiting for things to happen, do we pick our prior on the time scale (number of minutes per bus) or the rate scale (number of buses per minute)?
@bolkerEcological2008a Fig 4.4: The difficulty of defining an uninformative prior on continuous scales.  If we assume that the probabilities are uniform on one scale (linear or logarithmic), they must be non-uniform on the other.

Bolker (2008) Fig 4.4: The difficulty of defining an uninformative prior on continuous scales. If we assume that the probabilities are uniform on one scale (linear or logarithmic), they must be non-uniform on the other.

  • Discrete hypotheses: subdivision (nest predation example: do we consider species separately, or grouped by higher-level taxon?)
@bolkerEcological2008a Fig 4.3: The difficulty of defining an uninformative prior for discrete hypotheses. Dark gray bars are priors that assume predation by each species is equally likely; light gray bars divide predation by group first, then by species within group.

Bolker (2008) Fig 4.3: The difficulty of defining an uninformative prior for discrete hypotheses. Dark gray bars are priors that assume predation by each species is equally likely; light gray bars divide predation by group first, then by species within group.

Improper priors

  • There is no uniform distribution over the real numbers
  • But for Bayesian analysis, we can pretend that there is
    • This is conceptually cool, and usually works out fine
    • Must be able to guarantee that the posterior distribution exists
    • Also need to choose a scale for your uniform prior

Checking priors

  • prior predictive simulation
  • pick parameters from the prior distribution, simulate data, summarize/plot
  • adjust parameters to allow all reasonable outcomes, eliminate ridiculous outcomes

Statistical models

  • A statistical model allows us to calculate the likelihood of the data based on parameters
    • Relationships between quantities, e.g.:
    • Y is linearly related to X
    • The variance of Y is linearly related to Z
    • Distributions
      • e.g. Y has a Poisson (or normal, or lognormal) distribution

Making a probability model

Assumptions

  • We need enough assumptions to calculate the likelihood of our data (probability, given the parameters)
  • To make a probability model we need prior distributions for all of the parameters we wish to estimate
  • We then need to make explicit assumptions about how our data are generated, and calculate a likelihood for the data corresponding to any set of parameters

A simple example

  • for a single observation of counts
  • assume the data are Poisson distributed with some mean \(\lambda\)
  • assume the prior on the mean is Gamma distributed
  • we can write down the answer immediately (with Calc II)

MCMC methods

What about hard problems?

  • Bayesian methods are very flexible
  • We can write down reasonable priors, and likelihoods, to cover a wide variety of assumptions and situations
  • Unfortunately, we usually can’t integrate (calculate the denominator of Bayes’ formula)
  • Instead we use Markov chain Monte Carlo methods to sample randomly from the posterior distribution
    • Simple to do, but takes a long time and hard to know for sure that it’s working

MCMC sampling

  • Rules that assure that we will visit each point in parameter space in proportion to its likelihood … eventually
    • Metropolis-Hastings (jumping)
    • Gibbs sampling (sample one parameter, or sets of parameters, at a time)
    • Hamiltonian Monte Carlo (‘shoot’ along trajectories)
  • Efficient sampling moves ‘the right amount’ per step to cover the likely space
  • Checking results
    • Are your steps behaving sensibly? (“Divergences”: HMC-specific)
    • Are your parameters bouncing back and forth rather than going somewhere? (trace plot)
    • Repeat the whole process with a different starting point (in parameter space): do these “chains” converge? (R-hat statistic) (Vats and Knudson 2018; Vehtari et al. 2021)
    • Are you sampling enough (effective sample size)?
  • Checking model fit
    • posterior predictive simulation
      • sample from posterior, simulate, summarize; are important patterns in the data captured by the results?

Packages

  • Lots of software, including R packages, will do MCMC sampling for you
  • GLMM-like: brms, rstanarm, MCMCglmm, INLA
  • various statistical models: MCMCpack
  • build-your-own: JAGS (wrapped by rjags/r2jags), Stan (wrapped by rethinking), greta, TMB (TMB/tmbstan), NIMBLE

Sampling from the posterior

Great power ⇒ great responsibility

  • Once you have calculated (or estimated) a Bayesian posterior, you can calculate whatever you want!
    • In particular, you can attach a probability to any combination of the parameters
    • You can simulate a model forward in time and get credible intervals not only for the parameters, but what you expect to happen

CRAN Bayesian inference task view

References

Bolker, Benjamin M. 2008. Ecological Models and Data in R. Princeton University Press.
Burnham, Kenneth P., and David R. Anderson. 2002. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer.
Clark, James S. 2020. Models for Ecological Data: An Introduction. Princeton University Press.
Edwards, Don. 1996. “Comment: The First Data Analysis Should Be Journalistic.” Ecological Applications 6 (4): 1090–94.
Elderd, Bret D., Vanja M. Dukic, and Greg Dwyer. 2006. “Uncertainty in Predictions of Disease Spread and Public Health Responses to Bioterrorism and Emerging Diseases.” Proceedings of the National Academy of Sciences 103 (42): 15693–97. https://doi.org/10.1073/pnas.0600816103.
Hespanhol, Luiz, Caio Sain Vallio, Lucíola Menezes Costa, and Bruno T Saragiotto. 2019. “Understanding and Interpreting Confidence and Credible Intervals Around Effect Estimates.” Brazilian Journal of Physical Therapy 23 (4): 290–301. https://doi.org/10.1016/j.bjpt.2018.12.006.
Hobbs, N. Thompson, and Mevin B. Hooten. 2015. Bayesian Models: A Statistical Primer for Ecologists. Princeton, New Jersey: Princeton University Press.
Lemoine, Nathan P. 2019. “Moving Beyond Noninformative Priors: Why and How to Choose Weakly Informative Priors in Bayesian Analyses.” Oikos 128 (7): 912–28. https://doi.org/10.1111/oik.05985.
Ludwig, Donald. 1996. “Uncertainty and the Assessment of Extinction Probabilities.” Ecological Applications 6 (4): 1067–76. https://doi.org/10.2307/2269591.
Makowski, Dominique, Mattan S. Ben-Shachar, S. H. Annabel Chen, and Daniel Lüdecke. 2019. “Indices of Effect Existence and Significance in the Bayesian Framework.” Frontiers in Psychology 10.
McCarthy, M. 2007. Bayesian Methods for Ecology. Cambridge, England: Cambridge University Press.
McCarthy, Michael A., and Pip Masters. 2005. “Profiting from Prior Information in Bayesian Analyses of Ecological Data.” Journal of Applied Ecology 42 (6): 1012–19.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.
Shi, Haolun, and Guosheng Yin. 2021. “Reconnecting \(p\)-Value and Posterior Probability Under One- and Two-Sided Tests.” The American Statistician 75 (3): 265–75. https://doi.org/10.1080/00031305.2020.1717621.
Vats, Dootika, and Christina Knudson. 2018. “Revisiting the Gelman-Rubin Diagnostic.” arXiv:1812.09384 [Stat], December. https://arxiv.org/abs/1812.09384.
Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. 2021. “Rank-Normalization, Folding, and Localization: An Improved R-hat for Assessing Convergence of MCMC (with Discussion).” Bayesian Analysis 16 (2): 667–718. https://doi.org/10.1214/20-BA1221.

Last updated: 14 March 2024 19:33