library(ape)
library(lme4)
library(Matrix)
## additional possibilities for PGLMMs ...
library(MCMCglmm)
library(lattice)
library(pez)
## utils
library(coda)
library(broom)
library(dotwhisker)
• The standard problem of phylogenetic comparative methods is to analyze relationships among data where the observations are gathered from nodes (usually tips) of a phylogenetic tree - for example, regression analyses of body temperature as a function of body size for animals within a clade
• More generally, we can frame this in the usual GLMM way as $\begin{split} y & \sim D(\mu,\phi) \\ \mu & = g^{-1}(\eta) = g^{-1}(X \beta + Z b) \\ b & \sim \textrm{MVN}(0,\Sigma) \end{split}$ where the part that makes it specifically phylogenetic is that $$\Sigma$$ captures the phylogenetic correlation. The PC is the correlation among observations due to relatedness; recently diverged taxa have higher correlation than more anciently diverged taxa. In the extreme case of a star phylogeny (all taxa diverged from each other simultaneously at some point in the past) the phylogenetic correlation collapses to a diagonal matrix and we get back to the simple, uncorrelated regression.

Various P(G)LMM (phylogenetic [generalized] linear mixed model] approaches have been proposed. Many depend on Pagel’s lambda transformation, which gives the correlation matrix a particularly simple form (but has been criticized …)

An alternative approach is to model the phylogenetic correlation as a Gaussian process. In particular, suppose that the evolutionary process is a Brownian motion (an almost certainly incorrect/oversimplified model of evolution, but one that many phylogenetic methods are built on). In that case, the phylogenetic variability of a particular observation can be written as the sum of the evolutionary changes that occurred on all of the branches in the phylogeny in its past. If we set up the $$Z$$ matrix appropriately, we can model everything with a sequence of independent errors, rather than having to do fancy things to impose a correlation structure on the random effects.

## Nuts and bolts: from a phylogeny to a $$Z$$ matrix for the GP

library(ape)
set.seed(101)
r <- makeNodeLabel(rtree(4))
plot(r,show.node.label=TRUE)

Information in a phylo object is contained in the edge matrix:

edge: a two-column matrix of mode numeric where each row represents an edge of the tree; the nodes and the tips are symbolized with numbers; the tips are numbered 1, 2, …, and the nodes are numbered after the tips. For each row, the first column gives the ancestor.

t(r$edge) ## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 5 6 6 5 7 7 ## [2,] 6 1 2 7 3 4 and a list of edge lengths r$edge.length
## [1] 0.3000548 0.5848666 0.3334671 0.6220120 0.5458286 0.8797957

Inspecting this tree, we can figure out (see $tip.label and $node.label for label-to-number correspondences):

• tips are 1-4, nodes are 5-7
• tip 1 (t1) involves branches 2 (6 $$\to$$ 1) and 1 (5 $$\to$$ 6).
• tip 2 (t3) involves branches 3 (6 $$\to$$ 2) and 1 (5 $$\to$$ 6)
• tip 3 (t2) involves branches 5 (7 $$\to$$ 3) and 4 (5 $$\to$$ 7)
• tip 4 (t4) involves branches 6 (7 $$\to$$ 4) and 4 (5 $$\to$$ 7)

So, for example, we can say that the ‘error’ value corresponding to tip 1 is $$\ell_1 b_1 + \ell_2 b_2$$, where $$\ell_i$$ is the (square root of??) the branch length and the $$b_i$$ are independent, homoscedastic Normal variates. Alternately, the $$Z$$ matrix is

$\begin{pmatrix} \ell_1 & \ell_2 & 0 & 0 & 0 & 0 \\ \ell_1 & 0 & \ell_3 & 0 & 0 & 0 \\ 0 & 0 & 0 & \ell_4 & \ell_5 & 0 \\ 0 & 0 & 0 & \ell_4 & 0 & \ell_6 \end{pmatrix}$ where $$\ell_i$$ is the length of the $$i^\textrm{th}$$ branch, so that the species effects are $$Z b$$.

If we can build the corresponding $$Z$$ matrix, then we can insert it in the lme4 modular model-fitting process (see ?modular).

Here’s a (probably not very efficent) way to construct the Z matrix. (There must be a way to not walk the tree multiple times from every tip …

phylo.to.Z <- function(r) {
ntip <- length(r$tip.label) Z <- Matrix(0.0,ncol=length(r$edge.length),nrow=ntip)
nodes <- (ntip+1):max(r$edge) root <- nodes[!(nodes %in% r$edge[,2])]
for (i in 1:ntip) {
cn <- i  ## current node
while (cn != root) {
ce <- which(r$edge[,2]==cn) ## find current edge Z[i,ce] <- r$edge.length[ce]  ## set Z to branch length
cn <- r$edge[ce,1] ## find previous node } } return(Z) } phylo.to.Z(r) ## 4 x 6 sparse Matrix of class "dgCMatrix" ## ## [1,] 0.3000548 0.5848666 . . . . ## [2,] 0.3000548 . 0.3334671 . . . ## [3,] . . . 0.622012 0.5458286 . ## [4,] . . . 0.622012 . 0.8797957 (This could benefit from the repeated-entry sparse matrix class that Steve Walker wrote.) On the other hand, it only takes a few seconds to run for a 200-species phylogeny (see below). ## constructing a GP PGLMM “All” we need to do is (1) call (g)lFormula, with a formula that includes a (1|phylo) term, to build the basic (wrong) structure; (2) modify the reTrms component of the structure appropriately; (3) go through the rest of the modular procedure for building a (G)LMM. #' split a square (block) matrix into component blocks #' @param M square matrix #' @param ind indices (0,n1,n2,...) giving the endpoint of each block split_blkMat <- function(M,ind) { res <- list() if (length(ind)==1) return(list(M)) for (i in 1:(length(ind)-1)) { v <- (ind[i]+1):ind[i+1] res[[i]] <- M[v,v] } return(res) } #' modify reTrms object #' @param rt a reTrms object #' @param phylo a phylo object (phylogenetic tree) #' @param phylonm name of phylogenetic term in model #' @param phyloZ Z matrix built on branch length modify_phylo_retrms <- function(rt,phylo,phylonm="phylo", phyloZ=phylo.to.Z(phylo)) { ## FIXME: better way to specify phylonm ## need to replace Zt, Lind, Gp, flist, Ztlist ## we have the same number of parameters (theta, lower), ## same number of obs n.edge <- nrow(phylo$edge)
phylo.pos <- which(names(rt$cnms)==phylonm) inds <- c(0,cumsum(sapply(rt$Ztlist,nrow)))
## Zt: substitute phylo Z for previous dummy (scalar-intercept) Z
rt[["Ztlist"]][[phylo.pos]] <- t(phyloZ)
## reconstitute Zt from new Ztlist
rt[["Zt"]] <- do.call(rbind,rt[["Ztlist"]])
## Gp: substitute new # random effects (n.edge) for old # (n.phylo)
Gpdiff <- diff(rt$Gp) ## old numbers Gpdiff_new <- Gpdiff Gpdiff_new[phylo.pos] <- n.edge ## replace rt[["Gp"]] <- as.integer(c(0,cumsum(Gpdiff_new))) ## reconstitute ## Lind: replace phylo block with the same element, just more values Lind_list <- split(rt[["Lind"]],rep(seq_along(Gpdiff),Gpdiff)) Lind_list[[phylo.pos]] <- rep(Lind_list[[phylo.pos]][1],n.edge) rt[["Lind"]] <- unlist(Lind_list) ## Lambdat: replace block-diagonal element in Lambdat with a ## larger diagonal matrix Lambdat_list <- split_blkMat(rt[["Lambdat"]],inds) Lambdat_list[[phylo.pos]] <- Diagonal(n.edge,1.0) rt[["Lambdat"]] <- Matrix::.bdiag(Lambdat_list) ## flist: rt[["flist"]] <- as.list(rt[["flist"]]) rt[["flist"]][[phylonm]] <- factor(paste0("edge_",seq(n.edge))) return(rt) } #' phylo_glmm <- function(formula,data,family,phylo,phyloZ) { glmod <- glFormula(formula=formula,data = data, family = family) glmod$reTrms <- modify_phylo_retrms(glmod$reTrms,phylo, phylonm="phylo",phyloZ) devfun <- do.call(mkGlmerDevfun, glmod) opt <- optimizeGlmer(devfun) devfun <- updateGlmerDevfun(devfun, glmod$reTrms)
opt <- optimizeGlmer(devfun, stage=2)
mkMerMod(environment(devfun), opt, glmod$reTrms, fr = glmod$fr)
}

## example

From chapter 11 of Garamszegi (ed.): data are here

if (!file.exists("data/phylo.nex")) {
dir.create("data")
dest="data/OPM_ch11_data.zip")
setwd("data")
untar("OPM_ch11_data.zip")
setwd("..")
}
dat <- read.table("data/data_pois.txt",header=TRUE)
system.time(phyloZ <- phylo.to.Z(phylo))
##    user  system elapsed
##   5.276   0.012   5.312

Add an observation-level random effect (in case we want to match with MCMCglmm)

dat$obs <- factor(seq(nrow(dat))) phylo_glmm_fit <- phylo_glmm(phen_pois~cofactor+(1|phylo)+(1|obs), data=dat,family=poisson,phylo=phylo, phyloZ=phyloZ) From Garamszegi ch. 11 code examples. nitt <- 5e5 ## was 5e6 inv.phylo <- inverseA(phylo,nodes="TIPS",scale=TRUE) prior <- list(G=list(G1=list(V=1,nu=0.02)),R=list(V=1,nu=0.02)) MCMC_time <- system.time( MCMCglmm_fit <- MCMCglmm(phen_pois~cofactor,random=~phylo, family="poisson",ginverse=list(phylo=inv.phylo$Ainv),
prior=prior,data=dat,nitt=nitt,burnin=1000,
thin=nitt/1000,verbose=FALSE))

MCMCglmm fit takes about 28 minutes … effective sample size is 1000, which means the initial suggested 5 million steps may be overkill ? (However, with only 5e4 parameter sets we get wonky-looking trace plots/effective sample size of only 200, so 5e5 may be necessary …)

ss <- summary(MCMCglmm_fit)
with(ss,rbind(solutions[,1:3],Gcovariances[,1:3],Rcovariances[,1:3]))
##               post.mean     l-95% CI   u-95% CI
## (Intercept) -2.08796523 -2.479168010 -1.6785878
## cofactor     0.25082309  0.229117489  0.2738551
##              0.03959840  0.002052013  0.1047885
##              0.04294357  0.003287147  0.1029546
rbind(coef(summary(phylo_glmm_fit))[,1:2],
cbind(matrix(unlist(VarCorr(phylo_glmm_fit))),NA))
##                Estimate Std. Error
## (Intercept) -2.09051239 0.18813499
## cofactor     0.25032657 0.01117239
##              0.05177132         NA
##              0.04132941         NA

Not exactly identical, but pretty close (priors, etc. etc.) (Getting profile confidence intervals will take some more work, because we can’t use update.merMod transparently within the profile …)

## To do

• Update the tidy method for MCMCglmm_fit to return the standard devs …
• Think about connections to/correspondence with Pagel’s $$\lambda$$, which multiplies the phylogenetic covariance matrix by a factor $$0<\lambda<1$$?
• In an LMM what assumptions do we have to make about residual var, i.e. fix it to a small value?
• When is residual var unidentifiable (e.g. all terminal branch lengths identical)?
• how easy is it to put this into glmmTMB?
• how easily could we implement an O-U process (this would require that Z be recomputed each time with changing $$\alpha$$ [and reference level]: could be considerably more difficult/mess up some of the linear algebra tricks?)