Brain dump: phylogenetic comparative methods in the lme4-verse

library(ape)
library(lme4)
library(Matrix)
## additional possibilities for PGLMMs ...
library(MCMCglmm)
library(lattice)
library(pez)
## utils
library(coda)
library(broom)
library(dotwhisker)

The standard problem of phylogenetic comparative methods is to analyze relationships among data where the observations are gathered from nodes (usually tips) of a phylogenetic tree - for example, regression analyses of body temperature as a function of body size for animals within a clade
More generally, we can frame this in the usual GLMM way as \[ \begin{split} y & \sim D(\mu,\phi) \\ \mu & = g^{-1}(\eta) = g^{-1}(X \beta + Z b) \\ b & \sim \textrm{MVN}(0,\Sigma) \end{split} \] where the part that makes it specifically phylogenetic is that $\Sigma$ captures the phylogenetic correlation. The PC is the correlation among observations due to relatedness; recently diverged taxa have higher correlation than more anciently diverged taxa. In the extreme case of a star phylogeny (all taxa diverged from each other simultaneously at some point in the past) the phylogenetic correlation collapses to a diagonal matrix and we get back to the simple, uncorrelated regression.

Various P(G)LMM (phylogenetic [generalized] linear mixed model] approaches have been proposed. Many depend on Pagel’s lambda transformation, which gives the correlation matrix a particularly simple form (but has been criticized …)

An alternative approach is to model the phylogenetic correlation as a Gaussian process. In particular, suppose that the evolutionary process is a Brownian motion (an almost certainly incorrect/oversimplified model of evolution, but one that many phylogenetic methods are built on). In that case, the phylogenetic variability of a particular observation can be written as the sum of the evolutionary changes that occurred on all of the branches in the phylogeny in its past. If we set up the $Z$ matrix appropriately, we can model everything with a sequence of independent errors, rather than having to do fancy things to impose a correlation structure on the random effects.

Nuts and bolts: from a phylogeny to a $Z$ matrix for the GP

library(ape)
set.seed(101)
r <- makeNodeLabel(rtree(4))
plot(r,show.node.label=TRUE)

Information in a phylo object is contained in the edge matrix:

edge: a two-column matrix of mode numeric where each row represents an edge of the tree; the nodes and the tips are symbolized with numbers; the tips are numbered 1, 2, …, and the nodes are numbered after the tips. For each row, the first column gives the ancestor.

t(r$edge)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    5    6    6    5    7    7
## [2,]    6    1    2    7    3    4

and a list of edge lengths

r$edge.length

## [1] 0.3000548 0.5848666 0.3334671 0.6220120 0.5458286 0.8797957

Inspecting this tree, we can figure out (see $tip.label and $node.label for label-to-number correspondences):

tips are 1-4, nodes are 5-7
tip 1 (t1) involves branches 2 (6 $\to$ 1) and 1 (5 $\to$ 6).
tip 2 (t3) involves branches 3 (6 $\to$ 2) and 1 (5 $\to$ 6)
tip 3 (t2) involves branches 5 (7 $\to$ 3) and 4 (5 $\to$ 7)
tip 4 (t4) involves branches 6 (7 $\to$ 4) and 4 (5 $\to$ 7)

So, for example, we can say that the ‘error’ value corresponding to tip 1 is $\ell_1 b_1 + \ell_2 b_2$, where $\ell_i$ is the (square root of??) the branch length and the $b_i$ are independent, homoscedastic Normal variates. Alternately, the $Z$ matrix is

\[ \begin{pmatrix} \ell_1 & \ell_2 & 0 & 0 & 0 & 0 \\ \ell_1 & 0 & \ell_3 & 0 & 0 & 0 \\ 0 & 0 & 0 & \ell_4 & \ell_5 & 0 \\ 0 & 0 & 0 & \ell_4 & 0 & \ell_6 \end{pmatrix} \] where $\ell_i$ is the length of the $i^\textrm{th}$ branch, so that the species effects are $Z b$.

If we can build the corresponding $Z$ matrix, then we can insert it in the lme4 modular model-fitting process (see ?modular).

Here’s a (probably not very efficent) way to construct the Z matrix. (There must be a way to not walk the tree multiple times from every tip …

phylo.to.Z <- function(r) {
    ntip <- length(r$tip.label)
    Z <- Matrix(0.0,ncol=length(r$edge.length),nrow=ntip)
    nodes <- (ntip+1):max(r$edge)
    root <- nodes[!(nodes %in% r$edge[,2])]
    for (i in 1:ntip) {
        cn <- i  ## current node
        while (cn != root) {
            ce <- which(r$edge[,2]==cn)   ## find current edge
            Z[i,ce] <- r$edge.length[ce]  ## set Z to branch length
            cn <- r$edge[ce,1]            ## find previous node
        }
    }
    return(Z)
}
phylo.to.Z(r)

## 4 x 6 sparse Matrix of class "dgCMatrix"
##                                                                
## [1,] 0.3000548 0.5848666 .         .        .         .        
## [2,] 0.3000548 .         0.3334671 .        .         .        
## [3,] .         .         .         0.622012 0.5458286 .        
## [4,] .         .         .         0.622012 .         0.8797957

(This could benefit from the repeated-entry sparse matrix class that Steve Walker wrote.)

On the other hand, it only takes a few seconds to run for a 200-species phylogeny (see below).

constructing a GP PGLMM

“All” we need to do is (1) call (g)lFormula, with a formula that includes a (1|phylo) term, to build the basic (wrong) structure; (2) modify the reTrms component of the structure appropriately; (3) go through the rest of the modular procedure for building a (G)LMM.

#' split a square (block) matrix into component blocks 
#' @param M square matrix
#' @param ind indices (0,n1,n2,...) giving the endpoint of each block
split_blkMat <- function(M,ind) {
    res <- list()
    if (length(ind)==1) return(list(M))
    for (i in 1:(length(ind)-1)) {
        v <- (ind[i]+1):ind[i+1]
        res[[i]] <- M[v,v]
    }
    return(res)
}

#' modify reTrms object
#' @param rt a reTrms object
#' @param phylo a phylo object (phylogenetic tree)
#' @param phylonm name of phylogenetic term in model
#' @param phyloZ Z matrix built on branch length
modify_phylo_retrms <- function(rt,phylo,phylonm="phylo",
                                phyloZ=phylo.to.Z(phylo)) {
    ## FIXME: better way to specify phylonm
    ## need to replace Zt, Lind, Gp, flist, Ztlist
    ## we have the same number of parameters (theta, lower),
    ##  same number of obs
    n.edge <- nrow(phylo$edge)
    phylo.pos <- which(names(rt$cnms)==phylonm)
    inds <- c(0,cumsum(sapply(rt$Ztlist,nrow)))
    ## Zt: substitute phylo Z for previous dummy (scalar-intercept) Z
    rt[["Ztlist"]][[phylo.pos]] <- t(phyloZ)
    ## reconstitute Zt from new Ztlist
    rt[["Zt"]] <- do.call(rbind,rt[["Ztlist"]])
    ## Gp: substitute new # random effects (n.edge) for old # (n.phylo)
    Gpdiff <- diff(rt$Gp)  ## old numbers
    Gpdiff_new <- Gpdiff
    Gpdiff_new[phylo.pos] <- n.edge  ## replace
    rt[["Gp"]] <- as.integer(c(0,cumsum(Gpdiff_new)))          ## reconstitute
    ## Lind: replace phylo block with the same element, just more values
    Lind_list <- split(rt[["Lind"]],rep(seq_along(Gpdiff),Gpdiff))
    Lind_list[[phylo.pos]] <- rep(Lind_list[[phylo.pos]][1],n.edge)
    rt[["Lind"]] <- unlist(Lind_list)
    ## Lambdat: replace block-diagonal element in Lambdat with a
    ##   larger diagonal matrix
    Lambdat_list <- split_blkMat(rt[["Lambdat"]],inds)
    Lambdat_list[[phylo.pos]] <- Diagonal(n.edge,1.0)
    rt[["Lambdat"]] <- Matrix::.bdiag(Lambdat_list)
    ## flist: 
    rt[["flist"]] <- as.list(rt[["flist"]])
    rt[["flist"]][[phylonm]] <- factor(paste0("edge_",seq(n.edge)))
    return(rt)
}

#'
phylo_glmm <- function(formula,data,family,phylo,phyloZ) {
    glmod <- glFormula(formula=formula,data = data, family = family)
    glmod$reTrms <- modify_phylo_retrms(glmod$reTrms,phylo,
                                        phylonm="phylo",phyloZ)
    devfun <- do.call(mkGlmerDevfun, glmod)
    opt <- optimizeGlmer(devfun)
    devfun <- updateGlmerDevfun(devfun, glmod$reTrms)
    opt <- optimizeGlmer(devfun, stage=2)
    mkMerMod(environment(devfun), opt, glmod$reTrms, fr = glmod$fr)
}

example

From chapter 11 of Garamszegi (ed.): data are here

if (!file.exists("data/phylo.nex")) {
    dir.create("data")
    download.file("http://mpcm-evolution.org/OPM/Chapter11_OPM/data.zip",
                  dest="data/OPM_ch11_data.zip")
    setwd("data")
    untar("OPM_ch11_data.zip")
    setwd("..")
}
phylo <- read.nexus("data/phylo.nex")
dat <- read.table("data/data_pois.txt",header=TRUE)

system.time(phyloZ <- phylo.to.Z(phylo))

##    user  system elapsed 
##   5.276   0.012   5.312

Add an observation-level random effect (in case we want to match with MCMCglmm)

dat$obs <- factor(seq(nrow(dat)))
phylo_glmm_fit <- phylo_glmm(phen_pois~cofactor+(1|phylo)+(1|obs),
                             data=dat,family=poisson,phylo=phylo,
                             phyloZ=phyloZ)

From Garamszegi ch. 11 code examples.

nitt <- 5e5 ## was 5e6
inv.phylo <- inverseA(phylo,nodes="TIPS",scale=TRUE)
prior <- list(G=list(G1=list(V=1,nu=0.02)),R=list(V=1,nu=0.02))
MCMC_time <- system.time(
    MCMCglmm_fit <- MCMCglmm(phen_pois~cofactor,random=~phylo,
                       family="poisson",ginverse=list(phylo=inv.phylo$Ainv),
                       prior=prior,data=dat,nitt=nitt,burnin=1000,
                       thin=nitt/1000,verbose=FALSE))

MCMCglmm fit takes about 28 minutes … effective sample size is 1000, which means the initial suggested 5 million steps may be overkill ? (However, with only 5e4 parameter sets we get wonky-looking trace plots/effective sample size of only 200, so 5e5 may be necessary …)

ss <- summary(MCMCglmm_fit)
with(ss,rbind(solutions[,1:3],Gcovariances[,1:3],Rcovariances[,1:3]))

##               post.mean     l-95% CI   u-95% CI
## (Intercept) -2.08796523 -2.479168010 -1.6785878
## cofactor     0.25082309  0.229117489  0.2738551
##              0.03959840  0.002052013  0.1047885
##              0.04294357  0.003287147  0.1029546

rbind(coef(summary(phylo_glmm_fit))[,1:2],
      cbind(matrix(unlist(VarCorr(phylo_glmm_fit))),NA))

##                Estimate Std. Error
## (Intercept) -2.09051239 0.18813499
## cofactor     0.25032657 0.01117239
##              0.05177132         NA
##              0.04132941         NA

Not exactly identical, but pretty close (priors, etc. etc.) (Getting profile confidence intervals will take some more work, because we can’t use update.merMod transparently within the profile …)

To do

Update the tidy method for MCMCglmm_fit to return the standard devs …
Think about connections to/correspondence with Pagel’s $\lambda$, which multiplies the phylogenetic covariance matrix by a factor $0<\lambda<1$?
In an LMM what assumptions do we have to make about residual var, i.e. fix it to a small value?
When is residual var unidentifiable (e.g. all terminal branch lengths identical)?
how easy is it to put this into glmmTMB?
how easily could we implement an O-U process (this would require that Z be recomputed each time with changing $\alpha$ [and reference level]: could be considerably more difficult/mess up some of the linear algebra tricks?)

Brain dump: phylogenetic comparative methods in the lme4-verse

Ben Bolker

13:07 07 June 2017

Nuts and bolts: from a phylogeny to a \(Z\) matrix for the GP

constructing a GP PGLMM

example

To do