library(ape)
library(lme4)
library(Matrix)
## additional possibilities for PGLMMs ...
library(MCMCglmm)
library(lattice)
library(pez)
## utils
library(coda)
library(broom)
library(dotwhisker)

Various P(G)LMM (phylogenetic [generalized] linear mixed model] approaches have been proposed. Many depend on Pagel’s lambda transformation, which gives the correlation matrix a particularly simple form (but has been criticized …)

An alternative approach is to model the phylogenetic correlation as a Gaussian process. In particular, suppose that the evolutionary process is a Brownian motion (an almost certainly incorrect/oversimplified model of evolution, but one that many phylogenetic methods are built on). In that case, the phylogenetic variability of a particular observation can be written as the sum of the evolutionary changes that occurred on all of the branches in the phylogeny in its past. If we set up the \(Z\) matrix appropriately, we can model everything with a sequence of independent errors, rather than having to do fancy things to impose a correlation structure on the random effects.

Nuts and bolts: from a phylogeny to a \(Z\) matrix for the GP

library(ape)
set.seed(101)
r <- makeNodeLabel(rtree(4))
plot(r,show.node.label=TRUE)

Information in a phylo object is contained in the edge matrix:

edge: a two-column matrix of mode numeric where each row represents an edge of the tree; the nodes and the tips are symbolized with numbers; the tips are numbered 1, 2, …, and the nodes are numbered after the tips. For each row, the first column gives the ancestor.

t(r$edge)
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    5    6    6    5    7    7
## [2,]    6    1    2    7    3    4

and a list of edge lengths

r$edge.length
## [1] 0.3000548 0.5848666 0.3334671 0.6220120 0.5458286 0.8797957

Inspecting this tree, we can figure out (see $tip.label and $node.label for label-to-number correspondences):

So, for example, we can say that the ‘error’ value corresponding to tip 1 is \(\ell_1 b_1 + \ell_2 b_2\), where \(\ell_i\) is the (square root of??) the branch length and the \(b_i\) are independent, homoscedastic Normal variates. Alternately, the \(Z\) matrix is

\[ \begin{pmatrix} \ell_1 & \ell_2 & 0 & 0 & 0 & 0 \\ \ell_1 & 0 & \ell_3 & 0 & 0 & 0 \\ 0 & 0 & 0 & \ell_4 & \ell_5 & 0 \\ 0 & 0 & 0 & \ell_4 & 0 & \ell_6 \end{pmatrix} \] where \(\ell_i\) is the length of the \(i^\textrm{th}\) branch, so that the species effects are \(Z b\).

If we can build the corresponding \(Z\) matrix, then we can insert it in the lme4 modular model-fitting process (see ?modular).

Here’s a (probably not very efficent) way to construct the Z matrix. (There must be a way to not walk the tree multiple times from every tip …

phylo.to.Z <- function(r) {
    ntip <- length(r$tip.label)
    Z <- Matrix(0.0,ncol=length(r$edge.length),nrow=ntip)
    nodes <- (ntip+1):max(r$edge)
    root <- nodes[!(nodes %in% r$edge[,2])]
    for (i in 1:ntip) {
        cn <- i  ## current node
        while (cn != root) {
            ce <- which(r$edge[,2]==cn)   ## find current edge
            Z[i,ce] <- r$edge.length[ce]  ## set Z to branch length
            cn <- r$edge[ce,1]            ## find previous node
        }
    }
    return(Z)
}
phylo.to.Z(r)
## 4 x 6 sparse Matrix of class "dgCMatrix"
##                                                                
## [1,] 0.3000548 0.5848666 .         .        .         .        
## [2,] 0.3000548 .         0.3334671 .        .         .        
## [3,] .         .         .         0.622012 0.5458286 .        
## [4,] .         .         .         0.622012 .         0.8797957

(This could benefit from the repeated-entry sparse matrix class that Steve Walker wrote.)

On the other hand, it only takes a few seconds to run for a 200-species phylogeny (see below).

constructing a GP PGLMM

“All” we need to do is (1) call (g)lFormula, with a formula that includes a (1|phylo) term, to build the basic (wrong) structure; (2) modify the reTrms component of the structure appropriately; (3) go through the rest of the modular procedure for building a (G)LMM.

#' split a square (block) matrix into component blocks 
#' @param M square matrix
#' @param ind indices (0,n1,n2,...) giving the endpoint of each block
split_blkMat <- function(M,ind) {
    res <- list()
    if (length(ind)==1) return(list(M))
    for (i in 1:(length(ind)-1)) {
        v <- (ind[i]+1):ind[i+1]
        res[[i]] <- M[v,v]
    }
    return(res)
}

#' modify reTrms object
#' @param rt a reTrms object
#' @param phylo a phylo object (phylogenetic tree)
#' @param phylonm name of phylogenetic term in model
#' @param phyloZ Z matrix built on branch length
modify_phylo_retrms <- function(rt,phylo,phylonm="phylo",
                                phyloZ=phylo.to.Z(phylo)) {
    ## FIXME: better way to specify phylonm
    ## need to replace Zt, Lind, Gp, flist, Ztlist
    ## we have the same number of parameters (theta, lower),
    ##  same number of obs
    n.edge <- nrow(phylo$edge)
    phylo.pos <- which(names(rt$cnms)==phylonm)
    inds <- c(0,cumsum(sapply(rt$Ztlist,nrow)))
    ## Zt: substitute phylo Z for previous dummy (scalar-intercept) Z
    rt[["Ztlist"]][[phylo.pos]] <- t(phyloZ)
    ## reconstitute Zt from new Ztlist
    rt[["Zt"]] <- do.call(rbind,rt[["Ztlist"]])
    ## Gp: substitute new # random effects (n.edge) for old # (n.phylo)
    Gpdiff <- diff(rt$Gp)  ## old numbers
    Gpdiff_new <- Gpdiff
    Gpdiff_new[phylo.pos] <- n.edge  ## replace
    rt[["Gp"]] <- as.integer(c(0,cumsum(Gpdiff_new)))          ## reconstitute
    ## Lind: replace phylo block with the same element, just more values
    Lind_list <- split(rt[["Lind"]],rep(seq_along(Gpdiff),Gpdiff))
    Lind_list[[phylo.pos]] <- rep(Lind_list[[phylo.pos]][1],n.edge)
    rt[["Lind"]] <- unlist(Lind_list)
    ## Lambdat: replace block-diagonal element in Lambdat with a
    ##   larger diagonal matrix
    Lambdat_list <- split_blkMat(rt[["Lambdat"]],inds)
    Lambdat_list[[phylo.pos]] <- Diagonal(n.edge,1.0)
    rt[["Lambdat"]] <- Matrix::.bdiag(Lambdat_list)
    ## flist: 
    rt[["flist"]] <- as.list(rt[["flist"]])
    rt[["flist"]][[phylonm]] <- factor(paste0("edge_",seq(n.edge)))
    return(rt)
}

#'
phylo_glmm <- function(formula,data,family,phylo,phyloZ) {
    glmod <- glFormula(formula=formula,data = data, family = family)
    glmod$reTrms <- modify_phylo_retrms(glmod$reTrms,phylo,
                                        phylonm="phylo",phyloZ)
    devfun <- do.call(mkGlmerDevfun, glmod)
    opt <- optimizeGlmer(devfun)
    devfun <- updateGlmerDevfun(devfun, glmod$reTrms)
    opt <- optimizeGlmer(devfun, stage=2)
    mkMerMod(environment(devfun), opt, glmod$reTrms, fr = glmod$fr)
}

example

From chapter 11 of Garamszegi (ed.): data are here

if (!file.exists("data/phylo.nex")) {
    dir.create("data")
    download.file("http://mpcm-evolution.org/OPM/Chapter11_OPM/data.zip",
                  dest="data/OPM_ch11_data.zip")
    setwd("data")
    untar("OPM_ch11_data.zip")
    setwd("..")
}
phylo <- read.nexus("data/phylo.nex")
dat <- read.table("data/data_pois.txt",header=TRUE)
system.time(phyloZ <- phylo.to.Z(phylo))
##    user  system elapsed 
##   5.276   0.012   5.312

Add an observation-level random effect (in case we want to match with MCMCglmm)

dat$obs <- factor(seq(nrow(dat)))
phylo_glmm_fit <- phylo_glmm(phen_pois~cofactor+(1|phylo)+(1|obs),
                             data=dat,family=poisson,phylo=phylo,
                             phyloZ=phyloZ)

From Garamszegi ch. 11 code examples.

nitt <- 5e5 ## was 5e6
inv.phylo <- inverseA(phylo,nodes="TIPS",scale=TRUE)
prior <- list(G=list(G1=list(V=1,nu=0.02)),R=list(V=1,nu=0.02))
MCMC_time <- system.time(
    MCMCglmm_fit <- MCMCglmm(phen_pois~cofactor,random=~phylo,
                       family="poisson",ginverse=list(phylo=inv.phylo$Ainv),
                       prior=prior,data=dat,nitt=nitt,burnin=1000,
                       thin=nitt/1000,verbose=FALSE))

MCMCglmm fit takes about 28 minutes … effective sample size is 1000, which means the initial suggested 5 million steps may be overkill ? (However, with only 5e4 parameter sets we get wonky-looking trace plots/effective sample size of only 200, so 5e5 may be necessary …)

ss <- summary(MCMCglmm_fit)
with(ss,rbind(solutions[,1:3],Gcovariances[,1:3],Rcovariances[,1:3]))
##               post.mean     l-95% CI   u-95% CI
## (Intercept) -2.08796523 -2.479168010 -1.6785878
## cofactor     0.25082309  0.229117489  0.2738551
##              0.03959840  0.002052013  0.1047885
##              0.04294357  0.003287147  0.1029546
rbind(coef(summary(phylo_glmm_fit))[,1:2],
      cbind(matrix(unlist(VarCorr(phylo_glmm_fit))),NA))
##                Estimate Std. Error
## (Intercept) -2.09051239 0.18813499
## cofactor     0.25032657 0.01117239
##              0.05177132         NA
##              0.04132941         NA

Not exactly identical, but pretty close (priors, etc. etc.) (Getting profile confidence intervals will take some more work, because we can’t use update.merMod transparently within the profile …)

To do