library(ape)
library(lme4)
library(Matrix)
## additional possibilities for PGLMMs ...
library(MCMCglmm)
library(lattice)
library(pez)
## utils
library(coda)
library(broom)
library(dotwhisker)
Various P(G)LMM (phylogenetic [generalized] linear mixed model] approaches have been proposed. Many depend on Pagel’s lambda transformation, which gives the correlation matrix a particularly simple form (but has been criticized …)
An alternative approach is to model the phylogenetic correlation as a Gaussian process. In particular, suppose that the evolutionary process is a Brownian motion (an almost certainly incorrect/oversimplified model of evolution, but one that many phylogenetic methods are built on). In that case, the phylogenetic variability of a particular observation can be written as the sum of the evolutionary changes that occurred on all of the branches in the phylogeny in its past. If we set up the \(Z\) matrix appropriately, we can model everything with a sequence of independent errors, rather than having to do fancy things to impose a correlation structure on the random effects.
library(ape)
set.seed(101)
r <- makeNodeLabel(rtree(4))
plot(r,show.node.label=TRUE)
Information in a phylo
object is contained in the edge matrix:
edge: a two-column matrix of mode numeric where each row represents an edge of the tree; the nodes and the tips are symbolized with numbers; the tips are numbered 1, 2, …, and the nodes are numbered after the tips. For each row, the first column gives the ancestor.
t(r$edge)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 5 6 6 5 7 7
## [2,] 6 1 2 7 3 4
and a list of edge lengths
r$edge.length
## [1] 0.3000548 0.5848666 0.3334671 0.6220120 0.5458286 0.8797957
Inspecting this tree, we can figure out (see $tip.label
and $node.label
for label-to-number correspondences):
t1
) involves branches 2 (6 \(\to\) 1) and 1 (5 \(\to\) 6).t3
) involves branches 3 (6 \(\to\) 2) and 1 (5 \(\to\) 6)t2
) involves branches 5 (7 \(\to\) 3) and 4 (5 \(\to\) 7)t4
) involves branches 6 (7 \(\to\) 4) and 4 (5 \(\to\) 7)So, for example, we can say that the ‘error’ value corresponding to tip 1 is \(\ell_1 b_1 + \ell_2 b_2\), where \(\ell_i\) is the (square root of??) the branch length and the \(b_i\) are independent, homoscedastic Normal variates. Alternately, the \(Z\) matrix is
\[ \begin{pmatrix} \ell_1 & \ell_2 & 0 & 0 & 0 & 0 \\ \ell_1 & 0 & \ell_3 & 0 & 0 & 0 \\ 0 & 0 & 0 & \ell_4 & \ell_5 & 0 \\ 0 & 0 & 0 & \ell_4 & 0 & \ell_6 \end{pmatrix} \] where \(\ell_i\) is the length of the \(i^\textrm{th}\) branch, so that the species effects are \(Z b\).
If we can build the corresponding \(Z\) matrix, then we can insert it in the lme4
modular model-fitting process (see ?modular
).
Here’s a (probably not very efficent) way to construct the Z matrix. (There must be a way to not walk the tree multiple times from every tip …
phylo.to.Z <- function(r) {
ntip <- length(r$tip.label)
Z <- Matrix(0.0,ncol=length(r$edge.length),nrow=ntip)
nodes <- (ntip+1):max(r$edge)
root <- nodes[!(nodes %in% r$edge[,2])]
for (i in 1:ntip) {
cn <- i ## current node
while (cn != root) {
ce <- which(r$edge[,2]==cn) ## find current edge
Z[i,ce] <- r$edge.length[ce] ## set Z to branch length
cn <- r$edge[ce,1] ## find previous node
}
}
return(Z)
}
phylo.to.Z(r)
## 4 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] 0.3000548 0.5848666 . . . .
## [2,] 0.3000548 . 0.3334671 . . .
## [3,] . . . 0.622012 0.5458286 .
## [4,] . . . 0.622012 . 0.8797957
(This could benefit from the repeated-entry sparse matrix class that Steve Walker wrote.)
On the other hand, it only takes a few seconds to run for a 200-species phylogeny (see below).
“All” we need to do is (1) call (g)lFormula
, with a formula that includes a (1|phylo)
term, to build the basic (wrong) structure; (2) modify the reTrms
component of the structure appropriately; (3) go through the rest of the modular procedure for building a (G)LMM.
#' split a square (block) matrix into component blocks
#' @param M square matrix
#' @param ind indices (0,n1,n2,...) giving the endpoint of each block
split_blkMat <- function(M,ind) {
res <- list()
if (length(ind)==1) return(list(M))
for (i in 1:(length(ind)-1)) {
v <- (ind[i]+1):ind[i+1]
res[[i]] <- M[v,v]
}
return(res)
}
#' modify reTrms object
#' @param rt a reTrms object
#' @param phylo a phylo object (phylogenetic tree)
#' @param phylonm name of phylogenetic term in model
#' @param phyloZ Z matrix built on branch length
modify_phylo_retrms <- function(rt,phylo,phylonm="phylo",
phyloZ=phylo.to.Z(phylo)) {
## FIXME: better way to specify phylonm
## need to replace Zt, Lind, Gp, flist, Ztlist
## we have the same number of parameters (theta, lower),
## same number of obs
n.edge <- nrow(phylo$edge)
phylo.pos <- which(names(rt$cnms)==phylonm)
inds <- c(0,cumsum(sapply(rt$Ztlist,nrow)))
## Zt: substitute phylo Z for previous dummy (scalar-intercept) Z
rt[["Ztlist"]][[phylo.pos]] <- t(phyloZ)
## reconstitute Zt from new Ztlist
rt[["Zt"]] <- do.call(rbind,rt[["Ztlist"]])
## Gp: substitute new # random effects (n.edge) for old # (n.phylo)
Gpdiff <- diff(rt$Gp) ## old numbers
Gpdiff_new <- Gpdiff
Gpdiff_new[phylo.pos] <- n.edge ## replace
rt[["Gp"]] <- as.integer(c(0,cumsum(Gpdiff_new))) ## reconstitute
## Lind: replace phylo block with the same element, just more values
Lind_list <- split(rt[["Lind"]],rep(seq_along(Gpdiff),Gpdiff))
Lind_list[[phylo.pos]] <- rep(Lind_list[[phylo.pos]][1],n.edge)
rt[["Lind"]] <- unlist(Lind_list)
## Lambdat: replace block-diagonal element in Lambdat with a
## larger diagonal matrix
Lambdat_list <- split_blkMat(rt[["Lambdat"]],inds)
Lambdat_list[[phylo.pos]] <- Diagonal(n.edge,1.0)
rt[["Lambdat"]] <- Matrix::.bdiag(Lambdat_list)
## flist:
rt[["flist"]] <- as.list(rt[["flist"]])
rt[["flist"]][[phylonm]] <- factor(paste0("edge_",seq(n.edge)))
return(rt)
}
#'
phylo_glmm <- function(formula,data,family,phylo,phyloZ) {
glmod <- glFormula(formula=formula,data = data, family = family)
glmod$reTrms <- modify_phylo_retrms(glmod$reTrms,phylo,
phylonm="phylo",phyloZ)
devfun <- do.call(mkGlmerDevfun, glmod)
opt <- optimizeGlmer(devfun)
devfun <- updateGlmerDevfun(devfun, glmod$reTrms)
opt <- optimizeGlmer(devfun, stage=2)
mkMerMod(environment(devfun), opt, glmod$reTrms, fr = glmod$fr)
}
From chapter 11 of Garamszegi (ed.): data are here
if (!file.exists("data/phylo.nex")) {
dir.create("data")
download.file("http://mpcm-evolution.org/OPM/Chapter11_OPM/data.zip",
dest="data/OPM_ch11_data.zip")
setwd("data")
untar("OPM_ch11_data.zip")
setwd("..")
}
phylo <- read.nexus("data/phylo.nex")
dat <- read.table("data/data_pois.txt",header=TRUE)
system.time(phyloZ <- phylo.to.Z(phylo))
## user system elapsed
## 5.276 0.012 5.312
Add an observation-level random effect (in case we want to match with MCMCglmm
)
dat$obs <- factor(seq(nrow(dat)))
phylo_glmm_fit <- phylo_glmm(phen_pois~cofactor+(1|phylo)+(1|obs),
data=dat,family=poisson,phylo=phylo,
phyloZ=phyloZ)
From Garamszegi ch. 11 code examples.
nitt <- 5e5 ## was 5e6
inv.phylo <- inverseA(phylo,nodes="TIPS",scale=TRUE)
prior <- list(G=list(G1=list(V=1,nu=0.02)),R=list(V=1,nu=0.02))
MCMC_time <- system.time(
MCMCglmm_fit <- MCMCglmm(phen_pois~cofactor,random=~phylo,
family="poisson",ginverse=list(phylo=inv.phylo$Ainv),
prior=prior,data=dat,nitt=nitt,burnin=1000,
thin=nitt/1000,verbose=FALSE))
MCMCglmm
fit takes about 28 minutes … effective sample size is 1000, which means the initial suggested 5 million steps may be overkill ? (However, with only 5e4 parameter sets we get wonky-looking trace plots/effective sample size of only 200, so 5e5 may be necessary …)
ss <- summary(MCMCglmm_fit)
with(ss,rbind(solutions[,1:3],Gcovariances[,1:3],Rcovariances[,1:3]))
## post.mean l-95% CI u-95% CI
## (Intercept) -2.08796523 -2.479168010 -1.6785878
## cofactor 0.25082309 0.229117489 0.2738551
## 0.03959840 0.002052013 0.1047885
## 0.04294357 0.003287147 0.1029546
rbind(coef(summary(phylo_glmm_fit))[,1:2],
cbind(matrix(unlist(VarCorr(phylo_glmm_fit))),NA))
## Estimate Std. Error
## (Intercept) -2.09051239 0.18813499
## cofactor 0.25032657 0.01117239
## 0.05177132 NA
## 0.04132941 NA
Not exactly identical, but pretty close (priors, etc. etc.) (Getting profile confidence intervals will take some more work, because we can’t use update.merMod
transparently within the profile …)
tidy
method for MCMCglmm_fit
to return the standard devs …glmmTMB
?