models for testing

Introduction

How should we think about the connection between numbers of (1) infected people in the population (2) tests done (3) cases reported?

Most generally/mechanistically, on any day we have some number of people who are on day \(\tau\) of their infectious period. On a given day, an infectious (\(I\)) person may be subclinical (asymptomatic or so mildly symptomatic that they don’t report symptoms); mild; severe (hospitalized); or very severe (ICU/ventilator). The progression among these stages will be different for different people (although in general they will progress toward greater severity). The more severe the symptoms, the more likely someone is to be tested.

Testing will also depend on how they got infected, e.g. imported cases and close contacts are more likely to be tested. (An asymptomatic, community-infected person will basically never be tested.)

In a complete agent-based model (or a model based on a complex compartmental structure), we could keep track of people moving through all of these categories and assign them probabilities of being tested. In a sufficiently realistic model we could even take into account detailed testing criteria and phenomena such as limited testing resources (so that, e.g., low-risk, mild cases would never be tested).

There are two extreme scenarios that are easy to reason about.

if testing is random, the proportion of positive tests should be the same as the prevalence (or incidence - I’m not being precise about whether we are measuring a flow or a stock, but I don’t think it makes a difference), regardless of either. Increasing the number of tests when the prevalence is constant leads to a proportional increase in the number of cases. The proportion of positive tests is \(i\) and the number of cases (i.e., positive tests) is \(iT\). (I’m using \(i\), \(t\) for proportion of the population infected or tested and \(I\), \(T\) as the number infected or tested: \(I=iN\), \(T=tN\).) If the prevalence is increasing exponentially with rate \(r_1\) and testing is increasing at rate \(r_2\), the number of cases increases at rate \(r_1 + r_2\) (and doubling time \(\log(2)/(r1+r2)\)).
if testing is perfectly focused on infected people, then tests are 100% positive until we run out of infected people (i.e. proportion of pop tested > prevalence). The proportion of positive tests is \(\min(1,i/t)\) and the number of cases is \(\min(T,I)\).

Now let’s consider that in general we test people in (approximate) order of their probability of infection (ignoring tests skipped because of constraints). We can think about a distribution in the population of probability of infection (higher for known contacts of infected people, travellers from high-risk areas, etc.), and a distribution in the probability of testing (which ideally lines up with the risk). We could think about having two distributions, but these can’t really be separated (FIXME: explain better!).

Machinery

Let’s use a Beta distribution with mean equal to prevalence \(i\) as the distribution of ‘probability infected’. (An alternative that might be more analytically tractable is the Kumaraswamy distribution.) As the dispersion \(\gamma=1/(a+b)\) goes to infinity we end up with point masses \(1-i\) on 0 and \(i\) on 1; as it goes to 0 we end up with a point mass on \(i\). If we always test ‘from the top down’ (i.e. calculate the mean value of the top fraction \(t = T/N\) of the population distribution), these two extreme cases correspond to perfect testing (cases=\(\min(T,I)\)) and random testing (cases=\(\textrm{Binomial}(i,T)\)).

If \(B\) is the Beta distribution with mean \(i\) and dispersion \(\gamma\), \(\Phi_B\) is the CDF, and \(Q\) is the inverse CDF (i.e., the quantile function) then our expected proportion positive from testing a fraction \(t=T/N\) is

\[ \frac{\left(\int_{Q(1-t)}^1 B(y,\phi) y \, dy\right)}{1-\Phi_B(Q(1-t))} = \frac{\left(\int_{Q(1-t)}^1 B(y,\phi) y \, dy\right)}{t} \]

i.e. the mean of the upper tail of the infection-probability distribution.

We can get a little bit farther analytically. Given that \(B(x,a,b) \propto x^{a-1} (1-x)^{b-1}\), the mean is \(a/(a+b)=i \to a \phi =i \to a = i/\phi, b=(a+b)-a = 1/\phi - i/\phi = (1-i)/\phi\). We can show that \(B(x,a,b) \cdot x = a/(a+b) B(x,a+1,b)\), so we should be able to do the integral directly by computing the appropriate CDF (or complementary CDF) of the Beta distribution.

\(B(x,a,b) =\frac{1}{\mathcal{B}(a,b)} x^{a-1} (1-x)^{b-1}\), where \(\mathcal{B}(a,b)\) is the beta function, which satisfies \[\mathcal{B}(a,b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}\,.\] It follows that \[ \begin{aligned} B(x,a,b)x = \frac{1}{\mathcal{B}(a,b)} x^{a} (1-x)^{b-1} &= \frac{\mathcal{B}(a+1,b)}{\mathcal{B}(a,b)}\times \frac{1}{\mathcal{B}(a+1,b)} x^{a} (1-x)^{b-1}\\ &=\frac{\mathcal{B}(a+1,b)}{\mathcal{B}(a,b)} B(x,a+1,b)\,, \end{aligned} \] which, using the identity \(\Gamma(z+1) = z\Gamma(z)\), simplifies to \[ \begin{aligned} B(x,a,b)x &=\frac{\Gamma(a+1)\Gamma(b)}{\Gamma(a+b+1)}\frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)} B(x,a+1,b)\\ &= \frac{a\Gamma(a)\Gamma(b)}{(a+b)\Gamma(a+b)}\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} B(x,a+1,b) = \frac{a}{a+b}B(x,a+1,b)\,. \end{aligned} \] Since we set the mean \(a/(a+b)\) of the distribution \(B(x,a,b)\) to \(i\), we have \(B(x,a,b)x = iB(x,a+1,b)\).

Can we do a little bit more here since we’re integrating this from \(Q(a,b)\) to 1 … ?

In order to make the dispersion parameter a little more interpretable we will transform the parameter \(\gamma\) from \((0,\infty)\) to a value \(\phi = 1-\exp(-\gamma) \in (0,1) (i.e. \gamma=-\log(1-\phi))\) (previously: \(\phi=-\log(1-\gamma)\) (i.e. \(\gamma=1-\exp(-\phi)\))), so that 0 corresponds to random testing and 1 corresponds to perfectly focused testing.

If we wanted to get more realistic/detailed we could relax the assumption of a Beta distribution and instead characterize the distribution of probability infected as a mixture of types (general population, symptomatic, travel history, etc.), with the constraint of a mean of \(i\)

Graphical example

Suppose the population incidence is 25% and we test 25% of the population. Our \(\phi\) parameter is 0.2211992 (dispersion of the Beta are 0.25 and shape parameters of the Beta are {1, 3}). The lower bound of the upper 25% tail is 0.37; the integral of \(x\) under that portion of the curve is 0.132, so the test positivity is round(qq/0.25).

Correlation

What is the relationship between \(\phi\) and the correlation between testing and infection?

We need

\[ \newcommand{\var}{\textrm{Var}} \frac{E[(\textrm{tested} - \overline{\textrm{tested}})\cdot - (\textrm{infected} - \overline{\textrm{infected}})]}{\sqrt{\var(\textrm{tested})} \cdot \var(\textrm{infected})} \]

The probability of testing is 0 below the threshold and 1 above it; thus this distribution should be a Bernoulli with weight \(1-t\) on zero and \(t\) on 1. Thus the numerator is the value we’ve already computed for the numerator of the proportion-positive value (\(\int_{1-t}^1 x \cdot \textrm{Beta(x)}\,dx\)), minus the product of the testing and infection means (\(= ti\), using the usual identity \(E[(A-\bar A)(B-\bar B)] = E[AB] - \bar A \bar B\)). The variance of testing probability is (\((1-t)\cdot 0^2 + t \cdot 1^2 = t\)); the variance of infection probability is the variance of the Beta distribution.

inf_test_corr <- function(i, t, phi) {
    gamma <- -log(1-phi)
    a <- i/gamma; b <- (1-i)/gamma
    lwr <- qbeta(t, a, b, lower.tail = FALSE)  ## hope we don't need robust version in covid19testing_funs.R
    num <- pbeta(lwr,a+1,b,lower.tail=FALSE)*(a/(a+b))-t*i
    var_test <- t*(1-t)
    var_inf <- a*b/((a+b)^2*(a+b+1)) ## Wikipedia
    num/sqrt(var_test*var_inf)
}
nn <- 101
phivec <- seq(0, 1, length.out = nn)
cordf <- data.frame(i = rep(c(0.25, 0.1), c(nn, 2*nn)),
                    t = rep(c(0.25, 0.05, 0.01), each = nn),
                    phi = rep(phivec, 3))
cordf$cor <- with(cordf, inf_test_corr(i,t,phi))

## Warning in qbeta(t, a, b, lower.tail = FALSE): NaNs produced

cordf$cat <- with(cordf, sprintf("i = %1.2f, t = %1.2f", i, t))
cordf$cat <- factor(cordf$cat, levels = unique(cordf$cat))
ggplot(cordf, aes(phi, cor, colour = cat)) + geom_line() +
    geom_pointrange(data = numcor, aes(y=cor, ymin = lwr, ymax = upr),
                    colour = "blue", size = 0.1)

## Warning: Removed 6 rows containing missing values or values outside the scale range (`geom_line()`).

?? Is it true that correlation decreases with large \(\phi\) when testing and prevalence are low … ?? (Seems too large an effect to be due to numeric problems)
SZ: current investigation indicates that the “decrease” is caused by the bimodality of Beta when \(\phi\) is large, when \(t\) is small, as \(t\) increase, the test positivity decrease rapidly. But I doubt currently if this is decrease is determined by monotocity (1st order derivative) of the beta distribution curve or together with concave (2nd order derivative) or even more complicated behaviour.
?? Is there a simple expression for cor as a function of \(\phi\) or vice versa? (probably not but ??)

Numerical examples

par(las=1,bty="l")
curve(prop_pos_test(i=0.01,t=0.001,x),from=0.001,to=0.999,
      xlab=expression("testing focus"~(phi)),
      ylab="proportion of positive tests",
      main=c("1% prevalence, 0.1% testing"),
      ylim=c(0,1))
abline(h=0.01,lty=2)

What happens as we test more people than are actually infected in the population?

par(las=1,bty="l")
curve(prop_pos_test(t=x,i=0.01,phi=0.5),
      from=0.001, to=0.4,log="x",xlab="proportion tested",
      ylab="prop positive tests",
      main=expression(list("prevalence=1%",phi==0.5)))
abline(v=0.01,lty=2)

dd <- (expand.grid(time=1:25,phi=c(0.001,0.2,0.5,0.8,0.999))
    %>% as_tibble()
    %>% mutate(inc=0.001*exp(log(2)/3*time),
               tests=1e-4*exp(log(2)/4*time),
               pos_prop=prop_pos_test(inc,tests,phi),
               cases=pos_prop*tests)
    ## issues with pivot_longer? use gather() instead
    ## "Error: `spec` must have `.name` and `.value` columns"
    ## FIXME: figure out what's going on here?
    %>% gather("var","value",inc,tests,pos_prop,cases)
)
print(ggplot(dd,aes(time,value,col=phi,group=phi))
  + geom_line()
  + facet_wrap(~var,scale="free")
  + scale_y_log10()
  + scale_colour_viridis_c()
)

FIXME: direct labels? drop incidence and testing, include as reference-slope lines?

Estimation

When is \(\phi\) identifiable?

can we make a combined model with hospitalizations/deaths?
if we assume that prevalence is increasing exponentially (and we know the rates of testing) can we estimate \(\phi\)?
what other comparisons are possible?
there is some information in the model because change in \(i\) over time is constrained by the dynamical parameters …
can we set a sensible prior on \(\phi\) and integrate over the possibilities?

To do

Simulations

The first thing to do is the minimal identifiability test: if we simulate data directly using prop_pos_test, then add some noise (e.g. binomial testing outcomes), and then we try to take the observed testing data (total tests and positive tests) and get an MLE of both phi and the parameters of the exponential growth of prevalence (initial value and growth rate), can we do it? If \(t\) is the fraction of the population tested and \(N\) is the population size and \(f(\tau)=\textrm{prop\_pos}(i(\tau),t(\tau),\phi)\) is the fraction positive, then we want to simulate \(c(\tau) \sim \textrm{Binomial}(f,tN)\), the number of confirmed positive cases at time \(\tau\). Not worrying about sensitivity/specificity yet (tests are perfect).

MLE: want to estimate \(i_0\), \(r\) (growth rate of prevalence), and \(\phi\). We know \(t(\tau)\) and \(c(\tau)\) (no time lags yet!)

## t (test vector) and c (confirmation vector) are defined; tau is a time vector
set.seed(101)
## scalar parameters
true_pars <- c(I_0=100, r=log(2)/3, phi=0.5,
               T_0=100, r_t=log(2)/4, N=1e6)
true_pars["i_0"] <- true_pars["I_0"]/true_pars["N"]
## vectors of observed stuff
dd <- data.frame(tau=1:20)
dd$T <- round(true_pars["T_0"]*exp(true_pars["r_t"]*dd$tau))
dd$t <- dd$T/true_pars["N"]
## simulating confirmation vector
dd$c <- with(c(as.list(true_pars), dd),
             rbinom(length(tau),
                    size=T,  ## number of tests at time tau
                    prob=prop_pos_test(i_0*exp(r*tau), t, phi)
                    )
             )
mle_out<- mle2(c~dbinom(prob=prop_pos_test(plogis(logit_i_0)*exp(r*tau),
                                           t, exp(log_phi),
                                 debug=FALSE, phiscale = "unconstrained"),
              size=t*N),
              start=list(logit_i_0=qlogis(true_pars["i_0"]),
                         r=true_pars["r"], log_phi=log(true_pars["phi"]) ),
              data= list(tau=dd$tau, c=dd$c, t=dd$t, N=true_pars["N"]),
              control=list(maxit=1000)
)
mle_est0 <- coef(mle_out)

pp <- profile(mle_out)

ggdd <- as.data.frame(pp)
ggplot(ggdd,aes(focal,abs(z))) + geom_point() + geom_line() +
  facet_wrap(~param,scale="free_x")+
  scale_y_continuous(limits=c(NA,5))

invlink <- function(x) c(plogis(x[1]),x[2],exp(x[3]))
tt <- (broom::tidy(mle_out)
  %>% mutate(lwr=estimate-2*std.error,
             upr=estimate+2*std.error)
  %>% select(term,estimate,lwr,upr)
  %>% mutate_if(is.numeric,invlink)
  %>% mutate(term=c("i_0","r","phi"))
  %>% mutate(true=true_pars[c("i_0","r","phi")])
)

RTMB version

tmbdat <- list(tau=dd$tau, c=dd$c, t=dd$t, N=true_pars["N"])
pars <- list(logit_i_0=qlogis(true_pars["i_0"]),
             r=true_pars["r"], plogis_phi=plogis(true_pars["phi"]) )
nllfun <- function(pars) {
    getAll(pars, tmbdat)
    prob <- prop_pos_test1(plogis(logit_i_0)*exp(r*tau),
                                           t, exp(plogis_phi))
    -sum(dbinom(c, size = t*N, prob = prob, log = TRUE))
}
ff <- MakeADFun(nllfun, pars, silent = TRUE)
ff$fn()

## [1] 4058.523

fit <- with(ff, nlminb(pars, fn, gr))

## run tmbprofile on a selected set of parameters, combine results,
## scale to delta-deviance (signed? sqrt?)
## is returning CI as an attribute really the best way to do this?
## allow tracing by variable? parallel computation?
tmbprof2 <- function(obj, pars, conf.level = 0.95, ...) {
    pfun <- function(p) {
        tt <- TMB::tmbprofile(obj, p, ..., trace = FALSE)
        ci <- confint(tt, level = conf.level)
        tt$value <- sqrt(2*(tt$value - min(tt$value, na.rm=TRUE)))
        names(tt) <- c(".focal", "value")
        list(df = data.frame(var = p, tt), ci = data.frame(var = p, ci))
    }
    res <- lapply(pars, pfun)
    res_df <- do.call(rbind, lapply(res, \(x) x$df)) |>
        transform(var = factor(var, levels = pars))
    res_ci <- do.call(rbind, lapply(res, \(x) x$ci))
    rownames(res_ci) <- NULL
    attr(res_df, "ci") <- res_ci
    res_df
}

prof <- tmbprof2(ff, c("logit_i_0", "r", "plogis_phi"), ystep = 0.01)

true_vec <- unlist(sapply(pars, unname))
citab <- (attr(prof, "ci")
    |> mutate(est = fit$par[.data$var],
              true = true_vec[.data$var],
              .before = lower)
    |> mutate(across(upper, ~ replace_na(., Inf)))
)

ggplot(prof, aes(.focal, value)) +
    geom_line() + facet_wrap(~var, scale = "free_x") +
    geom_hline(yintercept = 1.96, lty = 2) +
    geom_vline(data = citab, aes(xintercept = true), lty = 2, colour = "red") +
    theme(panel.spacing = grid::unit(0, "lines"))

knitr::kable(citab)

var	est	true	lower	upper
logit_i_0	-9.3133305	-9.2102404	-9.8106774	-8.1010632
r	0.2328694	0.2310491	0.1877328	0.2703343
plogis_phi	-0.2983374	0.6224593	-1.0402324	Inf

ggplot(citab, aes(est, var)) +
    geom_pointrange(aes(xmin=lower, xmax = upper)) +
    geom_point(aes(x=true), colour = "red", size = 4) +
    facet_wrap(~var, scale = "free", ncol = 1) +
    labs(y = "", x = "")

(run multiple sims for RMSE/coverage/bias/etc.? compare against naive estimates, uncorrecting for testing bias?

The next simulation is for understanding better what we’re really modeling here. The simulation should be a more mechanistic description of the infection and testing process: not sure how this actually worked. There must be some infection process (e.g. an SIR model …) and some process by which people get assigned a probability of being chosen for testing, which we can match up with the mathematical description above. The testing process is also mechanistic; after testing people get removed from the testable pool.

How do we model a testing policy? How do we make the testing policy match up with our distribution?

Give every individual two (correlated) numbers which correspond to infection risk and testing risk? Increase testing risk at onset of symptoms?

Test asymptomatic (susceptible) people at one rate and symptomatic (infectious) people at another rate?

A fraction of the incidence moves into a “testing pool”; people in the testing pool are tested at a constant rate. Every person gets a number sampled from our beta distribution when they become infected, which governs their chance of being selected

Kumaraswamy distribution

Is the inverse CDF for the Kumaraswamy distribution tractable? Seems that way:

\[ \begin{split} q & =1-(1-x^a)^b \\ (1-q)^{1/b} & = 1-x^a \\ x = (1- (1-q)^{1/b})^{1/a} \end{split} \]

(this could be important if we want to embed this machinery in Stan/JAGS/etc. where pbeta()/qbeta() may not exist and/or be sufficiently robust …)

Unfortunately the tradeoff is that the expression for the mean is complicated:

\[ \frac{b\Gamma\left(1+ \frac{1}{a}\right) \Gamma(b)}{\Gamma\left(1+\frac{1}{a}+b\right)} \]

so reparameterizing in terms of mean/dispersion might be hard (and the integral trick with the Beta also might not work )

brain dump/misc

How does this relate to other ways that people are thinking about testing vs incidence? e.g. Noam Ross’s graphs in https://docs.google.com/spreadsheets/d/1ecQ0t1Sn2maR2b9sUacA3RSIUTHFlT-V1XiNnOFiBpw/edit#gid=2019730603