Introduction

How should we think about the connection between numbers of (1) infected people in the population (2) tests done (3) cases reported?

Most generally/mechanistically, on any day we have some number of people who are on day \(\tau\) of their infectious period. On a given day, an infectious (\(I\)) person may be subclinical (asymptomatic or so mildly symptomatic that they don’t report symptoms); mild; severe (hospitalized); or very severe (ICU/ventilator). The progression among these stages will be different for different people (although in general they will progress toward greater severity). The more severe the symptoms, the more likely someone is to be tested.

Testing will also depend on how they got infected, e.g. imported cases and close contacts are more likely to be tested. (An asymptomatic, community-infected person will basically never be tested.)

In a complete agent-based model (or a model based on a complex compartmental structure), we could keep track of people moving through all of these categories and assign them probabilities of being tested. In a sufficiently realistic model we could even take into account detailed testing criteria and phenomena such as limited testing resources (so that, e.g., low-risk, mild cases would never be tested).

There are two extreme scenarios that are easy to reason about.

Now let’s consider that in general we test people in (approximate) order of their probability of infection (ignoring tests skipped because of constraints). We can think about a distribution in the population of probability of infection (higher for known contacts of infected people, travellers from high-risk areas, etc.), and a distribution in the probability of testing (which ideally lines up with the risk). We could think about having two distributions, but these can’t really be separated (FIXME: explain better!).

Machinery

Let’s use a Beta distribution with mean equal to prevalence \(i\) as the distribution of ‘probability infected’. (An alternative that might be more analytically tractable is the Kumaraswamy distribution.) As the dispersion \(\gamma=1/(a+b)\) goes to infinity we end up with point masses \(1-i\) on 0 and \(i\) on 1; as it goes to 0 we end up with a point mass on \(i\). If we always test ‘from the top down’ (i.e. calculate the mean value of the top fraction \(t = T/N\) of the population distribution), these two extreme cases correspond to perfect testing (cases=\(\min(T,I)\)) and random testing (cases=\(\textrm{Binomial}(i,T)\)).

If \(B\) is the Beta distribution with mean \(i\) and dispersion \(\gamma\), \(\Phi_B\) is the CDF, and \(Q\) is the inverse CDF (i.e., the quantile function) then our expected proportion positive from testing a fraction \(t=T/N\) is

\[ \frac{\left(\int_{Q(1-t)}^1 B(y,\phi) y \, dy\right)}{1-\Phi_B(Q(1-t))} = \frac{\left(\int_{Q(1-t)}^1 B(y,\phi) y \, dy\right)}{t} \]

i.e. the mean of the upper tail of the infection-probability distribution.

We can get a little bit farther analytically. Given that \(B(x,a,b) \propto x^{a-1} (1-x)^{b-1}\), the mean is \(a/(a+b)=i \to a \phi =i \to a = i/\phi, b=(a+b)-a = 1/\phi - i/\phi = (1-i)/\phi\). We can show that \(B(x,a,b) \cdot x = a/(a+b) B(x,a+1,b)\), so we should be able to do the integral directly by computing the appropriate CDF (or complementary CDF) of the Beta distribution.

\(B(x,a,b) =\frac{1}{\mathcal{B}(a,b)} x^{a-1} (1-x)^{b-1}\), where \(\mathcal{B}(a,b)\) is the beta function, which satisfies \[\mathcal{B}(a,b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}\,.\] It follows that \[ \begin{aligned} B(x,a,b)x = \frac{1}{\mathcal{B}(a,b)} x^{a} (1-x)^{b-1} &= \frac{\mathcal{B}(a+1,b)}{\mathcal{B}(a,b)}\times \frac{1}{\mathcal{B}(a+1,b)} x^{a} (1-x)^{b-1}\\ &=\frac{\mathcal{B}(a+1,b)}{\mathcal{B}(a,b)} B(x,a+1,b)\,, \end{aligned} \] which, using the identity \(\Gamma(z+1) = z\Gamma(z)\), simplifies to \[ \begin{aligned} B(x,a,b)x &=\frac{\Gamma(a+1)\Gamma(b)}{\Gamma(a+b+1)}\frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)} B(x,a+1,b)\\ &= \frac{a\Gamma(a)\Gamma(b)}{(a+b)\Gamma(a+b)}\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} B(x,a+1,b) = \frac{a}{a+b}B(x,a+1,b)\,. \end{aligned} \] Since we set the mean \(a/(a+b)\) of the distribution \(B(x,a,b)\) to \(i\), we have \(B(x,a,b)x = iB(x,a+1,b)\).

Can we do a little bit more here since we’re integrating this from \(Q(a,b)\) to 1 … ?

In order to make the dispersion parameter a little more interpretable we will transform the parameter \(\gamma\) from \((0,\infty)\) to a value \(\phi = 1-\exp(-\gamma) \in (0,1) (i.e. \gamma=-\log(1-\phi))\) (previously: \(\phi=-\log(1-\gamma)\) (i.e. \(\gamma=1-\exp(-\phi)\))), so that 0 corresponds to random testing and 1 corresponds to perfectly focused testing.

If we wanted to get more realistic/detailed we could relax the assumption of a Beta distribution and instead characterize the distribution of probability infected as a mixture of types (general population, symptomatic, travel history, etc.), with the constraint of a mean of \(i\)

Graphical example

Suppose the population incidence is 25% and we test 25% of the population. Our \(\phi\) parameter is 0.2211992 (dispersion of the Beta are 0.25 and shape parameters of the Beta are {1, 3}). The lower bound of the upper 25% tail is 0.37; the integral of \(x\) under that portion of the curve is 0.132, so the test positivity is round(qq/0.25).

Correlation

What is the relationship between \(\phi\) and the correlation between testing and infection?

We need

\[ \newcommand{\var}{\textrm{Var}} \frac{E[(\textrm{tested} - \overline{\textrm{tested}})\cdot - (\textrm{infected} - \overline{\textrm{infected}})]}{\sqrt{\var(\textrm{tested})} \cdot \var(\textrm{infected})} \]

The probability of testing is 0 below the threshold and 1 above it; thus this distribution should be a Bernoulli with weight \(1-t\) on zero and \(t\) on 1. Thus the numerator is the value we’ve already computed for the numerator of the proportion-positive value (\(\int_{1-t}^1 x \cdot \textrm{Beta(x)}\,dx\)), minus the product of the testing and infection means (\(= ti\), using the usual identity \(E[(A-\bar A)(B-\bar B)] = E[AB] - \bar A \bar B\)). The variance of testing probability is (\((1-t)\cdot 0^2 + t \cdot 1^2 = t\)); the variance of infection probability is the variance of the Beta distribution.

inf_test_corr <- function(i, t, phi) {
    gamma <- -log(1-phi)
    a <- i/gamma; b <- (1-i)/gamma
    lwr <- qbeta(t, a, b, lower.tail = FALSE)  ## hope we don't need robust version in covid19testing_funs.R
    num <- pbeta(lwr,a+1,b,lower.tail=FALSE)*(a/(a+b))-t*i
    var_test <- t*(1-t)
    var_inf <- a*b/((a+b)^2*(a+b+1)) ## Wikipedia
    num/sqrt(var_test*var_inf)
}
nn <- 101
phivec <- seq(0, 1, length.out = nn)
cordf <- data.frame(i = rep(c(0.25, 0.1), c(nn, 2*nn)),
                    t = rep(c(0.25, 0.05, 0.01), each = nn),
                    phi = rep(phivec, 3))
cordf$cor <- with(cordf, inf_test_corr(i,t,phi))
## Warning in qbeta(t, a, b, lower.tail = FALSE): NaNs produced
cordf$cat <- with(cordf, sprintf("i = %1.2f, t = %1.2f", i, t))
cordf$cat <- factor(cordf$cat, levels = unique(cordf$cat))
ggplot(cordf, aes(phi, cor, colour = cat)) + geom_line() +
    geom_pointrange(data = numcor, aes(y=cor, ymin = lwr, ymax = upr),
                    colour = "blue", size = 0.1)
## Warning: Removed 6 rows containing missing values or values outside the scale range (`geom_line()`).

Numerical examples

par(las=1,bty="l")
curve(prop_pos_test(i=0.01,t=0.001,x),from=0.001,to=0.999,
      xlab=expression("testing focus"~(phi)),
      ylab="proportion of positive tests",
      main=c("1% prevalence, 0.1% testing"),
      ylim=c(0,1))
abline(h=0.01,lty=2)

What happens as we test more people than are actually infected in the population?

par(las=1,bty="l")
curve(prop_pos_test(t=x,i=0.01,phi=0.5),
      from=0.001, to=0.4,log="x",xlab="proportion tested",
      ylab="prop positive tests",
      main=expression(list("prevalence=1%",phi==0.5)))
abline(v=0.01,lty=2)

dd <- (expand.grid(time=1:25,phi=c(0.001,0.2,0.5,0.8,0.999))
    %>% as_tibble()
    %>% mutate(inc=0.001*exp(log(2)/3*time),
               tests=1e-4*exp(log(2)/4*time),
               pos_prop=prop_pos_test(inc,tests,phi),
               cases=pos_prop*tests)
    ## issues with pivot_longer? use gather() instead
    ## "Error: `spec` must have `.name` and `.value` columns"
    ## FIXME: figure out what's going on here?
    %>% gather("var","value",inc,tests,pos_prop,cases)
)
print(ggplot(dd,aes(time,value,col=phi,group=phi))
  + geom_line()
  + facet_wrap(~var,scale="free")
  + scale_y_log10()
  + scale_colour_viridis_c()
)

FIXME: direct labels? drop incidence and testing, include as reference-slope lines?

Estimation

When is \(\phi\) identifiable?

To do

Simulations

  • The first thing to do is the minimal identifiability test: if we simulate data directly using prop_pos_test, then add some noise (e.g. binomial testing outcomes), and then we try to take the observed testing data (total tests and positive tests) and get an MLE of both phi and the parameters of the exponential growth of prevalence (initial value and growth rate), can we do it? If \(t\) is the fraction of the population tested and \(N\) is the population size and \(f(\tau)=\textrm{prop\_pos}(i(\tau),t(\tau),\phi)\) is the fraction positive, then we want to simulate \(c(\tau) \sim \textrm{Binomial}(f,tN)\), the number of confirmed positive cases at time \(\tau\). Not worrying about sensitivity/specificity yet (tests are perfect).

MLE: want to estimate \(i_0\), \(r\) (growth rate of prevalence), and \(\phi\). We know \(t(\tau)\) and \(c(\tau)\) (no time lags yet!)

## t (test vector) and c (confirmation vector) are defined; tau is a time vector
set.seed(101)
## scalar parameters
true_pars <- c(I_0=100, r=log(2)/3, phi=0.5,
               T_0=100, r_t=log(2)/4, N=1e6)
true_pars["i_0"] <- true_pars["I_0"]/true_pars["N"]
## vectors of observed stuff
dd <- data.frame(tau=1:20)
dd$T <- round(true_pars["T_0"]*exp(true_pars["r_t"]*dd$tau))
dd$t <- dd$T/true_pars["N"]
## simulating confirmation vector
dd$c <- with(c(as.list(true_pars), dd),
             rbinom(length(tau),
                    size=T,  ## number of tests at time tau
                    prob=prop_pos_test(i_0*exp(r*tau), t, phi)
                    )
             )
mle_out<- mle2(c~dbinom(prob=prop_pos_test(plogis(logit_i_0)*exp(r*tau),
                                           t, exp(log_phi),
                                 debug=FALSE, phiscale = "unconstrained"),
              size=t*N),
              start=list(logit_i_0=qlogis(true_pars["i_0"]),
                         r=true_pars["r"], log_phi=log(true_pars["phi"]) ),
              data= list(tau=dd$tau, c=dd$c, t=dd$t, N=true_pars["N"]),
              control=list(maxit=1000)
)
mle_est0 <- coef(mle_out)
pp <- profile(mle_out)
ggdd <- as.data.frame(pp)
ggplot(ggdd,aes(focal,abs(z))) + geom_point() + geom_line() +
  facet_wrap(~param,scale="free_x")+
  scale_y_continuous(limits=c(NA,5))

invlink <- function(x) c(plogis(x[1]),x[2],exp(x[3]))
tt <- (broom::tidy(mle_out)
  %>% mutate(lwr=estimate-2*std.error,
             upr=estimate+2*std.error)
  %>% select(term,estimate,lwr,upr)
  %>% mutate_if(is.numeric,invlink)
  %>% mutate(term=c("i_0","r","phi"))
  %>% mutate(true=true_pars[c("i_0","r","phi")])
)

RTMB version

tmbdat <- list(tau=dd$tau, c=dd$c, t=dd$t, N=true_pars["N"])
pars <- list(logit_i_0=qlogis(true_pars["i_0"]),
             r=true_pars["r"], plogis_phi=plogis(true_pars["phi"]) )
nllfun <- function(pars) {
    getAll(pars, tmbdat)
    prob <- prop_pos_test1(plogis(logit_i_0)*exp(r*tau),
                                           t, exp(plogis_phi))
    -sum(dbinom(c, size = t*N, prob = prob, log = TRUE))
}
ff <- MakeADFun(nllfun, pars, silent = TRUE)
ff$fn()
## [1] 4058.523
fit <- with(ff, nlminb(pars, fn, gr))

## run tmbprofile on a selected set of parameters, combine results,
## scale to delta-deviance (signed? sqrt?)
## is returning CI as an attribute really the best way to do this?
## allow tracing by variable? parallel computation?
tmbprof2 <- function(obj, pars, conf.level = 0.95, ...) {
    pfun <- function(p) {
        tt <- TMB::tmbprofile(obj, p, ..., trace = FALSE)
        ci <- confint(tt, level = conf.level)
        tt$value <- sqrt(2*(tt$value - min(tt$value, na.rm=TRUE)))
        names(tt) <- c(".focal", "value")
        list(df = data.frame(var = p, tt), ci = data.frame(var = p, ci))
    }
    res <- lapply(pars, pfun)
    res_df <- do.call(rbind, lapply(res, \(x) x$df)) |>
        transform(var = factor(var, levels = pars))
    res_ci <- do.call(rbind, lapply(res, \(x) x$ci))
    rownames(res_ci) <- NULL
    attr(res_df, "ci") <- res_ci
    res_df
}
prof <- tmbprof2(ff, c("logit_i_0", "r", "plogis_phi"), ystep = 0.01)
true_vec <- unlist(sapply(pars, unname))
citab <- (attr(prof, "ci")
    |> mutate(est = fit$par[.data$var],
              true = true_vec[.data$var],
              .before = lower)
    |> mutate(across(upper, ~ replace_na(., Inf)))
)
ggplot(prof, aes(.focal, value)) +
    geom_line() + facet_wrap(~var, scale = "free_x") +
    geom_hline(yintercept = 1.96, lty = 2) +
    geom_vline(data = citab, aes(xintercept = true), lty = 2, colour = "red") +
    theme(panel.spacing = grid::unit(0, "lines"))

knitr::kable(citab)
var est true lower upper
logit_i_0 -9.3133305 -9.2102404 -9.8106774 -8.1010632
r 0.2328694 0.2310491 0.1877328 0.2703343
plogis_phi -0.2983374 0.6224593 -1.0402324 Inf
ggplot(citab, aes(est, var)) +
    geom_pointrange(aes(xmin=lower, xmax = upper)) +
    geom_point(aes(x=true), colour = "red", size = 4) +
    facet_wrap(~var, scale = "free", ncol = 1) +
    labs(y = "", x = "")

(run multiple sims for RMSE/coverage/bias/etc.? compare against naive estimates, uncorrecting for testing bias?

How do we model a testing policy? How do we make the testing policy match up with our distribution?

Give every individual two (correlated) numbers which correspond to infection risk and testing risk? Increase testing risk at onset of symptoms?

Test asymptomatic (susceptible) people at one rate and symptomatic (infectious) people at another rate?

A fraction of the incidence moves into a “testing pool”; people in the testing pool are tested at a constant rate. Every person gets a number sampled from our beta distribution when they become infected, which governs their chance of being selected

Kumaraswamy distribution

Is the inverse CDF for the Kumaraswamy distribution tractable? Seems that way:

\[ \begin{split} q & =1-(1-x^a)^b \\ (1-q)^{1/b} & = 1-x^a \\ x = (1- (1-q)^{1/b})^{1/a} \end{split} \]

(this could be important if we want to embed this machinery in Stan/JAGS/etc. where pbeta()/qbeta() may not exist and/or be sufficiently robust …)

Unfortunately the tradeoff is that the expression for the mean is complicated:

\[ \frac{b\Gamma\left(1+ \frac{1}{a}\right) \Gamma(b)}{\Gamma\left(1+\frac{1}{a}+b\right)} \]

so reparameterizing in terms of mean/dispersion might be hard (and the integral trick with the Beta also might not work )

brain dump/misc

How does this relate to other ways that people are thinking about testing vs incidence? e.g. Noam Ross’s graphs in https://docs.google.com/spreadsheets/d/1ecQ0t1Sn2maR2b9sUacA3RSIUTHFlT-V1XiNnOFiBpw/edit#gid=2019730603