disaggregating clarity of statistical results

Author

Ben Bolker and Jonathan Dushoff

Published

December 27, 2025

Code

library(tidyverse)
library(ggplot2); theme_set(theme_bw() + theme(panel.spacing=grid::unit(0, "lines")))
source("powertargets_funs.R")

Simulate a \(t\)-test-like comparison: what is the distribution of outcomes?

The relevant parameters:

param	default	meaning
\(s\)	1	SESOI (“smallest effect size of interest”), clinically meaningful difference
\(\delta\)	1	unscaled effect size (difference between means)
\(\sigma\)	1	standard error
\(\alpha\)	0.05	critical \(p\)-value
\(n\)		sample size per group

We are classifying outcomes in the following way (from here/here), assuming a positive estimate (symmetric equivalents of all cases except “completely unclear” are possible).

Code

largeEffect=1.3; smallEffect=0.5; tinyEffect = 0.1; smallVar = 0.2; medVar = 0.4; largeVar = 0.7; hugeVar = 1.2
span = 2
vget = Vectorize(get)
vr <- (
    read.table(header=TRUE, strip.white=TRUE, sep=":", text="
        pic : val : unc : atext : ntext
        PL : largeEffect : smallVar : Clearly large|and positive : different
        PU : largeEffect : largeVar : Clearly positive,|maybe large : different
        PS : smallEffect : smallVar : Clearly positive|and not large : different
        US : tinyEffect : medVar : Maybe positive,|clearly small : different
        UU : smallEffect : largeVar : Not both (large|and negative) : different
        nopower : tinyEffect : hugeVar : Should have|done a power|analysis first : different
    ")
)
vf <- (vr
    |> mutate(NULL
        , pic = factor(pic, levels=pic)
        , val = vget(val)
        , unc = vget(unc)
        , lwr = val-unc
        , upr = val+unc
        , atext = gsub('\\|', '\n', atext)
    )
)
print(ggplot(vf)
    + aes(val, pic)
    + geom_pointrange(aes(xmin=lwr, xmax=upr, col = pic))
     ## delete vertical grid lines
    + theme(panel.grid.minor.x = element_blank(),
            legend.position = "none") 
    + geom_vline(xintercept=c(0, -1, 1), lty = c(1, 2, 2))
    + scale_x_continuous(limits=c(-span, span)
        , breaks = -1:1
        , labels = c("cutoff\n(SESOI)", 0 , "cutoff\n(SESOI)")
    )
    + scale_y_discrete(limits = rev, labels=rev(vf$atext))
    + labs(x = "", y = "")
    + out_colour_scale(vf$pic)
)

As an example, we’ll start by determining the sample size to needed to get 80% power for a cutoff/SESOI value (s in the underlying code) with a standard deviation of 1:

Code

pp <- power.t.test(delta = 1, sd = 1, power = 0.8)
n <- ceiling(pp$n)  ## 17
tt <- power.t.test(delta = 1, sd = 1, n = n)
pow1 <- round(tt$power,3) ## 0.807

With per-sample \(n\) of 17 (and with \(\delta=1\), \(\sigma=1\), \(\alpha=0.05\)) we get a power of 0.807.

If we simulate 10,000 such experiments, here is the breakdown of different outcomes:

Code

set.seed(101)
t0 <- system.time(
  tt0 <- tabfun(n=17, nsim =  10000)
)
pow1_sim <- sum(tt0[1:3])
stopifnot(all.equal(pow1_sim, tt$power, tolerance = 2e-3))
print(tt0)


              Clearly large\nand positive 
                                   0.0228 
           Clearly positive,\nmaybe large 
                                   0.7825 
          Clearly positive\nand not large 
                                   0.0002 
           Maybe positive,\nclearly small 
                                   0.0233 
           Not both (large\nand negative) 
                                   0.1712 
Should have\ndone a power\nanalysis first 
                                   0.0000

We confirm that we do get overall power, i.e. all outcomes with a clear sign, very close to the nominal value (0.805)

If the true effect size (\(\delta = 1\)) is equal to the cutoff/SESOI (\(s=1\)) then everything converges to a narrow window around the SESOI: mostly “unclear magnitude/clear sign” but always a clear sign. Indeed, we can see that for \(n=100\) we have converged (for 95% CI/\(\alpha = 0.05\)) to all “clear sign” (i.e. power \(\approx\) 100%) with 95% in the “unclear magnitude” category and 2.5% each in “clearly small” or “clearly large”.

Code

tt1 <- tabfun(n=200, nsim = 10000)
print(tt1)


              Clearly large\nand positive 
                                   0.0274 
           Clearly positive,\nmaybe large 
                                   0.9500 
          Clearly positive\nand not large 
                                   0.0226 
           Maybe positive,\nclearly small 
                                   0.0000 
           Not both (large\nand negative) 
                                   0.0000 
Should have\ndone a power\nanalysis first 
                                   0.0000

Simulations with \(\delta=0.5\), \(s=1\), varying \(n\)

Code

set.seed(101)
nvec <- c(5:10, (2:9)*10, 100, 200)
sim1 <- nsimfun(nvec, nsim =  10000)
plotfun(sim1)

Alternatively:

Code

(sim1
  |> filter(n <= 100)
  |> plotfun(sim1, stack = TRUE)
)

Simulations with \(\delta=1.5\), \(s=1\), varying \(n\)

What if we make the true effect size greater than the SESOI?

Code

sim2 <- nsimfun(nvec, delta=1.5, nsim = 10000)
plotfun(sim2)

(Limiting this plot to \(n \le 50\) so we have a hope of seeing what’s happening …)

Code

(sim2
  |> filter(n <= 50)
  |> plotfun(stack = TRUE)
)

session info

Code

sessionInfo()

R Under development (unstable) (2025-12-22 r89219)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /usr/local/lib/R/lib/libRblas.so 
LAPACK: /usr/local/lib/R/lib/libRlapack.so;  LAPACK version 3.12.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/Toronto
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] ggrepel_0.9.6          directlabels_2025.6.24 lubridate_1.9.4       
 [4] forcats_1.0.1          stringr_1.6.0          dplyr_1.1.4           
 [7] purrr_1.2.0            readr_2.1.6            tidyr_1.3.2           
[10] tibble_3.3.0           ggplot2_4.0.1          tidyverse_2.0.0       

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.6.0     Rcpp_1.1.0        
 [5] tidyselect_1.2.1   bspm_0.5.7         dichromat_2.0-0.1  scales_1.4.0      
 [9] yaml_2.3.12        fastmap_1.2.0      R6_2.6.1           labeling_0.4.3    
[13] generics_0.1.4     knitr_1.51         htmlwidgets_1.6.4  tzdb_0.5.0        
[17] pillar_1.11.1      RColorBrewer_1.1-3 rlang_1.1.6        stringi_1.8.7     
[21] xfun_0.55          quadprog_1.5-8     S7_0.2.1           otel_0.2.0        
[25] timechange_0.3.0   cli_3.6.5          withr_3.0.2        magrittr_2.0.4    
[29] digest_0.6.39      grid_4.6.0         hms_1.1.4          lifecycle_1.0.4   
[33] vctrs_0.6.5        evaluate_1.0.5     glue_1.8.0         farver_2.1.2      
[37] rmarkdown_2.30     tools_4.6.0        pkgconfig_2.0.3    htmltools_0.5.9

references

Julious, Steven A. 2004. “Sample Sizes for Clinical Trials with Normal Data.” Statistics in Medicine 23 (12): 1921–86. https://doi.org/10.1002/sim.1783.

MacCallum, Robert C., Michael W. Browne, and Hazuki M. Sugawara. 1996. “Power Analysis and Determination of Sample Size for Covariance Structure Modeling.” Psychological Methods 1 (2): 130–49. https://doi.org/10.1037/1082-989X.1.2.130.

McBride, Graham B. 1999. “Applications: Equivalence Tests Can Enhance Environmental Science and Management.” Australian & New Zealand Journal of Statistics 41 (1): 19–29. https://doi.org/10.1111/1467-842X.00058.

Morey, Richard D. 2020. “Power and Precision.” Medium. https://richarddmorey.medium.com/power-and-precision-47f644ddea5e.

Owen, D. B. 1965. “A Special Case of a Bivariate Non-Central \(t\)-Distribution.” Biometrika 52 (3/4): 437–46. https://doi.org/10.2307/2333696.