Licensed under the Creative Commons attribution-noncommercial license. Please share & remix noncommercially, mentioning its origin.

Basic criteria for data presentation

If you’re at all interested in this topic, the talk by John Rauser (2016) (here) is strongly recommended.

Visual perception of quantitative information: Cleveland hierarchy (Cleveland and McGill 1984, @cleveland_graphical_1987, @cleveland_visualizing_1993)

cleveland

Techniques for multilevel data

faceting (= trellis plots = small multiples) vs grouping (“spaghetti plots”)
join data within a group by lines (perhaps thin/transparent)
colour lines by group (more useful for explanatory than presentation graphics)
dynamic graphics (hovertext; plotly::ggplotly)
other ways to indicate grouping: stat_ellipse, ggalt::geom_encircle, stat_centseg (from ../../R/geom_cstar.R)
depends on context: how many groups, what kind of predictors? time series or scatterplots?

ggplot2 makes it fairly easy to do a simple two-stage analysis on the fly using geom_smooth, e.g. with the CBPP data discussed below:

ggplot

Grammar of Graphics: based on Wilkinson (1999)
documented in Wickham (2009), also web site, mailing list, StackOverflow tag
explicit mapping from variables to “aesthetics”: e.g. x, y, colour, size, shape
easier to overlay multiple data sets, data summaries, model predictions etc.
rendering can be slow
ggalt, gridExtra, ggExtra, cowplot, directlabels packages useful
ggplot gallery; ggplot extensions

Rules of thumb

what goes where? Based on Cleveland hierarchy
- (Continuous) response on the \(y\)-axis, most salient (continuous) predictor on the \(x\)-axis (or many categories)
- Most salient comparisons within the same subplot (distinguished by color/shape), and nearby within the subplot when grouping bars/points
- Facet rows > facet columns
flip axes to display labels better (coord_flip(), ggstance() package)
use transparency to include important but potentially distracting detail
do category levels need to be identified or just distinguished?
choose geoms according to data size: points < boxplots < violin, hexbin
order categorical variables meaningfully (“What’s wrong with Alabama?”): forcats::fct_reorder(), forcats::fct_infreq()
choose colors carefully (RColorBrewer/ColorBrewer, IWantHue): respect dichromats and B&W printouts
visual design (tweaking) vs. reproducibility (e.g. ggrepel, directlabels packages)

ggplot intro

data
mappings: between variables in the data frame and aesthetics, or graphical attributes (x position, y position, size, colour …)
first two show up as (e.g.) ggplot(my_data,aes(x=age,y=rootgrowth,colour=phosphate))
geoms:
simple: geom_point, geom_line

load("../../data/gopherdat2.RData")
library("ggplot2"); theme_set(theme_bw())
(ggplot(Gdat,aes(x=year,y=shells/Area,colour=Site))
    + geom_point()
)

more complex: geom_boxplot, geom_smooth
geoms are added to an existing data/mapping combination
facets: facet_wrap (free-form wrapping of subplots), facet_grid (two-D grid of subplots)
also: scales, coordinate transformations, statistical summaries, position adjustments …

See Karthik Ram’s ggplot intro or my intro for disease ecologists, among many others.

Multilevel data examples

library("ggalt")
source("../../R/geom_cstar.R")

time series: cbpp data set

Contagious bovine pleuropneumonia (CBPP): from Lesnoff et al. (2004), via the lme4 package. See ?lme4::cbpp for details.

data("cbpp",package="lme4")
## make period *numeric* so lines will be connected/grouping won't happen
cbpp2 <- transform(cbpp,period=as.numeric(as.character(period)))
g0 <- ggplot(cbpp2,aes(period,incidence/size)) ## plot template (no geom)

spaghetti plot

g1 <- (g0
    +geom_line(aes(colour=herd))
    +geom_point(aes(size=size,colour=herd))
)

Do we need the colours?

g2 <- (g0
    +geom_line(aes(group=herd))
    +geom_point(aes(size=size,group=herd))
)

Facet instead:

g4 <- g1+facet_wrap(~herd)

Order by average prop. incidence, using the %+% trick:

cbpp2R <- transform(cbpp2,herd=reorder(herd,incidence/size))
g4 %+% cbpp2R

two-stage analysis:

(g0
    + geom_point(aes(size=size,group=herd))
    + geom_smooth(aes(group=herd,weight=size),
                  method="glm",
                  method.args=list(family=binomial),
                  se=FALSE))

## `geom_smooth()` using formula = 'y ~ x'

(ignore glm.fit warnings if you try this)

scatterplots: gopher tortoise mycoplasma data

Gopher tortoise data (from Ozgul et al. (2009), see ecostats chapter)

Plot density of shells from freshly dead tortoises (shells/Area) as a function of mycoplasmal prevalence (%, prev): you may want to consider site, year of collection, or population density as well.

load("../../data/gopherdat2.RData")
g5 <- ggplot(Gdat,aes(prev,shells/Area))+geom_point()

g5+geom_encircle(aes(group=Site))
g5+geom_encircle(aes(group=Site),s_shape=1,expand=0) ## convex hulls
## connect points to center
g5+stat_centseg(aes(group=Site),cfun=mean)

treatment comparisons: clipping data

Data from Banta, Stevens, and Pigliucci (2010):

Easier if there is one data point per group (connect with lines), but

load("../../data/Banta.RData")
## dat.tf$ltf1 <- log(dat.tf$total.fruits+1)
g6 <- ggplot(dat.tf,aes(nutrient,total.fruits,colour=gen))+
    geom_point()+
    scale_y_continuous(trans="log1p")+
    facet_wrap(~amd)+
    stat_summary(fun.y=mean,aes(group=interaction(popu,gen)),
                 geom="line")

## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

If stat_summary is used with fun.data=, it can also compute confidence intervals. Try "mean_cl_boot" or "mean_cl_normal" (see ?mean_cl_boot)

Dynamic graphics:

library(plotly)
ggplotly(g6)

exercise

Pick a data set from the list available on the web page (or use your own) and create two plots that indicate the grouping in different ways.

References

Banta, Joshua A., Martin H. H. Stevens, and Massimo Pigliucci. 2010. “A Comprehensive Test of the ’Limiting Resources’ Framework Applied to Plant Tolerance to Apical Meristem Damage.” Oikos 119 (2): 359–69. https://doi.org/10.1111/j.1600-0706.2009.17726.x.

Cleveland, William. 1993. Visualizing Data. Summit, NJ: Hobart Press.

Cleveland, William S., and Robert McGill. 1984. “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” Journal of the American Statistical Association 79 (387): 531–54. https://doi.org/10.2307/2288400.

———. 1987. “Graphical Perception: The Visual Decoding of Quantitative Information on Graphical Displays of Data.” Journal of the Royal Statistical Society. Series A (General) 150 (3): 192–229. https://doi.org/10.2307/2981473.

John Rauser. 2016. “How Humans See Data.” https://www.youtube.com/watch?v=fSgEeI2Xpdc.

Lesnoff, Matthieu, Géraud Laval, Pascal Bonnet, Sintayehu Abdicho, Asseguid Workalemahu, Daniel Kifle, Armelle Peyraud, Renaud Lancelot, and François Thiaucourt. 2004. “Within-Herd Spread of Contagious Bovine Pleuropneumonia in Ethiopian Highlands.” Preventive Veterinary Medicine 64 (1): 27–40. https://doi.org/10.1016/j.prevetmed.2004.03.005.

Ozgul, Arpat, Madan K Oli, Benjamin M Bolker, and Carolina Perez-Heydrich. 2009. “Upper Respiratory Tract Disease, Force of Infection, and Effects on Survival of Gopher Tortoises.” Ecological Applications 19 (3): 786–98. http://www.ncbi.nlm.nih.gov/pubmed/19425439.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. 2nd Printing. Springer.

Wilkinson, L. 1999. The Grammar of Graphics. New York: Springer.

Data visualization, focusing on ggplot and multilevel data

Ben Bolker

20:06 24 June 2023