Packages

library(rainbow)
library(ggplot2); theme_set(theme_bw())
library(ggthemes)
library(directlabels)
theme_update(panel.spacing=grid::unit(0,"lines"))
library(cowplot) ## for arranging multiple plots, labeling, etc.
library(Hmisc)

Tukey and exploratory data analysis

Tukey: principles

  • simplicity
  • speed
  • flexibility
  • robustness
  • parsimony

stem-and-leaf plot

stem(mtcars$hp)
## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##   0 | 5677799
##   1 | 0011111122
##   1 | 55888888
##   2 | 123
##   2 | 556
##   3 | 4

boxplot

ggplot(mtcars,aes(cyl,hp,group=cyl))+geom_boxplot()

bag plot (2D boxplot)

rainbow::fboxplot(data = ElNino,
                  plot.type = "bivariate",
                  type = "bag", projmethod="PCAproj")

is Tukey still relevant?

  • yes (principles)
  • simultaneous increase in data size/complexity and computing power

Cleveland

principles

  • accuracy of quantitative representation
  • visual estimation of differences

perceptual experiments

  • show participants the same data in different formats
  • ask them questions about relative magnitudes

perceptual experiments: results

is Cleveland still relevant?

  • yes!
  • Elliott (2016), "39 studies about human perception in 30 minutes"
    • healthy tradition of scientific experiments on graphical perception
      • accuracy
      • memory
      • preference

Heer and Bostock (2010)

Tufte

Tufte principles

  • functional, minimal graphics
  • maximize data-ink / minimize non-data-ink
  • don't lie (lie factor)
  • small multiples
  • "If a picture isn't worth 1000 words, the hell with it" - Ad Reinhardt
  • information at the point of need (legends etc.)
  • Powerpoint sucks

data ink

  • maximize data ink (within reason)
g0 <- ggplot(OrchardSprays,aes(treatment,decrease))+scale_y_log10()
print(plot_grid(g0 + geom_boxplot(),  g0 + geom_tufteboxplot()))

ggthemes::geom_tufteboxplot()

information at the point of need

  • less eye movement is better
  • direct labels > legends > info in caption > info in text
g1 <- ggplot(iris,aes(Sepal.Length,Petal.Length,colour=Species,
                shape=Species))+geom_point()
print(plot_grid(g1,direct.label(g1)))

directlabels package

other

Rules of thumb

  • (Continuous) response on the \(y\)-axis
    • assumes we have a single, quantitative/ordered (continuous or discrete) response variable; multivariate responses more challenging
  • put most salient predictor on the \(x\)-axis
    • highest value in Cleveland hierarchy
    • if most important predictor is categorical, use most important continous predictor on \(x\)-axis
    • if most important predictor has few categories, use next most important predictor with many categories

Rules of thumb (continued)

  • Put most salient comparisons within the same subplot (distinguished by color/shape), and nearby within the subplot when grouping bars/points
  • Facet rows > facet columns

Rules of thumb (3)

  • Use transparency to include important but potentially distracting detail
  • Do category levels need to be identified or just distinguished? (Direct labeling, e.g. via directlabels package)
  • Order categorical variables meaningfully ("Alabama/Alberta" problem)
  • Think about whether to display population variation (standard deviations, boxplots) or estimation uncertainty (standard errors, mean \(\pm\) 2 SE, boxplot notches)
  • Try to match graphics to statistical analysis, but not at all costs
  • Choose colors carefully (RColorBrewer/ColorBrewer, IWantHue: respect dichromats and B&W printouts (see dichromat & colorblindr & cividis packages Sciani (2018)

Data presentation scales with data size

  • small show all points, possibly dodged/jittered, with some summary statistics: dotplot, beeswarm. Simple trends (linear/GLM/loess)
  • medium boxplots, loess, histograms, GAM (or linear regression)
  • large modern nonparametrics: violin plots, hexbin plots, kernel densities: computational burden, and display overlapping problems, relevant
  • combinations or overlays where appropriate (beanplot; rugs+scatterplot)

examples

Notes

a. the dreaded "dynamite plot". Problems:

  • bar plot on logarithmic axis is inappropriate (anchors graph to arbitrary zero point)
  • assumes distribution is symmetric (although this applies to b,c as well)
  • some forms of this plot show only top whisker (makes comparison even harder)

b. inferential (point \(\pm\) 2 SE) plot

  • same assumptions as dynamite plot
  • less strongly anchored to zero

c. points \(\pm\) 1 and 2 SE

  • de-emphasizes approximate 95% CI
  • equivalent for Bayesian posterior intervals would typically show both 50% and 95% credible intervals (based on quantiles or highest posterior density)

Notes (continued)

d. points alone

  • true to the data
  • description only; provides no inferential help
  • can confound sample size and range (larger samples have more extreme values so look more variable)

e. boxplots

  • well-established
  • "outliers" can be misleading (Dawson 2011)
  • can add notches to indicate approximate 95% CI on medians (McGill, Tukey, and Larsen 1978)

f. violin plots

  • mirror-image density plots
  • best for large data sets
  • may be funky for small/medium data sets
  • can be combined with jittered data, segments indicating median/quantiles, etc.

Example

References

Dawson, Robert. 2011. “How Significant Is a Boxplot Outlier.” Journal of Statistics Education 19 (2): 1–12.

Elliott, Kennedy. 2016. “39 Studies About Human Perception in 30 Minutes.” Medium. https://medium.com/@kennelliott/39-studies-about-human-perception-in-30-minutes-4728f9e31a73.

Heer, Jeffrey, and Michael Bostock. 2010. “Crowdsourcing Graphical Perception: Using Mechanical Turk to Assess Visualization Design.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 203–12. ACM.

McGill, Robert, John W. Tukey, and Wayne A. Larsen. 1978. “Variations of Box Plots.” The American Statistician 32 (1): 12–16. doi:10.2307/2683468.

Sciani, Marco. 2018. “Cividis: Implementation of the Matplotlib ’Viridis’ Color Map in R (Lite Version).” https://github.com/marcosci/cividis.