Course structure

Course goals

General introduction to data viz principles and tools

Course structures

  • lectures from professors on basic ideas (first half)
  • in-class work
  • homework
  • lectures from students on topics/advanced ideas (second half)

Tools

Version control

  • Git: distributed version control system
  • GitHub: centralized version control server
    • alternatives: BitBucket, GitLab, …
  • Git clients: software for working with Git on your computer
    • command-line (e.g. git add foo.rmd)
    • RStudio
    • others (GitHub desktop etc.)

Basic Git workflow with RStudio

  • create repository on Github
  • copy repository to local machine
    • git clone
    • RStudio: File > New Project > Version Control > Git > fill in name from "Clone" button on GH

  • repeat:
    • pull (fetch and integrate changes from GH) [git pull]
      • RStudio: Git panel > click blue down-arrow
    • do stuff (create, edit files, etc.)
    • stage [git add]
      • RStudio: Git panel > click "Staged" button
    • commit [git commit]
      • RStudio: Git panel > click "Commit" icon >
        enter commit message > click "Commit" button (ignore "amend previous commit" button!)
    • push [git push]
      • RStudio: Git panel > click green up-arrow

tidyverse

  • set of R packages: https://www.tidyverse.org/
  • advantages
    • expressiveness
    • speed
    • new hotness
  • disadvantages
    • minor incompatibilities with base R
    • rapid evolution
    • non-standard evaluation

tidyverse: big ideas

  • new verbs
  • piping
  • tibbles

tidyverse: new verbs

  • filter(x,condition): choose rows equivalent to subset(x,condition) or x[condition,] (with non-standard evaluation)
  • select(x,condition): choose columns
    • equivalent to subset(x,select=condition) or x[,condition]
    • helper functions such as starts_with(), matches()
  • mutate(x,var=...): change or add variables (equivalent to x$var = ... or transform(x,var=...)

tidyverse: split-apply-combine

  • group_by(): adds grouping information
  • summarise(): collapses variables to a single value
  • e.g.
x <- group_by(x,course)
summarise(x,mean_score=mean(score),sd_score=sd(score))
  • equivalent to plyr::ddply() or
d_split <- split(d,d$var)       ## split
d_proc <- lapply(d_split, ...)  ## apply
d_res <- do.call(rbind,d_proc)  ## combine

tidyverse: piping

  • new %>% operator (orig. from magrittr package)
  • directs result of previous operation to next function, as first argument
  • e.g.
(d_input
    %>% select(row1,row2)
    %>% filter(cond1,cond2)
    %>% mutate(...)
) -> d_output

tidyverse: tibbles

  • extension of data frames (sort of)
  • differences
    • printing
      • only prints first few rows/columns
      • labels columns by type
    • no rownames
    • never drops dimensions (tib[,"column1"] is still a tibble)

tidyverse: reshaping (tidyr package)

  • gather(data,key,value,<include/exclude>)
    • wide to long
    • reshape2::melt()
  • spread(data,key,value)
    • long to wide
    • reshape2::cast()

types of data visualization

exploratory

  • find patterns in data, explore hypotheses
  • emphasize robust approaches
  • minimize (parametric) assumptions
  • Tukey, Cleveland

diagnostic

  • evaluate assumptions of a model
    • normality
    • homoscedasticity
    • lack of bias/goodness of fit
  • easily spot deviations
  • identify outliers and influential points

inferential

  • coefficient plots
  • replacement for tables
  • also: tests of inference Wickham et al. (2010)
  • Gelman

expository: data-viz

  • tell an accurate story
  • high information density
  • Tufte, Cleveland

presentation: info-viz

  • grab attention/engage/sell/entertain
  • "puzzle" graphics

dashboards

  • present a quick overview of a data set
  • user control

dynamic

  • engage
  • allow reader to drill down
  • Cook

References

Wickham, H et al. 2010. IEEE Transactions on Visualization and Computer Graphics 16 (6) (November): 973–979. doi:10.1109/TVCG.2010.161.