class: top, right, title-slide # Best practices for [TYPE] projects ## TYPE: research / data analysis / scientific computing ### Sina Rüeger ### 2018-12-04 --- <!-- From here: https://slides.yihui.name/xaringan/ --> --- class: left, middle # install.packages(...) - janitor - formatr - styler - bannerCommenter - pRojects - projectTemplate # Material bit.ly/2QfkOGZ --- class: left, middle # About me - Background in Data Analysis & Engineering. - In the past: I worked as a research assistant in data science, in epidemiology and in statistical genetics. - Right now: PostDoc @ EPFL
Analysis of genetic data in infectious diseases at the [Fellay Lab](https://fellay-lab.epfl.ch/). -
-Ladies Lausanne co-organiser. --- class: left, middle # Inspiration - [**R-Ladies Remote Journal Club**](https://twitter.com/acastillogill/status/1067699852048510976) where we discussed *Good enough practices in scientific computing* [Wil+17]. - [**rOpenSci Community call**](https://ropensci.org/commcalls/2018-10-16/): *Code Review in the Lab* [Ope18] and the [summary](https://ropensci.org/blog/2018/11/29/codereview/) of it. - Plus all the references listed throughout the presentation. --- class: center, middle, inverse # Life cycle of (data analysis driven research) projects --- class: left, middle <img border="0" alt="" src="img/dailyworkflow/dailyworkflow.001.png" width="800"> .small[Adapted from Figure in [R4DS book](http://r4ds.had.co.nz/explore-intro.html)] --- class: left, middle <img border="0" alt="" src="img/dailyworkflow/dailyworkflow.002.png" width="800"> .small[Adapted from Figure in [R4DS book](http://r4ds.had.co.nz/explore-intro.html)] --- class: left, middle <img border="0" alt="" src="img/dailyworkflow/dailyworkflow.003.png" width="800"> .small[Adapted from Figure in [R4DS book](http://r4ds.had.co.nz/explore-intro.html)] --- class: left, middle # Goal in research ### 1. Figuring out stuff. ### 2. Telling others. - Making it easy for yourself & colleagues to **rerun** (and **understand**) the project → *"repeatability"*. - Publishing code: - making it easy for others to **rerun** and to **understand** the project → [*"reproducibility"*](https://twitter.com/jtleek/status/759822823552606208). - making it easy for others to rerun the code **with different data** → [*"replicability"*](https://twitter.com/jtleek/status/759822823552606208). --- class: center, middle # ...extremely well organised scientists <a href="https://docs.google.com/presentation/d/1VK1hngMZSY3FT2SrDd4_AHiB28CHrsuSsaFr7r3SAL8/edit#slide=id.p"> <img border="0" alt="" src="img/heidibaya.png" width="800"> </a> .small[Extract from presentation by Heidi Seibold @HeidiBaya on [*Tools for reproducibility in Statistics and Machine Learning*](https://docs.google.com/presentation/d/1VK1hngMZSY3FT2SrDd4_AHiB28CHrsuSsaFr7r3SAL8/edit#slide=id.p)] --- class: left, middle exclude: true # Problems - Computers (Servers, OS, ) - Trained for domain x and not computing --- class: left, middle # Best practices .small[Box 1 from *Good enough practices in scientific computing* [Wil+17]] .pull-left[ <img border="0" alt="" src="img/bp-box1.png" width="400"> ] .pull-right[ <img border="0" alt="" src="img/bp-box2.png" width="400"> ] --- class: left, middle # Focus on: ### Data management ### Code ### Project organisation ### Version control .small[A selection from *Good enough practices in scientific computing* [Wil+17]] --- class: left, middle exclude: true # What are solutions? ### Data Management - `janitor` ### Code - Style guides (`lintr`, `formatr`, `styler`) - `bannerCommenter` - work with rstudio projects and `here::here()` ### Project Organisation - runall script: `drake` - directory structur: `ProjectTemplate`, `pRojects`, `rrtools` ### Version control / git --- class: left, middle # Creating *reviewable* projects - Having an **overview of the analysis and its iteration steps** → cleaning, modelling, visualisation, reports. - Well styled code makes it easier to apply **code review**. <!-----------------------------------> --- class: inverse, center, middle .large[Data Management] --- class: left, middle ## Create data - *"Create the data you wish to see in the world"*" [Wil+17] - Pay attention to **variable names**. - Use a **data dictionary** / codebook (variable meta data). --- class: left, middle ## library(janitor) - cleaning data - `tabyl()` ```r dat <- readr::read_delim("janitor-example.txt", delim = " ") tail(dat) ``` ``` # A tibble: 6 x 7 id sepal.length `Sepal Width` Petal.Length Petal_WIDTH Species ` ` <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> 1 147 6.3 2.5 5 1.9 virginica <NA> 2 148 6.5 3 5.2 2 virginica <NA> 3 149 6.2 3.4 5.4 2.3 virginica <NA> 4 150 5.9 3 5.1 1.8 virginica <NA> 5 NA NA NA NA NA <NA> <NA> 6 NA NA NA NA NA <NA> <NA> ``` --- class: left, middle ### Remove empty columns and rows (1/2) ### remove_empty() ```r dim(dat) ``` ``` [1] 152 7 ``` ```r janitor::remove_empty(dat) %>% dim() ``` ``` [1] 150 6 ``` --- class: left, middle ### Apply consistent variable naming (2/2) ### clean_names() ```r dat %>% names() ``` ``` [1] "id" "sepal.length" "Sepal Width" "Petal.Length" [5] "Petal_WIDTH" "Species" " " ``` ```r janitor::clean_names(dat, case = "snake") %>% names() ``` ``` [1] "id" "sepal_length" "sepal_width" "petal_length" [5] "petal_width" "species" "x" ``` ```r janitor::clean_names(dat, case = "lower_camel") %>% names() ``` ``` [1] "id" "sepalLength" "sepalWidth" "petalLength" "petalWidth" [6] "species" "x" ``` --- class: left, middle ## More - `library(readxl)` and `library(googlesheets)` to work with spreadsheet data. - [`library(dataMeta)`](https://cran.r-project.org/web/packages/dataMeta/vignettes/dataMeta_Vignette.html) to create meta data of variables. <!-----------------------------------> --- class: inverse, center, middle .large[(Styling) Code] --- class: left, middle ## Why styling code? Well *styled* code makes reviewing easier. --- class: left, middle ## In-built options in RStudio (1/4) - **Reindent** Lines (Cmd+I) → proper indenting - **Reformat** Code (Cmd+Shift+A) → adds spaces, breaks long horizontal code - ↪ apply both in [`mycode.R`:](https://github.com/sinarueeger/20181204-r-lunchs-gva/blob/master/mycode.R) --- class: left, middle ## Style guides (2/4) Style guides: a set of opinionated rules. - [`library(formatr)`](https://yihui.name/formatr/) - [`library(styler)`](https://github.com/r-lib/styler) --- class: left, middle ### library(formatr) Some more details on [Yihui Xie page](https://yihui.name/formatr/) (make sure you read *6. Further Notes* about all the bad things that can happen). - `formatR::tidy_file("mycode.R")` will change the R file - `formatR::tidy_source("mycode.R")` will output the styled R code --- class: left, middle ### Bad code ```r make.background=function(N,filename ){ cols <- c("#88398A","#B07FB2","#F7CCA4","#C8C05E","#627C63") ## generated here: http://colormind.io/ ## colors set.seed(3) df=data.fame(x=runif(N), y =runif(N), col = sample(cols, N, replace = TRUE), size=runif(N, 1, 10)) ## no border qp <- ggplot(df, aes(x, y))+ geom_point(aes(size =size, fill =I(col), color = I(col)), alpha = 0.4, shape = 16)+theme_void()+theme(legend.position="none")+scale_size_continuous(range = c(1, 30)) return(qp) } ``` ### Apply formatr::tidy_source() ```r formatR::tidy_source("mycode.R", blank = FALSE, arrow = TRUE, width.cutoff = 50) ``` --- class: left, middle ### Better code ```r make.background <- function(N, filename) { cols <- c("#88398A", "#B07FB2", "#F7CCA4", "#C8C05E", "#627C63") ## generated here: http://colormind.io/ ## colors set.seed(3) df <- data.fame(x = runif(N), y = runif(N), col = sample(cols, N, replace = TRUE), size = runif(N, 1, 10)) ## no border qp <- ggplot(df, aes(x, y)) + geom_point(aes(size = size, fill = I(col), color = I(col)), alpha = 0.4, shape = 16) + theme_void() + theme(legend.position = "none") + scale_size_continuous(range = c(1, 30)) return(qp) } ``` --- class: left, middle ### library(styler) - Follows tidyverse styling rules. - RStudio add-in - ↪ [`mycode.R`:](https://github.com/sinarueeger/20181204-r-lunchs-gva/blob/master/mycode.R) select parts, apply `style selection` - ↪ [`mycode.R`:](https://github.com/sinarueeger/20181204-r-lunchs-gva/blob/master/mycode.R): apply `style active file` --- class: left, middle ## Linter for R code (3/4) - `library(lintr)` - pretty sophisticated, static code analysis - checks style, syntax and semantic - ↪ https://github.com/jimhester/lintr ```r lintr::lint("mycode.R") ``` <img border="0" alt="" src="img/lintr.png" width="500"> --- class: left, middle ## libary(bannerCommenter) (4/4) - For a consistent **commenting style**. - Checkout [bannerCommenter-vignette](https://cran.r-project.org/web/packages/bannerCommenter/vignettes/Banded_comment_maker.pdf) for more options. - Main function is `bannerCommenter::banner()` --- class: left, middle ```r bannerCommenter::banner("This is an R-script Title", "and this is a subtitle", emph = TRUE, numLines = 3) ``` ``` ########################################################################### ########################################################################### ########################################################################### ### ### ### ### ### THIS IS AN R-SCRIPT TITLE ### ### AND THIS IS A SUBTITLE ### ### ### ### ### ########################################################################### ########################################################################### ########################################################################### ``` --- class: left, middle ```r bannerCommenter::banner("Section", centre = TRUE, bandChar = "/", minHashes = 70) ``` ``` ##///////////////////////////////////////////////////////////////////// ## Section // ##///////////////////////////////////////////////////////////////////// ``` --- class: left, middle ```r bannerCommenter::banner("Subsection", centre = FALSE, snug = TRUE, bandChar = "-") ``` ``` ##---------------- ## Subsection -- ##---------------- ``` --- class: left, middle ```r bannerCommenter::open_box("Important note", bandChar = ":") ``` ``` ##:::::::::::::::::: ## Important note ##:::::::::::::::::: ``` --- class: left, middle ### Lots of more options - *emph*: bigger, bolder banner - *snug*: box close fitting - *upper*: text in upper case - *centre*: text in centre - *leftSideHashes*: nbr of hash characters to the left - *rightSideHashes*: nbr of hash characters to the right - *minHashes*: length of bands (e.g. 80) - *numLines*: nbr of lines - *bandChar*: type of character --- class: left, middle ### Tip: Make it a habit using gist.github.com For example: https://gist.github.com/sinarueeger/ee42a74cbcb1d834e906199b26bc19f0 ``` ## Add banner comments to R scripts using ## [bannerCommenter](https://cran.r-project.org/web/packages/bannerCommenter/vignettes/Banded_comment_maker.pdf). ## header ----------------------------- bannerCommenter::banner("G2G-EBV analysis using GASTON", emph = TRUE) ## section ----------------------------- bannerCommenter::banner("0. Setup", centre = TRUE, bandChar = "/") ## subsection ----------------------------- bannerCommenter::banner("0.1 Setup", centre = FALSE, bandChar = "-") ## note ----------------------------- bannerCommenter::open_box("Variables need to be defined", bandChar = ":") ``` --- class: left, middle ## More - Work with RStudio projects and `here::here()` (if you need convincing, checkout [this blogpost](https://malco.io/2018/11/05/why-should-i-use-the-here-package/)). - More about style guides: - http://adv-r.had.co.nz/Style.html - http://jef.works/R-style-guide/#file-names - https://nicercode.github.io/intro/bad-habits.html <!-----------------------------------> --- class: inverse, center, middle .large[Project organisation] --- class: left, middle ### R package folder structure <a href="http://r-pkgs.had.co.nz/package.html"> <img border="0" alt="" src="img/package-files.png" width="500"> </a> .small[Figure from http://r-pkgs.had.co.nz/package.html.] --- class: left, middle ## Setting up a project (1/2) 1. [`library(ProjectTemplate)`](http://projecttemplate.net/architecture.html) 2. [`library(pRojects)`](https://itsalocke.com/projects/) - Directory structure - Set up version control - ... --- class: left, middle ### library(ProjectTemplate) - ↪ http://projecttemplate.net/architecture.html - ↪ http://projecttemplate.net/getting_started.html ```r ProjectTemplate::create.project(project.name = "TB.analysis") ``` .pull-left[ ``` TB.analysis/ ├── README.md ├── TODO ├── cache │ └── README.md ├── config │ ├── README.md │ └── global.dcf ├── data │ └── README.md ├── diagnostics │ ├── 1.R │ └── README.md ├── docs │ └── README.md ├── graphs │ └── README.md ``` ] .pull-right[ ``` ├── lib │ ├── README.md │ ├── globals.R │ └── helpers.R ├── logs │ └── README.md ├── munge │ ├── 01-A.R │ └── README.md ├── profiling │ ├── 1.R │ └── README.md ├── reports │ └── README.md ├── src │ ├── README.md │ └── eda.R └── tests ├── 1.R └── README.md ``` ] --- class: left, middle Minimal version creates folders: `cache, config, data, munge, README.md, src` ```r create.project(project.name = "TB.analysis", template = "minimal") ``` ```r ## cd into TB.analysis load.project() ``` --- class: left, middle ### library(pRojects) - Create a project layout for an analysis project - includes support for git, packrat - ↪ https://itsalocke.com/projects/ ```r pRojects::createAnalysisProject(name = "TB.analysis") ``` ``` TB.analysis/ ├── DESCRIPTION ├── R ├── TB.analysis.Rproj └── packrat ├── init.R ├── lib │ └── x86_64-apple-darwin15.6.0 │ └── 3.5.1 ├── packrat.lock ├── packrat.opts └── src └── packrat └── packrat_0.4.9-3.tar.gz ``` --- class: left, middle Customise: ```r pRojects::createAnalysisProject(name = "TB.analysis", dirs = c("data", "docs", "output", "src", "report")) ``` --- class: left, middle ### Self baked .pull-left[ ``` yourproject/ ├── README.md ├── bin ├── data │ ├── raw │ ├── processed │ └── results ├── doc ├── img ├── src │ ├── A_dataprep.R │ ├── B_fit.R │ └── functions.R ├── report │ ├── report-intro.Rmd │ └── report-analysis.Rmd └── yourproject.Rproj ``` ] .pull-right[ <blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">src/ : folder for scripts<br>data/ : original data pulled from database<br>output/ : intermediate RDS data objects needed for Rmd<br>analysis/: Rmd files and HTML output<br>doc/ : any long-form notes to self or documentation<br>ext/: external images or other random files I want to keep in proj</p>— Emily Riederer (@EmilyRiederer) <a href="https://twitter.com/EmilyRiederer/status/1038982918033604609?ref_src=twsrc%5Etfw">September 10, 2018</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ] --- class: left, middle ## Runall script (2/2) Helps you with: - **Clear instructions** → contains a sort of **recipe** of the analysis. - **Modular code** → makes you use **functions** instead of free floating code. - (**Minimising** redundant computation → **caching** results.) --- class: left, middle ### Where ``` yourproject/ ├── README.md ├── runall ... ``` ### Types - runall script in `.R` - runall script in `.sh` - Makefile --- class: left, middle ## runall script in `.R` `runall.R`: ``` source("src/A_dataprep.R") source("src/B_fit.R") rmarkdown::render("report/report-intro.Rmd") rmarkdown::render("report/report-analyis.Rmd") ``` --- class: left, middle ## runall script in `.sh` `runall.sh`: ``` #!/bin/bash Rscript --vanilla src/A_dataprep.R Rscript --vanilla src/B_fit.R Rscript --vanilla -e "rmarkdown::render(\"report/report-intro.Rmd\")" Rscript --vanilla -e "rmarkdown::render(\"report/report-analyis.Rmd\")" ``` - In terminal run `chmod a+x runall.sh` to make it executable. - Run it with `./runall.sh` --- class: left, middle ## Makefile - **Build automation tool** that automatically builds targets - ↪ https://kbroman.org/minimal_make/ --- class: left, middle ## More - [`library(workflowr)`](https://github.com/jdblischak/workflowr): similar to `pRojects` and `projectTemplate`. - [`library(remake)`](https://github.com/richfitz/remake): Make-like declarative runall scripts. - `library(drake)`: [drake github repo](https://github.com/ropensci/drake) and [my blogpost](https://sinarueeger.github.io/2018/10/09/workflow/). <!-----------------------------------> --- class: inverse, center, middle .large[Version control] --- class: left, middle # Version control - http://happygitwithr.com/bingo.html - Get private folders (free for students on github) - Work with others --- class: left, middle # More ### library(rrtools): Packaging Data Analytical Work - [*Packaging Data Analytical Work Reproducibly Using R (and Friends)*](https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375986) (article) - [*An introduction to rrtools*](https://benmarwick.github.io/UO-2018-On-Ramps-to-Reproducibility/UO-2018-On-Ramps-to-Reproducibility.html#1) (presentation) ### Talk by Charles T. Gray Presentation by [Charles T. Gray](https://twitter.com/cantabile) about [*Open and Reproducible Data Analysis*](http://cantabile.rbind.io/talks/replication/marseille-2018.html#1) (at Young Bayesians and Big Data for Social Good 2018, CIRM, Marseille). ### Article [*Enhancing reproducibility for computational methods*](http://science.sciencemag.org/content/354/6317/1240.full.pdf) [Sto+16]. --- class: left, middle # References rOpenSci (2018). "Code Review in the Lab". Community Call. URL: [https://ropensci.org/commcalls/2018-10-16/](https://ropensci.org/commcalls/2018-10-16/). Stodden, V, M. McNutt, D. H. Bailey, et al. (2016). "Enhancing reproducibility for computational methods". In: _Science_ 354.6317, pp. 1240-1241. ISSN: 0036-8075. DOI: [10.1126/science.aah6168](https://doi.org/10.1126/science.aah6168). eprint: http://science.sciencemag.org/content/354/6317/1240.full.pdf. Wilson, G, J. Bryan, K. Cranston, et al. (2017). "Good enough practices in scientific computing". In: _PLOS Computational Biology_ 13.6, pp. 1-20. DOI: [10.1371/journal.pcbi.1005510](https://doi.org/10.1371/journal.pcbi.1005510). --- class: inverse, center, middle .big[<font face="Yanone Kaffeesatz"> Thank you! </font>] <!------
----------> .left[ Slides: [https://sinarueeger.github.io/20181204-r-lunchs-gva/#1](https://sinarueeger.github.io/20181204-r-lunchs-gva/#1) Source code: [https://github.com/sinarueeger/20181204-r-lunchs-gva/](https://github.com/sinarueeger/20181204-r-lunchs-gva/)
: [@sinarueeger](https://twitter.com/sinarueeger) ]