+ - 0:00:00
Notes for current slide
Notes for next slide

Tidying workflows
R community

Download example scripts here: http://bit.ly/2O15oom

Sina Rüeger

2018-10-04 (updated: 2018-10-05)

1 / 46
2 / 46

About me

  • Background in Data Analysis & Engineering

  • PostDoc @ EPFL Analysis of genetic data in infectious diseases at the Fellay Lab.

  • -Ladies Lausanne co-organiser

3 / 46

About me

  • Background in Data Analysis & Engineering

  • PostDoc @ EPFL Analysis of genetic data in infectious diseases at the Fellay Lab.

  • -Ladies Lausanne co-organiser

  • Data analysis & Genetic data & Data visualisation

  • usethis package

4 / 46

My everyday work

5 / 46

Adapted from Figure in R4DS book

6 / 46

Adapted from Figure in R4DS book

7 / 46

Adapted from Figure in R4DS book

8 / 46

Tidying workflows

9 / 46

The challenges

  • Making it easy for other colleagues to rerun (and understand) the project → "repeatability"
10 / 46

The challenges

  • Making it easy for other colleagues to rerun (and understand) the project → "repeatability"

  • Publishing code

    • making it easy for others to rerun and to understand the project → "reproducibility"
    • making it easy for others to rerun the code with different data"replicability"
11 / 46

The challenges

  • Making it easy for other colleagues to rerun (and understand) the project → "repeatability"

  • Publishing code

    • making it easy for others to rerun and to understand the project → "reproducibility"
    • making it easy for others to rerun the code with different data"replicability"
  • Keeping up with new data deliveries, changing data formats, generally, data chaos.

12 / 46

The challenges

  • Making it easy for other colleagues to rerun (and understand) the project → "repeatability"

  • Publishing code

    • making it easy for others to rerun and to understand the project → "reproducibility"
    • making it easy for others to rerun the code with different data"replicability"
  • Keeping up with new data deliveries, changing data formats, generally, data chaos.

  • Having an overview of the analysis and its iteration steps → cleaning, modelling, visualisation, reports.

13 / 46

The challenges

  • Making it easy for other colleagues to rerun (and understand) the project → "repeatability"

  • Publishing code

    • making it easy for others to rerun and to understand the project → "reproducibility"
    • making it easy for others to rerun the code with different data"replicability"
  • Keeping up with new data deliveries, changing data formats, generally, data chaos.

  • Having an overview of the analysis and its iteration steps → cleaning, modelling, visualisation, reports.

  • Separating data, processed-data and output-data

14 / 46

The challenges

  • Making it easy for other colleagues to rerun (and understand) the project → "repeatability"

  • Publishing code

    • making it easy for others to rerun and to understand the project → "reproducibility"
    • making it easy for others to rerun the code with different data"replicability"
  • Keeping up with new data deliveries, changing data formats, generally, data chaos.

  • Having an overview of the analysis and its iteration steps → cleaning, modelling, visualisation, reports.

  • Separating data, processed-data and output-data

  • Having different places for computation (PC, Server1, Server2).

15 / 46

The challenges

  • Making it easy for other colleagues to rerun (and understand) the project → "repeatability"

  • Publishing code

    • making it easy for others to rerun and to understand the project → "reproducibility"
    • making it easy for others to rerun the code with different data"replicability"
  • Keeping up with new data deliveries, changing data formats, generally, data chaos.

  • Having an overview of the analysis and its iteration steps → cleaning, modelling, visualisation, reports.

  • Separating data, processed-data and output-data

  • Having different places for computation (PC, Server1, Server2).

  • Using similar code in many different R scripts → redundant code

16 / 46

There is no magic solution

Extract from presentation by Heidi Seibold @HeidiBaya on Tools for reproducibility in Statistics and Machine Learning

18 / 46

What we need

  • Tidy folders

    • clear folder structure, e.g. data, bin, code, but not data1, data2, code_old
    • only files with "purposes" (no B_mod_old.R)
  • Clear instructions → one file should contain a sort of recipe of the analysis.

  • Modular code → using functions instead of free floating code.

  • Minimising redundant computation → caching results.

19 / 46

R package folder structure

Figure from http://r-pkgs.had.co.nz/package.html.

20 / 46

What are the options?

21 / 46

The default aka. wild west

https://github.com/sinarueeger/workflow-example/tree/master/wild-west

Folder structure

wild-west/
├── code
│ ├── A_dataprep.R
│ ├── B_fit.R
│ └── functions.R
├── data
│ ├── genotyping_data_subset_train.bim
│ ├── genotyping_data_subset_train.raw
│ └── training_set_details.txt
├── report
│ └── report.Rmd
└── wild-west.Rproj
22 / 46

Wild west "pro"

Folder structure

wild-west-pro/
├── README.md
├── code
│ ├── A_dataprep.R
│ ├── B_fit.R
│ └── functions.R
├── data
│ ├── genotyping_data_subset_train.bim
│ ├── genotyping_data_subset_train.raw
│ └── training_set_details.txt
├── report
│ └── report.Rmd
└── wild-west-pro.Rproj
23 / 46

Wild west "pro"

Folder structure

wild-west-pro/
├── README.md
├── code
│ ├── A_dataprep.R
│ ├── B_fit.R
│ └── functions.R
├── data
│ ├── genotyping_data_subset_train.bim
│ ├── genotyping_data_subset_train.raw
│ └── training_set_details.txt
├── report
│ └── report.Rmd
└── wild-west-pro.Rproj

+ README file

  • Problem: the README.md file needs to be updated.
24 / 46

make

From https://kbroman.org/minimal_make/

  • Variations of make, e.g. stu.

  • Problem: what if colleagues don't know make?

25 / 46

Drake

  • drake = Data Frames in R for Make

  • "general-purpose workflow manager for data-driven tasks"

    • borrows some features from make
    • caching of runs (future runs only start from the part where something has changed)
    • scalable (parallel computing)
    • supports easy maintainance of data analysis projects
  • rOpenSci package → code is reviewed

  • Created by Will Landau, with contributions by many others.

26 / 46

cd mini/

27 / 46

Mini example to get familiar with drake (part 1)

  1. install.packages("drake")
  2. Check-out the different examples with drake::drake_examples().
  3. Run drake::drake_example("main") → this will download a folder called main.
  4. cd main/
main/
├── COPYRIGHT.md
├── LICENSE.md
├── README.md
├── clean.R
├── make.R
├── raw_data.xlsx
└── report.Rmd
28 / 46

Mini example to get familiar with drake (part 2)

  1. Open make.R: key components are drake_plan() and make().
  2. Add the following bit before and after make(plan).
    config <- drake_config(plan)
    vis_drake_graph(config)
  3. Run all code for a first time.
  4. Change something (e.g. the plot function).
  5. Rerun and watch the colors change in vis_drake_graph(config).
  6. Use functions readd() and loadd() to work with the produced output.
29 / 46

cd drake-land/

30 / 46

Example with our data

https://github.com/sinarueeger/workflow-example/tree/master/drake-land

Folder structure

drake-land/
├── data
│ ├── genotyping_data_subset_train.bim
│ ├── genotyping_data_subset_train.raw
│ └── training_set_details.txt
├── drake-land.Rproj
├── functions.R
├── make.R
└── report.Rmd
31 / 46

More complex example

What if you have folders, instead of a flat folder structure?

https://github.com/sinarueeger/workflow-example/tree/master/drake-land-adv

drake-land-adv/
├── data
│ ├── genotyping_data_subset_train.bim
│ ├── genotyping_data_subset_train.raw
│ └── training_set_details.txt
├── src
│ ├── functions.R
│ ├── some-other-stuff.R
├── report
│ ├── drake-land-adv.Rproj
│ ├── make.R
│ ├── report.Rmd
32 / 46

.drake/

33 / 46

Resources

34 / 46

Diversity in the R community

35 / 46

Diversity initiatives in R

36 / 46

R Forwards

The task force was set up by the R Foundation in December 2015 to address the underrepresentation of women and rebranded in January 2017 to accommodate more under-represented groups such as LGBT, minority ethnic groups, and people with disabilities in the R community.

Activities

For example:

In 2016, 11.4% of package maintainers were women.

2016 saw a rise in the proportion of female attendees from 19% to 28% at useR.

37 / 46

-Ladies

  • Global organisation.

  • Mission: To increase gender diversity in the R community by encouraging, inspiring, and empowering underrepresented minorities.

  • Founded in 2012 by Gabriela de Queiroz ➡ listen to this interview - it is 👍!

  • Currently 130 R-Ladies meetup groups in 43 countries.

  • Find out more about R-Ladies: https://rladies.org/

38 / 46

Growth of R-Ladies

Source code by Daniela Vázquez.

39 / 46

meetupr package: https://github.com/rladies/meetupr

41 / 46

Find speakers in the R-Ladies directory: https://rladies.org/directory/

you can also add yourself!

42 / 46

@WeAreRLadies: The R-Ladies RoCur

43 / 46

Community Slack

  • have a safe and global space
  • to discuss within public channels #rstats news, packages, community ideas
  • include R-Ladies members around the world

Who can sign-up?

People that...

  • identify as a woman or gender minority
  • that have read and agreed to the CoC

Sign up here: bit.ly/rladies-slack

44 / 46

Join us in Lausanne!

R-Ladies Lausanne : https://www.meetup.com/rladies-lausanne/

library(dplyr)
rladies_global
%>% filter(city==“Lausanne”)

45 / 46
2 / 46
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow