Background in Data Analysis & Engineering
PostDoc @ EPFL Analysis of genetic data in infectious diseases at the Fellay Lab.
-Ladies Lausanne co-organiser
Background in Data Analysis & Engineering
PostDoc @ EPFL Analysis of genetic data in infectious diseases at the Fellay Lab.
-Ladies Lausanne co-organiser
Data analysis & Genetic data & Data visualisation
usethis
package
Tidying workflows
Making it easy for other colleagues to rerun (and understand) the project → "repeatability"
Publishing code
Making it easy for other colleagues to rerun (and understand) the project → "repeatability"
Publishing code
Keeping up with new data deliveries, changing data formats, generally, data chaos.
Making it easy for other colleagues to rerun (and understand) the project → "repeatability"
Publishing code
Keeping up with new data deliveries, changing data formats, generally, data chaos.
Having an overview of the analysis and its iteration steps → cleaning, modelling, visualisation, reports.
Making it easy for other colleagues to rerun (and understand) the project → "repeatability"
Publishing code
Keeping up with new data deliveries, changing data formats, generally, data chaos.
Having an overview of the analysis and its iteration steps → cleaning, modelling, visualisation, reports.
Separating data
, processed-data
and output-data
Making it easy for other colleagues to rerun (and understand) the project → "repeatability"
Publishing code
Keeping up with new data deliveries, changing data formats, generally, data chaos.
Having an overview of the analysis and its iteration steps → cleaning, modelling, visualisation, reports.
Separating data
, processed-data
and output-data
Having different places for computation (PC, Server1, Server2).
Making it easy for other colleagues to rerun (and understand) the project → "repeatability"
Publishing code
Keeping up with new data deliveries, changing data formats, generally, data chaos.
Having an overview of the analysis and its iteration steps → cleaning, modelling, visualisation, reports.
Separating data
, processed-data
and output-data
Having different places for computation (PC, Server1, Server2).
Using similar code in many different R scripts → redundant code
Extract from presentation by Heidi Seibold @HeidiBaya on Tools for reproducibility in Statistics and Machine Learning
Tidy folders
data
, bin
, code
, but not data1
, data2
, code_old
B_mod_old.R
)Clear instructions → one file should contain a sort of recipe of the analysis.
Modular code → using functions instead of free floating code.
Minimising redundant computation → caching results.
https://github.com/sinarueeger/workflow-example/tree/master/wild-west
wild-west/├── code│ ├── A_dataprep.R│ ├── B_fit.R│ └── functions.R├── data│ ├── genotyping_data_subset_train.bim│ ├── genotyping_data_subset_train.raw│ └── training_set_details.txt├── report│ └── report.Rmd└── wild-west.Rproj
wild-west-pro/├── README.md├── code│ ├── A_dataprep.R│ ├── B_fit.R│ └── functions.R├── data│ ├── genotyping_data_subset_train.bim│ ├── genotyping_data_subset_train.raw│ └── training_set_details.txt├── report│ └── report.Rmd└── wild-west-pro.Rproj
wild-west-pro/├── README.md├── code│ ├── A_dataprep.R│ ├── B_fit.R│ └── functions.R├── data│ ├── genotyping_data_subset_train.bim│ ├── genotyping_data_subset_train.raw│ └── training_set_details.txt├── report│ └── report.Rmd└── wild-west-pro.Rproj
README.md
file needs to be updated.From https://kbroman.org/minimal_make/
Variations of make, e.g. stu
.
Problem: what if colleagues don't know make?
drake = Data Frames in R for Make
"general-purpose workflow manager for data-driven tasks"
rOpenSci package → code is reviewed
Created by Will Landau, with contributions by many others.
cd mini/
install.packages("drake")
drake::drake_examples()
.drake::drake_example("main")
→ this will download a folder called main
.cd main/
main/├── COPYRIGHT.md├── LICENSE.md├── README.md├── clean.R├── make.R├── raw_data.xlsx└── report.Rmd
make.R
: key components are drake_plan()
and make()
. make(plan)
.config <- drake_config(plan) vis_drake_graph(config)
vis_drake_graph(config)
.readd()
and loadd()
to work with the produced output.cd drake-land/
https://github.com/sinarueeger/workflow-example/tree/master/drake-land
drake-land/├── data│ ├── genotyping_data_subset_train.bim│ ├── genotyping_data_subset_train.raw│ └── training_set_details.txt├── drake-land.Rproj├── functions.R├── make.R└── report.Rmd
What if you have folders, instead of a flat folder structure?
https://github.com/sinarueeger/workflow-example/tree/master/drake-land-adv
drake-land-adv/├── data│ ├── genotyping_data_subset_train.bim│ ├── genotyping_data_subset_train.raw│ └── training_set_details.txt├── src│ ├── functions.R│ ├── some-other-stuff.R├── report│ ├── drake-land-adv.Rproj│ ├── make.R│ ├── report.Rmd
Best practices for drake projects.
How drake compares to similar work.
Check-out this tutorial by Kirill Müller.
Diversity in the R community
The task force was set up by the R Foundation in December 2015 to address the underrepresentation of women and rebranded in January 2017 to accommodate more under-represented groups such as LGBT, minority ethnic groups, and people with disabilities in the R community.
For example:
In 2016, 11.4% of package maintainers were women.
2016 saw a rise in the proportion of female attendees from 19% to 28% at useR.
Global organisation.
Mission: To increase gender diversity in the R community by encouraging, inspiring, and empowering underrepresented minorities.
Founded in 2012 by Gabriela de Queiroz ➡ listen to this interview - it is 👍!
Currently 130 R-Ladies meetup groups in 43 countries.
Find out more about R-Ladies: https://rladies.org/
you can also add yourself!
People that...
Sign up here: bit.ly/rladies-slack
R-Ladies Lausanne : https://www.meetup.com/rladies-lausanne/
library(dplyr)rladies_global %>% filter(city==“Lausanne”)
Thank you!
Slides: https://sinarueeger.github.io/20181004-geneve-rug/slides#1
Source code: https://github.com/sinarueeger/20181004-geneve-rug/
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |