Reprodown: An R package for reproducible data analysis
5 min
Replicability and reproducibility
Reproducibility and replicability play important roles in science to confirm new discoveries and to extend our understanding of the natural world. Both are about getting consistent results; the former using mainly the same data, methods and code; while the latter across studies aiming to answer the same scientific question with different collected data and methods (National Academies of Sciences, Engineering, and Medicine; 2019).
The concern on reproducibility and replicability has been raised drastically in the last decade given that several promising results in medicine and other research areas were not able to be replicated or reproduced. In particular, Baker (2015) highlighted the difficulty of replicating results in studies that use antibodies (“Y-shaped proteins that bind to specified biomolecules and used to flag their presence in a sample”). The main problem with these applications was that antibodies were not validated adequately and that standardized information quality about them were not provided.
These concerns are common in different areas of research. A Nature’s survey of 1576 researchers (on Chemistry, Biology, Physics and Engineering, Earth and environment and others) reported that that 70% of them have tried and failed to reproduce another researcher’s experiment. These researchers thought that “more robust experiment design”, “better statistics” and “better mentorship” could help to improve reproducibility.
Reproducibility can be considered a key minimum acceptable standard of research given that replicability can not always be achieved because it depends on the size of the study, available budget, time and other factors (Peng, 2015). Reproducibility can help to validate data analysis, improve collaboration and detect errors or bad practices of the analysis.
Reprodown
Reprodown is an R package that helps to improve reproducibility by using:
- Blogdown: An
R
package that integrates rmarkdown with Hugo to create a website. - GNU make: A
GNU
utility that determines which pieces of a program need to be compiled. This is based on a file calledMakefile
where dependencies are defined. - scholar-docs: A custom
hugo
theme for a webpage.
The workflow of reprodown
is to write the .Rmd
files containing our data analysis
inside a sub-folder (e.g. scripts). Then the function reprodown::makefile
will read the
.Rmd
files to create automatically the Makefile
. The outputs are render to html
files
by simply running the utility make
on the terminal.
Reprodown example
An example of a website built with reprodown
can be found at
https://erickchacon.gitlab.io/project-web. You can explore the source code at
https://gitlab.com/ErickChacon/project-web.
Reprodown tutorial
Requirements
We need to install the R packages blogdown
and reprodown
. We need my custom fork of
blogdown
given that I made a pull request to add a functionality to the function
blogdown:::build_rmds
. Hopefully, this will accepted in the future.
remotes::install_github("ErickChacon/blogdown")
remotes::install_github("ErickChacon/reprodown")
In addition, we also need the GNU make
utility which comes with any GNU/Linux distribution.
Getting started
- Create the structure of the project (optional): The function
reprodown::create_proj()
can be used to create the folders of our project. By default, this create the foldersdata
,docs
,scripts
andsrc
. However, you can provide the argumentyaml_file
with a path to ayaml
file with a custom structure. - Create the website-related files: The function
blogdown::new_site
can be used to create these files. The theme of the website is also downloaded by this function. You can use your custom theme or other hugo theme from https://themes.gohugo.io/. The user does not need to work directly with most of these files,blogdown
will take care of this. The filedocs/config.toml
defines the metadata of your web, check this and modify your data accordingly.
reprodown::create_proj()
blogdown::new_site('docs', theme = 'ErickChacon/scholar-docs', sample = FALSE)
Create your custom scripts
Create your .Rmd
files inside the scripts
folder. Take into consideration the
following:
- The file
_index.Rmd
inside thescripts
folder control the homepage. You can define the title in ayaml
header. See for example the scripts/_index.Rmd file of the web https://erickchacon.gitlab.io/project-web. - The dependency of the files is defined in the yaml header of the
.Rmd
files. For example, the yaml header below of the file scripts/30-process/process.Rmd indicates that this files needs as input the filedata/cleaned/data.rds
and has as output the filedata/processed/data.rds
.
---
title: "Transform covariate"
prerequisites:
- data/cleaned/data.rds
targets:
- data/processed/data.rds
---
Render the web
The web can be rendered by:
- Creating the Makefile with the following R code:
reprodown::makefile()
. A Makefile like this will be created. - Render the
.Rmd
files with the R codesystem(
make)
or runningmake
on your terminal. - Serve the site using
setwd("docs"); blogdown::serve_site(); setwd("..")
- Stop serving with
servr::daemon_stop()
.
Publish and update automatically your website
You can host your project in a remote repository to make your website available. An easy
way is to host it on gitlab. Use reprodown::create_gitlab_ci()
to create the file .gitlab-ci.yml
that define the workflow to create the website.
reprodown::create_gitlab_ci()
In addition, I suggest to avoid pushing the data
folder content to avoid publishing
confidential data or having issues with big files. The same should be done with the docs/public
folder given that it will be
automatically created by the gitlab workflow. This can be done by by creating a .gitignore
file with content:
# Public content
/docs/public
# Data sub-folders content
/data/raw/*
!/data/raw/.gitkeep
/data/modelled/*
!/data/modelled/.gitkeep
/data/processed/*
!/data/processed/.gitkeep
/data/cleaned/*
!/data/cleaned/.gitkeep
Push all your folder content to a gitlab repository and you will have available a website that will be automatically updated each time you push a commit.
References:
- Baker, M. (2016). A Nature Survey Lifts the Lid on How Researchers View the ‘Crisis’ Rocking Science and What They Think Will Help. Nature, 3.
- Baker, M. (2015). Blame it on the antibodies. Nature, 521(7552), 274.
- National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and replicability in science. National Academies Press.
- Peng, R. (2015). The reproducibility crisis in science: A statistical counterattack. Significance, 12(3), 30-32.