Reproducible R Computing

A framework for good R practice with RStudio and Quarto

Reproducibility can be ensured by adhering to (opinionated) standards for project management and data analysis.
quarto
R
data analysis
reproducibility
Author

G. Bisaccia

Published

Apr 27, 2024

Modified

Aug 31, 2024

This guide is aimed at intermediate R users and was developed with Rstudio in mind. Much wider coverage and further resources on reproducible R workflows can be found at Prof. Harrell’s website.

1 Install and load ProjectTemplate

1.1 Install ProjectTemplate

ProjectTemplate is an R package allowing for automated organization of R projects. It works by creating an opinionated project structure, pre-loading all packages and data sets used, and pre-processing data where needed.

You can install the package by running

install.packages('ProjectTemplate')

1.2 Load ProjectTemplate

Load the package and create a new minimal project structure:

require('ProjectTemplate')
create.project(project.name = 'projectname',
               template = 'minimal')
Note

create.project() is the function that takes care of creating the project structure. It needs a project.name (or will create the project in a “new-project” folder). Optionally, the user can specify a project template (currently, options 'full' and 'minimal' are available; custom templates are also supported).

1.3 Choose a useful project name

Be mindful of recommendations for naming things.

Note

Generally, you want your project (and any files and folders you’re working with) to be named consistently, so that any files (loaded or generated) are immediately recognizable. File names (especially for plots, tables, and exported data sets) should often include a date. Names of R scripts (and Quarto reports) should be self-explaining.

2 Open the project in RStudio

  1. Open Rstudio
  2. Go to File > Open Project…

3 Using your structured project

3.1 Load your project

You can load the whole project (data, R scripts) with:

load.project('projectname')
  • data sets in data will be loaded as data.frames.

3.2 Getting to know the project folders

A minimal project structure includes the following folders. Each folder comes with an explanatory README file, but they will be summarized here, in reasonable order:

  • data/ contains the project’s data sets.

    • Raw data sets should be backed up in multiple locations outside the project.

    • Raw data should never be edited or overwritten. This is because you can never be sure all edits are respectful of the original data.

  • munge/ includes scripts for pre-processing the data (e.g. adding custom columns, re-assigning variable classes, etc.).

    • Pre-processed data should be generated in the cache folder.
  • cache/ includes data sets generated through pre-processing steps.

    • In this folder, transformed data are placed.

    • When a data set is available in both cache/ and data/, load.project() will only load the cached version.

  • src/ contains R scripts for data analysis.

    • each script should start with the same line:

      library('ProjectTemplate')
      load.project()
    • any code that’s shared between different R scripts should be placed in the munge directory instead.

References

1.
The Stanford Psychology Department Open Science Community. Stanford psychology guide to doing open science 2020.
2.
Wilson G., Bryan J., Cranston K., Kitzes J., Nederbragt L., Teal TK. Good enough practices in scientific computing. PLOS Computational Biology 2017;13(6):e1005510. Doi: 10.1371/journal.pcbi.1005510.