Reproducible Data Analysis Workflow in R

Reproducible Data Analysis Workflow in R

Reproducible research is the practice of conducting research in a way that others can easily understand and replicate your results. It is becoming increasingly important in data analysis, as it allows for the verification of results and fosters trust and transparency. However, achieving reproducibility in data analysis can be a complex task due to differences in computational environments, the use of various data sources, and other factors. Ever since I began In this post, we will explore a set of tools in R that can help us overcome these challenges and conduct reproducible research with ease.

I will attempt to cover the the basics of project organization and setting up a reliable directory structure with?usethis, dependency management with?renv, setting up an automatic workflow/pipeline using?targets, creating some custom functions to manage constants, and creating a reproducible document using Quarto. By the end, we'll have our code neatly packaged, ready to be handed off to someone else, either as a reproducible R script or as a packaged Docker container.

Reading guide:

In this post, I will follow a hands-on approach, with plenty of code snippets and step-by-step instructions. At the same time, I will briefly discuss the reasoning and benefits behind each step.

It can be difficult to keep track of whether to run the code inside the console or to paste/write it inside a file. To minimise confusion, the first line of every code snippet contains a comment indicating where to run the code.

If code is to be run in the console:

# Console #

print("Repdocible data analysis workflow")        

If code is to be written or pasted inside a file:

# R/function_1.R #

test <- function() {
  print("Test")
}        

Furthermore, to enhance reproducibility and clarity in our code, we’ll be utilising namespaces when calling functions. When we specify a function along with its namespace, like?usethis::create_project(), we're ensuring that R uses the create_project()?function from the?usethis?package specifically. This approach offers several advantages in relation to reproducibility: 1) ensures clarity of the package source of the function, 2) avoids potential conflicts with functions of the same name, and 3) saves some memory as the entire packages does not need to be loaded into the project environment.


Sections

  1. Reproducible Data Analysis using R
  2. Overview of the Data Analysis Workflow
  3. Project Setup and Organization with?usethis
  4. Dependency Management with?renv
  5. Create custom metadata system
  6. Setup CONSTANTS
  7. Documenting your code
  8. Get some data
  9. Workflow Automation with?targets
  10. Create reproducible documents with Quarto
  11. Packaging the Data Analysis Workflow


1.?Reproducible Data Analysis using R

R is extremely powerful, when it comes to conducting reproducible data analysis, where the R community, the RStudio team (now Posit), and countless open-source contributors have developed indispensable tools for reproducibility. In this article, I'll be using the packages?fs,?usethis,?devtools,?rlang,?renv,?here, and?targets, along with Quarto to produce a reproducible data analysis workflow. Lastly, we will package the entire project using Docker.

Below, I briefly describe the packages we'll use and how they enable reproducibility:

Work with files and folders using?fs

The challenge: File and folder structures differ between users, operating systems (MacOS vs Windows), and storage locations, leading to inconsistencies and confusion.
Our solution: The?fs?package uses a unified file path language, ensuring consistency across users and systems.

Define file paths with?here

The challenge: By default, R interprets file paths relative to a working directory. This solution can be fragile as the working directory can be changed, sometimes without own knowledge.
Our solution: The?here?package allows us to define the file path relative to our R project, ensuring a fixed reference point regardless of the user or script running.

Automate common operations with?usethis

The challenge: Setting up a new project involves various manual steps to prepare the project environment. The way these steps are performed can differ amongst individuals and may also change over time.
Our solution: The?usethis?package provides utility functions that allows us to automate and perform many routine tasks related to project setup, and data analysis workflow.

Ensure we are working with an identical environment using?renv

The challenge: Differences in project environments (R version, package versions, etc.) amongst users can lead to discrepancies in results or even code execution failures.
Our solution:?renv allows us to create virtual environments that can be packaged with the project. Inside this environment we can define package versions and R version, thereby ensuring the consistency of the environment across different users.

Automatically produce code and output using?targets

The challenge: Imagine that you have completed your data analysis, but are suddenly required to alter parts of your code. Updating a part of your data analysis without affecting the rest can be tricky, especially in large projects.
Our solution:?targets?is an automatic workflow tool, that keeps track of dependencies and workflow steps. In the end, it allows us to change parts of the code and will figure out what needs to be rerun, and what can be kept.

More...

In addition to the above mentioned packages, we will also utilise:

  • rlang?for working with data-masking
  • magrittr?for the pipe
  • ggplot2?to create boxplots
  • readr?to read and write data
  • tibble?to create tidy data frames
  • devtools?to load internal data
  • purrr?to create “functional” loops
  • cli?to create pretty console outputs
  • glue?to write string literals

And the tools:

  • Quarto?to produce a reproducible document
  • Docker?to package the data analysis workflow


2. Overview of the Data Analysis Workflow

It's important to note that there is no definitive way to structure a data analysis workflow. The following workflow is based on my current preferences and experiences, and may evolve over time.

This process utilises several tools to enhance reproducibility and efficiency in data analysis:

Project Setup and Organization with?usethis

Create the foundation for our project by organising its structure and defining its settings.

Steps include:

  1. Install workflow packages
  2. Initiate R project
  3. Set up files and folders
  4. Define project specific settings and packages

Dependency Management with?renv

Manage the specific R packages required for our data analysis workflow, ensuring consistency across different environments.

Steps include:

  1. Initialising the Virtual Environment
  2. Setup renv with a DESCRIPTION file
  3. Installing Packages Inside the Virtual Environment

Create custom metadata system

Creation of a metadata system to keep track of various metadata we wish to define.

Steps include:

  1. Create function that creates a metadata file
  2. Create function to add and remove metadata
  3. Create function to extract metadata

Define CONSTANT

This step involves creating a function to write internal data, that can function as constants (key-value pairs).

Documenting our code

Creating comprehensive and helpful documentation for the code and functions.

Get some data

Downloading some data to work with.

Workflow automation with?targets

Automate our data analysis workflow using targets.

Steps include:

  1. Setup?targets
  2. Customize the _targets.R script
  3. Define the workflow
  4. Run the workflow

Create reproducible documents with?Quarto

Use Quarto in conjunction with?targets?to create reproducible documents.

Packaging the Data Analysis Workflow

Package the entire workflow in a portable format using Docker or an R script, allowing for easy distribution and replication.


3. Project Setup and Organization with?usethis

When it comes to project organisation and setup, the?usethis?package is indispensable. We can use the package to define other packages needed, initialise our project, create new folders and files, and establish project-wide settings that enhance reproducibility.

Step 1: Install workflow packages

To create our data analysis workflow, we need certain workflow packages. Workflow packages help us manage and keep track of your project, simplifying the workflow creation process. Although these packages aren't required for those who wish to view or run our code later, they are instrumental in our initial setup.

Let's install the workflow packages?usethis,?devtools, and?rlang.

# Console #

install.packages("usethis")
install.packages("devtools")
install.packages("rlang")        

Step 2.?Initiate?R project

Before proceeding, we need to create a new project to contain our data analysis workflow. You might be accustomed to to using the RStudio file prompt for this, but as we want to ensure complete reproducibility, we instead initialise the project using code. We can use the?usethis::create_project()?for this.

Write the following inside the console.

# Console #

usethis::create_project("project/folder")        

Replace "project/folder"?with your desired project location.

Step 3: Set up files and folders

Create Directories

A crucial but often overlooked aspect of data analysis is the organization of files and folders. While there's no universally agreed-upon structure, there are some general standards:

  • Raw data should be saved in its own folder (”data-raw”).
  • Processed data should be stored in its own folder (”data/”).
  • R code should be stored within its own folder (”R/”).
  • Documents and various images should be stored in a document folder (”doc/”).

In this workflow, we'll create these directories as follows:

# Console #

usethis::use_directory("doc")
usethis::use_directory("data")
usethis::use_directory("data-raw")
usethis::use_directory("R")        

Create description file

Furthermore, when doing reproducible data analysis, it is important to have a way to document various aspects of your code. In R, a common way to do this is a so called DESCRIPTION file. This file can contain information about your workflow, used packages, and other metadata. To create a description file, run the following:

# Console #

usethis::use_description(check_name = FALSE)        

In addition to the DESCRIPTION file, it is often beneficial to provide a?README.md?file that provides an overview of the project and its organisation. We are not going to create a readme file in this example.

Step 4: Define project specific settings and packages

Blank slate

Let's start by setting our project to always open with a blank slate. This ensures that we aren't relying on objects or code only saved in the environment. When prompted, choose "yes/absolutely/sure".

# Console #

usethis::use_blank_slate("project")        

Define packages

An important aspect of any data analysis workflow is to define what R packages you are using. In this workflow, we define the packages in the DESCRIPTION file. Let's add some of the packages we'll be using:

# Console #

usethis::use_package("fs")
usethis::use_package("here")
usethis::use_package("cli")
usethis::use_package("magrittr")        

You can add packages to the description file at any time.

The?magrittr?pipe

Throughout this tutorial, we'll be using the?magrittr?pipe (%>%) extensively. Instead of manually loading the?magrittrpackage, we can add it to our project by:

# Console #

usethis::use_pipe()
devtools::document()        

We've now laid the groundwork for our R project and established the backbone of our R environment. Our project is set to always open with a blank slate, and we've documented our project and its dependencies in the DESCRIPTION file. This setup promotes reproducibility and makes it easier for others to understand and use our code.


4. Dependency Management with?renv

renv?is a powerful tool for creating virtual environments in R. By encapsulating our workflow inside a virtual environment, we can easily share it with others, ensuring full reproducibility. Although the concept of virtual environments may seem complex, the?renv package makes it quite accessible.

Step 1:?Initialising the Virtual Environment

When we initialise a virtual environment with?renv, several things happen:

  1. A new, isolated library is created for the project, ensuring that only the packages required for this specific project are included (located inside the “renv/” folder).
  2. Creation of an?renv.lock?file. This file is a snapshot of the current state of the library, capturing the exact versions of each package.

Note: At the time of this writing, there is a bug in renv automatic snapshot feature, requiring us to manually create snapshot. This post will be updated when that changes.

Before initialising, let's add?renv?to our DESCRIPTION file and install it:

# Console #

usethis::use_package("renv")        

With?renv?installed, we can now initialise our virtual environment:

# Console #

renv::init(bare = TRUE)
renv::snapshot() # Is usually performed automatically        

We're using the?bare = TRUE?argument so that?renv?references the DESCRIPTION file for required packages, rather than looking for library calls.

Step 2: Setup renv with a DESCRIPTION file

Given we are going to utilize?renv?together with our DESCRIPTION file, it's necessary to configure some virtual environment settings. Specifically, we want?renv?to look in the DESCRIPTION file for packages, instead of its default behaviour of capturing calls to?install.packages().

As the initialization process automatically places us inside the virtual environment. To access our installed packages, we need to deactivate the environment temporarily:

# Console #

renv::deactivate()        

To configure?renv?to reference the DESCRIPTION file when looking for packages, we need to modify the .Rprofile file. While doing this, it's a good idea to add?devtools::load_all()?to the profile, as this will ensure that functions we create later will automatically be loaded when opening the project. We can also add an option that will automatically update our renv.lock file when adding packages to the DESCRIPTION file.

Lets open the .Rprofile file.

# Console #

usethis::edit_r_profile("project")        

Then, copy the subsequent code into the .Rprofile. Remember to save the file.

# .Rprofile #

devtools::load_all()
options(
? renv.settings.snapshot.type = "explicit",
? renv.config.auto.snapshot = TRUE
)        

Note:?In certain instances, you may need to manually restart RStudio for the changes in the virtual environment to take effect properly.

Before we proceed, let's reactivate our virtual environment:

# Console #

renv::activate()        

Part 3:?Installing Packages Inside the Virtual Environment

As we have now moved inside our isolated virtual environment, we will not have access to the packages we installed earlier, and need to reinstall them gain inside the isolated enviorment. This can be done by running (this can take some time):

# Console #

renv::install()
renv::snapshot() # Is usually performed automatically        

We now have a fully functional virtual environment that can be shared along with our code, allowing others to use the exact same setup we have, thereby enhancing reproducibility. Usually, if we wish to restore from a lock file, it will be sufficient to run renv::restore().


5. Create system to manage metadata

Navigating data and variables can be complex, often requiring various adjustments. Thus, a system is needed that can dynamically alter data or change variables names. In this example, we want a system that can:

  1. Define the data points we are using
  2. Store metadata about various data points
  3. Refer to data in the text that can be dynamically altered.

Since there aren't any existing packages (to my knowledge) that facilitate this workflow, we will create dedicated functions to implement this functionality.

Step 1:?Create function that creates a metadata file

There are several possibilities of how we could store the metadata. Here I am going to store the data as a .csv file. We are going to use the?readr?package to save and load the metadata.

Let's add?readr?to the DESCRIPTION file (when adding packages like this, you will be prompted to install the package - select yes).

# Console #

usethis::use_package("readr")        

We will start by creating a new R file to contain our function. In the console, write:

# Console #

usethis::use_r("create_metadata_file")        

This command will create a file named?create_metadata_file.R?and place it inside the?R/?folder. This is the perffered folder to store functions.

Now, let's create a function that will generate a metadata file. Copy and paste the following into the newly created file:

# R/create_metadata_file.R #

#' Creates metadata file that can be used to store various metadata inside.
#'
#' @param ... Additional parameters. Supports: Column names.
#' @param path str | Path to store metadata
#' @param overwrite bool | Overwrite old metadata (default = FALSE).
#' @param backup bool | Create backup of old metadata (default = TRUE)
#'
#' @return NA
#' @export
#'
#' @examples
#' # Create metadata
#' create_metadata("constant",
#'                 "column_name",
#'                 "display_name",
#'                 "col_type")
create_metadata <- function(...,
                            path = here::here("metadata.csv"),
                            overwrite = FALSE,
                            backup = TRUE) {

    # Collect dynamic dots (...)
    dots <- rlang::list2(...)

    # Create backup
    if(backup == TRUE & fs::file_exists(path)){
        fs::file_copy(path,
                      new_path = paste0(fs::path_dir(path),
                                        "/backup_",
                                        Sys.Date(),
                                        "_",
                                        fs::path_file(path)),
                      overwrite = TRUE)
    }

    # Check if file exists
    if(fs::file_exists(path) & overwrite == FALSE){
        stop("File already exists. Overwrite with 'overwrite = TRUE'.")
    }

    # Columns
    columns <- append(c("constant", "type", "column_name"), dots) %>%
        unique()

    # Create metadata
    metadata <- tibble::as_tibble(matrix(nrow = 0,
                                         ncol = length(columns),
                                         dimnames = list(NULL, columns)))

    # Write metadata
    readr::write_csv(metadata,
                     file = path)

    # Give status
    cli::cli_alert_success("Created metadata file with the following parameters:")
    cli::cli_ul(columns)
}        

Now, lets try creating a metadata file. If we don’t provide any specific arguemnts for the file, it will be named "metadata.csv", and be located in the root directory.

# Console #

devtools::load_all()
create_metadata("constant",
                "column_name",
                "display_name",
                "col_type")        

Note:?When creating new functions, we always need to run?devtools::load_all()?to load the functions into our environment.

Step 2: Create function to add and remove metadata

With the creation of the metadata file, we need a way to add entries/metadata to the file. For this, we will use the?dplyr,?tibble?and?glue. Let's begin by adding those to the DESCRIPTION file:

# Console #

usethis::use_package("dplyr")
usethis::use_package("tibble")
usethis::use_package("glue")        

Now, lets create a new R file for our function.

# Console #

usethis::use_r("add_to_metadata")        

Now, copy the following inside the newly created “R/add_to_metadata.R”.

# R/add_to_metadata.R #

#' Adds data field to metadata file.
#'
#' @param ... Additional parameters. Supports: Column names.
#' @param path str | CSV file containing metadata (default: "metadata.csv").
#'
#' @return NA
#' @export
#'
#' @examples
#' # Create metadata
#' create_metadata("constant",
#'                 "column_name",
#'                 "display_name",
#'                 "col_type")
#'
#' # Add metadata
#' add_metadata(constant = BMI,
#'              column_name = BMI,
#'              display_name = "Body mass index (BMI)",
#'              col_type = "dbl")
add_metadata <- function(..., path = here::here("metadata.csv")) {

    # Collect dynamic dots (...)
    dots <- rlang::list2(...)

    # Extract data
    columns <- tibble::as_tibble(dots)

    # Read metadata
    metadata <- readr::read_csv(path,
                                 col_types = readr::cols(.default = "c"))

    # Extract constants
    constants <- metadata %>%
        dplyr::select(constant) %>%
        dplyr::pull()

    # Check if constant exists
    if(columns$constant %in% constants){
        stop(glue::glue("Constant {columns$constant} already present in data file.\\n Delete row with 'rm_metadata()' before attempting again."))
    }

    # Append data
    metadata %>%
        dplyr::rows_append(columns) %>%
        readr::write_csv(file = path,
                         na = NA_character_)

    # Give status
    cli::cli_alert_success("Added the following parameters to the metadata:")
    cli::cli_ul(columns)
}        

It's also useful to have a method to remove metadata. First, create a new .R file:

# Console #

usethis::use_r("remove_metadata")        

Copy the following inside the newly created “R/remove_metadata.R”.

# R/remove_metadata.R #

#' Removes metadata from metadata fil
#'
#' @param remove str | Constant you want removed.
#' @param path str | Path to metadata (default: 'metadata.csv').
#'
#' @return NA
#' @export
#'
#' @examples
#' # Create metadata
#' create_metadata("constant",
#'                 "column_name",
#'                 "display_name",
#'                 "col_type")
#'
#' # Add metadata
#' add_metadata(constant = BMI,
#'              column_name = BMI,
#'              display_name = "Body mass index (BMI)",
#'              col_type = "dbl")
#'
#' # Remove metadata
#' rm_metadata("BMI")
rm_metadata <- function(remove,
                        path = here::here("metadata.csv")) {

    # Read metadata
    metadata <- readr::read_csv(file = path,
                                col_types = "c")

    # Check if constant is in metadata
    if(!remove %in% metadata$constant){
        stop(glue::glue("Constant '{remove}' not in metadata."))
    }

    # Remove constant row
    metadata <- metadata %>%
        dplyr::filter(!constant == remove)

    # Write metadata
    readr::write_csv(metadata, path)

    # Give status
    cli::cli_alert_success("Removed the following parameter to the metadata:")
    cli::cli_ul(remove)
}        

Now, lets test our new functions. Lets begin by adding some metadata.

# Console #

devtools::load_all()
add_metadata(constant = "EDUCATION",
             column_name = "Education",
             display_name = "Education",
             col_type = "str")

add_metadata(constant = "BMI",
             column_name = "BMI",
             display_name = "Body mass index (BMI)",
             col_type = "dbl")

add_metadata(constant = "SEX",
             column_name = "Sex",
             display_name = "Sex",
             col_type = "str")

add_metadata(constant = "DIABETES",
             column_name = "Diabetes",
             display_name = "Diabetes status",
             col_type = "str")        

And remove sex.

# Console #

rm_metadata("SEX")        

Step 3: Create function to extract metadata

While we now have a system to add and remove metadata, one could imagine scenarios where one would need to access specific metadata columns. To facilitate this, we can create a function that retrieves metadata for us.

Lets begin by creating a new R file for our function.

# Console #

usethis::use_r("extract_metadata")        

Now, copy and paste the following into?“R/extract_metadata.R”.

# R/extract_metadata.R #

#' Extracts metadata
#'
#' @param columns str | Column or vector of columns you want to extract
#' @param path str | Parh to metadata. Default: "here::here("metadata.csv)"
#' @param col_type str | Type of data. Default: "NULL"
#'
#' @return Tibble with extracted metadata.
#' @export
#'
#' @examples
#' # Create metadata
#' create_metadata("constant",
#'                 "column_name",
#'                 "display_name",
#'                 "col_type")
#'
#' # Add metadata
#' add_metadata(constant = BMI,
#'              column_name = BMI,
#'              display_name = "Body mass index (BMI)",
#'              col_type = "dbl")
#'
#' # Extract metadata
#' extract_metadata("BMI")
extract_metadata <- function(columns, path = here::here("metadata.csv"), col_type = NULL) {

    # Read data
    metadata <- readr::read_csv(path,
                                col_types = readr::cols(.default = "c"))

    # Extract everything
    if(is.null(col_type)) {
        data <- metadata %>%
            dplyr::select(dplyr::all_of(columns))
    } else {
    # Extract specific field
    data <- metadata %>%
        dplyr::filter(type == {{ col_type }}) %>%
        dplyr::select(dplyr::all_of(columns))
    }

    # Return
    data
}        

As always, let’s test it:

# Console #

devtools::load_all()
extract_metadata("constant")        

In the end, this metadata system is very basic and is susceptible to breaking if not handled properly. However, as with anything in R, you have the freedom to modify the functions, add new features, or come up with better (or worse) ways of doing things.


6. Define CONSTANTS

A prevalent programming principle involves the use of constants (or CONSTANTS). Constant represent fixed values that remain unchanged throughout the program execution. They provide meaningful names to unchanging values, aiding in code readability and maintenance. Although R does not have a straightforward method to manage constants, we can use our metadata system to define variables in our metadata that can act as constants.

In this example, we aim to create constants referring to colunm names in our data. This approach allows us to change column names dynamically without worrying about breaking the code. Another way of looking at constants is that we create key-value pairs, where the constant is the key, and the value is the column name.

Step 1: Create function to write internal data

My current way of defining constants is by saving them as internal data. While internal data typically is used for R package development, we an adapt its functionallites to meet our needs.

Before creating the function to write internal data, we need to add?purrr?to our description file (allows us to create “functional” loops).

# Console #

usethis::use_package("purrr")        

Now we can create a new R script:

# Console #

usethis::use_r("internal_data")        

In the newly created “R/internal_data.R”,?insert the following:

# R/internal_data.R #

#' Writes metadata file into internal data.
#'
#' @param path str | Path to metadata file. Default: "metadata.csv".
#'
#' @return NA
#' @export
#'
#' @examples
#' # Create metadata
#' create_metadata("constant",
#'                 "column_name",
#'                 "display_name",
#'                 "col_type")
#'
#' # Add metadata
#' add_metadata(constant = BMI,
#'              column_name = BMI,
#'              display_name = "Body mass index (BMI)",
#'              col_type = "dbl")
#'
#' write_internal_data()
write_internal_data <- function(path = here::here("metadata.csv")) {

    # Read data
    metadata <- readr::read_csv(path,
                                col_types = readr::cols(.default = "c"))

    # Extract constants
    constants <- metadata %>%
        dplyr::select(constant) %>%
        dplyr::pull()

    # Extract column names
    colunm_name <- metadata %>%
        dplyr::select(column_name) %>%
        dplyr::pull()

    # Create new enviorment
    temp_env <- rlang::env()

    # Write to temp enviorment
    purrr::map2(colunm_name, constants, ~assign(.y, .x, envir = temp_env))

    # Save as sysdata
    save(list = constants, file = "R/sysdata.rda", envir = temp_env)

    # Give status
    cli::cli_alert_success("Wrote internal data!")

    # Reload if interactive
		if(interactive()) devtools::load_all()
}        

To test the function, let try creating/writing our internal data:

# Console #

devtools::load_all()
write_internal_data()         

And see if we can access it from anywhere by printing column name of the constant DIABETES.

# Console #

devtools::load_all()
print(DIABETES)        

Note:?An alternative and potentially better method of storing constants (key-value pairs) would be using custom environments within your project. This method could potentially provide a more organized and flexible way to manage your constants, however, is not something I’ve tried as of yet.


7.?Documenting your Code

Before continuing, we need to take a small detour and talk about documentation. One of the critical aspects of creating reproducible code is comprehensive documentation. Although the ideal is to create self-explanatory code, it's often challenging to achieve this in practice. A function's purpose, usage, and expected inputs and outputs may not be immediately clear without proper documentation.

In relation to this, you may have noticed the information we've included at the top of each function so far. This is known as a roxygen skeleton, a vital tool in documenting your R code (explained in detail?here). This documentation is called a roxygen skeleton (explained in detail?here). Using roxygen skeleton while doing data analysis provides a structured way to describe what the function does, enhancing its usability for others.

Furthermore, the?devtools?package offers a convenient function to render our roxygen explanations as actual help inside R. To generate the documentation, run:

# Console #

devtools::document()        

Now, we can access the documentation for our functions:

# Console #

?rm_metadata        

8. Get some data

While setting up an effective workflow is crucial, it is equally important to have actual data to work with. For this tutorial, which focuses on the reproducibility aspect of data analysis, we will use the NHANES dataset, which is relatively clean and tidy.

To get the NHANES dataset, let's create an R script that download and stores the raw data inside the?data-raw?folder. First, lets add the NHANES package to the description file:

# Console # 

usethis::use_package("NHANES")        

Next, let's create a R script to import the data.

# Console #

usethis::use_data_raw("import_data")
        

This should open R/import_data.R?file, where you should replace the existing content with:

# raw-data/import_data.R #

#' Downloads the NHANES data and saves it as raw data.
#'
#' @param ... For compatibility with the targets pipeline.
#' 
#' @return str | File path to data
#' @export
#'
#' @examples
#' get_raw_data()
get_raw_data <- function(...) {

  # Write data
  readr::write_csv(NHANES::NHANES, here::here("data-raw/nhanes.csv"))

  # Return file
  here::here("data-raw/nhanes.csv")
}        

You might have noticed the use of a '...' argument which isn't directly used within the function. This is because the?targets?package which we are going to use for workflow automation (to the best of my knowledge) requires this to link our metadata system to our raw data.


9. Workflow automation with?targets

At this stage, we’ve setup our project enviroment and have dataset, meaning its time to do some data analysis. Here, one challenging aspects is managing our data analysis workflow. Here, the?targets?package allows us to streamline and automate our data-analysis worfklow.?targets?automatically tracks dependencies, managing file-based resources, and can execute tasks in parallel, allowing us as users to focus on the actual data analysis.

The?targets?workflow is in contrast to a script based workflow (an master script that runs the workflow) which many researchers and data analysts may be familiar with.

Step 1:?Setup?targets

First, we need to install and setup?targets. Let's start by adding?targets, along with its companion packages?tarchetypes?and?visNetwork, to the DESCRIPTION file:

# Console #

usethis::use_package("targets")
usethis::use_package("tarchetypes")
usethis::use_package("visNetwork")        

Next, we initialize the?targets?package:

# Console #

targets::use_targets()        

This command will generate several files and folders in your root directory. The most crucial is the?_targets.R?script, where we will outline our workflow.

Step 2: Customize the _targets.R script

In order to understand how targets work, it is a good idea to give the _targets.R file a good look over. While doing that, we are going to edit it to allign to our data analysis workflow.

The initial few lines contain information about the?targets?package, which we don't need to alter.

No alt text provided for this image
_targets.R - First lines

The subsequent lines define the packages needed to execute our pipeline.

No alt text provided for this image
_targets.R - Library calls

In our case, we need?targets?and?tarchetypes?(which allows us to automatically create Quarto documents). We should also include?dplyr?and?magrittr. Let's add them:

# _targets.R #

# Load packages required to define the pipeline.
library(targets)
library(tarchetypes)
library(dplyr)
library(magrittr)

# Load internal data
load("R/sysdata.R")        

Note:?Packages defined here are for our actual pipeline to run, and should not include packages for our data analysis.

Note 2: We load internal data with load("R/sysdata.R") because we plan to use constants directly inside the targets?pipeline. If not using constant inside the pipeline, this line can be omitted.

The next section sets options for our?targets?workflow.

No alt text provided for this image
_targets.R - Target options

Here, we need to specify the packages used for our data analysis. Instead of manually listing each package, we can extract them automatically from the description file using?dplyr?and?desc.

Replace the default options with:

# _targets.R #

# Set target options:
tar_option_set(
    packages = desc::desc_get_deps() %>%
        dplyr::filter(type == "Imports") %>%
        dplyr::select(package) %>%
        dplyr::pull(),
    format = "rds"
)        

The next part of the script runs?tar_source(). This function reads all the files we have defined in the?R/?folder, loading in all our functions.

No alt text provided for this image
_targets.R - Functions to source

In addition to functions in the R/ folder, we also need our function to download the raw data located in the data-raw/ folder. So, lets add that:

# _targets.R #

# Run the R scripts in the R/ folder with your custom functions:
tar_source()
source("data-raw/import_data.R")        

This brings us to the last part, which is the actual targets workflow.

No alt text provided for this image
_targets.r - Target workflow

The targets workflow is defined as a list, where each step is defined with the?tar_target()?function.

Step 3:?Define the workflow

The power of the?targets?package lies in its ability to organize, streamline, and automate complex data analysis workflows. To illustrate this, we'll create a fairly straightforwad workflow that can be extended to fit smaller or bigger projects.

  1. Write internal data
  2. Import raw data
  3. Read data
  4. Clean data
  5. Create graphs and tables
  6. Produce a reproducible document

This sequence of steps will represent a common data analysis workflow, moving from the ingestion of raw data to the production of a creation of a reproducible document containing some results.

Lets begin by creating a skeleton of all the workflow steps:

# _targets.R #

list(
	# 1 - Write internal data
	# TODO

	# 2 - Import raw data
  # TODO

  # 3 - Read data
  # TODO

	# 4 - Clean data
	# TODO 

	# 5 - Create visulizations
	# TODO

  # 6 - Create reproducible document
  # TODO
)        

Now, lets go into each step in turn.

1 - Write internal data

The first step is to incorporate our metadata system into the pipeline. Here, we want?targets?to recognice that whenever we add or remove a metadata object, we want to update the entire workflow. To accomplish this, we'll define our metadata as a watched file and write new internal data each time the metadata file changes.

To do this, lets create our first target to do this:

# _targets.R #

list(
	# Step 1 - Write internal data
  tar_target(name = metadata,
             command = here::here("metadata.csv"),
             format = "file"),
  tar_target(name = write_sys_data,
             command = write_internal_data(metadata))

	# 2 - Import raw data
  # TODO

  # 3 - Read data
  # TODO

	# 4 - Clean data
	# TODO 

	# 5 - Create visulizations
	# TODO

  # 6 - Create reproducible document
  # TODO
)        

We use the argument?format = “file”?because we're referring to a specific file, not an object.

2 - Import raw data

The next step involves retrieving our raw data. This step utilizes our?get_raw_data()?function. Note that we're not editing raw data directly, but rather fetching it from an external source.

# _targets.R #

list(
	# Step 1 - Write internal data
  tar_target(name = metadata,
             command = here::here("metadata.csv"),
             format = "file"),
  tar_target(name = write_sys_data,
             command = write_internal_data(metadata)),

	# 2 - Import raw data
  tar_target(name = raw_data,
             command = get_raw_data(metadata),
             format = "file")

  # 3 - Read data
  # TODO

	# 4 - Clean data
	# TODO 

	# 5 - Create visulizations
	# TODO

  # 6 - Create reproducible document
  # TODO
)        

Note: Often, you would not need a target to retrieve your raw raw data. In fact, you should avoid directly modifying your raw data. However, as our raw data is located externally, we need a specific target to retrieve your data.

3 - Read data

With raw data retrieved, we want to open it as an R object. We'll use the?read_csv()?function from the?readr?package to accomplish this as our next target.

# _targets.R #

list(
	# Step 1 - Write internal data
  tar_target(name = metadata,
             command = here::here("metadata.csv"),
             format = "file"),
  tar_target(name = write_sys_data,
             command = write_internal_data(metadata)),

	# 2 - Import raw data
  tar_target(name = raw_data,
             command = get_raw_data(metadata),
             format = "file"),

  # 3 - Read data
  tar_target(name = nhanes_df,
             command = readr::read_csv(raw_data))

	# 4 - Clean data
	# TODO 

	# 5 - Create visulizations
	# TODO

  # 6 - Create reproducible document
  # TODO
)        

4 - Clean data

Now, that we have read our raw data as an object, our next data analysis step is to clean our dataset. Although the NHANES dataset is fairly clean, we will apply filters based on some of our constants to further refine our data.

By applying filters based on our constants, we can extract only the necessary columns that we need for our analysis. This will help us improve the efficiency of our analysis and reduce the chances of errors.

# _targets.R #

list(
  # Step 1 - Write internal data
  tar_target(name = metadata,
             command = here::here("metadata.csv"),
             format = "file"),
  tar_target(name = write_sys_data,
             command = write_internal_data(metadata)),

  # 2 - Import raw data
  tar_target(name = raw_data,
             command = get_raw_data(metadata),
             format = "file"),

  # 3 - Read data
  tar_target(name = nhanes_df,
             command = readr::read_csv(raw_data)),

  # 4 - Clean data
  tar_target(name = clean_df,
             command = nhanes_df %>%
             dplyr::select(dplyr::all_of(c(DIABETES,
                                           EDUCATION,   
                                           BMI))))
  # 5 - Create visulizations
  # TODO

  # 6 - Create reproducible document
  # TODO
)        

5 - Create Visualisations

Having cleaned our data, we can create visualizations to better understand the data. We'll utilize the?ggplot2?package in R for this purpose.

Before continuning, add?ggplot2?to the DESCRIPTION file

# Console #

usethis::use_package("ggplot2")        

Now, we can create a target to create a visulization. When working with?ggplot?and utlizing constants, we need to convert our constants into symbols using??!!rlang::sym(constant).

Lets do that:

# _targets.R #

list(
  # Step 1 - Write internal data
  tar_target(name = metadata,
             command = here::here("metadata.csv"),
             format = "file"),
  tar_target(name = write_sys_data,
             command = write_internal_data(metadata)),

  # 2 - Import raw data
  tar_target(name = raw_data,
             command = get_raw_data(metadata),
             format = "file"),

  # 3 - Read data
  tar_target(name = nhanes_df,
             command = readr::read_csv(raw_data)),

  # 4 - Clean data
  tar_target(name = clean_df,
             command = nhanes_df %>%
             dplyr::select(dplyr::all_of(c(DIABETES,
                                           EDUCATION,   
                                           BMI)))),
  # 5 - Create visulizations
  tar_target(name = education_vs_bmi,
             command = clean_df %>% 
             ggplot2::ggplot(ggplot2::aes(x = !!rlang::sym(EDUCATION),
                                          y = !!rlang::sym(BMI))) +
             ggplot2::geom_boxplot())
  
  # 6 - Create reproducible document
  # TODO
)        

Step 4:?Run the workflow

With most of our workflow now defined, the next crucial step is to put it into action and observe the outcomes. To run the workflow, we use the following command:

# Console #

targets::tar_make()        

Now, whenever we make any changes to the workflow only the parts affected by the change are re-executed. This feature is particularly beneficial in larger, more complex data analysis projects, where it can save substantial amounts of time.

Beyond execution,?targets?also offers functionalities for visualizing the entire workflow. We can generate a visual network of our pipeline using the following command:

# Console #

targets::tar_visnetwork()        

Lastly, one thing to keep in mind is that objects and visualizations created as part of a?targets?workflow are not immediately viewable in our R enviorment, and need to be access through?targets. Lets look at the visulization we just created.

# Console #

targets::tar_read(education_vs_bmi)        

10. Create reproducible documents with?Quarto

As a final step for our data-analysis pipeline, we would like to create a automated and reproducible document. We can achieve this by integrating Quarto or rmarkdown documents with?targets.

Let's start by creating a Quarto document inside the doc/?directory:

# Console #

fs::file_create(here::here("doc/report.qmd"))        

We also need to install the?rmarkdown?and?quarto?package.

# Console #

usethis::use_package("rmarkdown")
usethis::use_package("quarto")        

Now, let's create an Quarto document that generates a simple data-analysis report. We utilize?targets::tar_read() to access target objects inside quarto documents.

# doc/report.qmd #

---
title: "Report"
code-fold: true
format: html
---
? ??
```{r setup}
#| include: false

# Set targets store location
targets::tar_config_set(store = here::here("_targets"))

# Read data
nhanes_df <- targets::tar_read(clean_df)?
```

# Data
The dataset looks like this:
? ??
```{r}
#| warning: false
#| message: false

dplyr::glimpse(nhanes_df)
```

# Boxplot
BMI vs education as a boxplot.

```{r}
#| warning: false
#| message: false
targets::tar_read(education_vs_bmi)
```        

Next, we can go back to our pipeline, and add our final workflow step.

# _targets.R #

list(
  # Step 1 - Write internal data
  tar_target(name = metadata,
             command = here::here("metadata.csv"),
             format = "file"),
  tar_target(name = write_sys_data,
             command = write_internal_data(metadata)),

  # 2 - Import raw data
  tar_target(name = raw_data,
             command = get_raw_data(metadata),
             format = "file"),

  # 3 - Read data
  tar_target(name = nhanes_df,
             command = readr::read_csv(raw_data)),

  # 4 - Clean data
  tar_target(name = clean_df,
             command = nhanes_df %>%
             dplyr::select(dplyr::all_of(c(DIABETES,
                                           EDUCATION,   
                                           BMI)))),
  # 5 - Create visulizations
  tar_target(name = education_vs_bmi,
             command = clean_df %>% 
             ggplot2::ggplot(ggplot2::aes(x = !!rlang::sym(EDUCATION),
                                          y = !!rlang::sym(BMI))) +
             ggplot2::geom_boxplot()),

  # 6 - Create reproducible document
  tar_quarto(name = report,
             path = "doc/report.qmd")
)        

Now, let's execute the workflow and verify if it successfully generates a?report.html?inside the?doc/?directory.

# Console # 

targets::tar_make()        

After the pipeline has finished, we should end up with a report.html akin to this:

No alt text provided for this image
Quarto report

Quarto documents have an impressive level of customisations, which can be explored more on the official Quarto documentation.


11. Packaging the Data Analysis Workflow

The final step in our data analysis workflow is packaging. This ensures that our code can be easily shared and reproduced by others. There are two primary ways we can package our workflow: by using a run script or by creating a Docker image.

Run Script

We can leverage the?run.R?file, auto-generated by the targets package, to automate our workflow, where we can write the necessary R function calls to execute our analysis in its entirety. Here's how we can define it:

# run.R #

# Install renv
install.packages('renv', verbose = FALSE)

# Install all packages
renv::restore()

# Run workflow
targets::tar_make()        

You can then package the necessary files and directories for distribution. At a minumum you need to include the following files

  • _targets.R
  • DESCRIPTION
  • metadata.csv
  • NAMESPACE
  • renv.lock
  • run.R
  • reproducible-data-analysis.Rproj
  • .Rproject
  • Your name might differ.

And the following folders:

  • doc/
  • data/
  • data-raw/
  • man/
  • R/

You can also choose to include the?renv/?folder. If you don’t include it, your scripts will reinstall all packages from source.

After having packaged your project, the entire analysis can be rerun using:

# Console #

source("run.R")        

Docker

Alternatively, we can containerize our entire application using Docker. This ensures that the analysis can be run in an identical environment, further enhancing reproducibility. First, let's create a Dockerfile:

# Console #

fs::file_create("Dockerfile")        

Next, we'll define the container environments and actions:

# Install R
FROM rocker/verse:4.2.3

# Install curl & openssl
RUN apt-get update \
? ? && apt-get install -y --no-install-recommends \
? ? libcurl4-openssl-dev libfribidi-dev libharfbuzz-dev ?\
? ? libssl-dev libfontconfig1-dev libfreetype6-dev libpng-dev \
? ? libtiff5-dev libjpeg-dev libxml2-dev

# Install renv
RUN Rscript -e 'install.packages("renv")'

# Copy renv files
WORKDIR /project
COPY renv.lock renv.lock
COPY renv/ renv/

# Set renv lib path
ENV RENV_PATHS_LIBRARY renv/library

# Restore renv
RUN R -e "renv::restore()"

# Copy remaining files
COPY _targets.R _targets.R
COPY DESCRIPTION DESCRIPTION
COPY metadata.csv metadata.csv
COPY NAMESPACE NAMESPACE
COPY run.R run.R

# Copy r project file (This might be different for you!)
COPY reproducible-data-analysis.Rproj reproducible-data-analysis.Rproj

# Copy remaining folders
COPY ./data data/
COPY ./data-raw data-raw/
COPY ./doc doc/
COPY ./R R/
COPY ./man man/

# Run targets
CMD R -e "list.files('doc')"
CMD R -e "targets::tar_make()"        

We can then build a Docker image from the terminal:

# Terminal #

docker build -t reproducible_data_analysis .        

Once the image is built, you can share it, providing others with an identical environment for running the analysis. To run the analysis, run:

# Terminal #
docker run --rm reproducible_data_analysis        

Note: You need Docker CLI tools to be able to use it.

Final thoughts

_targets.R script

The final “_targets.R” script is outlined below.

# _targets.R #

# Created by use_targets()
# Follow the comments below to fill in this target script.
# Then follow the manual to check and run the pipeline:
#   <https://books.ropensci.org/targets/walkthrough.html#inspect-the-pipeline> # nolint

# Load packages required to define the pipeline:
library(targets)
library(tarchetypes)
library(dplyr)
library(magrittr)
library(here)

# Set target options:
tar_option_set(
  packages = c("tibble"), # packages that your targets need to run
  format = "rds" # default storage format
  # Set other options as needed.
)

# tar_make_clustermq() configuration (okay to leave alone):
options(clustermq.scheduler = "multicore")

# tar_make_future() configuration (okay to leave alone):
# Install packages {{future}}, {{future.callr}}, and {{future.batchtools}} to allow use_targets() to configure tar_make_future() options.

# Run the R scripts in the R/ folder with your custom functions:
tar_source()
source("data-raw/import_data.R")
# source("other_functions.R") # Source other scripts as needed. # nolint

# Replace the target list below with your own:
list(
  # Step 1 - Write internal data
  tar_target(name = metadata,
             command = here::here("metadata.csv"),
             format = "file"),
  tar_target(name = write_sys_data,
             command = write_internal_data(metadata)),

  # 2 - Import raw data
  tar_target(name = raw_data,
             command = get_raw_data(metadata),
             format = "file"),

  # 3 - Read data
  tar_target(name = nhanes_df,
             command = readr::read_csv(raw_data)),

  # 4 - Clean data
  tar_target(name = clean_df,
             command = nhanes_df %>%
             dplyr::select(dplyr::all_of(c(DIABETES,
                                             EDUCATION,
                                             BMI)))),
  # 5 - Create visulizations
  tar_target(name = education_vs_bmi,
             command = clean_df %>%
             ggplot2::ggplot(ggplot2::aes(x = !!rlang::sym(EDUCATION),
                                          y = !!rlang::sym(BMI))) +
             ggplot2::geom_boxplot()),

  # 6 - Create reproducible document
  tar_quarto(name = report,
             path = "doc/report.qmd")
)        

Git

Perhaps the largest omission from this post, is the fact that we do not use git. Git is a version control software that can greatly enhance your workflow and reproducibility. However, as the actual installation and setup of git can differ a lot depending on software and hardware, I instead refer to these excellent guides on setting up and using git.

Template

If you'd prefer not to set everything up manually, the workflow is available as a GitHub template here.

Learn more...

This post, although lengthy, only covers a fraction of all the functionalities and possibilities for doing reproducible data analysis in R. If you want to learn more, you can access the function documentation that is linked above.

Several of the tools and packages used here was introduced to me by Luke Johnston . Furthermore, some of the tools are covered in Luke Johnston excellent series of?R courses.

Igor Segota, PhD

Staff Data Scientist @ Illumina

1 年

Really nice in-depth article, thank you for sharing! By the way, one thing not mentioned that I like is R Installation Manager "rig": https://github.com/r-lib/rig for managing versions of R (not packages)!

要查看或添加评论,请登录

Anders Askeland的更多文章

社区洞察

其他会员也浏览了