dtrackr

dtrackr: An R package for tracking the provenance of data - Published in JOSS (2022)

https://github.com/terminological/dtrackr

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org, zenodo.org
  • Committers with academic emails
    1 of 3 committers (33.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software
Last synced: 6 months ago · JSON representation

Repository

An R library for managing and documenting dplyr data pipelines

Basic Info
Statistics
  • Stars: 68
  • Watchers: 4
  • Forks: 6
  • Open Issues: 9
  • Releases: 18
Created about 5 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog Contributing License

README.Rmd

---
output: github_document
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

library(magrittr)
library(dplyr)
library(dtrackr)

here::i_am("README.Rmd")
```

# dtrackr: Track your Data Pipelines 


[![R-CMD-check](https://github.com/terminological/dtrackr/workflows/R-CMD-check/badge.svg)](https://github.com/terminological/dtrackr/actions)
[![DOI](https://zenodo.org/badge/335974323.svg)](https://zenodo.org/badge/latestdoi/335974323)
[![dtrackr status badge](https://terminological.r-universe.dev/badges/dtrackr)](https://terminological.r-universe.dev)
[![metacran downloads](https://cranlogs.r-pkg.org/badges/dtrackr)](https://cran.r-project.org/package=dtrackr)
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/dtrackr)](https://cran.r-project.org/package=dtrackr)
[![codecov](https://codecov.io/gh/terminological/dtrackr/branch/main/graph/badge.svg?token=FR1SBH82D3)](https://app.codecov.io/gh/terminological/dtrackr)
[![DOI](https://joss.theoj.org/papers/10.21105/joss.04707/status.svg)](https://doi.org/10.21105/joss.04707)
[![EPSRC badge](https://img.shields.io/badge/EPSRC%20grant-EP%2FY028392%2F1-05acb5)](https://gtr.ukri.org/projects?ref=EP%2FY028392%2F1)



## Overview

Accurate documentation of a data pipeline is a first step to reproducibility,
and a flow chart describing the steps taken to prepare data is a useful part of 
this documentation. In analyses that rely on data that is frequently updated,
documenting a data flow by copying and pasting row counts into flowcharts in
PowerPoint becomes quickly tedious. With interactive data analysis, and
particularly using RMarkdown, code execution sometimes happens in a non-linear
fashion, and this can lead to, at best, confusion and at worst erroneous
analysis. Basing such documentation on what the code does when executed
sequentially can be inaccurate when the data has being analysed interactively.

The goal of `dtrackr` is to take away this pain by instrumenting and monitoring
a dataframe through a `dplyr` pipeline, creating a step-by-step summary of the
important parts of the wrangling as it actually happened to the dataframe, right
into dataframe metadata itself. This metadata can be used to generate
documentation as a flowchart, and allows both a quick overview of the data and
also a visual check of the actual data processing.

## Installation

In general use `dtrackr` is expected to be installed alongside the `tidyverse`
set of packages. It is recommended to install `tidyverse` first.

Binary packages of `dtrackr` are available on CRAN and r-universe for `macOS` 
and `Windows`. `dtrackr` can be installed from source on Linux. `dtrackr` has 
been tested on R versions 3.6, 4.0, 4.1 and 4.2. 

You can install the released version of `dtrackr` from [CRAN](https://CRAN.R-project.org) with:

``` r
install.packages("dtrackr")
```

### System dependencies for installation from source

For installation from source on Linux, `dtrackr` has required transitive dependencies on 
a few system libraries. These can be installed with the following commands:

```BASH
# Ubuntu 20.04 and other debian based distributions:
sudo apt-get install libcurl4-openssl-dev libssl-dev librsvg2-dev \
  libicu-dev libnode-dev libpng-dev libjpeg-dev libpoppler-cpp-dev

# Centos 8
sudo dnf install libcurl-devel openssl-devel librsvg2-devel \
  libicu-devel libpng-devel libjpeg-turbo-devel poppler-devel

# for other linux distributions I suggest using the R pak library:
# install.packages("pak")
# pak::pkg_system_requirements("dtrackr")

# N.B. There are additional suggested R package dependencies on 
# the `tidyverse` and `rstudioapi` packages which have a longer set of dependencies. 
# We suggest you install them individually first if required.
```

### Alternative versions of `dtrackr`

Early release versions are available on the `r-universe`. This will typically
be more up to date than CRAN.

```r
# Enable repository from terminological
options(repos = c(
  terminological = 'https://terminological.r-universe.dev',
  CRAN = 'https://cloud.r-project.org'))
# Download and install dtrackr in R
install.packages('dtrackr')
```

The unstable development version is available from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("terminological/dtrackr")
```

## Example usage

Suppose we are constructing a data set with out initial input being the `iris`
data. Our analysis depends on some `cutOff` parameter and we want to prepare a
stratified data set that excludes flowers with narrow petals, and those with the
biggest petals of each Species. With `dtrackr` we can mix regular `dplyr`
commands with additional `dtrackr` commands such as `comment` and `status`, and
an enhanced implementation of `dplyr::filter`, called `exclude_all`, and
`include_any`.

```{r example}
# a pipeline parameter
cutOff = 3

# the pipeline
dataset = iris %>% 
  track() %>%
  status() %>%
  group_by(Species) %>%
  status(
    short = p_count_if(Sepal.Width=cutOff), 
    .messages=c("consisting of {short} short sepal <{cutOff}","and {long} long sepal >={cutOff}")
  )  %>%
  exclude_all(
    Petal.Width<0.3 ~ "excluding {.excluded} with narrow petals",
    Petal.Width == max(Petal.Width) ~ "and {.excluded} outlier"
  ) %>%
  comment("test message") %>%
  status(.messages = "{.count} of type {Species}") %>%
  ungroup() %>%
  status(.messages = "{.count} together with cutOff {cutOff}") 
```

Having prepared our dataset we conduct our analysis, and want to write it up and
prepare it for submission. As a key part of documenting the data pipeline a 
visual summary is useful, and for bio-medical journals or clinical trials often 
a requirement. 


```R
dataset %>% flowchart()
```

```{r include=FALSE}
# Needed because this is a github README to allow relative links to a hosted
# file.
dataset %>% flowchart(
  here::here("man/figures/README-flowchart.png")) %>%
  invisible()
```



And your publication ready data pipeline, with any assumptions you care to
document, is creates in a format of your choice (as long as that choice is one
of `pdf`, `png`, `svg` or `ps`), ready for submission to Nature.

This is a trivial example, but the more complex the pipeline, the bigger 
benefit you will get. 

Check out the [main documentation for more details](https://terminological.github.io/dtrackr/),
and in particular the [getting started vignette](https://terminological.github.io/dtrackr/articles/dtrackr.html).

## Funding

The authors gratefully acknowledge the support of the UK Research and Innovation
AI programme of the Engineering and Physical Sciences Research Council [EPSRC
grant EP/Y028392/1](https://gtr.ukri.org/projects?ref=EP%2FY028392%2F1).

Owner

  • Name: terminological
  • Login: terminological
  • Kind: organization
  • Email: rob@terminological.co.uk

Health informatics and data analytics

JOSS Publication

dtrackr: An R package for tracking the provenance of data
Published
December 13, 2022
Volume 7, Issue 80, Page 4707
Authors
Robert Challen ORCID
Engineering Mathematics, University of Bristol, Bristol, United Kingdom, College of Engineering, Mathematics and Physical Sciences, University of Exeter, Devon, United Kingdom
Editor
Andrew Stewart ORCID
Tags
data pipeline consort diagram strobe statement data quality reproducible research

GitHub Events

Total
  • Create event: 3
  • Issues event: 2
  • Release event: 3
  • Watch event: 5
  • Issue comment event: 3
  • Push event: 7
Last Year
  • Create event: 3
  • Issues event: 2
  • Release event: 3
  • Watch event: 5
  • Issue comment event: 3
  • Push event: 8

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 118
  • Total Committers: 3
  • Avg Commits per committer: 39.333
  • Development Distribution Score (DDS): 0.042
Past Year
  • Commits: 5
  • Committers: 1
  • Avg Commits per committer: 5.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Rob Challen r****b@t****k 113
Lionel Henry l****y@g****m 4
TJ McKinley t****y@e****k 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 32
  • Total pull requests: 3
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 2 days
  • Total issue authors: 6
  • Total pull request authors: 3
  • Average comments per issue: 2.03
  • Average comments per pull request: 1.67
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: 5 days
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 3.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • robchallen (15)
  • debruine (7)
  • craig-willis (6)
  • dvaiman (2)
  • pteridin (1)
  • matlyons (1)
Pull Request Authors
  • tjmckinley (1)
  • seabbs (1)
  • lionel- (1)
Top Labels
Issue Labels
enhancement (11) help wanted (3) documentation (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 258 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 5
  • Total maintainers: 1
cran.r-project.org: dtrackr

Track your Data Pipelines

  • Versions: 5
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 258 Last month
Rankings
Stargazers count: 6.6%
Forks count: 9.6%
Average: 22.2%
Dependent repos count: 24.0%
Dependent packages count: 28.8%
Downloads: 42.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v3 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/draft-pdf.yml actions
  • actions/checkout v2 composite
  • actions/upload-artifact v1 composite
  • openjournals/openjournals-draft-action master composite
.github/workflows/test-coverage.yaml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v3 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • R >= 2.10 depends
  • V8 * imports
  • base64enc * imports
  • dplyr * imports
  • fs * imports
  • glue * imports
  • htmltools * imports
  • magrittr * imports
  • pdftools * imports
  • png * imports
  • purrr * imports
  • rlang * imports
  • rsvg * imports
  • stringr * imports
  • tibble * imports
  • tidyr * imports
  • utils * imports
  • covr * suggests
  • devtools * suggests
  • here * suggests
  • knitr * suggests
  • rmarkdown * suggests
  • rstudioapi * suggests
  • survival * suggests
  • testthat >= 2.1.0 suggests
  • tidyselect * suggests
  • tidyverse * suggests