daiquiri

daiquiri: Data Quality Reporting for Temporal Datasets - Published in JOSS (2022)

https://github.com/ropensci/daiquiri

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
    2 of 3 committers (66.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

data-quality initial-data-analysis r r-package reproducible-research rstats temporal-data time-series

Scientific Fields

Engineering Computer Science - 60% confidence
Last synced: 4 months ago · JSON representation

Repository

Data quality reporting for temporal datasets.

Basic Info
Statistics
  • Stars: 38
  • Watchers: 3
  • Forks: 3
  • Open Issues: 7
  • Releases: 8
Topics
data-quality initial-data-analysis r r-package reproducible-research rstats temporal-data time-series
Created about 4 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog Contributing License Codemeta

README.Rmd

---
output: github_document
---



```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# daiquiri 


[![CRAN Status](https://www.r-pkg.org/badges/version/daiquiri)](https://cran.r-project.org/package=daiquiri)
[![R-CMD-check](https://github.com/ropensci/daiquiri/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ropensci/daiquiri/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/ropensci/daiquiri/branch/master/graph/badge.svg)](https://app.codecov.io/gh/ropensci/daiquiri?branch=master)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/535_status.svg)](https://github.com/ropensci/software-review/issues/535)
[![JOSS paper](https://joss.theoj.org/papers/10.21105/joss.05034/status.svg)](https://doi.org/10.21105/joss.05034)


The daiquiri package generates data quality reports that enable quick visual review of temporal shifts in record-level data. Time series plots showing aggregated values are automatically created for each data field (column) depending on its contents (e.g. min/max/mean values for numeric data, no. of distinct values for categorical data), as well as overviews for missing values, non-conformant values, and duplicated rows.

Essentially, it takes input such as this:



And outputs this:



The resulting html reports are shareable and can contribute to forming a transparent record of the entire analysis process. It is designed with electronic health records in mind, but can be used for any type of record-level temporal data.

## Why should I use it?

Large routinely-collected datasets are increasingly being used in research. However, given their data are collected for operational rather than research purposes, there is a greater-than-usual need for them to be checked for data quality issues before any analyses are conducted. Events occurring at the institutional level such as software updates, new machinery or processes can cause temporal artefacts that, if not identified and taken into account, can lead to biased results and incorrect conclusions. For example, the figures below show real data from a large hospital in the UK, and how it has changed over time.



The first figure  shows the percentage of missing values in the 'Duration' field of a dataset containing antibiotic prescriptions, and the second figure shows the mean value of all laboratory tests checking for levels of 'creatinine' in the blood. As you can see, there are points in time where these values shift up or down suddenly and unnaturally, indicating that something changed in the way the data was collected or processed. A careful researcher needs to take these sudden changes into account, particularly if comparing or combining the data before and after these 'change points'.

While these checks should theoretically be conducted by the researcher at the initial data analysis stage, in practice it is unclear to what extent this is actually done, since it is rarely, if ever, reported in published papers. With the increasing drive towards greater transparency and reproducibility within the scientific community, this essential yet often-overlooked part of the analysis process will inevitably begin to come under greater scrutiny. The daiquiri package helps researchers conduct this part of the process more thoroughly, consistently and transparently, hence increasing the quality of their studies as well as trust in the scientific process.

## Installation

```{r, eval = FALSE}
# install the latest release from CRAN
install.packages("daiquiri")

# or install the development version from rOpenSci
install.packages("daiquiri", repos = "https://ropensci.r-universe.dev")

# or install direct from source
# install.packages("remotes")
remotes::install_github("ropensci/daiquiri")
```

## Usage

```{r}
library(daiquiri)

# load delimited file into a data.frame without doing any datatype conversion
path <- system.file("extdata", "example_prescriptions.csv", package = "daiquiri")
raw_data <- read_data(path, show_progress = FALSE)

head(raw_data)

# specify the type of data expected in each column of the data.frame
fts <- field_types(
  PrescriptionID = ft_uniqueidentifier(),
  PrescriptionDate = ft_timepoint(),
  AdmissionDate = ft_datetime(includes_time = FALSE),
  Drug = ft_freetext(),
  Dose = ft_numeric(),
  DoseUnit = ft_categorical(),
  PatientID = ft_ignore(),
  Location = ft_categorical(aggregate_by_each_category = TRUE)
)
```

```{r, eval = FALSE}
# create a report in the current directory
daiq_obj <- daiquiri_report(
  raw_data,
  field_types = fts
)
```

An [example report](https://ropensci.github.io/daiquiri/articles/example_prescriptions.html) is available from the [package website](https://ropensci.github.io/daiquiri/index.html).

More detailed guidance can be found in the [walkthrough vignette](https://ropensci.github.io/daiquiri/articles/daiquiri.html):

```{r, eval = FALSE}
vignette("daiquiri", package = "daiquiri")
```

## How to cite this package

> Quan, T. P., (2022). daiquiri: Data Quality Reporting for Temporal Datasets. Journal of Open Source Software, 7(80), 5034, https://doi.org/10.21105/joss.05034

## Acknowledgements

This work was supported by the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Healthcare Associated Infections and Antimicrobial Resistance at the University of Oxford in partnership with Public Health England (PHE) (NIHR200915), and by the NIHR Oxford Biomedical Research Centre.


## Contributing to this package

Please report any bugs or suggestions by opening a [github issue](https://github.com/ropensci/daiquiri/issues).

Please note that this package is released with a [Contributor Code of Conduct](https://ropensci.org/code-of-conduct/). 
By contributing to this project, you agree to abide by its terms.

Owner

  • Name: rOpenSci
  • Login: ropensci
  • Kind: organization
  • Email: info@ropensci.org
  • Location: Berkeley, CA

JOSS Publication

daiquiri: Data Quality Reporting for Temporal Datasets
Published
December 26, 2022
Volume 7, Issue 80, Page 5034
Authors
T. Phuong Quan ORCID
University of Oxford, UK
Editor
Arfon Smith ORCID
Tags
data quality time series reproducible research initial data analysis

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "identifier": "daiquiri",
  "description": "Generate reports that enable quick visual review of temporal shifts in record-level data. Time series plots showing aggregated values are automatically created for each data field (column) depending on its contents (e.g. min/max/mean values for numeric data, no. of distinct values for categorical data), as well as overviews for missing values, non-conformant values, and duplicated rows. The resulting reports are shareable and can contribute to forming a transparent record of the entire analysis process. It is designed with Electronic Health Records in mind, but can be used for any type of record-level temporal data (i.e. tabular data where each row represents a single \"event\", one column contains the \"event date\", and other columns contain any associated values for the event).",
  "name": "daiquiri: Data Quality Reporting for Temporal Datasets",
  "relatedLink": [
    "https://ropensci.github.io/daiquiri/",
    "https://CRAN.R-project.org/package=daiquiri"
  ],
  "codeRepository": "https://github.com/ropensci/daiquiri",
  "issueTracker": "https://github.com/ropensci/daiquiri/issues",
  "license": "https://spdx.org/licenses/GPL-3.0",
  "version": "1.1.1.9000",
  "programmingLanguage": {
    "@type": "ComputerLanguage",
    "name": "R",
    "url": "https://r-project.org"
  },
  "runtimePlatform": "R version 4.4.2 (2024-10-31 ucrt)",
  "provider": {
    "@id": "https://cran.r-project.org",
    "@type": "Organization",
    "name": "Comprehensive R Archive Network (CRAN)",
    "url": "https://cran.r-project.org"
  },
  "author": [
    {
      "@type": "Person",
      "givenName": [
        "T.",
        "Phuong"
      ],
      "familyName": "Quan",
      "email": "phuong.quan@ndm.ox.ac.uk",
      "@id": "https://orcid.org/0000-0001-8566-1817"
    }
  ],
  "contributor": [
    {
      "@type": "Person",
      "givenName": "Jack",
      "familyName": "Cregan"
    }
  ],
  "copyrightHolder": [
    {
      "@type": "Organization",
      "name": "University of Oxford"
    }
  ],
  "funder": [
    {
      "@type": "Organization",
      "name": "National Institute for Health Research (NIHR)"
    }
  ],
  "maintainer": [
    {
      "@type": "Person",
      "givenName": [
        "T.",
        "Phuong"
      ],
      "familyName": "Quan",
      "email": "phuong.quan@ndm.ox.ac.uk",
      "@id": "https://orcid.org/0000-0001-8566-1817"
    }
  ],
  "softwareSuggestions": [
    {
      "@type": "SoftwareApplication",
      "identifier": "covr",
      "name": "covr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=covr"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "knitr",
      "name": "knitr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=knitr"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "testthat",
      "name": "testthat",
      "version": ">= 3.0.0",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=testthat"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "codemetar",
      "name": "codemetar",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=codemetar"
    }
  ],
  "softwareRequirements": {
    "1": {
      "@type": "SoftwareApplication",
      "identifier": "data.table",
      "name": "data.table",
      "version": ">= 1.12.8",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=data.table"
    },
    "2": {
      "@type": "SoftwareApplication",
      "identifier": "readr",
      "name": "readr",
      "version": ">= 2.0.0",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=readr"
    },
    "3": {
      "@type": "SoftwareApplication",
      "identifier": "ggplot2",
      "name": "ggplot2",
      "version": ">= 3.1.0",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=ggplot2"
    },
    "4": {
      "@type": "SoftwareApplication",
      "identifier": "scales",
      "name": "scales",
      "version": ">= 1.1.0",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=scales"
    },
    "5": {
      "@type": "SoftwareApplication",
      "identifier": "cowplot",
      "name": "cowplot",
      "version": ">= 0.9.3",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=cowplot"
    },
    "6": {
      "@type": "SoftwareApplication",
      "identifier": "rmarkdown",
      "name": "rmarkdown",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=rmarkdown"
    },
    "7": {
      "@type": "SoftwareApplication",
      "identifier": "reactable",
      "name": "reactable",
      "version": ">= 0.2.3",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=reactable"
    },
    "8": {
      "@type": "SoftwareApplication",
      "identifier": "utils",
      "name": "utils"
    },
    "9": {
      "@type": "SoftwareApplication",
      "identifier": "stats",
      "name": "stats"
    },
    "10": {
      "@type": "SoftwareApplication",
      "identifier": "xfun",
      "name": "xfun",
      "version": ">= 0.15",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=xfun"
    },
    "SystemRequirements": null
  },
  "fileSize": "1653.474KB",
  "citation": [
    {
      "@type": "ScholarlyArticle",
      "datePublished": "2022",
      "author": [
        {
          "@type": "Person",
          "givenName": [
            "T.",
            "Phuong"
          ],
          "familyName": "Quan"
        }
      ],
      "name": "daiquiri: Data Quality Reporting for Temporal Datasets",
      "identifier": "10.21105/joss.05034",
      "pagination": "5034",
      "@id": "https://doi.org/10.21105/joss.05034",
      "sameAs": "https://doi.org/10.21105/joss.05034",
      "isPartOf": {
        "@type": "PublicationIssue",
        "issueNumber": "80",
        "datePublished": "2022",
        "isPartOf": {
          "@type": [
            "PublicationVolume",
            "Periodical"
          ],
          "volumeNumber": "7",
          "name": "Journal of Open Source Software"
        }
      }
    }
  ],
  "releaseNotes": "https://github.com/ropensci/daiquiri/blob/master/NEWS.md",
  "readme": "https://github.com/ropensci/daiquiri/blob/master/README.md",
  "contIntegration": [
    "https://github.com/ropensci/daiquiri/actions/workflows/R-CMD-check.yaml",
    "https://app.codecov.io/gh/ropensci/daiquiri?branch=master"
  ],
  "developmentStatus": "https://www.repostatus.org/#active",
  "review": {
    "@type": "Review",
    "url": "https://github.com/ropensci/software-review/issues/535",
    "provider": "https://ropensci.org"
  },
  "keywords": [
    "data-quality",
    "initial-data-analysis",
    "r",
    "r-package",
    "reproducible-research",
    "rstats",
    "temporal-data",
    "time-series"
  ]
}

GitHub Events

Total
  • Create event: 1
  • Release event: 1
  • Issues event: 2
  • Watch event: 1
  • Issue comment event: 2
  • Push event: 4
Last Year
  • Create event: 1
  • Release event: 1
  • Issues event: 2
  • Watch event: 1
  • Issue comment event: 2
  • Push event: 4

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 298
  • Total Committers: 3
  • Avg Commits per committer: 99.333
  • Development Distribution Score (DDS): 0.017
Past Year
  • Commits: 11
  • Committers: 2
  • Avg Commits per committer: 5.5
  • Development Distribution Score (DDS): 0.091
Top Committers
Name Email Commits
Phuong Quan p****n@n****k 293
Maëlle Salmon m****n@y****e 3
Phuong Quan p****q@n****k 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 20
  • Total pull requests: 4
  • Average time to close issues: 3 months
  • Average time to close pull requests: 11 days
  • Total issue authors: 7
  • Total pull request authors: 2
  • Average comments per issue: 0.95
  • Average comments per pull request: 0.0
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 1
  • Average time to close issues: about 3 hours
  • Average time to close pull requests: about 2 months
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • phuongquan (11)
  • fkohrt (4)
  • teunbrand (2)
  • leppott (1)
  • eveliseb (1)
  • Analect (1)
  • louisaslett (1)
Pull Request Authors
  • phuongquan (3)
  • maelle (3)
Top Labels
Issue Labels
enhancement (6) bug (6)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 741 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 5
  • Total maintainers: 1
cran.r-project.org: daiquiri

Data Quality Reporting for Temporal Datasets

  • Versions: 5
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 741 Last month
Rankings
Stargazers count: 10.3%
Forks count: 21.9%
Average: 25.4%
Downloads: 29.5%
Dependent packages count: 29.8%
Dependent repos count: 35.5%
Maintainers (1)
Last synced: 4 months ago

Dependencies

DESCRIPTION cran
  • cowplot >= 0.9.3 imports
  • data.table >= 1.12.8 imports
  • ggplot2 >= 3.1.0 imports
  • reactable >= 0.2.3 imports
  • readr >= 1.3.1 imports
  • rmarkdown * imports
  • scales >= 1.1.0 imports
  • stats * imports
  • utils * imports
  • codemetar * suggests
  • covr * suggests
  • knitr * suggests
  • testthat >= 3.0.0 suggests
.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v2 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/test-coverage.yaml actions
  • actions/checkout v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite