butterfly

Verification of continually updating timeseries data where we expect new values, but want to ensure previous data remains unchanged. Maintained by @thomaszwagerman

https://github.com/ropensci/butterfly

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (19.6%) to scientific vocabulary

Keywords

data-versioning qaqc r r-package rstats timeseries verification
Last synced: 6 months ago · JSON representation

Repository

Verification of continually updating timeseries data where we expect new values, but want to ensure previous data remains unchanged. Maintained by @thomaszwagerman

Basic Info
Statistics
  • Stars: 9
  • Watchers: 1
  • Forks: 0
  • Open Issues: 4
  • Releases: 4
Topics
data-versioning qaqc r r-package rstats timeseries verification
Created over 1 year ago · Last pushed 11 months ago
Metadata Files
Readme Changelog Contributing License Codemeta

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# butterfly butterfly website


[![R-CMD-check](https://github.com/ropensci/butterfly/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ropensci/butterfly/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/thomaszwagerman/butterfly/branch/main/graph/badge.svg)](https://app.codecov.io/gh/ropensci/butterfly?branch=main)
[![Lifecycle: stable](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html#stable)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![pkgcheck](https://github.com/ropensci/butterfly/workflows/pkgcheck/badge.svg)](https://github.com/ropensci/butterfly/actions?query=workflow%3Apkgcheck)
[![Status at rOpenSci Software Peer Review](https://badges.ropensci.org/676_status.svg)](https://github.com/ropensci/software-review/issues/676)


The goal of butterfly is to aid in the verification of continually updating timeseries data, where we expect new values over time, but want to ensure previous data remains unchanged, and timesteps remain continuous. 

```{r butterfly_diagram, echo=FALSE, out.width="100%", fig.cap="An illustration of continually updating timeseries data where a previous value unexpectedly changes."}
knitr::include_graphics("man/figures/README-butterfly_diagram.png")
```


Data previously recorded could change for a number of reasons, such as discovery of an error in model code, a change in methodology or instrument recalibration. Monitoring data sources for these changes is not always possible.

Unnoticed changes in previous data could have unintended consequences, such as invalidating a published dataset's Digital Object Identfier (DOI), or altering future predictions if used as input in forecasting models.

Other unnoticed changes could include a jump in time or measurement frequency, due to instrument failure or software updates.

```{r timeseries_diagram, echo=FALSE, out.width="100%", fig.cap="An illustration of timeseries data not being continuous in the way it is expected to be."}
knitr::include_graphics("man/figures/README-timeseries_dark.png")
```

This package provides functionality that can be used as part of a data pipeline, to check and flag changes to previous data to prevent changes going unnoticed.

## Installation

You can install butterfly with:

``` r
install.packages("butterfly", repos = "https://ropensci.r-universe.dev")
```

## Overview
The butterfly package contains the following functions:

  * `butterfly::loupe()` - examines in detail whether previous values have changed, and returns TRUE/FALSE for no change/change.
  * `butterfly::catch()` - returns rows which contain previously changed values in a dataframe.
  * `butterfly::release()` - drops rows which contain previously changed values, and returns a dataframe containing new and unchanged rows.
  * `butterfly::create_object_list()` - returns a list of objects required by all of `loupe()`, `catch()` and `release()`. Contains underlying functionality.
  * `butterfly::timeline()` - check if a timeseries is continuous between timesteps.
  * `butterfly::timeline_group()` - group distinct, but continuous sequences of a timeseries.

There are also dummy datasets, which a fictional and purely to demonstrate butterfly functionality:

  * `butterflycount` - a list of monthly dataframes, which contain fictional butterfly counts for a given date.
  * `forestprecipitation` - a list of monthly dataframes, which contain fictional daily precipitation measurements for a given date.
  * `butterflymess` - a messy version of `butterflycount`, provided for testing purposes.

## Examples

This is a basic example which shows you how to use butterfly:

```{r simple_example}
library(butterfly)

# Imagine a continually updated dataset that starts in January and is updated once a month
butterflycount$january

# In February an additional row appears, all previous data remains the same
butterflycount$february

# In March an additional row appears again
# ...but a previous value has unexpectedly changed
butterflycount$march
```

We can use `butterfly::loupe()` to examine in detail whether previous values have changed.

```{r butterfly_example}
butterfly::loupe(
  butterflycount$february,
  butterflycount$january,
  datetime_variable = "time"
)

butterfly::loupe(
  butterflycount$march,
  butterflycount$february,
  datetime_variable = "time"
)
```

`butterfly::loupe()` uses `dplyr::semi_join()` to match the new and old objects using a common unique identifier, which in a timeseries will be the timestep. `waldo::compare()` is then used to compare these and provide a detailed report of the differences.

`butterfly` follows the `waldo` philosophy of erring on the side of providing too much information, rather than too little. It will give a detailed feedback message on the status between two objects.

### Using butterfly for data wrangling
You might want to return changed rows as a dataframe, or drop them altogether. For this `butterfly::catch()` and `butterfly::release()` are provided.

Here, `butterfly::catch()` only returns rows which have **changed** from the previous version. It will not return new rows.

```{r butterfly_catch}
df_caught <- butterfly::catch(
  butterflycount$march,
  butterflycount$february,
  datetime_variable = "time"
)

df_caught
```

Conversely, `butterfly::release()` drops all rows which had changed from the previous version. Note it retains new rows, as these were expected.

```{r butterfly_release}
df_released <- butterfly::release(
  butterflycount$march,
  butterflycount$february,
  datetime_variable = "time"
)

df_released
```

### Checking for continuity: `timeline()`

To check if a timeseries is continuous, `timeline()` and `timeline_group()` are
provided.

```{r rain_gauge_data}
# A rain gauge which measures precipitation every day
butterfly::forestprecipitation$january

# In February there is a power failure in the instrument
butterfly::forestprecipitation$february
```

To check if a timeseries is continuous:

```{r check_continuity}
butterfly::timeline(
   forestprecipitation$january,
   datetime_variable = "time",
   expected_lag = 1
 )
```

In February our imaginary rain gauge's onboard computer had a failure.

The timestamp was reset to `1970-01-01`:
 
```{r not_continuous}
forestprecipitation$february

butterfly::timeline(
  forestprecipitation$february,
   datetime_variable = "time",
   expected_lag = 1
 )
```

If we wanted to group chunks of our timeseries that are distinct, or broken up
in some way, but still continuous, we can use `timeline_group()`:

```{r timeline_group}
butterfly::timeline_group(
  forestprecipitation$february,
   datetime_variable = "time",
   expected_lag = 1
 )
```

## Relevant packages and functions
The butterfly package was created for a specific use case of handling continuously updating/overwritten timeseries data, where previous values may change without notice. 

There are other R packages and functions which handle object comparison, which may suit your specific needs better. Below we describe their overlap and differences to `butterfly`:

* [waldo](https://github.com/r-lib/waldo) - `butterfly` uses `waldo::compare()` in every function to provide a report on difference. There is therefore significant overlap, however `butterfly` builds on `waldo` by providing the functionality of comparing objects where we expect some changes, with previous versions but not others. `butterfly` also provides extra user feedback to provide clarity on what it is and isn't comparing, due to the nature of comparing only "matched" rows.
* [diffdf](https://github.com/gowerc/diffdf) - similar to `waldo`, but specifically for data frames, `diffdf` provides the ability to compare data frames directly. We could have used `diffdf::diffdf()` in our case, but we prefer `waldo`'s more explicit and clear user feedback. That said, there is significant overlap in functionality: `butterfly::loupe()` and `diffdf::diffdf_has_issues()` both provide a TRUE/FALSE difference check, while `diffdf::diffdf_issue_rows()` and `butterfly::catch()` both return the rows where changes have occurred. However, it lacks the flexibility of `butterfly` to compare object where we expect some changes, but not others.
* [assertr](https://github.com/tonyfischetti/assertr) - `assertr` provides assertion functionality that can be used as part of a pipeline, and test assertions on a particular dataset, but it does not offer tools for comparison. We do highly recommend using `assertr` for checks, prior to using `butterfly`, as any data quality issues will be caught first.
* [daquiri](https://github.com/ropensci/daiquiri/) - `daquiri` provides tools to check data quality and visually inspect timeseries data. It is also quality assurance package for timeseries, but has a very different purpose to `butterfly`.

Other functions include `all.equal()` (base R) or [dplyr](https://github.com/tidyverse/dplyr)'s `setdiff()`.

## `butterfly` in production

Read more about how `butterfly` is [used in an operational data pipeline](https://docs.ropensci.org/butterfly/articles/butterfly_in_pipeline.html) to verify a continually updated **and** published dataset.

## Contributing

For full guidance on contributions, please refer to `.github/CONTRIBUTING.md`.

### Without write access
Corrections, suggestions and general improvements are welcome as issues.

You can also suggest changes by forking this repository, and opening a pull request. Please target your pull requests to the main branch.

### With write access
You can push directly to main for small fixes. Please use PRs to main for discussing larger updates.

## Code of Conduct

Please note that this package is released with a [Contributor
Code of Conduct](https://ropensci.org/code-of-conduct/). 

By contributing to this project, you agree to abide by its terms.

Owner

  • Name: rOpenSci
  • Login: ropensci
  • Kind: organization
  • Email: info@ropensci.org
  • Location: Berkeley, CA

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "identifier": "butterfly",
  "description": "Verification of continually updating time series data where we expect new values, but want to ensure previous data remains unchanged. Data previously recorded could change for a number of reasons, such as discovery of an error in model code, a change in methodology or instrument recalibration. Monitoring data sources for these changes is not always possible. Other unnoticed changes could include a jump in time or measurement frequency, due to instrument failure or software updates. Functionality is provided that can be used to check and flag changes to previous data to prevent changes going unnoticed, as well as unexpected jumps in time.",
  "name": "butterfly: Verification for Continually Updating Time Series Data",
  "relatedLink": "https://docs.ropensci.org/butterfly/",
  "codeRepository": "https://github.com/ropensci/butterfly/",
  "issueTracker": "https://github.com/ropensci/butterfly/issues",
  "license": "https://spdx.org/licenses/MIT",
  "version": "1.1.2",
  "programmingLanguage": {
    "@type": "ComputerLanguage",
    "name": "R",
    "url": "https://r-project.org"
  },
  "runtimePlatform": "R version 4.4.3 (2025-02-28)",
  "author": [
    {
      "@type": "Person",
      "givenName": "Thomas",
      "familyName": "Zwagerman",
      "email": "thozwa@bas.ac.uk",
      "@id": "https://orcid.org/0009-0003-3742-3234"
    }
  ],
  "copyrightHolder": [
    {
      "@type": "Organization",
      "name": "British Antarctic Survey"
    }
  ],
  "maintainer": [
    {
      "@type": "Person",
      "givenName": "Thomas",
      "familyName": "Zwagerman",
      "email": "thozwa@bas.ac.uk",
      "@id": "https://orcid.org/0009-0003-3742-3234"
    }
  ],
  "softwareSuggestions": [
    {
      "@type": "SoftwareApplication",
      "identifier": "knitr",
      "name": "knitr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=knitr"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "rmarkdown",
      "name": "rmarkdown",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=rmarkdown"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "testthat",
      "name": "testthat",
      "version": ">= 3.0.0",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=testthat"
    }
  ],
  "softwareRequirements": {
    "1": {
      "@type": "SoftwareApplication",
      "identifier": "R",
      "name": "R",
      "version": ">= 4.1.0"
    },
    "2": {
      "@type": "SoftwareApplication",
      "identifier": "cli",
      "name": "cli",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=cli"
    },
    "3": {
      "@type": "SoftwareApplication",
      "identifier": "dplyr",
      "name": "dplyr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=dplyr"
    },
    "4": {
      "@type": "SoftwareApplication",
      "identifier": "lifecycle",
      "name": "lifecycle",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=lifecycle"
    },
    "5": {
      "@type": "SoftwareApplication",
      "identifier": "rlang",
      "name": "rlang",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=rlang"
    },
    "6": {
      "@type": "SoftwareApplication",
      "identifier": "waldo",
      "name": "waldo",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=waldo"
    },
    "SystemRequirements": null
  },
  "fileSize": "508.847KB",
  "citation": [
    {
      "@type": "CreativeWork",
      "datePublished": "2024",
      "author": [
        {
          "@type": "Person",
          "givenName": "Zwagerman",
          "familyName": "Thomas"
        }
      ],
      "name": "{butterfly}: quality assurance of continually updating and overwritten time-series data"
    }
  ],
  "releaseNotes": "https://github.com/ropensci/butterfly/blob/master/NEWS.md",
  "readme": "https://github.com/ropensci/butterfly/blob/main/README.md",
  "contIntegration": [
    "https://github.com/ropensci/butterfly/actions/workflows/R-CMD-check.yaml",
    "https://app.codecov.io/gh/ropensci/butterfly?branch=main",
    "https://github.com/ropensci/butterfly/actions?query=workflow%3Apkgcheck"
  ],
  "developmentStatus": [
    "https://lifecycle.r-lib.org/articles/stages.html#stable",
    "https://www.repostatus.org/#active"
  ],
  "review": {
    "@type": "Review",
    "url": "https://github.com/ropensci/software-review/issues/676",
    "provider": "https://ropensci.org"
  },
  "keywords": [
    "qaqc",
    "timeseries",
    "r",
    "r-package",
    "rstats",
    "data-versioning",
    "verification"
  ]
}

GitHub Events

Total
  • Release event: 1
  • Watch event: 2
  • Delete event: 2
  • Issue comment event: 12
  • Push event: 14
  • Pull request event: 4
  • Create event: 3
Last Year
  • Release event: 1
  • Watch event: 2
  • Delete event: 2
  • Issue comment event: 12
  • Push event: 14
  • Pull request event: 4
  • Create event: 3

Packages

  • Total packages: 1
  • Total downloads:
    • cran 198 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 1
  • Total maintainers: 1
cran.r-project.org: butterfly

Verification for Continually Updating Time Series Data

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 198 Last month
Rankings
Dependent packages count: 26.8%
Dependent repos count: 33.0%
Average: 48.8%
Downloads: 86.7%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v4 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action v4.5.0 composite
  • actions/checkout v4 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/test-coverage.yaml actions
  • actions/checkout v4 composite
  • actions/upload-artifact v4 composite
  • codecov/codecov-action v4 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
DESCRIPTION cran
  • R >= 2.10 depends
  • cli * imports
  • dplyr * imports
  • lifecycle * imports
  • waldo * imports
  • knitr * suggests
  • rmarkdown * suggests
  • testthat >= 3.0.0 suggests
.github/workflows/pkgcheck.yaml actions
  • ropensci-review-tools/pkgcheck-action main composite