datacomparer

dataCompareR is an R package that allows users to compare two datasets and view a report on the similarities and differences.

https://github.com/capitalone/datacomparer

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.4%) to scientific vocabulary

Keywords

compare-data data data-analysis data-science r
Last synced: 6 months ago · JSON representation

Repository

dataCompareR is an R package that allows users to compare two datasets and view a report on the similarities and differences.

Basic Info
Statistics
  • Stars: 75
  • Watchers: 8
  • Forks: 26
  • Open Issues: 27
  • Releases: 4
Topics
compare-data data data-analysis data-science r
Created almost 9 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Codeowners Roadmap

README.md

dataCompareR

|CRAN downloads|dev build|master build| |------|-----|-----| | |Build Status | Build Status|

dataCompareR is an R package that allows users to compare two datasets and view a report on the similarities and differences.

dataCompareR aims to make it easy to compare two tabular data objects in R. It’s specifically designed to show differences between two sets of data in a useful way that should make it easier to understand the differences, and if necessary, help you work out how to remedy them. In this regard, it aims to offer a more useful output than all.equal when your two datasets do not match, but isn’t intended to replace all.equal if you just want a binary test for equality.

  • rCompare() does the comparison and creates a dataCompareR object containing all the differences between the two inputted datasets. The object can be used with print and summary.
  • generateMismatchData() generates a list of two data frames, each having the missing rows from the comparison.
  • saveReport() creates a summary of the comparison that is saved into a file.

It’s expected that dataCompareR will be used to compare data frames, but it can be used to compare any objects that can be coerced to data frames, such as data tables, tibbles or matrices. dataCompareR cannot compare data that is not tabular in format (nested JSON, irregular lists etc) but does handle tabular data that needs to be matched (or joined) on one or more keys (or ID columns).

Getting started

Requirements

Confirmed as working on R v3.6.3 and v4.0.0 for Windows, as well as v3.6.2, v4.0.0 and the devel release for Linux. Package was built with the following dependencies, but we anticipate it will work with later versions of these packages.

| Package|Version|Source code URL| | ---|---|--- | |dplyr| 0.5.0| https://github.com/hadley/dplyr | |knitr| 1.12.3| https://github.com/yihui/knitr | |stringi| 1.0-1| https://github.com/gagolews/stringi | |markdown|0.7.7| https://github.com/rstudio/markdown |

Installing the package

You can install from the CRAN via:

r install.packages("dataCompareR")

You can also install the latest version directly from GitHub via

r library(devtools) install_git('https://github.com/capitalone/dataCompareR.git', branch = 'master', subdir = 'dataCompareR', type = 'source', repos = NULL, build_vignettes = TRUE)

Using dataCompareR

Please run vignette('dataCompareR') after installation to see an example of the dataCompareR workflow.

Repo Contents

The code is arranged as an R package, with the following contents:

  • dataCompareR/R
  • dataCompareR/man
  • dataCompareR/tests/testthat
  • dataCompareR/tests/performancetesting
  • dataCompareR/inst/css
  • dataCompareR/vignette

The contents will be covered below.

dataCompareR/R

The main body of R code that provide the dataCompareR functionality.

The R package format mandates that this is a flat folder structure. Initial development had a nested structure, so to try to maintain this as far as possible, the naming convention for files is to preface them with 2 or 3 letter code that identifies the part of the code that file belongs to. The codes and hierarchy is as follows

  • rc - rCompare - the entry point of the function
    • pf - processFlow - handles the flow of an rCompare run
      • vd - validateData - checks the data is suitable before starting an rCompare run
      • pd - prepareData - prepares the input data for comparison
      • cd - compareData - does the comparison
    • rco - rCompare object - routines to handle the rCompare object that is generated by an rCompare run
    • out - output - code to provide various views of the output

The filenames follow the format of the prefix, followed by underscore, followed by a camelcase description of what the code does. The .R files tend to have either 1 function inside them, or a small number of related functions.

dataCompareR/man

Code is commented using ROxygen2 headers, which is used to automatically create the required R man pages by running

devtools::document()

dataCompareR/tests/testthat

Automated tests that are run via

devtools::test()

This consists of both unit tests and some end-to-end tests that MUST pass before any code is merged to dev or main. We've added Travis integration, so this is now mandated. If your development code change breaks an existing test, then it is your responsibility to fix it!

The current unit test coverage can be found in testing.md - please feel free to add more tests, and regenerate this file using covR.

dataCompareR/tests/performancetesting

This folder contains useful repeatable performance tests, but there are not run automatically, and the results they produce can only be interpreted manually.

CRAN Release Version History

https://cran.r-project.org/package=dataCompareR

  • Version 0.1.0 released on 2017-07-17
  • Version 0.1.1 released on 2017-11-14
  • Version 0.1.2 released on 2019-09-07
  • Version 0.1.3 released on 2020-05-01
  • Version 0.1.4 released on 2021-11-23

External Contributors

We welcome and appreciate your contributions! Before we can accept any contributions, we ask that you please be sure to sign the Contributor License Agreement (CLA).

This project adheres to the Open Source Code of Conduct. By participating, you are expected to honor this code.

Project Roadmap

The project roadmap can be found in ROADMAP.md.

Owner

  • Name: Capital One
  • Login: capitalone
  • Kind: organization
  • Email: opensource@capitalone.com
  • Location: McLean, VA

We’re an open source-first organization — actively using, contributing to and managing open source software projects.

GitHub Events

Total
  • Issues event: 10
  • Watch event: 2
  • Issue comment event: 9
  • Fork event: 1
Last Year
  • Issues event: 10
  • Watch event: 2
  • Issue comment event: 9
  • Fork event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 125
  • Total Committers: 9
  • Avg Commits per committer: 13.889
  • Development Distribution Score (DDS): 0.568
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Rob Noble-Eddy r****y@g****m 54
Krishan Bhasin k****b@g****m 35
Sarah Johnston s****7@g****m 21
rnobleeddy r****y 7
tmbjmu 5****u 3
Ruijing Li R****3 2
whitesource-bolt-for-github[bot] 4****] 1
sclewis23 3****3 1
Zoe Turner z****2@n****k 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 52
  • Total pull requests: 57
  • Average time to close issues: 7 months
  • Average time to close pull requests: 13 days
  • Total issue authors: 15
  • Total pull request authors: 11
  • Average comments per issue: 1.27
  • Average comments per pull request: 0.86
  • Merged pull requests: 32
  • Bot issues: 22
  • Bot pull requests: 17
Past Year
  • Issues: 4
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 4
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mend-for-github-com[bot] (12)
  • mend-bolt-for-github[bot] (10)
  • rnobleeddy (10)
  • sajohnston (8)
  • KrishanBhasin (2)
  • tmbjmu (1)
  • petermeissner (1)
  • Erebus54 (1)
  • maciejmotyka (1)
  • ConorIA (1)
  • hechth (1)
  • benjaminwnelson (1)
  • 1DanielG (1)
  • Longfei2 (1)
  • ben1787 (1)
Pull Request Authors
  • sajohnston (17)
  • mend-for-github-com[bot] (13)
  • rnobleeddy (9)
  • KrishanBhasin (6)
  • mend-bolt-for-github[bot] (4)
  • robne1982 (3)
  • tmbjmu (2)
  • sclewis23 (1)
  • maximskorik (1)
  • Lextuga007 (1)
Top Labels
Issue Labels
security vulnerability (11) Mend: dependency security vulnerability (11) enhancement (8) pr_pending (7) false positive (7) bug (5) dependency issue (4) v1.1.2 (4) question (2) more info needed (1) documentation (1) help wanted (1)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 503 last-month
  • Total docker downloads: 95
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 5
    (may contain duplicates)
  • Total versions: 7
  • Total maintainers: 1
cran.r-project.org: dataCompareR

Compare Two Data Frames and Summarise the Difference

  • Versions: 5
  • Dependent Packages: 0
  • Dependent Repositories: 5
  • Downloads: 503 Last month
  • Docker Downloads: 95
Rankings
Forks count: 2.9%
Stargazers count: 4.8%
Dependent repos count: 13.0%
Average: 14.2%
Docker downloads count: 17.0%
Downloads: 18.8%
Dependent packages count: 28.6%
Last synced: 6 months ago
conda-forge.org: r-datacomparer
  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Forks count: 30.3%
Stargazers count: 34.0%
Dependent repos count: 34.0%
Average: 37.4%
Dependent packages count: 51.2%
Last synced: 6 months ago

Dependencies

dataCompareR/DESCRIPTION cran
  • R >= 3.2.3 depends
  • dplyr >= 0.5.0 imports
  • knitr * imports
  • markdown * imports
  • stringi * imports
  • bit64 * suggests
  • data.table * suggests
  • rmarkdown * suggests
  • testthat * suggests
  • tibble * suggests
  • titanic * suggests
.github/workflows/pkgdown.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite
.github/workflows/testing.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/upload-artifact main composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite