errorlocate

Find and replace erroneous fields in data using validation rules

https://github.com/data-cleaning/errorlocate

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.6%) to scientific vocabulary

Keywords

data-cleaning errors invalidation r
Last synced: 6 months ago · JSON representation

Repository

Find and replace erroneous fields in data using validation rules

Basic Info
Statistics
  • Stars: 22
  • Watchers: 3
  • Forks: 3
  • Open Issues: 14
  • Releases: 0
Topics
data-cleaning errors invalidation r
Created over 10 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

[![R build status](https://github.com/data-cleaning/errorlocate/workflows/R-CMD-check/badge.svg)](https://github.com/data-cleaning/errorlocate/actions)
[![CRAN](http://www.r-pkg.org/badges/version/errorlocate)](https://CRAN.R-project.org/package=errorlocate)
[![Downloads](http://cranlogs.r-pkg.org/badges/errorlocate)](http://www.r-pkg.org/pkg/errorlocate) 
[![status](https://tinyverse.netlify.com/badge/errorlocate)](https://CRAN.R-project.org/package=errorlocate)
[![Codecov test coverage](https://codecov.io/gh/data-cleaning/errorlocate/branch/master/graph/badge.svg)](https://codecov.io/gh/data-cleaning/errorlocate?branch=master)
[![Mentioned in Awesome Official Statistics ](https://awesome.re/mentioned-badge.svg)](http://www.awesomeofficialstatistics.org)

# Error localization

Find errors in data given a set of validation rules.
The `errorlocate` helps to identify obvious errors in raw datasets.

It works in tandem with the package `validate`.
With `validate` you formulate data validation rules to which the data must comply.

For example:

- "age cannot be negative": `age >= 0`.
- "if a person is married, he must be older then 16 years": `if (married ==TRUE) age > 16`.
- "Profit is turnover minus cost": `profit == turnover - cost`.

While `validate` can check if a record is valid or not, it does not identify
which of the variables are responsible for the invalidation. This may seem a simple task,
but is actually quite tricky:  a set of validation rules forms a web
of dependent variables: changing the value of an invalid record to repair for rule 1, may invalidate
the record for rule 2.

`errorlocate` provides a small framework for record based error detection and implements the Felligi Holt
algorithm. This algorithm assumes there is no other information available then the values of a record
and a set of validation rules. The algorithm minimizes the (weighted) number of values that need
to be adjusted to remove the invalidation.

# Installation

`errorlocate` can be installed from CRAN:

```r
install.packages("errorlocate")
```

Beta versions can be installed with `drat`:

```r
drat::addRepo("data-cleaning")
install.packages("errorlocate")
```

The latest development version of `errorlocate` can be installed from github with `devtools`:

```r
devtools::install_github("data-cleaning/errorlocate")
```

# Usage

```{r}
library(errorlocate)
rules <- validator( profit == turnover - cost
                  , cost >= 0.6 * turnover
                  , turnover >= 0
                  , cost >= 0 # is implied
)

data <- data.frame(profit=750, cost=125, turnover=200)

data_no_error <- replace_errors(data, rules)

# faulty data was replaced with NA
print(data_no_error)

er <- errors_removed(data_no_error)

print(er)

summary(er)

er$errors
```

Owner

  • Name: Data cleaning for statistical purpose
  • Login: data-cleaning
  • Kind: organization

Software for cleaning data

GitHub Events

Total
Last Year

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 270
  • Total Committers: 2
  • Avg Commits per committer: 135.0
  • Development Distribution Score (DDS): 0.004
Past Year
  • Commits: 3
  • Committers: 1
  • Avg Commits per committer: 3.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Edwin de Jonge e****e@g****m 269
Mark van der Loo m****o@g****m 1

Issues and Pull Requests

Last synced: over 2 years ago

All Time
  • Total issues: 41
  • Total pull requests: 0
  • Average time to close issues: 5 months
  • Average time to close pull requests: N/A
  • Total issue authors: 4
  • Total pull request authors: 0
  • Average comments per issue: 1.27
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • edwindj (37)
  • markvanderloo (2)
  • smartie5 (1)
  • nickforr (1)
Pull Request Authors
Top Labels
Issue Labels
enhancement (11) bug (9) question (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 413 last-month
  • Total docker downloads: 43,390
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 12
  • Total maintainers: 1
cran.r-project.org: errorlocate

Locate Errors with Validation Rules

  • Versions: 12
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 413 Last month
  • Docker Downloads: 43,390
Rankings
Stargazers count: 12.6%
Forks count: 17.8%
Average: 25.6%
Dependent packages count: 29.8%
Downloads: 32.3%
Dependent repos count: 35.5%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • validate * depends
  • lpSolveAPI * imports
  • methods * imports
  • parallel * imports
  • covr * suggests
  • knitr * suggests
  • rmarkdown * suggests
  • testthat >= 2.1.0 suggests
.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v3 composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action v4.4.1 composite
  • actions/checkout v3 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/test-coverage.yaml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • r-lib/actions/setup-pandoc v1 composite
  • r-lib/actions/setup-r v1 composite