dataMaid

An R package for data screening

https://github.com/ekstroem/datamaid

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    2 of 6 committers (33.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (19.7%) to scientific vocabulary

Keywords

data-cleaning data-screening reproducible-research
Last synced: 6 months ago · JSON representation

Repository

An R package for data screening

Basic Info
  • Host: GitHub
  • Owner: ekstroem
  • Language: HTML
  • Default Branch: master
  • Homepage:
  • Size: 25.5 MB
Statistics
  • Stars: 143
  • Watchers: 9
  • Forks: 26
  • Open Issues: 15
  • Releases: 0
Topics
data-cleaning data-screening reproducible-research
Created over 9 years ago · Last pushed 11 months ago
Metadata Files
Readme Changelog

README.md

dataMaid

Travis-CI Build
Status CRAN\_Release\_Badge Download counter

dataMaid is an R package for documenting and creating reports on data cleanliness.

dataMaid has become dataReporter

dataMaid has been renamed to dataReporter. dataMaid is no longer maintained. All future updates and development will be made for dataReporter. Install the new package from CRAN like this {r} install.packages("dataReporter") or install the development version from Github: {r] devtools::install_github("ekstroem/dataReporter") *Please report bugs at our new repository. *

Installation

This github page contains the development version of dataMaid. For the latest stable version download the package from CRAN directly using

{r} install.packages("dataMaid")

To install the development version of dataMaid run the following commands from within R (requires that the devtools package is already installed)

{r} devtools::install_github("ekstroem/dataMaid")

Package overview

A super simple way to get started is to load the package and use the makeDataReport() function on a data frame (if you try to generate several reports for the same data, then it may be necessary to add the replace=TRUE argument to overwrite the existing report).

{r} library("dataMaid") data(trees) makeDataReport(trees)

This will create a report with summaries and error checks for each variable in the trees data frame. The format of the report depends on your OS and whether you have have a LaTeX installation on your computer, which is needed for creating pdf reports.

Using dataMaid interactively

The dataMaid package can also be used interactively by running checks for the individual variables or for all variables in the dataset

{r} data(toyData) check(toyData$events) # Individual check of events check(toyData) # Check all variables at once

By default the standard battery of tests is run depending on the variable type. If we just want a specific test for, say, a numeric variable then we can specify that. All available checks can be viewed by calling allCheckFunctions(). See the documentation for an overview of the checks available or how to create and include your own tests.

{r} check(toyData$events, checks = setChecks(numeric = "identifyMissing"))

We can also access the graphics or summary tables that are produced for a variable by calling the visualize or summarize functions. One can visualize a single variable or a full dataset:

```{r}

Visualize a variable

visualize(toyData$events)

Visualize a dataset

visualize(toyData) ```

The same is true for summaries. Note also that the choice of checks/visualizations/summaries are customizable:

```{r}

Summarize a variable with default settings:

summarize(toyData$events)

Summarize a variable with user-specified settings:

summarize(toyData$events, summaries = setSummaries(all = c("centralValue", "minMax"))
```

Detailed documentation

You can read the main paper accompanying the package at the Journal of Statistical Software. It provides a detailed introduction to the dataMaid package.

We also have two blog posts that provide an introduction to the package. The can be found here (the primary one) and here.

Moreover, we have created a vignette that describes how to extend dataMaid to include user-defined data screening checks, summaries and visualizations. This vignette is called extending_dataMaid:

{r} vignette("extending_dataMaid")

Online app

We are currently working on an online version of the tool, where users can upload their data and get a report. A prototype is already up and running - we just need to configure the R server correctly.

Until we have set it up online, you can try it out on your own machine: {r} library(shiny) runUrl("https://github.com/ekstroem/dataMaid/raw/master/app/app.zip")

Owner

  • Name: Claus Ekstrøm
  • Login: ekstroem
  • Kind: user
  • Location: Copenhagen, Denmark
  • Company: University of Copenhagen

Statistician, scientist, researcher, R tinkerer

GitHub Events

Total
  • Push event: 1
Last Year
  • Push event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 427
  • Total Committers: 6
  • Avg Commits per committer: 71.167
  • Development Distribution Score (DDS): 0.485
Past Year
  • Commits: 1
  • Committers: 1
  • Avg Commits per committer: 1.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
ekstroem g****b@e****m 220
annepetersen1 a****e@s****k 194
Anne Helby Petersen z****9@s****k 8
Anne Petersen A****n 3
Nina Jakobsen n****n 1
Carl Frederick c****k@d****v 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 59
  • Total pull requests: 2
  • Average time to close issues: about 2 months
  • Average time to close pull requests: about 12 hours
  • Total issue authors: 36
  • Total pull request authors: 2
  • Average comments per issue: 2.47
  • Average comments per pull request: 0.5
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • annennenne (9)
  • richierocks (8)
  • ekstroem (4)
  • b1azk0 (2)
  • aalexandersson (2)
  • Jaeoc (2)
  • 1DanielG (2)
  • eribul (2)
  • goodfr (1)
  • tdemarchin (1)
  • sebastien-foulle (1)
  • carrollrm (1)
  • Ales-G (1)
  • WeeBeasties (1)
  • jclchan (1)
Pull Request Authors
  • carlbfrederick (1)
  • nmjakobsen (1)
Top Labels
Issue Labels
enhancement (17) bug (9) Suggestion (2) question (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 1,247 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 2
  • Total versions: 11
  • Total maintainers: 1
cran.r-project.org: dataMaid

A Suite of Checks for Identification of Potential Errors in a Data Frame as Part of the Data Screening Process

  • Versions: 11
  • Dependent Packages: 1
  • Dependent Repositories: 2
  • Downloads: 1,247 Last month
Rankings
Stargazers count: 2.9%
Forks count: 3.0%
Average: 11.4%
Downloads: 13.8%
Dependent packages count: 17.7%
Dependent repos count: 19.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • ggplot2 * imports
  • gridExtra * imports
  • haven * imports
  • htmltools * imports
  • magrittr * imports
  • methods * imports
  • pander * imports
  • rmarkdown >= 1.10 imports
  • robustbase >= 0.93 imports
  • stringi * imports
  • whoami * imports
  • knitr * suggests
  • testthat * suggests