datawizard

datawizard: An R Package for Easy Data Preparation and Statistical Transformations - Published in JOSS (2022)

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
○
Committers with academic emails
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

data dplyr hacktoberfest janitor manipulation r-package reshape rstats tidyr wrangling

Keywords from Contributors

standardization correlation predict easystats gaussian-graphical-models bayes-factors bayesian-correlations biserial cor correlation-analysis

Scientific Fields

Engineering Computer Science - 80% confidence

Last synced: 6 months ago · JSON representation

Repository

Magic potions to clean and transform your data 🧙

Basic Info

Host: GitHub
Owner: easystats
License: other
Language: R
Default Branch: main
Homepage: https://easystats.github.io/datawizard/
Size: 91.3 MB

Statistics

Stars: 230
Watchers: 8
Forks: 16
Open Issues: 33
Releases: 33

Topics

data dplyr hacktoberfest janitor manipulation r-package reshape rstats tidyr wrangling

Created over 4 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog Contributing Funding License Code of conduct Support

README.Rmd

---
output: github_document
---

# `datawizard`: Easy Data Wrangling and Statistical Transformations 

```{r, echo=FALSE, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  dpi = 300,
  out.width = "100%",
  fig.path = "man/figures/",
  comment = "#>"
)

set.seed(333)
library(datawizard)
```

[![DOI](https://joss.theoj.org/papers/10.21105/joss.04684/status.svg)](https://doi.org/10.21105/joss.04684)
[![downloads](https://cranlogs.r-pkg.org/badges/datawizard)](https://cran.r-project.org/package=datawizard)
[![total](https://cranlogs.r-pkg.org/badges/grand-total/datawizard)](https://cranlogs.r-pkg.org/)









`{datawizard}` is a lightweight package to easily manipulate, clean, transform, and prepare your data for analysis. It is part of the [easystats ecosystem](https://easystats.github.io/easystats/), a suite of R packages to deal with your entire statistical analysis, from cleaning the data to reporting the results.

It covers two aspects of data preparation:

- **Data manipulation**: `{datawizard}` offers a very similar set of functions to that of the *tidyverse* packages, such as a `{dplyr}` and `{tidyr}`, to select, filter and reshape data, with a few key differences. 1) All data manipulation functions start with the prefix `data_*` (which makes them easy to identify). 2) Although most functions can be used exactly as their *tidyverse* equivalents, they are also string-friendly (which makes them easy to program with and use inside functions). Finally, `{datawizard}` is super lightweight (no dependencies, similar to [poorman](https://github.com/nathaneastwood/poorman)), which makes it awesome for developers to use in their packages.

- **Statistical transformations**: `{datawizard}` also has powerful functions to easily apply common data [transformations](https://easystats.github.io/datawizard/reference/index.html#statistical-transformations), including standardization, normalization, rescaling, rank-transformation, scale reversing, recoding, binning, etc.











# Installation

[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/datawizard)](https://cran.r-project.org/package=datawizard) [![datawizard status badge](https://easystats.r-universe.dev/badges/datawizard)](https://easystats.r-universe.dev) [![codecov](https://codecov.io/gh/easystats/datawizard/branch/main/graph/badge.svg)](https://app.codecov.io/gh/easystats/datawizard) [![R-CMD-check](https://github.com/easystats/datawizard/workflows/R-CMD-check/badge.svg?branch=main)](https://github.com/easystats/datawizard/actions)

Type | Source | Command
---|---|---
Release | CRAN | `install.packages("datawizard")`
Development | r-universe | `install.packages("datawizard", repos = "https://easystats.r-universe.dev")`
Development | GitHub | `remotes::install_github("easystats/datawizard")`

> **Tip**
>
> **Instead of `library(datawizard)`, use `library(easystats)`.**
> **This will make all features of the  easystats-ecosystem available.**
>
> **To stay updated, use `easystats::install_latest()`.**

# Citation

To cite the package, run the following command:

```{r, comment=""}
citation("datawizard")
```

# Features

[![Documentation](https://img.shields.io/badge/documentation-datawizard-orange.svg?colorB=E91E63)](https://easystats.github.io/datawizard/)
[![Blog](https://img.shields.io/badge/blog-easystats-orange.svg?colorB=FF9800)](https://easystats.github.io/blog/posts/)
[![Features](https://img.shields.io/badge/features-datawizard-orange.svg?colorB=2196F3)](https://easystats.github.io/datawizard/reference/index.html)

Most courses and tutorials about statistical modeling assume that you are working with a clean and tidy dataset. In practice, however, a major part of doing statistical modeling is preparing your data--cleaning up values, creating new columns, reshaping the dataset, or transforming some variables. `{datawizard}` provides easy to use tools to perform these common, critical, and sometimes tedious data preparation tasks.

## Data wrangling

### Select, filter and remove variables

The package provides helpers to filter rows meeting certain conditions...

```{r}
data_match(mtcars, data.frame(vs = 0, am = 1))
```

... or logical expressions:

```{r}
data_filter(mtcars, vs == 0 & am == 1)
```

Finding columns in a data frame, or retrieving the data of selected columns, can be  achieved using `extract_column_names()` or `data_select()`:

```{r}
# find column names matching a pattern
extract_column_names(iris, starts_with("Sepal"))

# return data columns matching a pattern
data_select(iris, starts_with("Sepal")) |> head()
```

It is also possible to extract one or more variables:

```{r}
# single variable
data_extract(mtcars, "gear")

# more variables
head(data_extract(iris, ends_with("Width")))
```

Due to the consistent API, removing variables is just as simple:

```{r}
head(data_remove(iris, starts_with("Sepal")))
```

### Reorder or rename

```{r}
head(data_relocate(iris, select = "Species", before = "Sepal.Length"))
```

```{r}
head(data_rename(iris, c("Sepal.Length", "Sepal.Width"), c("length", "width")))
```

### Merge

```{r}
x <- data.frame(a = 1:3, b = c("a", "b", "c"), c = 5:7, id = 1:3)
y <- data.frame(c = 6:8, d = c("f", "g", "h"), e = 100:102, id = 2:4)

x
y

data_merge(x, y, join = "full")

data_merge(x, y, join = "left")

data_merge(x, y, join = "right")

data_merge(x, y, join = "semi", by = "c")

data_merge(x, y, join = "anti", by = "c")

data_merge(x, y, join = "inner")

data_merge(x, y, join = "bind")
```

### Reshape

A common data wrangling task is to reshape data.

Either to go from wide/Cartesian to long/tidy format

```{r}
wide_data <- data.frame(replicate(5, rnorm(10)))

head(data_to_long(wide_data))
```

or the other way

```{r}
long_data <- data_to_long(wide_data, rows_to = "Row_ID") # Save row number

data_to_wide(long_data,
  names_from = "name",
  values_from = "value",
  id_cols = "Row_ID"
)
```

### Empty rows and columns

```{r}
tmp <- data.frame(
  a = c(1, 2, 3, NA, 5),
  b = c(1, NA, 3, NA, 5),
  c = c(NA, NA, NA, NA, NA),
  d = c(1, NA, 3, NA, 5)
)

tmp

# indices of empty columns or rows
empty_columns(tmp)
empty_rows(tmp)

# remove empty columns or rows
remove_empty_columns(tmp)
remove_empty_rows(tmp)

# remove empty columns and rows
remove_empty(tmp)
```

### Recode or cut dataframe

```{r}
set.seed(123)
x <- sample(1:10, size = 50, replace = TRUE)

table(x)

# cut into 3 groups, based on distribution (quantiles)
table(categorize(x, split = "quantile", n_groups = 3))
```

## Data Transformations

The packages also contains multiple functions to help transform data.

### Standardize

For example, to standardize (*z*-score) data:

```{r}
# before
summary(swiss)

# after
summary(standardize(swiss))
```

### Winsorize

To winsorize data:

```{r}
# before
anscombe

# after
winsorize(anscombe)
```

### Center

To grand-mean center data

```{r}
center(anscombe)
```

### Ranktransform

To rank-transform data:

```{r}
# before
head(trees)

# after
head(ranktransform(trees))
```

### Rescale

To rescale a numeric variable to a new range:

```{r}
change_scale(c(0, 1, 5, -5, -2))
```

### Rotate or transpose

```{r}
x <- mtcars[1:3, 1:4]

x

data_rotate(x)
```


## Data properties

`datawizard` provides a way to provide comprehensive descriptive summary for all variables in a dataframe:

```{r}
data(iris)
describe_distribution(iris)
```

Or even just a variable

```{r}
describe_distribution(mtcars$wt)
```

There are also some additional data properties that can be computed using this package.

```{r}
x <- (-10:10)^3 + rnorm(21, 0, 100)
smoothness(x, method = "diff")
```

## Function design and pipe-workflow

The design of the `{datawizard}` functions follows a design principle that makes it easy for user to understand and remember how functions work:

1. the first argument is the data
2. for methods that work on data frames, two arguments are following to `select` and `exclude` variables
3. the following arguments are arguments related to the specific tasks of the functions

Most important, functions that accept data frames usually have this as their first argument, and also return a (modified) data frame again. Thus, `{datawizard}` integrates smoothly into a "pipe-workflow".

```{r}
iris |>
  # all rows where Species is "versicolor" or "virginica"
  data_filter(Species %in% c("versicolor", "virginica")) |>
  # select only columns with "." in names (i.e. drop Species)
  data_select(contains("\\.")) |>
  # move columns that ends with "Length" to start of data frame
  data_relocate(ends_with("Length")) |>
  # remove fourth column
  data_remove(4) |>
  head()
```

# Contributing and Support

In case you want to file an issue or contribute in another way to the package, please follow [this guide](https://easystats.github.io/datawizard/CONTRIBUTING.html). For questions about the functionality, you may either contact us via email or also file an issue.

# Code of Conduct

Please note that this project is released with a
[Contributor Code of Conduct](https://easystats.github.io/datawizard/CODE_OF_CONDUCT.html). By participating in this project you agree to abide by its terms.

Owner

Name: easystats
Login: easystats
Kind: organization
Location: worldwide

Website: https://easystats.github.io/easystats/
Twitter: easystats4u
Repositories: 19
Profile: https://github.com/easystats

Make R stats easy!

JOSS Publication

datawizard: An R Package for Easy Data Preparation and Statistical Transformations

Published

October 09, 2022

DOI

10.21105/joss.04684

Volume 7, Issue 78, Page 4684

Authors

Indrajeet Patil

cynkra Analytics GmbH, Germany

Dominique Makowski

Nanyang Technological University, Singapore

Mattan S. Ben-Shachar

Ben-Gurion University of the Negev, Israel

Brenton M. Wiernik

Independent Researcher

Etienne Bacher

Luxembourg Institute of Socio-Economic Research (LISER), Luxembourg

Daniel Lüdecke

University Medical Center Hamburg-Eppendorf, Germany

Editor

Øystein Sørensen

GitHub Events

Total

Create event: 55
Commit comment event: 1
Release event: 4
Issues event: 43
Watch event: 14
Delete event: 48
Issue comment event: 288
Push event: 488
Pull request review event: 165
Pull request review comment event: 156
Pull request event: 99

Last Year

Create event: 55
Commit comment event: 1
Release event: 4
Issues event: 43
Watch event: 14
Delete event: 48
Issue comment event: 289
Push event: 492
Pull request review event: 167
Pull request review comment event: 157
Pull request event: 100

Name	Email	Commits
Daniel	m**l@d**e	619
Indrajeet Patil	p**e@g**m	334
Etienne Bacher	5****r	199
Mattan S. Ben-Shachar	m**b@m**o	26
Dominique Makowski	d**9@g**m	24
etiennebacher	y**u@e**m	15
github-actions[bot]	4****]	13
Brenton M. Wiernik	b****k	12
Rémi Thériault	1****c	6

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 113
Total pull requests: 311
Average time to close issues: 3 months
Average time to close pull requests: 15 days
Total issue authors: 17
Total pull request authors: 7
Average comments per issue: 4.18
Average comments per pull request: 3.09
Merged pull requests: 264
Bot issues: 0
Bot pull requests: 25

Past Year

Issues: 34
Pull requests: 127
Average time to close issues: 12 days
Average time to close pull requests: 3 days
Issue authors: 9
Pull request authors: 5
Average comments per issue: 3.35
Average comments per pull request: 2.2
Merged pull requests: 103
Bot issues: 0
Bot pull requests: 13

View more stats

Top Authors

Issue Authors

etiennebacher (25)
IndrajeetPatil (24)
strengejacke (20)
mattansb (11)
DominiqueMakowski (11)
jmgirard (6)
rempsyc (4)
bwiernik (2)
profandyfield (2)
chuxinyuan (1)
Cal-Fang (1)
Cghlewis (1)
albaperis (1)
lewislehe (1)
BalbR (1)

Pull Request Authors

strengejacke (170)
etiennebacher (88)
github-actions[bot] (25)
IndrajeetPatil (16)
mattansb (6)
DominiqueMakowski (4)
rempsyc (2)

Top Labels

Issue Labels

enhancement :boom: (8) bug 🪲 (8) upkeep :broom: (6) Feature idea :fire: (4) feature idea :fire: (3) consistency 🍎🍏 (3) docs 📚 (3) Upkeep :broom: (3) Bug :bug: (3) breaking :skull_and_crossbones: (2) Enhancement :boom: (2) question (1) Docs 📚 (1) Discussion :parrot: (1) Consistency :green_apple: :apple: (1) High priority :running_man: (1) invalid (1) high priority :running_man: (1)

Pull Request Labels

Auto-update (17) auto-update (8) docs 📚 (1)

Packages

Total packages: 2
Total downloads:
- cran 115,922 last-month
Total docker downloads: 48,992

Total dependent packages: 24
(may contain duplicates)
Total dependent repositories: 42
(may contain duplicates)
Total versions: 49
Total maintainers: 1

cran.r-project.org: datawizard

Easy Data Wrangling and Statistical Transformations

Homepage: https://easystats.github.io/datawizard/
Documentation: http://cran.r-project.org/web/packages/datawizard/datawizard.pdf
License: MIT + file LICENSE
Latest release: 1.2.0
published 7 months ago

Versions: 34
Dependent Packages: 18
Dependent Repositories: 41
Downloads: 115,922 Last month
Docker Downloads: 48,992

Rankings

Downloads: 1.3%

Stargazers count: 2.5%

Dependent packages count: 3.7%

Dependent repos count: 4.0%

Average: 6.4%

Forks count: 7.0%

Docker downloads count: 19.8%

Maintainers (1)

etienne.bacher@protonmail.com

Last synced: 6 months ago

conda-forge.org: r-datawizard

Homepage: https://easystats.github.io/datawizard/
License: GPL-3.0-only
Latest release: 0.6.3
published over 3 years ago

Versions: 15
Dependent Packages: 6
Dependent Repositories: 1

Rankings

Dependent packages count: 9.0%

Dependent repos count: 24.4%

Average: 26.9%

Stargazers count: 29.3%

Forks count: 44.9%

Last synced: 6 months ago

Dependencies

DESCRIPTION cran

R >= 3.6 depends
insight >= 0.18.8 imports
stats * imports
utils * imports
bayestestR * suggests
boot * suggests
brms * suggests
data.table * suggests
dplyr >= 1.0 suggests
effectsize * suggests
gamm4 * suggests
ggplot2 * suggests
gt * suggests
haven * suggests
htmltools * suggests
httr * suggests
knitr * suggests
lme4 * suggests
mediation * suggests
parameters * suggests
poorman >= 0.2.6 suggests
psych * suggests
readr * suggests
readxl * suggests
rio * suggests
rmarkdown * suggests
rstanarm * suggests
see * suggests
testthat >= 3.1.0 suggests
tidyr * suggests
withr * suggests

.github/workflows/R-CMD-check-devel-easystats.yaml actions

.github/workflows/R-CMD-check-hard.yaml actions

.github/workflows/R-CMD-check-strict.yaml actions

.github/workflows/R-CMD-check.yaml actions

.github/workflows/check-all-examples.yaml actions

.github/workflows/check-link-rot.yaml actions

.github/workflows/check-random-test-order.yaml actions

.github/workflows/check-readme.yaml actions

.github/workflows/check-spelling.yaml actions

.github/workflows/check-styling.yaml actions

.github/workflows/check-test-warnings.yaml actions

.github/workflows/check-vignette-warnings.yaml actions

.github/workflows/html-5-check.yaml actions

.github/workflows/lint-changed-files.yaml actions

.github/workflows/lint.yaml actions

.github/workflows/pkgdown-no-suggests.yaml actions

.github/workflows/pkgdown.yaml actions

.github/workflows/revdepcheck.yaml actions

.github/workflows/test-coverage-examples.yaml actions

.github/workflows/test-coverage.yaml actions

.github/workflows/update-to-latest-easystats.yaml actions

datawizard

Science Score: 93.0%

Keywords

Keywords from Contributors

Scientific Fields

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Owner

JOSS Publication

datawizard: An R Package for Easy Data Preparation and Statistical Transformations

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

cran.r-project.org: datawizard

Rankings

Maintainers (1)

conda-forge.org: r-datawizard

Rankings

Dependencies