datawizard
datawizard: An R Package for Easy Data Preparation and Statistical Transformations - Published in JOSS (2022)
Science Score: 93.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: joss.theoj.org -
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
data
dplyr
hacktoberfest
janitor
manipulation
r-package
reshape
rstats
tidyr
wrangling
Keywords from Contributors
standardization
correlation
predict
easystats
gaussian-graphical-models
bayes-factors
bayesian-correlations
biserial
cor
correlation-analysis
Scientific Fields
Engineering
Computer Science -
80% confidence
Last synced: 4 months ago
·
JSON representation
Repository
Magic potions to clean and transform your data 🧙
Basic Info
- Host: GitHub
- Owner: easystats
- License: other
- Language: R
- Default Branch: main
- Homepage: https://easystats.github.io/datawizard/
- Size: 91.3 MB
Statistics
- Stars: 230
- Watchers: 8
- Forks: 16
- Open Issues: 33
- Releases: 33
Topics
data
dplyr
hacktoberfest
janitor
manipulation
r-package
reshape
rstats
tidyr
wrangling
Created over 4 years ago
· Last pushed 4 months ago
Metadata Files
Readme
Changelog
Contributing
Funding
License
Code of conduct
Support
README.Rmd
--- output: github_document --- # `datawizard`: Easy Data Wrangling and Statistical Transformations```{r, echo=FALSE, warning=FALSE, message=FALSE} knitr::opts_chunk$set( collapse = TRUE, dpi = 300, out.width = "100%", fig.path = "man/figures/", comment = "#>" ) set.seed(333) library(datawizard) ``` [](https://doi.org/10.21105/joss.04684) [](https://cran.r-project.org/package=datawizard) [](https://cranlogs.r-pkg.org/) `{datawizard}` is a lightweight package to easily manipulate, clean, transform, and prepare your data for analysis. It is part of the [easystats ecosystem](https://easystats.github.io/easystats/), a suite of R packages to deal with your entire statistical analysis, from cleaning the data to reporting the results. It covers two aspects of data preparation: - **Data manipulation**: `{datawizard}` offers a very similar set of functions to that of the *tidyverse* packages, such as a `{dplyr}` and `{tidyr}`, to select, filter and reshape data, with a few key differences. 1) All data manipulation functions start with the prefix `data_*` (which makes them easy to identify). 2) Although most functions can be used exactly as their *tidyverse* equivalents, they are also string-friendly (which makes them easy to program with and use inside functions). Finally, `{datawizard}` is super lightweight (no dependencies, similar to [poorman](https://github.com/nathaneastwood/poorman)), which makes it awesome for developers to use in their packages. - **Statistical transformations**: `{datawizard}` also has powerful functions to easily apply common data [transformations](https://easystats.github.io/datawizard/reference/index.html#statistical-transformations), including standardization, normalization, rescaling, rank-transformation, scale reversing, recoding, binning, etc.
# Installation [](https://cran.r-project.org/package=datawizard) [](https://easystats.r-universe.dev) [](https://app.codecov.io/gh/easystats/datawizard) [](https://github.com/easystats/datawizard/actions) Type | Source | Command ---|---|--- Release | CRAN | `install.packages("datawizard")` Development | r-universe | `install.packages("datawizard", repos = "https://easystats.r-universe.dev")` Development | GitHub | `remotes::install_github("easystats/datawizard")` > **Tip** > > **Instead of `library(datawizard)`, use `library(easystats)`.** > **This will make all features of the easystats-ecosystem available.** > > **To stay updated, use `easystats::install_latest()`.** # Citation To cite the package, run the following command: ```{r, comment=""} citation("datawizard") ``` # Features [](https://easystats.github.io/datawizard/) [](https://easystats.github.io/blog/posts/) [](https://easystats.github.io/datawizard/reference/index.html) Most courses and tutorials about statistical modeling assume that you are working with a clean and tidy dataset. In practice, however, a major part of doing statistical modeling is preparing your data--cleaning up values, creating new columns, reshaping the dataset, or transforming some variables. `{datawizard}` provides easy to use tools to perform these common, critical, and sometimes tedious data preparation tasks. ## Data wrangling ### Select, filter and remove variables The package provides helpers to filter rows meeting certain conditions... ```{r} data_match(mtcars, data.frame(vs = 0, am = 1)) ``` ... or logical expressions: ```{r} data_filter(mtcars, vs == 0 & am == 1) ``` Finding columns in a data frame, or retrieving the data of selected columns, can be achieved using `extract_column_names()` or `data_select()`: ```{r} # find column names matching a pattern extract_column_names(iris, starts_with("Sepal")) # return data columns matching a pattern data_select(iris, starts_with("Sepal")) |> head() ``` It is also possible to extract one or more variables: ```{r} # single variable data_extract(mtcars, "gear") # more variables head(data_extract(iris, ends_with("Width"))) ``` Due to the consistent API, removing variables is just as simple: ```{r} head(data_remove(iris, starts_with("Sepal"))) ``` ### Reorder or rename ```{r} head(data_relocate(iris, select = "Species", before = "Sepal.Length")) ``` ```{r} head(data_rename(iris, c("Sepal.Length", "Sepal.Width"), c("length", "width"))) ``` ### Merge ```{r} x <- data.frame(a = 1:3, b = c("a", "b", "c"), c = 5:7, id = 1:3) y <- data.frame(c = 6:8, d = c("f", "g", "h"), e = 100:102, id = 2:4) x y data_merge(x, y, join = "full") data_merge(x, y, join = "left") data_merge(x, y, join = "right") data_merge(x, y, join = "semi", by = "c") data_merge(x, y, join = "anti", by = "c") data_merge(x, y, join = "inner") data_merge(x, y, join = "bind") ``` ### Reshape A common data wrangling task is to reshape data. Either to go from wide/Cartesian to long/tidy format ```{r} wide_data <- data.frame(replicate(5, rnorm(10))) head(data_to_long(wide_data)) ``` or the other way ```{r} long_data <- data_to_long(wide_data, rows_to = "Row_ID") # Save row number data_to_wide(long_data, names_from = "name", values_from = "value", id_cols = "Row_ID" ) ``` ### Empty rows and columns ```{r} tmp <- data.frame( a = c(1, 2, 3, NA, 5), b = c(1, NA, 3, NA, 5), c = c(NA, NA, NA, NA, NA), d = c(1, NA, 3, NA, 5) ) tmp # indices of empty columns or rows empty_columns(tmp) empty_rows(tmp) # remove empty columns or rows remove_empty_columns(tmp) remove_empty_rows(tmp) # remove empty columns and rows remove_empty(tmp) ``` ### Recode or cut dataframe ```{r} set.seed(123) x <- sample(1:10, size = 50, replace = TRUE) table(x) # cut into 3 groups, based on distribution (quantiles) table(categorize(x, split = "quantile", n_groups = 3)) ``` ## Data Transformations The packages also contains multiple functions to help transform data. ### Standardize For example, to standardize (*z*-score) data: ```{r} # before summary(swiss) # after summary(standardize(swiss)) ``` ### Winsorize To winsorize data: ```{r} # before anscombe # after winsorize(anscombe) ``` ### Center To grand-mean center data ```{r} center(anscombe) ``` ### Ranktransform To rank-transform data: ```{r} # before head(trees) # after head(ranktransform(trees)) ``` ### Rescale To rescale a numeric variable to a new range: ```{r} change_scale(c(0, 1, 5, -5, -2)) ``` ### Rotate or transpose ```{r} x <- mtcars[1:3, 1:4] x data_rotate(x) ``` ## Data properties `datawizard` provides a way to provide comprehensive descriptive summary for all variables in a dataframe: ```{r} data(iris) describe_distribution(iris) ``` Or even just a variable ```{r} describe_distribution(mtcars$wt) ``` There are also some additional data properties that can be computed using this package. ```{r} x <- (-10:10)^3 + rnorm(21, 0, 100) smoothness(x, method = "diff") ``` ## Function design and pipe-workflow The design of the `{datawizard}` functions follows a design principle that makes it easy for user to understand and remember how functions work: 1. the first argument is the data 2. for methods that work on data frames, two arguments are following to `select` and `exclude` variables 3. the following arguments are arguments related to the specific tasks of the functions Most important, functions that accept data frames usually have this as their first argument, and also return a (modified) data frame again. Thus, `{datawizard}` integrates smoothly into a "pipe-workflow". ```{r} iris |> # all rows where Species is "versicolor" or "virginica" data_filter(Species %in% c("versicolor", "virginica")) |> # select only columns with "." in names (i.e. drop Species) data_select(contains("\\.")) |> # move columns that ends with "Length" to start of data frame data_relocate(ends_with("Length")) |> # remove fourth column data_remove(4) |> head() ``` # Contributing and Support In case you want to file an issue or contribute in another way to the package, please follow [this guide](https://easystats.github.io/datawizard/CONTRIBUTING.html). For questions about the functionality, you may either contact us via email or also file an issue. # Code of Conduct Please note that this project is released with a [Contributor Code of Conduct](https://easystats.github.io/datawizard/CODE_OF_CONDUCT.html). By participating in this project you agree to abide by its terms.
Owner
- Name: easystats
- Login: easystats
- Kind: organization
- Location: worldwide
- Website: https://easystats.github.io/easystats/
- Twitter: easystats4u
- Repositories: 19
- Profile: https://github.com/easystats
Make R stats easy!
JOSS Publication
datawizard: An R Package for Easy Data Preparation and Statistical Transformations
Published
October 09, 2022
Volume 7, Issue 78, Page 4684
Authors
Tags
easystatsGitHub Events
Total
- Create event: 55
- Commit comment event: 1
- Release event: 4
- Issues event: 43
- Watch event: 14
- Delete event: 48
- Issue comment event: 288
- Push event: 488
- Pull request review event: 165
- Pull request review comment event: 156
- Pull request event: 99
Last Year
- Create event: 55
- Commit comment event: 1
- Release event: 4
- Issues event: 43
- Watch event: 14
- Delete event: 48
- Issue comment event: 289
- Push event: 492
- Pull request review event: 167
- Pull request review comment event: 157
- Pull request event: 100
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Daniel | m****l@d****e | 619 |
| Indrajeet Patil | p****e@g****m | 334 |
| Etienne Bacher | 5****r | 199 |
| Mattan S. Ben-Shachar | m****b@m****o | 26 |
| Dominique Makowski | d****9@g****m | 24 |
| etiennebacher | y****u@e****m | 15 |
| github-actions[bot] | 4****] | 13 |
| Brenton M. Wiernik | b****k | 12 |
| Rémi Thériault | 1****c | 6 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 113
- Total pull requests: 311
- Average time to close issues: 3 months
- Average time to close pull requests: 15 days
- Total issue authors: 17
- Total pull request authors: 7
- Average comments per issue: 4.18
- Average comments per pull request: 3.09
- Merged pull requests: 264
- Bot issues: 0
- Bot pull requests: 25
Past Year
- Issues: 34
- Pull requests: 127
- Average time to close issues: 12 days
- Average time to close pull requests: 3 days
- Issue authors: 9
- Pull request authors: 5
- Average comments per issue: 3.35
- Average comments per pull request: 2.2
- Merged pull requests: 103
- Bot issues: 0
- Bot pull requests: 13
Top Authors
Issue Authors
- etiennebacher (25)
- IndrajeetPatil (24)
- strengejacke (20)
- mattansb (11)
- DominiqueMakowski (11)
- jmgirard (6)
- rempsyc (4)
- bwiernik (2)
- profandyfield (2)
- chuxinyuan (1)
- Cal-Fang (1)
- Cghlewis (1)
- albaperis (1)
- lewislehe (1)
- BalbR (1)
Pull Request Authors
- strengejacke (170)
- etiennebacher (88)
- github-actions[bot] (25)
- IndrajeetPatil (16)
- mattansb (6)
- DominiqueMakowski (4)
- rempsyc (2)
Top Labels
Issue Labels
enhancement :boom: (8)
bug 🪲 (8)
upkeep :broom: (6)
Feature idea :fire: (4)
feature idea :fire: (3)
consistency 🍎🍏 (3)
docs 📚 (3)
Upkeep :broom: (3)
Bug :bug: (3)
breaking :skull_and_crossbones: (2)
Enhancement :boom: (2)
question (1)
Docs 📚 (1)
Discussion :parrot: (1)
Consistency :green_apple: :apple: (1)
High priority :running_man: (1)
invalid (1)
high priority :running_man: (1)
Pull Request Labels
Auto-update (17)
auto-update (8)
docs 📚 (1)
Packages
- Total packages: 2
-
Total downloads:
- cran 115,922 last-month
- Total docker downloads: 48,992
-
Total dependent packages: 24
(may contain duplicates) -
Total dependent repositories: 42
(may contain duplicates) - Total versions: 49
- Total maintainers: 1
cran.r-project.org: datawizard
Easy Data Wrangling and Statistical Transformations
- Homepage: https://easystats.github.io/datawizard/
- Documentation: http://cran.r-project.org/web/packages/datawizard/datawizard.pdf
- License: MIT + file LICENSE
-
Latest release: 1.2.0
published 5 months ago
Rankings
Downloads: 1.3%
Stargazers count: 2.5%
Dependent packages count: 3.7%
Dependent repos count: 4.0%
Average: 6.4%
Forks count: 7.0%
Docker downloads count: 19.8%
Maintainers (1)
Last synced:
4 months ago
conda-forge.org: r-datawizard
- Homepage: https://easystats.github.io/datawizard/
- License: GPL-3.0-only
-
Latest release: 0.6.3
published about 3 years ago
Rankings
Dependent packages count: 9.0%
Dependent repos count: 24.4%
Average: 26.9%
Stargazers count: 29.3%
Forks count: 44.9%
Last synced:
4 months ago
Dependencies
DESCRIPTION
cran
- R >= 3.6 depends
- insight >= 0.18.8 imports
- stats * imports
- utils * imports
- bayestestR * suggests
- boot * suggests
- brms * suggests
- data.table * suggests
- dplyr >= 1.0 suggests
- effectsize * suggests
- gamm4 * suggests
- ggplot2 * suggests
- gt * suggests
- haven * suggests
- htmltools * suggests
- httr * suggests
- knitr * suggests
- lme4 * suggests
- mediation * suggests
- parameters * suggests
- poorman >= 0.2.6 suggests
- psych * suggests
- readr * suggests
- readxl * suggests
- rio * suggests
- rmarkdown * suggests
- rstanarm * suggests
- see * suggests
- testthat >= 3.1.0 suggests
- tidyr * suggests
- withr * suggests
.github/workflows/R-CMD-check-devel-easystats.yaml
actions
.github/workflows/R-CMD-check-hard.yaml
actions
.github/workflows/R-CMD-check-strict.yaml
actions
.github/workflows/R-CMD-check.yaml
actions
.github/workflows/check-all-examples.yaml
actions
.github/workflows/check-link-rot.yaml
actions
.github/workflows/check-random-test-order.yaml
actions
.github/workflows/check-readme.yaml
actions
.github/workflows/check-spelling.yaml
actions
.github/workflows/check-styling.yaml
actions
.github/workflows/check-test-warnings.yaml
actions
.github/workflows/check-vignette-warnings.yaml
actions
.github/workflows/html-5-check.yaml
actions
.github/workflows/lint-changed-files.yaml
actions
.github/workflows/lint.yaml
actions
.github/workflows/pkgdown-no-suggests.yaml
actions
.github/workflows/pkgdown.yaml
actions
.github/workflows/revdepcheck.yaml
actions
.github/workflows/test-coverage-examples.yaml
actions
.github/workflows/test-coverage.yaml
actions
.github/workflows/update-to-latest-easystats.yaml
actions
```{r, echo=FALSE, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
dpi = 300,
out.width = "100%",
fig.path = "man/figures/",
comment = "#>"
)
set.seed(333)
library(datawizard)
```
[](https://doi.org/10.21105/joss.04684)
[](https://cran.r-project.org/package=datawizard)
[](https://cranlogs.r-pkg.org/)
`{datawizard}` is a lightweight package to easily manipulate, clean, transform, and prepare your data for analysis. It is part of the [easystats ecosystem](https://easystats.github.io/easystats/), a suite of R packages to deal with your entire statistical analysis, from cleaning the data to reporting the results.
It covers two aspects of data preparation:
- **Data manipulation**: `{datawizard}` offers a very similar set of functions to that of the *tidyverse* packages, such as a `{dplyr}` and `{tidyr}`, to select, filter and reshape data, with a few key differences. 1) All data manipulation functions start with the prefix `data_*` (which makes them easy to identify). 2) Although most functions can be used exactly as their *tidyverse* equivalents, they are also string-friendly (which makes them easy to program with and use inside functions). Finally, `{datawizard}` is super lightweight (no dependencies, similar to [poorman](https://github.com/nathaneastwood/poorman)), which makes it awesome for developers to use in their packages.
- **Statistical transformations**: `{datawizard}` also has powerful functions to easily apply common data [transformations](https://easystats.github.io/datawizard/reference/index.html#statistical-transformations), including standardization, normalization, rescaling, rank-transformation, scale reversing, recoding, binning, etc.
# Installation
[](https://cran.r-project.org/package=datawizard) [](https://easystats.r-universe.dev) [](https://app.codecov.io/gh/easystats/datawizard) [](https://github.com/easystats/datawizard/actions)
Type | Source | Command
---|---|---
Release | CRAN | `install.packages("datawizard")`
Development | r-universe | `install.packages("datawizard", repos = "https://easystats.r-universe.dev")`
Development | GitHub | `remotes::install_github("easystats/datawizard")`
> **Tip**
>
> **Instead of `library(datawizard)`, use `library(easystats)`.**
> **This will make all features of the easystats-ecosystem available.**
>
> **To stay updated, use `easystats::install_latest()`.**
# Citation
To cite the package, run the following command:
```{r, comment=""}
citation("datawizard")
```
# Features
[](https://easystats.github.io/datawizard/)
[](https://easystats.github.io/blog/posts/)
[](https://easystats.github.io/datawizard/reference/index.html)
Most courses and tutorials about statistical modeling assume that you are working with a clean and tidy dataset. In practice, however, a major part of doing statistical modeling is preparing your data--cleaning up values, creating new columns, reshaping the dataset, or transforming some variables. `{datawizard}` provides easy to use tools to perform these common, critical, and sometimes tedious data preparation tasks.
## Data wrangling
### Select, filter and remove variables
The package provides helpers to filter rows meeting certain conditions...
```{r}
data_match(mtcars, data.frame(vs = 0, am = 1))
```
... or logical expressions:
```{r}
data_filter(mtcars, vs == 0 & am == 1)
```
Finding columns in a data frame, or retrieving the data of selected columns, can be achieved using `extract_column_names()` or `data_select()`:
```{r}
# find column names matching a pattern
extract_column_names(iris, starts_with("Sepal"))
# return data columns matching a pattern
data_select(iris, starts_with("Sepal")) |> head()
```
It is also possible to extract one or more variables:
```{r}
# single variable
data_extract(mtcars, "gear")
# more variables
head(data_extract(iris, ends_with("Width")))
```
Due to the consistent API, removing variables is just as simple:
```{r}
head(data_remove(iris, starts_with("Sepal")))
```
### Reorder or rename
```{r}
head(data_relocate(iris, select = "Species", before = "Sepal.Length"))
```
```{r}
head(data_rename(iris, c("Sepal.Length", "Sepal.Width"), c("length", "width")))
```
### Merge
```{r}
x <- data.frame(a = 1:3, b = c("a", "b", "c"), c = 5:7, id = 1:3)
y <- data.frame(c = 6:8, d = c("f", "g", "h"), e = 100:102, id = 2:4)
x
y
data_merge(x, y, join = "full")
data_merge(x, y, join = "left")
data_merge(x, y, join = "right")
data_merge(x, y, join = "semi", by = "c")
data_merge(x, y, join = "anti", by = "c")
data_merge(x, y, join = "inner")
data_merge(x, y, join = "bind")
```
### Reshape
A common data wrangling task is to reshape data.
Either to go from wide/Cartesian to long/tidy format
```{r}
wide_data <- data.frame(replicate(5, rnorm(10)))
head(data_to_long(wide_data))
```
or the other way
```{r}
long_data <- data_to_long(wide_data, rows_to = "Row_ID") # Save row number
data_to_wide(long_data,
names_from = "name",
values_from = "value",
id_cols = "Row_ID"
)
```
### Empty rows and columns
```{r}
tmp <- data.frame(
a = c(1, 2, 3, NA, 5),
b = c(1, NA, 3, NA, 5),
c = c(NA, NA, NA, NA, NA),
d = c(1, NA, 3, NA, 5)
)
tmp
# indices of empty columns or rows
empty_columns(tmp)
empty_rows(tmp)
# remove empty columns or rows
remove_empty_columns(tmp)
remove_empty_rows(tmp)
# remove empty columns and rows
remove_empty(tmp)
```
### Recode or cut dataframe
```{r}
set.seed(123)
x <- sample(1:10, size = 50, replace = TRUE)
table(x)
# cut into 3 groups, based on distribution (quantiles)
table(categorize(x, split = "quantile", n_groups = 3))
```
## Data Transformations
The packages also contains multiple functions to help transform data.
### Standardize
For example, to standardize (*z*-score) data:
```{r}
# before
summary(swiss)
# after
summary(standardize(swiss))
```
### Winsorize
To winsorize data:
```{r}
# before
anscombe
# after
winsorize(anscombe)
```
### Center
To grand-mean center data
```{r}
center(anscombe)
```
### Ranktransform
To rank-transform data:
```{r}
# before
head(trees)
# after
head(ranktransform(trees))
```
### Rescale
To rescale a numeric variable to a new range:
```{r}
change_scale(c(0, 1, 5, -5, -2))
```
### Rotate or transpose
```{r}
x <- mtcars[1:3, 1:4]
x
data_rotate(x)
```
## Data properties
`datawizard` provides a way to provide comprehensive descriptive summary for all variables in a dataframe:
```{r}
data(iris)
describe_distribution(iris)
```
Or even just a variable
```{r}
describe_distribution(mtcars$wt)
```
There are also some additional data properties that can be computed using this package.
```{r}
x <- (-10:10)^3 + rnorm(21, 0, 100)
smoothness(x, method = "diff")
```
## Function design and pipe-workflow
The design of the `{datawizard}` functions follows a design principle that makes it easy for user to understand and remember how functions work:
1. the first argument is the data
2. for methods that work on data frames, two arguments are following to `select` and `exclude` variables
3. the following arguments are arguments related to the specific tasks of the functions
Most important, functions that accept data frames usually have this as their first argument, and also return a (modified) data frame again. Thus, `{datawizard}` integrates smoothly into a "pipe-workflow".
```{r}
iris |>
# all rows where Species is "versicolor" or "virginica"
data_filter(Species %in% c("versicolor", "virginica")) |>
# select only columns with "." in names (i.e. drop Species)
data_select(contains("\\.")) |>
# move columns that ends with "Length" to start of data frame
data_relocate(ends_with("Length")) |>
# remove fourth column
data_remove(4) |>
head()
```
# Contributing and Support
In case you want to file an issue or contribute in another way to the package, please follow [this guide](https://easystats.github.io/datawizard/CONTRIBUTING.html). For questions about the functionality, you may either contact us via email or also file an issue.
# Code of Conduct
Please note that this project is released with a
[Contributor Code of Conduct](https://easystats.github.io/datawizard/CODE_OF_CONDUCT.html). By participating in this project you agree to abide by its terms.
