tidyfst

tidyfst: Tidy Verbs for Fast Data Manipulation - Published in JOSS (2020)

https://github.com/hope-data-science/tidyfst

Science Score: 93.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 5 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org, zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software
Last synced: 6 months ago · JSON representation

Repository

Tidy Verbs for Fast Data Manipulation

Basic Info
Statistics
  • Stars: 106
  • Watchers: 5
  • Forks: 7
  • Open Issues: 0
  • Releases: 12
Created about 6 years ago · Last pushed 10 months ago
Metadata Files
Readme Contributing License Code of conduct Support

README.md

tidyfst: Tidy Verbs for Fast Data Manipulation

downloads

download downloads downloads

ZENODO DOI JOSS DOI

Overview

tidyfst is a toolkit of tidy data manipulation verbs with data.table as the backend . Combining the merits of syntax elegance from dplyr and computing performance from data.table, tidyfst intends to provide users with state-of-the-art data manipulation tools with least pain. This package is an extension of data.table, while enjoying a tidy syntax, it also wraps combinations of efficient functions to facilitate frequently-used data operations. Also, tidyfst would introduce more tidy data verbs from other packages, including but not limited to tidyverse and data.table. If you are a dplyr user but have to use data.table for speedy computation, or data.table user looking for readable coding syntax, tidyfst is designed for you (and me of course). For further details and tutorials, see vignettes. Both Chinese and English tutorials could be found there.

Till now, tidyfst has an API that might even transcend its predecessors (e.g. select_dt could accept nearly anything for super column selection). Enjoy the efficient data operations in tidyfst !

PS: For extreme performance in tidy syntax, try tidyfst's mirror package tidyft.

Features

  • Receives any data.frame (tibble/data.table/data.frame) and returns a data.table.
  • Show the variable class of data.table as default.
  • Never use in place replacement (also known as modification by reference, which means the original variable would not be modified without notification).
  • Use suffix ("_dt") rather than prefix to increase the efficiency (especially when you have IDE with automatic code completion).
  • More flexible verbs (e.g. pairwisecountdt) for big data manipulation.
  • Supporting data importing and parsing with fst, which saves both time and memory. Details see parsefst/selectfst/filter_fst and importfst/exportfst.
  • Low and stable dependency on mature packages (data.table, fst, stringr)

Installation

R install.packages("tidyfst")

Example

```R library(tidyfst)

iris %>% mutatedt(group = Species,sl = Sepal.Length,sw = Sepal.Width) %>% selectdt(group,sl,sw) %>% filterdt(sl > 5) %>% arrangedt(group,sl) %>% distinctdt(sl,.keepall = T) %>% summarise_dt(sw = max(sw),by = group)

> group sw

>

> 1: setosa 4.4

> 2: versicolor 3.4

> 3: virginica 3.8

iris %>% countdt(Species) %>% addprop()

> Species n prop prop_label

>

> 1: setosa 50 0.3333333 33.3%

> 2: versicolor 50 0.3333333 33.3%

> 3: virginica 50 0.3333333 33.3%

iris[3:8,] %>% mutate_when(Petal.Width == .2, one = 1,Sepal.Length=2)

> Sepal.Length Sepal.Width Petal.Length Petal.Width Species one

>

> 1: 2.0 3.2 1.3 0.2 setosa 1

> 2: 2.0 3.1 1.5 0.2 setosa 1

> 3: 2.0 3.6 1.4 0.2 setosa 1

> 4: 5.4 3.9 1.7 0.4 setosa NA

> 5: 4.6 3.4 1.4 0.3 setosa NA

> 6: 2.0 3.4 1.5 0.2 setosa 1

```

Future plans

tidyfst will keep up with the updates of data.table , in the next step would introduce more new features to improve the performance and flexibility to facilitate fast data manipulation in tidy syntax.

Vignettes

Cheat sheet

Suggested citation

Huang et al., (2020). tidyfst: Tidy Verbs for Fast Data Manipulation. Journal of Open Source Software, 5(52), 2388, https://doi.org/10.21105/joss.02388

Related work

Acknowledgement

The author of maditr, Gregory Demin and the author of fst, Marcus Klik have helped me a lot in the development of this work. It is so lucky to have them (and many other selfless contributors) in the same open source community of R.

Owner

  • Name: Hope
  • Login: hope-data-science
  • Kind: user
  • Location: Beijing
  • Company: Chinese Academy of Sciences

Use R to change the world!

JOSS Publication

tidyfst: Tidy Verbs for Fast Data Manipulation
Published
August 21, 2020
Volume 5, Issue 52, Page 2388
Authors
Tian-Yuan Huang ORCID
School of Life Science, Fudan University
Bin Zhao ORCID
School of Life Science, Fudan University
Editor
Mikkel Meyer Andersen ORCID
Tags
data.table data aggregation data manipulation dplyr tidyfst

GitHub Events

Total
  • Watch event: 12
  • Push event: 2
Last Year
  • Watch event: 12
  • Push event: 2

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 349
  • Total Committers: 3
  • Avg Commits per committer: 116.333
  • Development Distribution Score (DDS): 0.02
Past Year
  • Commits: 21
  • Committers: 1
  • Avg Commits per committer: 21.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Hope 3****e 342
Hadley Wickham h****m@g****m 6
Michael Chirico m****4@g****m 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 24
  • Total pull requests: 2
  • Average time to close issues: 3 months
  • Average time to close pull requests: 3 days
  • Total issue authors: 19
  • Total pull request authors: 2
  • Average comments per issue: 3.88
  • Average comments per pull request: 3.5
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • markfairbanks (6)
  • michaelaoash (1)
  • XianglinZhang-risker (1)
  • rcannood (1)
  • lssb (1)
  • fc-ibb105 (1)
  • B-1991-ing (1)
  • hwanghan (1)
  • kongdd (1)
  • acpguedes (1)
  • jfdesomzee (1)
  • maskegger (1)
  • xiaoluolorn (1)
  • xiaodaigh (1)
  • hope-data-science (1)
Pull Request Authors
  • MichaelChirico (2)
  • hadley (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 3,414 last-month
  • Total docker downloads: 9
  • Total dependent packages: 2
  • Total dependent repositories: 2
  • Total versions: 27
  • Total maintainers: 1
cran.r-project.org: tidyfst

Tidy Verbs for Fast Data Manipulation

  • Versions: 27
  • Dependent Packages: 2
  • Dependent Repositories: 2
  • Downloads: 3,414 Last month
  • Docker Downloads: 9
Rankings
Stargazers count: 4.1%
Forks count: 7.9%
Downloads: 13.1%
Dependent packages count: 13.7%
Average: 14.0%
Dependent repos count: 19.3%
Docker downloads count: 25.8%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.3.0 depends
  • data.table >= 1.13.0 imports
  • fst >= 0.9.0 imports
  • stringr >= 1.4.0 imports
  • bench * suggests
  • dplyr * suggests
  • ggplot2 * suggests
  • knitr * suggests
  • nycflights13 * suggests
  • pryr * suggests
  • rmarkdown * suggests
  • testthat * suggests
  • tidyr * suggests