multidplyr
A dplyr backend that partitions a data frame over multiple processes
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 16 committers (6.3%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (18.7%) to scientific vocabulary
Keywords
dplyr
multiprocess
Keywords from Contributors
data-manipulation
grammar
tidyverse
package-creation
curl
pandoc
rmarkdown
latex
bigquery
tidy-data
Last synced: 6 months ago
·
JSON representation
Repository
A dplyr backend that partitions a data frame over multiple processes
Basic Info
- Host: GitHub
- Owner: tidyverse
- License: other
- Language: R
- Default Branch: main
- Homepage: https://multidplyr.tidyverse.org
- Size: 2.32 MB
Statistics
- Stars: 646
- Watchers: 39
- Forks: 74
- Open Issues: 18
- Releases: 4
Topics
dplyr
multiprocess
Created over 10 years ago
· Last pushed over 1 year ago
Metadata Files
Readme
Changelog
License
Code of conduct
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# multidplyr
[](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml)
[](https://app.codecov.io/gh/tidyverse/multidplyr?branch=main)
[](https://cran.r-project.org/package=multidplyr)
## Overview
multidplyr is a backend for dplyr that partitions a data frame across multiple cores. You tell multidplyr how to split the data up with `partition()` and then the data stays on each node until you explicitly retrieve it with `collect()`. This minimises the amount of time spent moving data around, and maximises parallel performance. This idea is inspired by [partools](https://github.com/matloff/partools) by Norm Matloff and [distributedR](https://github.com/vertica/DistributedR) by the Vertica Analytics team.
Due to the overhead associated with communicating between the nodes, you won't see much performance improvement with simple operations on less than ~10 million observations, and you may want to instead try [dtplyr](https://dtplyr.tidyverse.org/), which uses [data.table](https://R-datatable.com/). multidplyr's strength is found parallelising calls to slower and more complex functions.
(Note that unlike other packages in the tidyverse, multidplyr requires R 3.5 or greater. We hope to relax this requirement [in the future](https://github.com/traversc/qs/issues/11).)
## Installation
You can install the released version of multidplyr from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("multidplyr")
```
And the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("pak")
pak::pak("tidyverse/multidplyr")
```
## Usage
To use multidplyr, you first create a cluster of the desired number of workers. Each one of these workers is a separate R process, and the operating system will spread their execution across multiple cores:
```{r setup}
library(multidplyr)
cluster <- new_cluster(4)
cluster_library(cluster, "dplyr")
```
There are two primary ways to use multidplyr. The first, and most efficient, way is to read different files on each worker:
```{r, eval = FALSE}
# Create a filename vector containing different values on each worker
cluster_assign_each(cluster, filename = c("a.csv", "b.csv", "c.csv", "d.csv"))
# Use vroom to quickly load the csvs
cluster_send(cluster, my_data <- vroom::vroom(filename))
# Create a party_df using the my_data variable on each worker
my_data <- party_df(cluster, "my_data")
```
Alternatively, if you already have the data loaded in the main session, you can use `partition()` to automatically spread it across the workers. Before calling `partition()`, it's a good idea to call `group_by()` to ensure that all of the observations belonging to a group end up on the same worker.
```{r}
library(nycflights13)
flight_dest <- flights %>% group_by(dest) %>% partition(cluster)
flight_dest
```
Now you can work with it like a regular data frame, but the computations will be spread across multiple cores. Once you've finished computation, use `collect()` to bring the data back to the host session:
```{r}
flight_dest %>%
summarise(delay = mean(dep_delay, na.rm = TRUE), n = n()) %>%
collect()
```
Note that there is some overhead associated with copying data from the worker nodes back to the host node (and vice versa), so you're best off using multidplyr with more complex operations. See `vignette("multidplyr")` for more details.
Owner
- Name: tidyverse
- Login: tidyverse
- Kind: organization
- Website: http://tidyverse.org
- Repositories: 43
- Profile: https://github.com/tidyverse
The tidyverse is a collection of R packages that share common principles and are designed to work together seamlessly
GitHub Events
Total
- Issues event: 1
- Watch event: 10
- Issue comment event: 2
Last Year
- Issues event: 1
- Watch event: 10
- Issue comment event: 2
Committers
Last synced: over 1 year ago
Top Committers
| Name | Commits | |
|---|---|---|
| Hadley Wickham | h****m@g****m | 156 |
| Romain Francois | r****n@r****m | 10 |
| Dale Maschette | d****e@a****u | 2 |
| Frans van Dunné | F****D | 2 |
| Jenny Bryan | j****n@g****m | 2 |
| Brent Brewington | b****n@g****m | 1 |
| Mara Averick | m****k@g****m | 1 |
| Max | m****n@g****m | 1 |
| Shyam Saladi | s****i@c****u | 1 |
| Will Beasley | w****y@h****m | 1 |
| eipi10 | j****l@j****m | 1 |
| Carlos Scheidegger | 2****d | 1 |
| Michael Grund | 2****d | 1 |
| Shyam Saladi | s****i | 1 |
| anobel | a****l | 1 |
| paulponcet | p****t@y****r | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 83
- Total pull requests: 28
- Average time to close issues: 11 months
- Average time to close pull requests: 6 months
- Total issue authors: 61
- Total pull request authors: 17
- Average comments per issue: 2.0
- Average comments per pull request: 1.14
- Merged pull requests: 19
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 1
- Average time to close issues: 20 days
- Average time to close pull requests: about 1 hour
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- hadley (18)
- lucazav (2)
- stanstrup (2)
- julou (2)
- xabriel (2)
- isaac-florence (2)
- ksallinger1 (1)
- d-morrison (1)
- pmenaq-new (1)
- impactanalysts (1)
- philiporlando (1)
- Ax3man (1)
- Erinaceida (1)
- romainfrancois (1)
- avsdev-cw (1)
Pull Request Authors
- hadley (8)
- romainfrancois (4)
- Maschette (2)
- DavisVaughan (2)
- wibeasley (2)
- FvD (1)
- michaelgrund (1)
- CorradoLanera (1)
- cscheid (1)
- borisveytsman (1)
- batpigandme (1)
- iago-pssjd (1)
- julou (1)
- germanium (1)
- jiho (1)
Top Labels
Issue Labels
feature (12)
bug (5)
upkeep (4)
reprex (2)
documentation (2)
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- cran 780 last-month
- Total docker downloads: 2,545
-
Total dependent packages: 3
(may contain duplicates) -
Total dependent repositories: 18
(may contain duplicates) - Total versions: 7
- Total maintainers: 1
cran.r-project.org: multidplyr
A Multi-Process 'dplyr' Backend
- Homepage: https://multidplyr.tidyverse.org
- Documentation: http://cran.r-project.org/web/packages/multidplyr/multidplyr.pdf
- License: MIT + file LICENSE
-
Latest release: 0.1.3
published almost 3 years ago
Rankings
Stargazers count: 0.5%
Forks count: 0.9%
Dependent repos count: 6.7%
Average: 10.5%
Dependent packages count: 10.9%
Downloads: 16.8%
Docker downloads count: 27.4%
Maintainers (1)
Last synced:
6 months ago
conda-forge.org: r-multidplyr
- Homepage: https://github.com/tidyverse/multidplyr
- License: MIT
-
Latest release: 0.1.2
published over 3 years ago
Rankings
Stargazers count: 14.5%
Forks count: 19.1%
Average: 29.7%
Dependent repos count: 34.0%
Dependent packages count: 51.2%
Last synced:
7 months ago
Dependencies
DESCRIPTION
cran
- R >= 3.4.0 depends
- R6 * imports
- callr >= 3.5.1 imports
- crayon * imports
- dplyr >= 1.0.0 imports
- magrittr * imports
- qs >= 0.24.1 imports
- rlang * imports
- tibble * imports
- tidyselect * imports
- vctrs >= 0.3.6 imports
- covr * suggests
- knitr * suggests
- lubridate * suggests
- mgcv * suggests
- nycflights13 * suggests
- rmarkdown * suggests
- testthat >= 3.0.2 suggests
- vroom * suggests
- withr * suggests