multidplyr

A dplyr backend that partitions a data frame over multiple processes

https://github.com/tidyverse/multidplyr

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 16 committers (6.3%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (18.7%) to scientific vocabulary

Keywords

dplyr multiprocess

Keywords from Contributors

data-manipulation grammar tidyverse package-creation curl pandoc rmarkdown latex bigquery tidy-data

Last synced: 6 months ago · JSON representation

Repository

A dplyr backend that partitions a data frame over multiple processes

Basic Info

Host: GitHub
Owner: tidyverse
License: other
Language: R
Default Branch: main
Homepage: https://multidplyr.tidyverse.org
Size: 2.32 MB

Statistics

Stars: 646
Watchers: 39
Forks: 74
Open Issues: 18
Releases: 4

Topics

dplyr multiprocess

Created over 10 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License Code of conduct

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# multidplyr


[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/tidyverse/multidplyr/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidyverse/multidplyr?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/multidplyr)](https://cran.r-project.org/package=multidplyr)


## Overview

multidplyr is a backend for dplyr that partitions a data frame across multiple cores. You tell multidplyr how to split the data up with `partition()` and then the data stays on each node until you explicitly retrieve it with `collect()`. This minimises the amount of time spent moving data around, and maximises parallel performance. This idea is inspired by [partools](https://github.com/matloff/partools) by Norm Matloff and [distributedR](https://github.com/vertica/DistributedR) by the Vertica Analytics team.

Due to the overhead associated with communicating between the nodes, you won't see much performance improvement with simple operations on less than ~10 million observations, and you may want to instead try [dtplyr](https://dtplyr.tidyverse.org/), which uses [data.table](https://R-datatable.com/). multidplyr's strength is found parallelising calls to slower and more complex functions.

(Note that unlike other packages in the tidyverse, multidplyr requires R 3.5 or greater. We hope to relax this requirement [in the future](https://github.com/traversc/qs/issues/11).)

## Installation

You can install the released version of multidplyr from [CRAN](https://CRAN.R-project.org) with:

``` r
install.packages("multidplyr")
```

And the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("pak")
pak::pak("tidyverse/multidplyr")
```

## Usage

To use multidplyr, you first create a cluster of the desired number of workers. Each one of these workers is a separate R process, and the operating system will spread their execution across multiple cores:

```{r setup}
library(multidplyr)

cluster <- new_cluster(4)
cluster_library(cluster, "dplyr")
```

There are two primary ways to use multidplyr. The first, and most efficient, way is to read different files on each worker:

```{r, eval = FALSE}
# Create a filename vector containing different values on each worker
cluster_assign_each(cluster, filename = c("a.csv", "b.csv", "c.csv", "d.csv"))

# Use vroom to quickly load the csvs
cluster_send(cluster, my_data <- vroom::vroom(filename))

# Create a party_df using the my_data variable on each worker
my_data <- party_df(cluster, "my_data")
```

Alternatively, if you already have the data loaded in the main session, you can use `partition()` to automatically spread it across the workers. Before calling `partition()`, it's a good idea to call `group_by()` to ensure that all of the observations belonging to a group end up on the same worker.

```{r}
library(nycflights13)

flight_dest <- flights %>% group_by(dest) %>% partition(cluster)
flight_dest
```

Now you can work with it like a regular data frame, but the computations will be spread across multiple cores. Once you've finished computation, use `collect()` to bring the data back to the host session:

```{r}
flight_dest %>% 
  summarise(delay = mean(dep_delay, na.rm = TRUE), n = n()) %>% 
  collect()
```

Note that there is some overhead associated with copying data from the worker nodes back to the host node (and vice versa), so you're best off using multidplyr with more complex operations. See `vignette("multidplyr")` for more details.

Owner

Name: tidyverse
Login: tidyverse
Kind: organization

Website: http://tidyverse.org
Repositories: 43
Profile: https://github.com/tidyverse

The tidyverse is a collection of R packages that share common principles and are designed to work together seamlessly

GitHub Events

Total

Issues event: 1
Watch event: 10
Issue comment event: 2

Last Year

Issues event: 1
Watch event: 10
Issue comment event: 2

Committers

Last synced: over 1 year ago

All Time

Total Commits: 183
Total Committers: 16
Avg Commits per committer: 11.438
Development Distribution Score (DDS): 0.148

Past Year

Commits: 1
Committers: 1
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Hadley Wickham	h**m@g**m	156
Romain Francois	r**n@r**m	10
Dale Maschette	d**e@a**u	2
Frans van Dunné	F****D	2
Jenny Bryan	j**n@g**m	2
Brent Brewington	b**n@g**m	1
Mara Averick	m**k@g**m	1
Max	m**n@g**m	1
Shyam Saladi	s**i@c**u	1
Will Beasley	w**y@h**m	1
eipi10	j**l@j**m	1
Carlos Scheidegger	2****d	1
Michael Grund	2****d	1
Shyam Saladi	s****i	1
anobel	a****l	1
paulponcet	p**t@y**r	1

Committer Domains (Top 20 + Academic)

joelschwartz.com: 1 caltech.edu: 1 aad.gov.au: 1 rstudio.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 83
Total pull requests: 28
Average time to close issues: 11 months
Average time to close pull requests: 6 months
Total issue authors: 61
Total pull request authors: 17
Average comments per issue: 2.0
Average comments per pull request: 1.14
Merged pull requests: 19
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 1
Average time to close issues: 20 days
Average time to close pull requests: about 1 hour
Issue authors: 1
Pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

hadley (18)
lucazav (2)
stanstrup (2)
julou (2)
xabriel (2)
isaac-florence (2)
ksallinger1 (1)
d-morrison (1)
pmenaq-new (1)
impactanalysts (1)
philiporlando (1)
Ax3man (1)
Erinaceida (1)
romainfrancois (1)
avsdev-cw (1)

Pull Request Authors

hadley (8)
romainfrancois (4)
Maschette (2)
DavisVaughan (2)
wibeasley (2)
FvD (1)
michaelgrund (1)
CorradoLanera (1)
cscheid (1)
borisveytsman (1)
batpigandme (1)
iago-pssjd (1)
julou (1)
germanium (1)
jiho (1)

Top Labels

Issue Labels

feature (12) bug (5) upkeep (4) reprex (2) documentation (2)

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- cran 780 last-month
Total docker downloads: 2,545

Total dependent packages: 3
(may contain duplicates)
Total dependent repositories: 18
(may contain duplicates)
Total versions: 7
Total maintainers: 1

cran.r-project.org: multidplyr

A Multi-Process 'dplyr' Backend

Homepage: https://multidplyr.tidyverse.org
Documentation: http://cran.r-project.org/web/packages/multidplyr/multidplyr.pdf
License: MIT + file LICENSE
Latest release: 0.1.3
published almost 3 years ago

Versions: 4
Dependent Packages: 3
Dependent Repositories: 18
Downloads: 780 Last month
Docker Downloads: 2,545

Rankings

Stargazers count: 0.5%

Forks count: 0.9%

Dependent repos count: 6.7%

Average: 10.5%

Dependent packages count: 10.9%

Downloads: 16.8%

Docker downloads count: 27.4%

Maintainers (1)

hadley@posit.co

Last synced: 6 months ago

conda-forge.org: r-multidplyr

Homepage: https://github.com/tidyverse/multidplyr
License: MIT
Latest release: 0.1.2
published over 3 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Stargazers count: 14.5%

Forks count: 19.1%

Average: 29.7%

Dependent repos count: 34.0%

Dependent packages count: 51.2%

Last synced: 7 months ago

Dependencies

DESCRIPTION cran

R >= 3.4.0 depends
R6 * imports
callr >= 3.5.1 imports
crayon * imports
dplyr >= 1.0.0 imports
magrittr * imports
qs >= 0.24.1 imports
rlang * imports
tibble * imports
tidyselect * imports
vctrs >= 0.3.6 imports
covr * suggests
knitr * suggests
lubridate * suggests
mgcv * suggests
nycflights13 * suggests
rmarkdown * suggests
testthat >= 3.0.2 suggests
vroom * suggests
withr * suggests