multidplyr

A dplyr backend that partitions a data frame over multiple processes

https://github.com/tidyverse/multidplyr

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 16 committers (6.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.7%) to scientific vocabulary

Keywords

dplyr multiprocess

Keywords from Contributors

data-manipulation grammar tidyverse package-creation curl pandoc rmarkdown latex bigquery tidy-data
Last synced: 6 months ago · JSON representation

Repository

A dplyr backend that partitions a data frame over multiple processes

Basic Info
Statistics
  • Stars: 646
  • Watchers: 39
  • Forks: 74
  • Open Issues: 18
  • Releases: 4
Topics
dplyr multiprocess
Created over 10 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License Code of conduct

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# multidplyr


[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![R-CMD-check](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidyverse/multidplyr/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/tidyverse/multidplyr/branch/main/graph/badge.svg)](https://app.codecov.io/gh/tidyverse/multidplyr?branch=main)
[![CRAN status](https://www.r-pkg.org/badges/version/multidplyr)](https://cran.r-project.org/package=multidplyr)


## Overview

multidplyr is a backend for dplyr that partitions a data frame across multiple cores. You tell multidplyr how to split the data up with `partition()` and then the data stays on each node until you explicitly retrieve it with `collect()`. This minimises the amount of time spent moving data around, and maximises parallel performance. This idea is inspired by [partools](https://github.com/matloff/partools) by Norm Matloff and [distributedR](https://github.com/vertica/DistributedR) by the Vertica Analytics team.

Due to the overhead associated with communicating between the nodes, you won't see much performance improvement with simple operations on less than ~10 million observations, and you may want to instead try [dtplyr](https://dtplyr.tidyverse.org/), which uses [data.table](https://R-datatable.com/). multidplyr's strength is found parallelising calls to slower and more complex functions.

(Note that unlike other packages in the tidyverse, multidplyr requires R 3.5 or greater. We hope to relax this requirement [in the future](https://github.com/traversc/qs/issues/11).)

## Installation

You can install the released version of multidplyr from [CRAN](https://CRAN.R-project.org) with:

``` r
install.packages("multidplyr")
```

And the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("pak")
pak::pak("tidyverse/multidplyr")
```

## Usage

To use multidplyr, you first create a cluster of the desired number of workers. Each one of these workers is a separate R process, and the operating system will spread their execution across multiple cores:

```{r setup}
library(multidplyr)

cluster <- new_cluster(4)
cluster_library(cluster, "dplyr")
```

There are two primary ways to use multidplyr. The first, and most efficient, way is to read different files on each worker:

```{r, eval = FALSE}
# Create a filename vector containing different values on each worker
cluster_assign_each(cluster, filename = c("a.csv", "b.csv", "c.csv", "d.csv"))

# Use vroom to quickly load the csvs
cluster_send(cluster, my_data <- vroom::vroom(filename))

# Create a party_df using the my_data variable on each worker
my_data <- party_df(cluster, "my_data")
```

Alternatively, if you already have the data loaded in the main session, you can use `partition()` to automatically spread it across the workers. Before calling `partition()`, it's a good idea to call `group_by()` to ensure that all of the observations belonging to a group end up on the same worker.

```{r}
library(nycflights13)

flight_dest <- flights %>% group_by(dest) %>% partition(cluster)
flight_dest
```

Now you can work with it like a regular data frame, but the computations will be spread across multiple cores. Once you've finished computation, use `collect()` to bring the data back to the host session:

```{r}
flight_dest %>% 
  summarise(delay = mean(dep_delay, na.rm = TRUE), n = n()) %>% 
  collect()
```

Note that there is some overhead associated with copying data from the worker nodes back to the host node (and vice versa), so you're best off using multidplyr with more complex operations. See `vignette("multidplyr")` for more details.

Owner

  • Name: tidyverse
  • Login: tidyverse
  • Kind: organization

The tidyverse is a collection of R packages that share common principles and are designed to work together seamlessly

GitHub Events

Total
  • Issues event: 1
  • Watch event: 10
  • Issue comment event: 2
Last Year
  • Issues event: 1
  • Watch event: 10
  • Issue comment event: 2

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 183
  • Total Committers: 16
  • Avg Commits per committer: 11.438
  • Development Distribution Score (DDS): 0.148
Past Year
  • Commits: 1
  • Committers: 1
  • Avg Commits per committer: 1.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Hadley Wickham h****m@g****m 156
Romain Francois r****n@r****m 10
Dale Maschette d****e@a****u 2
Frans van Dunné F****D 2
Jenny Bryan j****n@g****m 2
Brent Brewington b****n@g****m 1
Mara Averick m****k@g****m 1
Max m****n@g****m 1
Shyam Saladi s****i@c****u 1
Will Beasley w****y@h****m 1
eipi10 j****l@j****m 1
Carlos Scheidegger 2****d 1
Michael Grund 2****d 1
Shyam Saladi s****i 1
anobel a****l 1
paulponcet p****t@y****r 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 83
  • Total pull requests: 28
  • Average time to close issues: 11 months
  • Average time to close pull requests: 6 months
  • Total issue authors: 61
  • Total pull request authors: 17
  • Average comments per issue: 2.0
  • Average comments per pull request: 1.14
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 1
  • Average time to close issues: 20 days
  • Average time to close pull requests: about 1 hour
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • hadley (18)
  • lucazav (2)
  • stanstrup (2)
  • julou (2)
  • xabriel (2)
  • isaac-florence (2)
  • ksallinger1 (1)
  • d-morrison (1)
  • pmenaq-new (1)
  • impactanalysts (1)
  • philiporlando (1)
  • Ax3man (1)
  • Erinaceida (1)
  • romainfrancois (1)
  • avsdev-cw (1)
Pull Request Authors
  • hadley (8)
  • romainfrancois (4)
  • Maschette (2)
  • DavisVaughan (2)
  • wibeasley (2)
  • FvD (1)
  • michaelgrund (1)
  • CorradoLanera (1)
  • cscheid (1)
  • borisveytsman (1)
  • batpigandme (1)
  • iago-pssjd (1)
  • julou (1)
  • germanium (1)
  • jiho (1)
Top Labels
Issue Labels
feature (12) bug (5) upkeep (4) reprex (2) documentation (2)
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • cran 780 last-month
  • Total docker downloads: 2,545
  • Total dependent packages: 3
    (may contain duplicates)
  • Total dependent repositories: 18
    (may contain duplicates)
  • Total versions: 7
  • Total maintainers: 1
cran.r-project.org: multidplyr

A Multi-Process 'dplyr' Backend

  • Versions: 4
  • Dependent Packages: 3
  • Dependent Repositories: 18
  • Downloads: 780 Last month
  • Docker Downloads: 2,545
Rankings
Stargazers count: 0.5%
Forks count: 0.9%
Dependent repos count: 6.7%
Average: 10.5%
Dependent packages count: 10.9%
Downloads: 16.8%
Docker downloads count: 27.4%
Maintainers (1)
Last synced: 6 months ago
conda-forge.org: r-multidplyr
  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Stargazers count: 14.5%
Forks count: 19.1%
Average: 29.7%
Dependent repos count: 34.0%
Dependent packages count: 51.2%
Last synced: 7 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.4.0 depends
  • R6 * imports
  • callr >= 3.5.1 imports
  • crayon * imports
  • dplyr >= 1.0.0 imports
  • magrittr * imports
  • qs >= 0.24.1 imports
  • rlang * imports
  • tibble * imports
  • tidyselect * imports
  • vctrs >= 0.3.6 imports
  • covr * suggests
  • knitr * suggests
  • lubridate * suggests
  • mgcv * suggests
  • nycflights13 * suggests
  • rmarkdown * suggests
  • testthat >= 3.0.2 suggests
  • vroom * suggests
  • withr * suggests