Zoomerjoin

Zoomerjoin: Superlatively-Fast Fuzzy Joins - Published in JOSS (2023)

https://github.com/beniaminogreen/zoomerjoin

Science Score: 98.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

blazinglyfast fuzzyjoin join r r-package rstats rust zoomer
Last synced: 6 months ago · JSON representation ·

Repository

Superlatively-fast fuzzy-joins in R

Basic Info
Statistics
  • Stars: 105
  • Watchers: 5
  • Forks: 6
  • Open Issues: 9
  • Releases: 2
Topics
blazinglyfast fuzzyjoin join r r-package rstats rust zoomer
Created about 3 years ago · Last pushed 11 months ago
Metadata Files
Readme Changelog Contributing License Citation Codemeta

README.Rmd

---
output: github_document
always_allow_html: true
---

```{r, include=F}
library(tidyverse)
library(microbenchmark)
library(fuzzyjoin)

# rextendr::document()
devtools::load_all()
```


# zoomerjoin 

[![DOI](https://joss.theoj.org/papers/10.21105/joss.05693/status.svg)](https://doi.org/10.21105/joss.05693)
[![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental)
[![Codecov test coverage](https://codecov.io/gh/beniaminogreen/zoomerjoin/branch/main/graph/badge.svg)](https://app.codecov.io/gh/beniaminogreen/zoomerjoin?branch=main)


zoomerjoin is an R package that empowers you to fuzzy-join massive datasets
rapidly, and with little memory consumption. It is powered by high-performance
implementations of [Locality Sensitive
Hashing](https://en.wikipedia.org/wiki/Locality-sensitive_hashing), an
algorithm that finds the matches records between two datasets without having to
compare all possible pairs of observations. In practice, this means zoomerjoin
can fuzzily-join datasets days, or even years faster than other matching
packages. zoomerjoin has been used in-production to join datasets of hundreds
of millions of names or vectors in a matter of hours.

## Installation

### Installing from CRAN:

You can install from the CRAN as you would with any other package. Please be
aware that you will have to have Cargo (the rust toolchain and compiler) installed to build
the package from source.

```r
install.packages(zoomerjoin)
```


### Installing Rust

If your operating system or version of R is not installed, you must have the
[Rust compiler](https://www.rust-lang.org/tools/install) installed to compile
this package from sources. After the package is compiled, Rust is no longer
required, and can be safely uninstalled.

#### Installing Rust on Linux or Mac:

To install Rust on Linux or Mac, you can simply run the following snippet in
your terminal.

``` sh
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```

#### Installing Rust on Windows:

To install Rust on windows, you can use the Rust installation wizard,
`rustup-init.exe`, found [at this
site](https://forge.rust-lang.org/infra/other-installation-methods.html).
Depending on your version of Windows, you may see an error that looks something like this:

```
error: toolchain 'stable-x86_64-pc-windows-gnu' is not installed
```

In this case, you should run `rustup install stable-x86_64-pc_windows-gnu` to
install the missing toolchain. If you're missing another toolchain, simply type
this in the place of `stable-x86_64-pc_windows-gnu` in the command above.

### Installing Package from Github:

Once you have rust installed Rust, you should be able to install the package
with either the install.packages function as above, or using the
`install_github` function from the `devtools` package or with the `pkg_install`
function from the `pak` package.

``` r
## Install with devtools
# install.packages("devtools")
devtools::install_github("beniaminogreen/zoomerjoin")

## Install with pak
# install.packages("pak")
pak::pkg_install("beniaminogreen/zoomerjoin")
```

### Loading The Package

Once the package is installed, you can load it into memory as usual by typing:

```{r, warning = FALSE, message = FALSE}
library(zoomerjoin)
```

## Usage:

The flagship feature of zoomerjoins are the jaccard_join and euclidean family
of functions, which are designed to be near drop-ins for the corresponding
dplyr/fuzzyjoin commands:

* `jaccard_left_join()`
* `jaccard_right_join()`
* `jaccard_inner_join()`
* `jaccard_full_join()`
* `euclidean_left_join()`
* `euclidean_right_join()`
* `euclidean_inner_join()`
* `euclidean_full_join()`

The `jaccard_join` family of functions provide fast fuzzy-joins for strings
using the Jaccard distance while the `euclidean_join` family provides
fuzzy-joins for points or vectors using the Euclidean distance.

### Example: Joining rows of the Database on Ideology, Money in Politics, and Elections
(DIME)

Here's a snippet showing off how to use the `jaccard_inner_join()` merge two
lists of political donors in the [Database on Ideology, Money in Politics,
and Elections (DIME)](https://data.stanford.edu/dime). You can see a more
detailed example of this vignette in the [introductory vignette](https://beniamino.org/zoomerjoin/articles/guided_tour.html).

I start with two corpuses I would like to combine, `corpus_1`:

```{r}
corpus_1 <- dime_data %>%
  head(500)
names(corpus_1) <- c("a", "field")
corpus_1
```

And `corpus_2`:

```{r}
corpus_2 <- dime_data %>%
  tail(500)
names(corpus_2) <- c("b", "field")
corpus_2
```
Both corpuses have an observation ID column, and a donor name column. We would
like to join the two datasets on the donor names column, but the two can't be
directly joined because of misspellings. Because of this, we will use the
jaccard_inner_join function to fuzzily join the two on the donor name column.

Importantly, Locality Sensitive Hashing is a [probabilistic
algorithm](https://en.wikipedia.org/wiki/Randomized_algorithm), so it may fail
to identify some matches by random chance. I adjust the hyperparameters
`n_bands` and `band_width` until the chance of true matches being dropped is
negligible. By default, the package will issue a warning if the chance of a
true match being discovered is less than 95%. You can use the
`jaccard_probability` and `jaccard_hyper_grid_search` to help understand the
probability any true matches will be discarded by the algorithm.

More details and a more thorough description of how to tune the hyperparameters
can be can be found in the [guided tour
vignette](https://beniamino.org/zoomerjoin/articles/guided_tour.html).

```{r}
set.seed(1)
start_time <- Sys.time()
join_out <- jaccard_inner_join(corpus_1, corpus_2, n_gram_width = 6, n_bands = 20, band_width = 6)
print(Sys.time() - start_time)
print(join_out)
```

Zoomerjoin is able to quickly find the matching columns without comparing all
pairs of records. This saves more and more time as the size of each list
increases, so it can scale to join datasets with millions or hundreds of
millions of rows.

# Contributing

Thanks for your interest in contributing to Zoomerjoin!

I am using a gitub-centric workflow to manage the package; You can file a bug report, request a new feature, or ask a question about the package by [filing
an issue on the issues page](https://github.com/beniaminogreen/zoomerjoin/issues), where you will also
find a range of templates to help you out. If you'd like to make changes to the code, you can write and file a [pull
request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests)
on [this page](https://github.com/beniaminogreen/zoomerjoin/pulls). I'll try to
respond to all of these in a timely manner (within a week) although
occasionally I may take longer to respond to a complicated question or issue.

Please also be aware of the [contributor code of
conduct](https://github.com/beniaminogreen/zoomerjoin/blob/main/CONTRIBUTING.md)
for contributing to the repository.

## Acknowledgments:


The Zoomerjoin was made using [this SQL join
illustration](https://commons.wikimedia.org/wiki/File:SQL_Join_-_08_A_Cross_Join_B.svg)
by [Germanx](https://commons.wikimedia.org/wiki/User:GermanX) and [this speed
limit sign](https://commons.wikimedia.org/wiki/File:Speed_limit_75_sign.svg) from the
Federal Highway Administration - MUTCD.

## References:

Bonica, Adam. 2016. Database on Ideology, Money in Politics, and Elections: Public version 2.0 [Computer file]. Stanford, CA: Stanford University Libraries.

Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets (2nd. ed.). Cambridge University Press, USA.

Broder, Andrei Z. (1997), "On the resemblance and containment of documents", Compression and Complexity of Sequences: Proceedings. Positano, Salerno, Italy

Owner

  • Name: Beniamino Green
  • Login: beniaminogreen
  • Kind: user
  • Location: New Haven, CT
  • Company: Yale University

Pre-doctoral Fellow

JOSS Publication

Zoomerjoin: Superlatively-Fast Fuzzy Joins
Published
September 26, 2023
Volume 8, Issue 89, Page 5693
Authors
Beniamino Green ORCID
Yale University, USA
Editor
Samuel Forbes ORCID
Tags
Record linkage Fuzzy-joining Tidy data manipulation

Citation (CITATION.cff)

cff-version: "1.2.0"
authors:
- family-names: Green
  given-names: Beniamino
  orcid: "https://orcid.org/0009-0006-4501-597X"
contact:
- family-names: Green
  given-names: Beniamino
  orcid: "https://orcid.org/0009-0006-4501-597X"
doi: 10.5281/zenodo.8370652
message: If you use this software, please cite our article in the
  Journal of Open Source Software.
preferred-citation:
  authors:
  - family-names: Green
    given-names: Beniamino
    orcid: "https://orcid.org/0009-0006-4501-597X"
  date-published: 2023-09-26
  doi: 10.21105/joss.05693
  issn: 2475-9066
  issue: 89
  journal: Journal of Open Source Software
  publisher:
    name: Open Journals
  start: 5693
  title: "Zoomerjoin: Superlatively-Fast Fuzzy Joins"
  type: article
  url: "https://joss.theoj.org/papers/10.21105/joss.05693"
  volume: 8
title: "Zoomerjoin: Superlatively-Fast Fuzzy Joins"

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "identifier": "zoomerjoin",
  "description": "Empowers users to fuzzily-merge data frames with millions or tens of millions of rows in minutes with low memory usage. The package uses the locality sensitive hashing algorithms developed by Datar, Immorlica, Indyk and Mirrokni (2004) <doi:10.1145/997817.997857>, and Broder (1998) <doi:10.1109/SEQUEN.1997.666900> to avoid having to compare every pair of records in each dataset, resulting in fuzzy-merges that finish in linear time.",
  "name": "zoomerjoin: Superlatively Fast Fuzzy Joins",
  "codeRepository": "https://beniamino.org/zoomerjoin/",
  "issueTracker": "https://github.com/beniaminogreen/zoomerjoin/issues/",
  "license": "https://spdx.org/licenses/GPL-3.0",
  "version": "0.1.4",
  "programmingLanguage": {
    "@type": "ComputerLanguage",
    "name": "R",
    "url": "https://r-project.org"
  },
  "runtimePlatform": "R version 4.3.2 (2023-10-31)",
  "author": [
    {
      "@type": "Person",
      "givenName": "Beniamino",
      "familyName": "Green",
      "email": "beniamino.green@yale.edu"
    }
  ],
  "contributor": [
    {
      "@type": "Person",
      "givenName": "Etienne",
      "familyName": "Bacher",
      "email": "etienne.bacher@protonmail.com"
    },
    {
      "@type": "Organization",
      "name": "The authors of the dependency Rust crates"
    }
  ],
  "copyrightHolder": [
    {
      "@type": "Person",
      "givenName": "Beniamino",
      "familyName": "Green",
      "email": "beniamino.green@yale.edu"
    },
    {
      "@type": "Organization",
      "name": "The authors of the dependency Rust crates"
    }
  ],
  "maintainer": [
    {
      "@type": "Person",
      "givenName": "Beniamino",
      "familyName": "Green",
      "email": "beniamino.green@yale.edu"
    }
  ],
  "softwareSuggestions": [
    {
      "@type": "SoftwareApplication",
      "identifier": "babynames",
      "name": "babynames",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=babynames"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "covr",
      "name": "covr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=covr"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "fuzzyjoin",
      "name": "fuzzyjoin",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=fuzzyjoin"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "igraph",
      "name": "igraph",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=igraph"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "knitr",
      "name": "knitr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=knitr"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "rmarkdown",
      "name": "rmarkdown",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=rmarkdown"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "stringdist",
      "name": "stringdist",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=stringdist"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "testthat",
      "name": "testthat",
      "version": ">= 3.0.0",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=testthat"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "tidyverse",
      "name": "tidyverse",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=tidyverse"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "purrr",
      "name": "purrr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=purrr"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "microbenchmark",
      "name": "microbenchmark",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=microbenchmark"
    },
    {
      "@type": "SoftwareApplication",
      "identifier": "profmem",
      "name": "profmem",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=profmem"
    }
  ],
  "softwareRequirements": {
    "1": {
      "@type": "SoftwareApplication",
      "identifier": "dplyr",
      "name": "dplyr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=dplyr"
    },
    "2": {
      "@type": "SoftwareApplication",
      "identifier": "tibble",
      "name": "tibble",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=tibble"
    },
    "3": {
      "@type": "SoftwareApplication",
      "identifier": "tidyr",
      "name": "tidyr",
      "provider": {
        "@id": "https://cran.r-project.org",
        "@type": "Organization",
        "name": "Comprehensive R Archive Network (CRAN)",
        "url": "https://cran.r-project.org"
      },
      "sameAs": "https://CRAN.R-project.org/package=tidyr"
    },
    "4": {
      "@type": "SoftwareApplication",
      "identifier": "R",
      "name": "R",
      "version": ">= 2.10"
    },
    "SystemRequirements": "Cargo (>= 1.56) (Rust's package manager), rustc"
  },
  "fileSize": "350841.715KB",
  "contIntegration": "https://app.codecov.io/gh/beniaminogreen/zoomerjoin?branch=main",
  "developmentStatus": "https://lifecycle.r-lib.org/articles/stages.html#experimental"
}

GitHub Events

Total
  • Watch event: 4
  • Issue comment event: 1
  • Push event: 2
  • Pull request event: 1
  • Fork event: 1
Last Year
  • Watch event: 4
  • Issue comment event: 1
  • Push event: 2
  • Pull request event: 1
  • Fork event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 293
  • Total Committers: 4
  • Avg Commits per committer: 73.25
  • Development Distribution Score (DDS): 0.106
Past Year
  • Commits: 9
  • Committers: 2
  • Avg Commits per committer: 4.5
  • Development Distribution Score (DDS): 0.111
Top Committers
Name Email Commits
Beniamino Green b****n@g****m 262
Etienne Bacher 5****r 28
Josiah Parry j****y@g****m 2
Florian Caro f****0@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 25
  • Total pull requests: 79
  • Average time to close issues: 26 days
  • Average time to close pull requests: about 5 hours
  • Total issue authors: 9
  • Total pull request authors: 4
  • Average comments per issue: 3.04
  • Average comments per pull request: 0.43
  • Merged pull requests: 72
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: about 12 hours
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 6.0
  • Average comments per pull request: 1.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • beniaminogreen (10)
  • etiennebacher (6)
  • wincowgerDEV (3)
  • floriancaro (1)
  • aidanhorn (1)
  • JosiahParry (1)
  • werkstattcodes (1)
  • sfd99 (1)
Pull Request Authors
  • beniaminogreen (63)
  • etiennebacher (32)
  • JosiahParry (3)
  • floriancaro (2)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 225 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 4
  • Total maintainers: 1
cran.r-project.org: zoomerjoin

Superlatively Fast Fuzzy Joins

  • Versions: 4
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 225 Last month
Rankings
Dependent packages count: 28.2%
Dependent repos count: 36.1%
Average: 49.6%
Downloads: 84.5%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/R-CMD-check.yaml actions
  • actions/checkout v2 composite
  • dtolnay/rust-toolchain master composite
  • r-lib/actions/check-r-package v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
.github/workflows/pkgdown.yaml actions
  • JamesIves/github-pages-deploy-action v4.4.1 composite
  • actions/checkout v3 composite
  • r-lib/actions/setup-pandoc v2 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
src/rust/Cargo.lock cargo
  • autocfg 1.1.0
  • bitflags 1.3.2
  • cfg-if 1.0.0
  • crossbeam-channel 0.5.6
  • crossbeam-deque 0.8.2
  • crossbeam-epoch 0.9.13
  • crossbeam-utils 0.8.14
  • dashmap 5.4.0
  • either 1.8.0
  • extendr-api 0.3.1
  • extendr-engine 0.3.1
  • extendr-macros 0.3.1
  • getrandom 0.2.8
  • hashbrown 0.12.3
  • hermit-abi 0.2.6
  • lazy_static 1.4.0
  • libR-sys 0.3.0
  • libc 0.2.139
  • libm 0.2.6
  • lock_api 0.4.9
  • matrixmultiply 0.3.2
  • memoffset 0.7.1
  • ndarray 0.15.6
  • ndarray-rand 0.14.0
  • nohash-hasher 0.2.0
  • num-complex 0.4.3
  • num-integer 0.1.45
  • num-traits 0.2.15
  • num_cpus 1.15.0
  • once_cell 1.17.0
  • parking_lot_core 0.9.6
  • paste 1.0.11
  • ppv-lite86 0.2.17
  • proc-macro2 1.0.50
  • quote 1.0.23
  • rand 0.8.5
  • rand_chacha 0.3.1
  • rand_core 0.6.4
  • rand_distr 0.4.3
  • rawpointer 0.2.1
  • rayon 1.6.1
  • rayon-core 1.10.1
  • redox_syscall 0.2.16
  • rustc-hash 1.1.0
  • scopeguard 1.1.0
  • smallvec 1.10.0
  • syn 1.0.107
  • unicode-ident 1.0.6
  • wasi 0.11.0+wasi-snapshot-preview1
  • winapi 0.3.9
  • winapi-i686-pc-windows-gnu 0.4.0
  • winapi-x86_64-pc-windows-gnu 0.4.0
  • windows-sys 0.42.0
  • windows_aarch64_gnullvm 0.42.1
  • windows_aarch64_msvc 0.42.1
  • windows_i686_gnu 0.42.1
  • windows_i686_msvc 0.42.1
  • windows_x86_64_gnu 0.42.1
  • windows_x86_64_gnullvm 0.42.1
  • windows_x86_64_msvc 0.42.1
DESCRIPTION cran
  • dplyr * imports
  • babynames * suggests
  • fuzzyjoin * suggests
  • knitr * suggests
  • preferably * suggests
  • rmarkdown * suggests
  • testthat >= 3.0.0 suggests
  • tibble * suggests
.github/workflows/test-coverage.yaml actions
  • actions/checkout v3 composite
  • actions/upload-artifact v3 composite
  • r-lib/actions/setup-r v2 composite
  • r-lib/actions/setup-r-dependencies v2 composite
src/rust/Cargo.toml cargo