fastjaccard

An r package that computes Jaccard similarity running on parallel

https://github.com/alrobles/fastjaccard

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

An r package that computes Jaccard similarity running on parallel

Basic Info

Host: GitHub
Owner: alrobles
Language: C
Default Branch: main
Size: 40 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 1

Created about 3 years ago · Last pushed over 2 years ago

Metadata Files

Readme Citation

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# fastJaccard




This packages its designed to run the Jaccard similarity
for binary matrices in parallel using Rcpp and RcppParallel

## Installation

You can install the development version of fastJaccard from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("alrobles/fastJaccard")
```

## Example

We can create a binary matrix as example and run against a build r basic code

### Implementation in R
As a baseline we implement the Jaccard distance in plain R

``` r
jaccard_distance <- function(mat) {
  
  
  intersection <- function(p,q){
   sum(ifelse(p + q == 2, 1 , 0))
  }
  union = function(p,q){
   sum(p) + sum(q) - intersection(p, q)
  }
  
  res = matrix(0, nrow(mat), nrow(mat))
  
  for (i in 1:(nrow(mat) - 1)) {
    for (j in (i+1):nrow(mat)) {
      d1 = intersection(mat[i,], mat[j,])
      d2 = union(mat[i,], mat[j,])
      res[j,i] = d1/d2
      res[i,j] = d1/d2
    }
  }
  res
}
```
### Benchmarks

We create now a random binary matrix and run both implementations


```
library(fastJaccard)
## basic example code

# create a matrix
n = 1000
k = 2000
m = matrix(ifelse(runif(n*k) > 0.5, 1, 0), ncol = k)

# ensure that serial and parallel versions give the same result
r_res <- jaccard_distance(m)
rcpp_parallel_res <- fastJaccard::jaccard_fast_matrix(m)
stopifnot(all(rcpp_parallel_res - r_res < 1e-10)) ## precision differences

# compare performance
library(rbenchmark)
res <- benchmark(jaccard_distance(m),
                 jaccard_fast_matrix(m),
                 replications = 30,
                 order="relative")
res[,1:4]
```

## jaccard for pair of vectors

We can also can get a Jaccard similarity for vectors

``` r
set.seed(1235)
x = rbinom(1e6,1,.5)
y = rbinom(1e6,1,.5)

fastJaccard::jaccard_fast(x, y)
```

Owner

Name: Angel Luis Robles Fernández
Login: alrobles
Kind: user
Location: Xalapa Mexico
Company: Vida Analytics

Website: https://vidaanalytics.com/
Repositories: 60
Profile: https://github.com/alrobles

PhD student at Arizona State University

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Ángel Luis"
  given-names: Robles Fernández"
  orcid: "https://orcid.org/0000-0002-4674-4270"
title: "fastJaccard. An r package that computes Jaccard similarity running on parallel"
version: 1.0.4
doi: 10.5281/zenodo.8121171
date-released: 2017-12-18
url: "https://github.com/alrobles/fastJaccard"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science