fastjaccard

An r package that computes Jaccard similarity running on parallel

https://github.com/alrobles/fastjaccard

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

An r package that computes Jaccard similarity running on parallel

Basic Info
  • Host: GitHub
  • Owner: alrobles
  • Language: C
  • Default Branch: main
  • Size: 40 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Created almost 3 years ago · Last pushed about 2 years ago
Metadata Files
Readme Citation

README.Rmd

---
output: github_document
---



```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# fastJaccard




This packages its designed to run the Jaccard similarity
for binary matrices in parallel using Rcpp and RcppParallel

## Installation

You can install the development version of fastJaccard from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("alrobles/fastJaccard")
```

## Example

We can create a binary matrix as example and run against a build r basic code

### Implementation in R
As a baseline we implement the Jaccard distance in plain R

``` r
jaccard_distance <- function(mat) {
  
  
  intersection <- function(p,q){
   sum(ifelse(p + q == 2, 1 , 0))
  }
  union = function(p,q){
   sum(p) + sum(q) - intersection(p, q)
  }
  
  res = matrix(0, nrow(mat), nrow(mat))
  
  for (i in 1:(nrow(mat) - 1)) {
    for (j in (i+1):nrow(mat)) {
      d1 = intersection(mat[i,], mat[j,])
      d2 = union(mat[i,], mat[j,])
      res[j,i] = d1/d2
      res[i,j] = d1/d2
    }
  }
  res
}
```
### Benchmarks

We create now a random binary matrix and run both implementations


```
library(fastJaccard)
## basic example code

# create a matrix
n = 1000
k = 2000
m = matrix(ifelse(runif(n*k) > 0.5, 1, 0), ncol = k)

# ensure that serial and parallel versions give the same result
r_res <- jaccard_distance(m)
rcpp_parallel_res <- fastJaccard::jaccard_fast_matrix(m)
stopifnot(all(rcpp_parallel_res - r_res < 1e-10)) ## precision differences

# compare performance
library(rbenchmark)
res <- benchmark(jaccard_distance(m),
                 jaccard_fast_matrix(m),
                 replications = 30,
                 order="relative")
res[,1:4]
```

## jaccard for pair of vectors

We can also can get a Jaccard similarity for vectors

``` r
set.seed(1235)
x = rbinom(1e6,1,.5)
y = rbinom(1e6,1,.5)

fastJaccard::jaccard_fast(x, y)
```

Owner

  • Name: Angel Luis Robles Fernández
  • Login: alrobles
  • Kind: user
  • Location: Xalapa Mexico
  • Company: Vida Analytics

PhD student at Arizona State University

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Ángel Luis"
  given-names: Robles Fernández"
  orcid: "https://orcid.org/0000-0002-4674-4270"
title: "fastJaccard. An r package that computes Jaccard similarity running on parallel"
version: 1.0.4
doi: 10.5281/zenodo.8121171
date-released: 2017-12-18
url: "https://github.com/alrobles/fastJaccard"

GitHub Events

Total
Last Year