fastjaccard
An r package that computes Jaccard similarity running on parallel
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary
Last synced: 6 months ago
·
JSON representation
·
Repository
An r package that computes Jaccard similarity running on parallel
Basic Info
- Host: GitHub
- Owner: alrobles
- Language: C
- Default Branch: main
- Size: 40 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 1
Created almost 3 years ago
· Last pushed about 2 years ago
Metadata Files
Readme
Citation
README.Rmd
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# fastJaccard
This packages its designed to run the Jaccard similarity
for binary matrices in parallel using Rcpp and RcppParallel
## Installation
You can install the development version of fastJaccard from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("alrobles/fastJaccard")
```
## Example
We can create a binary matrix as example and run against a build r basic code
### Implementation in R
As a baseline we implement the Jaccard distance in plain R
``` r
jaccard_distance <- function(mat) {
intersection <- function(p,q){
sum(ifelse(p + q == 2, 1 , 0))
}
union = function(p,q){
sum(p) + sum(q) - intersection(p, q)
}
res = matrix(0, nrow(mat), nrow(mat))
for (i in 1:(nrow(mat) - 1)) {
for (j in (i+1):nrow(mat)) {
d1 = intersection(mat[i,], mat[j,])
d2 = union(mat[i,], mat[j,])
res[j,i] = d1/d2
res[i,j] = d1/d2
}
}
res
}
```
### Benchmarks
We create now a random binary matrix and run both implementations
```
library(fastJaccard)
## basic example code
# create a matrix
n = 1000
k = 2000
m = matrix(ifelse(runif(n*k) > 0.5, 1, 0), ncol = k)
# ensure that serial and parallel versions give the same result
r_res <- jaccard_distance(m)
rcpp_parallel_res <- fastJaccard::jaccard_fast_matrix(m)
stopifnot(all(rcpp_parallel_res - r_res < 1e-10)) ## precision differences
# compare performance
library(rbenchmark)
res <- benchmark(jaccard_distance(m),
jaccard_fast_matrix(m),
replications = 30,
order="relative")
res[,1:4]
```
## jaccard for pair of vectors
We can also can get a Jaccard similarity for vectors
``` r
set.seed(1235)
x = rbinom(1e6,1,.5)
y = rbinom(1e6,1,.5)
fastJaccard::jaccard_fast(x, y)
```
Owner
- Name: Angel Luis Robles Fernández
- Login: alrobles
- Kind: user
- Location: Xalapa Mexico
- Company: Vida Analytics
- Website: https://vidaanalytics.com/
- Repositories: 60
- Profile: https://github.com/alrobles
PhD student at Arizona State University
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Ángel Luis" given-names: Robles Fernández" orcid: "https://orcid.org/0000-0002-4674-4270" title: "fastJaccard. An r package that computes Jaccard similarity running on parallel" version: 1.0.4 doi: 10.5281/zenodo.8121171 date-released: 2017-12-18 url: "https://github.com/alrobles/fastJaccard"