treeminhash

TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation

https://github.com/oertl/treeminhash

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.2%) to scientific vocabulary

Keywords

hash-algorithm jaccard jaccard-coefficient jaccard-distance jaccard-index jaccard-similarity jaccard-similarity-estimation locality-sensitive locality-sensitive-hashing lsh-algorithm minhash minwise-hashing minwise-hashing-algorithm similarity-measures similarity-metric similarity-search sketching sketching-algorithm weighted-sets
Last synced: 4 months ago · JSON representation ·

Repository

TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation

Basic Info
  • Host: GitHub
  • Owner: oertl
  • Language: C++
  • Default Branch: main
  • Homepage:
  • Size: 2.62 MB
Statistics
  • Stars: 14
  • Watchers: 3
  • Forks: 3
  • Open Issues: 1
  • Releases: 1
Topics
hash-algorithm jaccard jaccard-coefficient jaccard-distance jaccard-index jaccard-similarity jaccard-similarity-estimation locality-sensitive locality-sensitive-hashing lsh-algorithm minhash minwise-hashing minwise-hashing-algorithm similarity-measures similarity-metric similarity-search sketching sketching-algorithm weighted-sets
Created over 5 years ago · Last pushed 5 months ago
Metadata Files
Readme Citation

README.md

DOI

TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation

TreeMinHash is a sketching algorithm for weighted sets. It is able to compute signatures that can be used for weighted Jaccard similarity estimation and locality-sensitive hashing. The algorithm requires multiple passes over the data and its time complexity is O(n + m log m) where n denotes the size of the weighted set (the number of elements with weight > 0) and m denotes the signature size (sketch size).

TreeMinHash combines several ideas of recently proposed algorithms. It uses a tree-like splitting of the weight domain as proposed by BagMinHash. Compared to TreeMinHash it uses a coarser weight discretization. To incorporate the values of the weights exactly, it uses rejection sampling as recently proposed by DartMinHash [2]. Furthermore, similar to DartMinHash, TreeMinHash estimates the stop limit in a first pass. In contrast, BagMinHash must update the stop limit permanently. We also use sampling without replacement for the selection of signature components as was already done before by SuperMinHash [3] and ProbMinHash [4].

The slides of a recent presentation which also covered the basic ideas of TreeMinHash can be found on SlideShare.

Results

We compared the performance of BagMinHash2 [1], DartMinHash [2], improved consistent weighted sampling (ICWS) [5], and TreeMinHash. The test setup was essentially the same as described in [4].

The performance results below show that the calculation time of TreeMinHash is independent of the weight sum, unlike DartMinHash. Furthermore, TreeMinHash is always faster for very small input.

speed_charts.svg

For verification we used synthetically generated weighted sets for which the weighted Jaccard similarity can be calculated in advance as described in [4]. The results below show that the relative empirical MSE for all tested algorithms is within the expected (gray) range.

paper/error_charts.svg

Steps to Reproduce the Results and Figures on Windows 11

  1. Install Windows Subsystem for Linux (WSL) with Ubuntu 22.04.1 LTS

  2. Install required packages: sudo apt install gradle g++ libboost-dev python3-matplotlib python3-scipy texlive-full make

  3. Clone repository including submodules: git clone --recursive https://github.com/oertl/treeminhash.git

  4. Download and compile xxHash as needed by BagMinHash: cd treeminhash/bagminhash/c++/xxhash wget https://github.com/Cyan4973/xxHash/archive/v0.8.1.zip unzip v0.8.1.zip cd xxHash-0.8.1 make lib cp libxxhash.a .. cp xxhash.h .. cd ../../../..

  5. Run simulations in treeminhash directory (takes several hours): gradle execute

  6. Generate figures: gradle figures

Sketches for Inner Product Estimation

It has been proposed to use weighted minwise hashing to create sketches for vectors that can be used for inner product estimation [6]. innerproducttest.cpp demonstrates how TreeMinHash can be combined with their ideas. The result is a cleaner and probably faster algorithm, since discretization of the input vectors is not required as in the original approach.

References

[1] Ertl, O. (2018). Bagminhash-minwise hashing algorithm for weighted sets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1368-1377). [paper] [GitHub]

[2] Christiani, T. (2020). DartMinHash: Fast Sketching for Weighted Sets. arXiv preprint arXiv:2005.11547. [paper] [GitHub]

[3] Ertl, O. (2017). Superminhash-A new minwise hashing algorithm for jaccard similarity estimation. arXiv preprint arXiv:1706.05698. [paper]

[4] Ertl, O. (2019). ProbMinHash--A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity. arXiv preprint arXiv:1911.00675. [paper] [GitHub]

[5] Ioffe, S. (2010). Improved consistent sampling, weighted minhash and l1 sketching. In 2010 IEEE International Conference on Data Mining (pp. 246-255). [paper]

[6] Bessa, A., et al. (2023). Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation. arXiv preprint arXiv:2301.05811. [paper]

Owner

  • Name: Otmar Ertl
  • Login: oertl
  • Kind: user
  • Location: Linz, Austria
  • Company: Dynatrace Research

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Ertl"
  given-names: "Otmar"
  orcid: "https://orcid.org/0000-0001-7322-6332"
title: "TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation"
date-released: 2020-08-24
url: "https://github.com/oertl/treeminhash"

GitHub Events

Total
  • Create event: 1
  • Issues event: 1
  • Release event: 1
  • Issue comment event: 3
  • Push event: 1
Last Year
  • Create event: 1
  • Issues event: 1
  • Release event: 1
  • Issue comment event: 3
  • Push event: 1

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • jianshu93 (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels