treeminhash
TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.2%) to scientific vocabulary
Keywords
Repository
TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation
Statistics
- Stars: 14
- Watchers: 3
- Forks: 3
- Open Issues: 1
- Releases: 1
Topics
Metadata Files
README.md
TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation
TreeMinHash is a sketching algorithm for weighted sets. It is able to compute signatures that can be used for weighted Jaccard similarity estimation and locality-sensitive hashing. The algorithm requires multiple passes over the data and its time complexity is O(n + m log m) where n denotes the size of the weighted set (the number of elements with weight > 0) and m denotes the signature size (sketch size).
TreeMinHash combines several ideas of recently proposed algorithms. It uses a tree-like splitting of the weight domain as proposed by BagMinHash. Compared to TreeMinHash it uses a coarser weight discretization. To incorporate the values of the weights exactly, it uses rejection sampling as recently proposed by DartMinHash [2]. Furthermore, similar to DartMinHash, TreeMinHash estimates the stop limit in a first pass. In contrast, BagMinHash must update the stop limit permanently. We also use sampling without replacement for the selection of signature components as was already done before by SuperMinHash [3] and ProbMinHash [4].
The slides of a recent presentation which also covered the basic ideas of TreeMinHash can be found on SlideShare.
Results
We compared the performance of BagMinHash2 [1], DartMinHash [2], improved consistent weighted sampling (ICWS) [5], and TreeMinHash. The test setup was essentially the same as described in [4].
The performance results below show that the calculation time of TreeMinHash is independent of the weight sum, unlike DartMinHash. Furthermore, TreeMinHash is always faster for very small input.
For verification we used synthetically generated weighted sets for which the weighted Jaccard similarity can be calculated in advance as described in [4]. The results below show that the relative empirical MSE for all tested algorithms is within the expected (gray) range.
Steps to Reproduce the Results and Figures on Windows 11
Install Windows Subsystem for Linux (WSL) with Ubuntu 22.04.1 LTS
Install required packages:
sudo apt install gradle g++ libboost-dev python3-matplotlib python3-scipy texlive-full makeClone repository including submodules:
git clone --recursive https://github.com/oertl/treeminhash.gitDownload and compile xxHash as needed by BagMinHash:
cd treeminhash/bagminhash/c++/xxhash wget https://github.com/Cyan4973/xxHash/archive/v0.8.1.zip unzip v0.8.1.zip cd xxHash-0.8.1 make lib cp libxxhash.a .. cp xxhash.h .. cd ../../../..Run simulations in
treeminhashdirectory (takes several hours):gradle executeGenerate figures:
gradle figures
Sketches for Inner Product Estimation
It has been proposed to use weighted minwise hashing to create sketches for vectors that can be used for inner product estimation [6]. innerproducttest.cpp demonstrates how TreeMinHash can be combined with their ideas. The result is a cleaner and probably faster algorithm, since discretization of the input vectors is not required as in the original approach.
References
[1] Ertl, O. (2018). Bagminhash-minwise hashing algorithm for weighted sets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1368-1377). [paper] [GitHub]
[2] Christiani, T. (2020). DartMinHash: Fast Sketching for Weighted Sets. arXiv preprint arXiv:2005.11547. [paper] [GitHub]
[3] Ertl, O. (2017). Superminhash-A new minwise hashing algorithm for jaccard similarity estimation. arXiv preprint arXiv:1706.05698. [paper]
[4] Ertl, O. (2019). ProbMinHash--A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity. arXiv preprint arXiv:1911.00675. [paper] [GitHub]
[5] Ioffe, S. (2010). Improved consistent sampling, weighted minhash and l1 sketching. In 2010 IEEE International Conference on Data Mining (pp. 246-255). [paper]
[6] Bessa, A., et al. (2023). Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation. arXiv preprint arXiv:2301.05811. [paper]
Owner
- Name: Otmar Ertl
- Login: oertl
- Kind: user
- Location: Linz, Austria
- Company: Dynatrace Research
- Website: research.dynatrace.com
- Twitter: otmar_ertl
- Repositories: 1
- Profile: https://github.com/oertl
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Ertl" given-names: "Otmar" orcid: "https://orcid.org/0000-0001-7322-6332" title: "TreeMinHash: Fast Sketching for Weighted Jaccard Similarity Estimation" date-released: 2020-08-24 url: "https://github.com/oertl/treeminhash"
GitHub Events
Total
- Create event: 1
- Issues event: 1
- Release event: 1
- Issue comment event: 3
- Push event: 1
Last Year
- Create event: 1
- Issues event: 1
- Release event: 1
- Issue comment event: 3
- Push event: 1
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- jianshu93 (1)