https://github.com/bytehamster/consensusrecsplit

ConsensusRecSplit is a perfect hash function with very small space consumption based on Combined Search and Encoding of Successful Seeds (Consensus)

https://github.com/bytehamster/consensusrecsplit

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.2%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

ConsensusRecSplit is a perfect hash function with very small space consumption based on Combined Search and Encoding of Successful Seeds (Consensus)

Basic Info
  • Host: GitHub
  • Owner: ByteHamster
  • License: gpl-3.0
  • Language: C++
  • Default Branch: main
  • Homepage:
  • Size: 229 KB
Statistics
  • Stars: 3
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License

README.md

ConsensusRecSplit

License: GPL v3 Build status

A minimal perfect hash function (MPHF) maps a set S of n keys to the first n integers without collisions. Perfect hash functions have applications in databases, bioinformatics, and as a building block of various space-efficient data structures.

ConsensusRecSplit is a perfect hash function with very small space consumption. It is based on Combined Search and Encoding of Successful Seeds (Consensus), applied to the recursive splitting idea of RecSplit. Compared to previous approaches, ConsensusRecSplit achieves a space consumption that is orders of magnitude closer to the lower bound. On 100 million keys and about an hour of construction time, it achieves a stunning 1.4448 bits per key, while the lower bound is 1.4427 bits per key. RecSplit achieves 1.6127 bits per key in the same construction time.

While the RecSplit tree with Consensus has polynomial running time, the first splittings touch a large number of keys, hurting cache locality. This is why we combine it with a simple threshold-based k-perfect hash function. We then perform combined search and encoding on the splitting seeds, while also combining the k-perfect buckets with one another. The k-perfect hash function itself currently does not use Consensus, even though it should in the future to improve space efficiency. The bucket size (k) gives a trade-off between query performance, construction performance, and space consumption. Rather large k such as 32768 work best in our experiments.

Construction Performance with 100M Keys

Plot

Library usage

Clone this repository (with submodules) and add the following to your CMakeLists.txt.

add_subdirectory(path/to/ConsensusRecSplit) target_link_libraries(YourTarget PRIVATE ConsensusRecSplit)

You can construct a ConsensusRecSplit perfect hash function as follows.

cpp std::vector<std::string> keys = {"abc", "def", "123", "456"}; consensus::ConsensusRecSplit</* k */ 4096, /* overhead */ 0.01> hashFunc(keys); std::cout << hashFunc("abc") << std::endl;

Licensing

This code is licensed under the GPLv3.

If you use this work in an academic context or publication, please cite our paper:

@article{lehmann2025consensus, author = {Hans-Peter Lehmann and Peter Sanders and Stefan Walzer and Jonatan Ziegler}, title = {Combined Search and Encoding for Seeds, with an Application to Minimal Perfect Hashing}, journal = {CoRR}, volume = {abs/2502.05613}, year = {2025}, doi = {10.48550/ARXIV.2502.05613} }

Owner

  • Login: ByteHamster
  • Kind: user
  • Location: Germany
  • Company: Karlsruhe Institute of Technology

I'm a PhD student at Karlsruhe Institute of Technology. In my freetime, I maintain AntennaPod and contribute to other projects like K-9 Mail and Baikal Server.

GitHub Events

Total
  • Watch event: 3
  • Push event: 3
  • Public event: 1
Last Year
  • Watch event: 3
  • Push event: 3
  • Public event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/build.yml actions
  • actions/checkout v4 composite