lemonhash

Learned Monotone Minimal Perfect Hashing

https://github.com/bytehamster/lemonhash

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.9%) to scientific vocabulary

Keywords

data-structures hashing learned-index
Last synced: 6 months ago · JSON representation ·

Repository

Learned Monotone Minimal Perfect Hashing

Basic Info
  • Host: GitHub
  • Owner: ByteHamster
  • License: gpl-3.0
  • Language: C++
  • Default Branch: main
  • Homepage:
  • Size: 548 KB
Statistics
  • Stars: 25
  • Watchers: 6
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
data-structures hashing learned-index
Created over 3 years ago · Last pushed 11 months ago
Metadata Files
Readme License Citation

README.md

Learned Monotone Minimal Perfect Hashing

Logo

License: GPL v3 Build status

A monotone minimal perfect hash function (MMPHF) maps a set S of n input keys to the first n integers without collisions. At the same time, it respects the natural order of the input universe. In other words, it maps each input key to its rank. MMPHFs have many applications in databases and space-efficient data structures.

LeMonHash (Learned Monotone Minimal Perfect Hashing) is a novel MMPHF that learns about regularities in the input data to achieve significant space and performance improvements. It uses the PGM-Index to calculate a learned rank estimate for each key and then solves collisions between these estimates using the retrieval data structure BuRR. Compared to competitors that are mostly based on tree-like data structures, LeMonHash is a lot more flat and therefore faster to query. LeMonHash dominates most competitors in terms of construction throughput, query throughput, and space consumption -- simultaneously. We also give a variant for variable-length strings that achieves significantly faster queries than competitors.

Usage

Requirements:

  • GCC 11 or later
  • boost

Clone the repository (as a submodule) and add the following to your CMakeLists.txt.

cmake add_subdirectory(path/to/LeMonHash) target_link_libraries(YourTarget PRIVATE LeMonHash)

Then you can use the straight-forward interface of LeMonHash:

cpp std::vector<uint64_t> inputData {0, 1, 7, 15, 23, 42, 250}; lemonhash::LeMonHash<> hashFunc(inputData); for (uint64_t x : inputData) { std::cout << x << ": \t" << hashFunc(x) << std::endl; }

Query performance

Plots preview

License

This code is licensed under the GPLv3. If you use the project in an academic context or publication, please cite our paper:

bibtex @inproceedings{DBLP:conf/esa/FerraginaL0V23, author = {Paolo Ferragina and Hans{-}Peter Lehmann and Peter Sanders and Giorgio Vinciguerra}, title = {Learned Monotone Minimal Perfect Hashing}, booktitle = {{ESA}}, series = {LIPIcs}, volume = {274}, pages = {46:1--46:17}, publisher = {Schloss Dagstuhl - Leibniz-Zentrum f{\"{u}}r Informatik}, year = {2023}, doi = {10.4230/LIPICS.ESA.2023.46} }

The code of the experiments comparing LeMonHash to competitors from the literature is available here.

Owner

  • Login: ByteHamster
  • Kind: user
  • Location: Germany
  • Company: Karlsruhe Institute of Technology

I'm a PhD student at Karlsruhe Institute of Technology. In my freetime, I maintain AntennaPod and contribute to other projects like K-9 Mail and Baikal Server.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software in an academic context or publication, please cite it as below."
authors:
- family-names: "Ferragina"
  given-names: "Paolo"
  orcid: "https://orcid.org/0000-0003-1353-360X"
- family-names: "Lehmann"
  given-names: "Hans-Peter"
  orcid: "https://orcid.org/0000-0002-0474-1805"
- family-names: "Sanders"
  given-names: "Peter"
  orcid: "https://orcid.org/0000-0003-3330-9349"
- family-names: "Vinciguerra"
  given-names: "Giorgio"
  orcid: "https://orcid.org/0000-0003-0328-7791"
title: "Learned Monotone Minimal Perfect Hashing"
preferred-citation:
  type: conference-paper
  title: "Learned Monotone Minimal Perfect Hashing"
  authors:
    - family-names: "Ferragina"
      given-names: "Paolo"
      orcid: "https://orcid.org/0000-0003-1353-360X"
    - family-names: "Lehmann"
      given-names: "Hans-Peter"
      orcid: "https://orcid.org/0000-0002-0474-1805"
    - family-names: "Sanders"
      given-names: "Peter"
      orcid: "https://orcid.org/0000-0003-3330-9349"
    - family-names: "Vinciguerra"
      given-names: "Giorgio"
      orcid: "https://orcid.org/0000-0003-0328-7791"
  doi: "10.4230/LIPICS.ESA.2023.46"
  journal: "ESA"
  year: 2023

GitHub Events

Total
  • Watch event: 4
  • Push event: 2
Last Year
  • Watch event: 4
  • Push event: 2

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 205
  • Total Committers: 2
  • Avg Commits per committer: 102.5
  • Development Distribution Score (DDS): 0.38
Past Year
  • Commits: 3
  • Committers: 1
  • Avg Commits per committer: 3.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
ByteHamster i****o@b****m 127
Giorgio Vinciguerra i@g****m 78
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels