hll

Fast HyperLogLog for Python.

https://github.com/ascv/hyperloglog

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.0%) to scientific vocabulary

Keywords

cardinality cardinality-counter cardinality-estimation hyperloglog python
Last synced: 6 months ago · JSON representation

Repository

Fast HyperLogLog for Python.

Basic Info
  • Host: GitHub
  • Owner: ascv
  • License: mit
  • Language: C
  • Default Branch: master
  • Homepage:
  • Size: 308 KB
Statistics
  • Stars: 109
  • Watchers: 6
  • Forks: 19
  • Open Issues: 2
  • Releases: 0
Topics
cardinality cardinality-counter cardinality-estimation hyperloglog python
Created over 13 years ago · Last pushed 7 months ago
Metadata Files
Readme License

README.md

Overview

The HyperLogLog algorithm [1] is a space efficient method to estimate the cardinality of extraordinarily large datasets. This module is written in C for Python >= 3.6. It implements a 64 bit version of HyperLogLog [2] using a Murmur64A hash.

Quick start

Install Python development libraries. This step will depend on your OS. On Ubuntu 24.0.4 LTS: sudo apt install python-dev-is-python3

Install HLL: pip install HLL

Example usage: ``` from HLL import HyperLogLog

hll = HyperLogLog(10) # use 2^10 registers hll.add('some data')

estimate = hll.cardinality() print(estimate) ```

Changelog

2.4

  • Use compiler built-ins for leading zero counts if using GCC or Clang.

2.3

  • Fix bug causing cardinalities on the order of $2^{45}$ to be under-estimated.

2.2

  • Remove support for Python 2.7.

2.1

Deprecation notice: this is the last supported version for Python 2.7.x.

  • Fixed bug where HyperLogLogs of unequal sizes could be merged.
  • Fixed bug causing cardinality estimates to be off when repeatedly merging sparse HyperLogLogs loaded from a pickle dump.

2.0

  • Algorithm has been updated to a 64 bit version [2]. This fixes the spike in relative error when switching from linear counting in the original HyperLogLog algorithm.
  • Hash function has been updated to the 64 bit Murmur64A function.
  • More efficiently store registers using a combination of sparse and dense representations.
  • Improved method for counting the number of leading zeroes.
  • Changed the return type of cardinality() from float to integer.
  • Changed the return logic of add(). This method no longer always indicates if a register was updated using its return value. This behavior is only preserved in dense representation. In sparse representation, add() always returns False.
  • HyperLogLog objects pickled in 1.x and 2.x are not compatible.
  • Added get_register()
  • Added hash()
  • Added _get_meta()
  • Deprecated murmur2_hash()
  • Deprecated registers()
  • Deprecated set_register()

Documentation

HyperLogLog objects

HyperLogLog objects implement a 64 bit HyperLogLog algorithm [2]. They can be used to estimate the cardinality of very large datasets. The estimation accuracy is proportional to the number of registers. Using more registers increases the accuracy and using less registers decreases the accuracy. The number of registers is set in powers of 2 using the parameter p and defaults to p=12 or $2^{12}$ registers. ```

from hll import HyperLogLog hll = HyperLogLog() # Default to 2^12 registers hll.size() 4096 hll = HyperLogLog(3) # Use 2^3 registers hll.size() 8 for data in ['one', 'two', 'three', 'four',]: ... hll.add(data) hll.cardinality() 4 ```

HyperLogLogs use a Murmur64A hash. This function is fast and has a good uniform distribution of bits which is necessary for accurate estimations. The seed to this hash function can be set in the HyperLogLog constructor: ```

hll = HyperLogLog(p=2, seed=123456789) hll.seed() 123456789 ```

The hash function can also be called directly: ```

hll.hash('something') 393810339 ```

Individual registers can be printed: ```

for i in range(2**4): ... print(hll.get_register(i)) 0 0 3 0 4 ```

HyperLogLog objects can be merged. This is done by taking the maximum value of their respective registers: ```

A = HyperLogLog(p=4) A.add('hello') B = HyperLogLog(p=4) B.add('world') A.merge(B) A.cardinality() 2 ```

Register representation

Registers are stored using both sparse and dense representation. Originally all registers are initialized to zero. However storing all these zeroes individually is wasteful. Instead a sorted linked list [3] is used to store only registers that have been set (e.g. have a non-zero value). When this list reaches sufficient size the HyperLogLog object will switch to using dense representation where registers are stored invidiaully using 6 bits.

Sparse representation can be disabled using the sparse flag: ```

HyperLogLog(p=2, sparse=False) ```

The maximum list size for the sparse register list determines when the HyperLogLog object switches to dense representation. This can be set using max_list_size: ```

HyperLogLog(p=15, maxlistsize=10**6) ```

Traversing the sparse register list every time an item is added to the HyperLogLog to update a register is expensive. A temporary buffer is instead used to defer this operation. Items added to the HyperLogLog are first added to the temporary buffer. When the buffer is full the items are sorted and then any register updates occur. These updates can be done in one pass since both the temproary buffer and sparse register list are sorted.

The buffer size can be set using max_buffer_size: ```

HyperLogLog(p=15, maxbuffersize=10**5) ```

License

This software is released under the MIT License.

References

[1] P. Flajolet, E. Fusy, O. Gandouet, F. Meunier. "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm," Conference on the Analysis of Algorithms 2007.

[2] O. Ertl, "New Cardinality Estimation Methods for HyperLogLog Sketches," arXiv:1706.07290 [cs], June 2017.

[3] S. Heule, M. Nunkesser, A. Hall. "HyperLogLog in Practice: Algorithimic Engineering of a State of the Art Cardinality Estimation Algorithm," Proceedings of the EDBT 2013 Conference, ACM, Genoa March 2013.

Owner

  • Name: Joshua Andersen
  • Login: ascv
  • Kind: user
  • Location: Seattle
  • Company: Fullbeaker.com

GitHub Events

Total
  • Issues event: 2
  • Watch event: 9
  • Issue comment event: 2
  • Push event: 6
  • Pull request event: 2
  • Create event: 1
Last Year
  • Issues event: 2
  • Watch event: 9
  • Issue comment event: 2
  • Push event: 6
  • Pull request event: 2
  • Create event: 1

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 418
  • Total Committers: 12
  • Avg Commits per committer: 34.833
  • Development Distribution Score (DDS): 0.256
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Joshua Andersen j****n@g****m 311
Josh Andersen j****h@c****m 94
Ilya Kramer i****a@l****m 2
Mauricio N. Guignard m****d@g****m 2
Mayank Asthana m****3@g****m 2
Chia-Yung Su c****g@g****m 1
David Fuhry d****y@g****m 1
Ed Hou h****d@g****m 1
zwindl z****l@p****m 1
beau b****t@g****m 1
Martin Packman m****n@z****m 1
josjosh s****e@s****) 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 27
  • Total pull requests: 24
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 28 days
  • Total issue authors: 17
  • Total pull request authors: 11
  • Average comments per issue: 3.67
  • Average comments per pull request: 0.54
  • Merged pull requests: 22
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 2
  • Average time to close issues: 4 days
  • Average time to close pull requests: 4 days
  • Issue authors: 1
  • Pull request authors: 1
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ascv (10)
  • mayankasthana (2)
  • itismewxg (1)
  • wordhou (1)
  • skipperkongen (1)
  • jjmalina (1)
  • ngrislain (1)
  • ZWindL (1)
  • JohnEmhoff (1)
  • DeoLeung (1)
  • erp12 (1)
  • B3AU (1)
  • Guangyi-Z (1)
  • ekimekim (1)
  • tatrats (1)
Pull Request Authors
  • ascv (18)
  • mayankasthana (1)
  • mauguignard (1)
  • wordhou (1)
  • ZWindL (1)
  • josephsu (1)
  • superjer (1)
  • B3AU (1)
  • dfuhry (1)
  • ilya-lt (1)
  • bz2 (1)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 18,124 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 19
  • Total maintainers: 1
pypi.org: hll

Fast HyperLogLog for Python

  • Versions: 19
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 18,124 Last month
Rankings
Downloads: 2.8%
Stargazers count: 7.3%
Forks count: 8.7%
Dependent packages count: 10.0%
Average: 19.3%
Dependent repos count: 67.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi