Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Keywords
Repository
Fast HyperLogLog for Python.
Basic Info
Statistics
- Stars: 109
- Watchers: 6
- Forks: 19
- Open Issues: 2
- Releases: 0
Topics
Metadata Files
README.md
Overview
The HyperLogLog algorithm [1] is a space efficient method to estimate the cardinality of extraordinarily large datasets. This module is written in C for Python >= 3.6. It implements a 64 bit version of HyperLogLog [2] using a Murmur64A hash.
Quick start
Install Python development libraries. This step will depend on your OS. On
Ubuntu 24.0.4 LTS:
sudo apt install python-dev-is-python3
Install HLL:
pip install HLL
Example usage: ``` from HLL import HyperLogLog
hll = HyperLogLog(10) # use 2^10 registers hll.add('some data')
estimate = hll.cardinality() print(estimate) ```
Changelog
2.4
- Use compiler built-ins for leading zero counts if using GCC or Clang.
2.3
- Fix bug causing cardinalities on the order of $2^{45}$ to be under-estimated.
2.2
- Remove support for Python 2.7.
2.1
Deprecation notice: this is the last supported version for Python 2.7.x.
- Fixed bug where HyperLogLogs of unequal sizes could be merged.
- Fixed bug causing cardinality estimates to be off when repeatedly merging sparse HyperLogLogs loaded from a pickle dump.
2.0
- Algorithm has been updated to a 64 bit version [2]. This fixes the spike in relative error when switching from linear counting in the original HyperLogLog algorithm.
- Hash function has been updated to the 64 bit Murmur64A function.
- More efficiently store registers using a combination of sparse and dense representations.
- Improved method for counting the number of leading zeroes.
- Changed the return type of
cardinality()from float to integer. - Changed the return logic of
add(). This method no longer always indicates if a register was updated using its return value. This behavior is only preserved in dense representation. In sparse representation,add()always returnsFalse. HyperLogLogobjects pickled in 1.x and 2.x are not compatible.- Added
get_register() - Added
hash() - Added
_get_meta() - Deprecated
murmur2_hash() - Deprecated
registers() - Deprecated
set_register()
Documentation
HyperLogLog objects
HyperLogLog objects implement a 64 bit HyperLogLog algorithm [2]. They can
be used to estimate the cardinality of very large datasets. The estimation
accuracy is proportional to the number of registers. Using more registers
increases the accuracy and using less registers decreases the accuracy. The
number of registers is set in powers of 2 using the parameter p and defaults
to p=12 or $2^{12}$ registers.
```
from hll import HyperLogLog hll = HyperLogLog() # Default to 2^12 registers hll.size() 4096 hll = HyperLogLog(3) # Use 2^3 registers hll.size() 8 for data in ['one', 'two', 'three', 'four',]: ... hll.add(data) hll.cardinality() 4 ```
HyperLogLogs use a Murmur64A hash. This function is fast and has a good
uniform distribution of bits which is necessary for accurate estimations. The
seed to this hash function can be set in the HyperLogLog constructor:
```
hll = HyperLogLog(p=2, seed=123456789) hll.seed() 123456789 ```
The hash function can also be called directly: ```
hll.hash('something') 393810339 ```
Individual registers can be printed: ```
for i in range(2**4): ... print(hll.get_register(i)) 0 0 3 0 4 ```
HyperLogLog objects can be merged. This is done by taking the maximum value
of their respective registers:
```
A = HyperLogLog(p=4) A.add('hello') B = HyperLogLog(p=4) B.add('world') A.merge(B) A.cardinality() 2 ```
Register representation
Registers are stored using both sparse and dense representation. Originally
all registers are initialized to zero. However storing all these zeroes
individually is wasteful. Instead a sorted linked list [3] is used to store
only registers that have been set (e.g. have a non-zero value). When this list
reaches sufficient size the HyperLogLog object will switch to using dense
representation where registers are stored invidiaully using 6 bits.
Sparse representation can be disabled using the sparse flag:
```
HyperLogLog(p=2, sparse=False) ```
The maximum list size for the sparse register list determines when the
HyperLogLog object switches to dense representation. This can be set
using max_list_size:
```
HyperLogLog(p=15, maxlistsize=10**6) ```
Traversing the sparse register list every time an item is added to the
HyperLogLog to update a register is expensive. A temporary buffer is instead
used to defer this operation. Items added to the HyperLogLog are first added
to the temporary buffer. When the buffer is full the items are sorted and then
any register updates occur. These updates can be done in one pass since both
the temproary buffer and sparse register list are sorted.
The buffer size can be set using max_buffer_size:
```
HyperLogLog(p=15, maxbuffersize=10**5) ```
License
This software is released under the MIT License.
References
[1] P. Flajolet, E. Fusy, O. Gandouet, F. Meunier. "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm," Conference on the Analysis of Algorithms 2007.
[2] O. Ertl, "New Cardinality Estimation Methods for HyperLogLog Sketches," arXiv:1706.07290 [cs], June 2017.
[3] S. Heule, M. Nunkesser, A. Hall. "HyperLogLog in Practice: Algorithimic Engineering of a State of the Art Cardinality Estimation Algorithm," Proceedings of the EDBT 2013 Conference, ACM, Genoa March 2013.
Owner
- Name: Joshua Andersen
- Login: ascv
- Kind: user
- Location: Seattle
- Company: Fullbeaker.com
- Repositories: 3
- Profile: https://github.com/ascv
GitHub Events
Total
- Issues event: 2
- Watch event: 9
- Issue comment event: 2
- Push event: 6
- Pull request event: 2
- Create event: 1
Last Year
- Issues event: 2
- Watch event: 9
- Issue comment event: 2
- Push event: 6
- Pull request event: 2
- Create event: 1
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Joshua Andersen | j****n@g****m | 311 |
| Josh Andersen | j****h@c****m | 94 |
| Ilya Kramer | i****a@l****m | 2 |
| Mauricio N. Guignard | m****d@g****m | 2 |
| Mayank Asthana | m****3@g****m | 2 |
| Chia-Yung Su | c****g@g****m | 1 |
| David Fuhry | d****y@g****m | 1 |
| Ed Hou | h****d@g****m | 1 |
| zwindl | z****l@p****m | 1 |
| beau | b****t@g****m | 1 |
| Martin Packman | m****n@z****m | 1 |
| josjosh | s****e@s****) | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 27
- Total pull requests: 24
- Average time to close issues: about 2 months
- Average time to close pull requests: 28 days
- Total issue authors: 17
- Total pull request authors: 11
- Average comments per issue: 3.67
- Average comments per pull request: 0.54
- Merged pull requests: 22
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 2
- Average time to close issues: 4 days
- Average time to close pull requests: 4 days
- Issue authors: 1
- Pull request authors: 1
- Average comments per issue: 2.0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ascv (10)
- mayankasthana (2)
- itismewxg (1)
- wordhou (1)
- skipperkongen (1)
- jjmalina (1)
- ngrislain (1)
- ZWindL (1)
- JohnEmhoff (1)
- DeoLeung (1)
- erp12 (1)
- B3AU (1)
- Guangyi-Z (1)
- ekimekim (1)
- tatrats (1)
Pull Request Authors
- ascv (18)
- mayankasthana (1)
- mauguignard (1)
- wordhou (1)
- ZWindL (1)
- josephsu (1)
- superjer (1)
- B3AU (1)
- dfuhry (1)
- ilya-lt (1)
- bz2 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 18,124 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 19
- Total maintainers: 1
pypi.org: hll
Fast HyperLogLog for Python
- Homepage: https://github.com/ascv/HyperLogLog
- Documentation: https://hll.readthedocs.io/
- License: MIT
-
Latest release: 2.3.0
published about 1 year ago