pyprobables

Probabilistic data structures in python http://pyprobables.readthedocs.io/en/latest/index.html

https://github.com/barrust/pyprobables

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 8 committers (12.5%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (18.1%) to scientific vocabulary

Keywords

bitarray bloom-filter count-mean-min-sketch count-mean-sketch count-min-sketch counting-bloom-filter counting-cuckoo-filter cuckoo-filter data-analysis data-mining data-science data-structures datastructures heavy-hitters probabilistic-programming probability python quotient-filter stream-threshold

Keywords from Contributors

sequences clade interactive genetic-algorithm mesh interpretability generic projection optim embedded
Last synced: 6 months ago · JSON representation ·

Repository

Probabilistic data structures in python http://pyprobables.readthedocs.io/en/latest/index.html

Basic Info
  • Host: GitHub
  • Owner: barrust
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 4.36 MB
Statistics
  • Stars: 118
  • Watchers: 6
  • Forks: 11
  • Open Issues: 1
  • Releases: 30
Topics
bitarray bloom-filter count-mean-min-sketch count-mean-sketch count-min-sketch counting-bloom-filter counting-cuckoo-filter cuckoo-filter data-analysis data-mining data-science data-structures datastructures heavy-hitters probabilistic-programming probability python quotient-filter stream-threshold
Created over 8 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License Citation

README.rst

PyProbables
===========

.. image:: https://img.shields.io/badge/license-MIT-blue.svg
    :target: https://opensource.org/licenses/MIT/
    :alt: License
.. image:: https://img.shields.io/github/release/barrust/pyprobables.svg
    :target: https://github.com/barrust/pyprobables/releases
    :alt: GitHub release
.. image:: https://github.com/barrust/pyprobables/workflows/Python%20package/badge.svg
    :target: https://github.com/barrust/pyprobables/actions?query=workflow%3A%22Python+package%22
    :alt: Build Status
.. image:: https://codecov.io/gh/barrust/pyprobables/branch/master/graph/badge.svg?token=OdETiNgz9k
    :target: https://codecov.io/gh/barrust/pyprobables
    :alt: Test Coverage
.. image:: https://readthedocs.org/projects/pyprobables/badge/?version=latest
    :target: http://pyprobables.readthedocs.io/en/latest/?badge=latest
    :alt: Documentation Status
.. image:: https://badge.fury.io/py/pyprobables.svg
    :target: https://pypi.org/project/pyprobables/
    :alt: Pypi Release
.. image:: https://pepy.tech/badge/pyprobables
    :target: https://pepy.tech/project/pyprobables
    :alt: Downloads

**pyprobables** is a pure-python library for probabilistic data structures.
The goal is to provide the developer with a pure-python implementation of
common probabilistic data-structures to use in their work.

To achieve better raw performance, it is recommended supplying an alternative
hashing algorithm that has been compiled in C. This could include using the
md5 and sha512 algorithms provided or installing a third party package and
writing your own hashing strategy. Some options include the murmur hash
`mmh3 `__ or those from the
`pyhash `__ library. Each data object in
**pyprobables** makes it easy to pass in a custom hashing function.

Read more about how to use `Supplying a pre-defined, alternative hashing strategies`_
or `Defining hashing function using the provided decorators`_.

Installation
------------------

Pip Installation:

::

    $ pip install pyprobables

To install from source:

To install `pyprobables`, simply clone the `repository on GitHub
`__, then run from the folder:

::

    $ python setup.py install

`pyprobables` supports python 3.6 - 3.11+

For *python 2.7* support, install `release 0.3.2 `__

::

    $ pip install pyprobables==0.3.2


API Documentation
---------------------

The documentation of is hosted on
`readthedocs.io `__

You can build the documentation locally by running:

::

    $ pip install sphinx
    $ cd docs/
    $ make html



Automated Tests
------------------

To run automated tests, one must simply run the following command from the
downloaded folder:

::

  $ python setup.py test



Quickstart
------------------

Import pyprobables and setup a Bloom Filter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    from probables import BloomFilter
    blm = BloomFilter(est_elements=1000, false_positive_rate=0.05)
    blm.add('google.com')
    blm.check('facebook.com')  # should return False
    blm.check('google.com')  # should return True


Import pyprobables and setup a Count-Min Sketch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    from probables import CountMinSketch
    cms = CountMinSketch(width=1000, depth=5)
    cms.add('google.com')  # should return 1
    cms.add('facebook.com', 25)  # insert 25 at once; should return 25


Import pyprobables and setup a Cuckoo Filter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    from probables import CuckooFilter
    cko = CuckooFilter(capacity=100, max_swaps=10)
    cko.add('google.com')
    cko.check('facebook.com')  # should return False
    cko.check('google.com')  # should return True


Import pyprobables and setup a Quotient Filter
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    from probables import QuotientFilter
    qf = QuotientFilter(quotient=24)
    qf.add('google.com')
    qf.check('facebook.com')  # should return False
    qf.check('google.com')  # should return True


Supplying a pre-defined, alternative hashing strategies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    from probables import BloomFilter
    from probables.hashes import default_sha256
    blm = BloomFilter(est_elements=1000, false_positive_rate=0.05,
                      hash_function=default_sha256)
    blm.add('google.com')
    blm.check('facebook.com')  # should return False
    blm.check('google.com')  # should return True


.. _use-custom-hashing-strategies:

Defining hashing function using the provided decorators
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: python

    import mmh3  # murmur hash 3 implementation (pip install mmh3)
    from probables.hashes import hash_with_depth_bytes
    from probables import BloomFilter

    @hash_with_depth_bytes
    def my_hash(key, depth):
        return mmh3.hash_bytes(key, seed=depth)

    blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)

.. code:: python

    import hashlib
    from probables.hashes import hash_with_depth_int
    from probables.constants import UINT64_T_MAX
    from probables import BloomFilter

    @hash_with_depth_int
    def my_hash(key, seed=0, encoding="utf-8"):
        max64mod = UINT64_T_MAX + 1
        val = int(hashlib.sha512(key.encode(encoding)).hexdigest(), 16)
        val += seed  # not a good example, but uses the seed value
        return val % max64mod

    blm = BloomFilter(est_elements=1000, false_positive_rate=0.05, hash_function=my_hash)


See the `API documentation `__
for other data structures available and the
`quickstart page `__
for more examples!


Changelog
------------------

Please see the `changelog
`__ for a list
of all changes.


Backward Compatible Changes
---------------------------

If you are using previously exported probablistic data structures (v0.4.1 or below)
and used the default hashing strategy, you will want to use the following code
to mimic the original default hashing algorithm.

.. code:: python

    from probables import BloomFilter
    from probables.hashes import hash_with_depth_int

    @hash_with_depth_int
    def old_fnv1a(key, depth=1):
        return tmp_fnv_1a(key)

    def tmp_fnv_1a(key):
        max64mod = UINT64_T_MAX + 1
        hval = 14695981039346656073
        fnv_64_prime = 1099511628211
        tmp = map(ord, key)
        for t_str in tmp:
            hval ^= t_str
            hval *= fnv_64_prime
            hval %= max64mod
        return hval

    blm = BloomFilter(filpath="old-file-path.blm", hash_function=old_fnv1a)

Owner

  • Name: Tyler Barrus
  • Login: barrust
  • Kind: user
  • Location: Richmond Va

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: PyProbables
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Tyler
    family-names: Barrus
    email: barrust@gmail.coim
    orcid: 'https://orcid.org/0000-0002-6691-0360'
repository-code: 'https://github.com/barrust/pyprobables'
abstract: >-
  A set of probabilistic data structures written in
  python
keywords:
  - Probabilistic
  - Data Structures
  - Bloom Filter
  - Count-Min Sketch
  - Cuckoo Filter
  - Counting Bloom Filter
  - Count-Mean-Min Sketch
  - Count-Mean Sketch
  - Heavy Hitters
  - Stream Threshold
  - Rolling Bloom Filter
  - Expanding Bloom Filter
  - Counting Cuckoo Filter
  - Quotient Filter
license: MIT
version: 0.6.0
date-released: '2024-01-10'

GitHub Events

Total
  • Create event: 8
  • Release event: 1
  • Issues event: 5
  • Watch event: 8
  • Delete event: 10
  • Issue comment event: 12
  • Push event: 32
  • Pull request event: 17
  • Fork event: 1
Last Year
  • Create event: 8
  • Release event: 1
  • Issues event: 5
  • Watch event: 8
  • Delete event: 10
  • Issue comment event: 12
  • Push event: 32
  • Pull request event: 17
  • Fork event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 149
  • Total Committers: 8
  • Avg Commits per committer: 18.625
  • Development Distribution Score (DDS): 0.107
Past Year
  • Commits: 11
  • Committers: 2
  • Avg Commits per committer: 5.5
  • Development Distribution Score (DDS): 0.091
Top Committers
Name Email Commits
Tyler Barrus b****t@g****m 133
KOLANICH K****H 6
dependabot[bot] 4****] 5
dnanto d****2@g****u 1
Marcus McCurdy m****y@g****m 1
Leonhard Masche l****e@g****m 1
Dominik Kozaczko d****k@k****o 1
Daniel M c****a 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 48
  • Total pull requests: 76
  • Average time to close issues: 3 months
  • Average time to close pull requests: 10 days
  • Total issue authors: 13
  • Total pull request authors: 10
  • Average comments per issue: 1.96
  • Average comments per pull request: 1.14
  • Merged pull requests: 70
  • Bot issues: 0
  • Bot pull requests: 5
Past Year
  • Issues: 2
  • Pull requests: 6
  • Average time to close issues: 25 days
  • Average time to close pull requests: 5 days
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 3.5
  • Average comments per pull request: 0.33
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
  • barrust (34)
  • KOLANICH (2)
  • racinmat (2)
  • simonmandlik (1)
  • JSai23 (1)
  • mrqc (1)
  • pyup-bot (1)
  • pcoccoli (1)
  • Glfrey (1)
  • sfletc (1)
  • dekoza (1)
  • huuthonguyen76 (1)
  • suokunlong (1)
Pull Request Authors
  • barrust (65)
  • dependabot[bot] (11)
  • KOLANICH (8)
  • cunla (2)
  • racinmat (2)
  • leonhma (1)
  • volker48 (1)
  • dekoza (1)
  • dnanto (1)
  • pyup-bot (1)
Top Labels
Issue Labels
enhancement (8) help wanted (1)
Pull Request Labels
dependencies (11) enhancement (1) hacktoberfest-accepted (1) github_actions (1)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 130,931 last-month
  • Total dependent packages: 3
  • Total dependent repositories: 1
  • Total versions: 31
  • Total maintainers: 1
pypi.org: pyprobables

Probabilistic data structures in python

  • Versions: 31
  • Dependent Packages: 3
  • Dependent Repositories: 1
  • Downloads: 130,931 Last month
Rankings
Stargazers count: 7.2%
Dependent packages count: 10.1%
Downloads: 10.7%
Forks count: 11.9%
Average: 12.3%
Dependent repos count: 21.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

docs/requirements.txt pypi
  • sphinx >=3.0
pyproject.toml pypi
  • black ^20.8b1 develop
  • flake8 ^3.6.0 develop
  • isort ^5.6.4 develop
  • pre-commit >=2.18.1 develop
  • pytest ^6.1.1 develop
.github/workflows/publish.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/python-package.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • codecov/codecov-action v2 composite
  • psf/black stable composite