datasketch
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
Science Score: 46.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
✓Committers with academic emails
1 of 30 committers (3.3%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.1%) to scientific vocabulary
Keywords
data-sketches
data-summary
hnsw
hyperloglog
jaccard-similarity
locality-sensitive-hashing
lsh
lsh-ensemble
lsh-forest
minhash
python
search
top-k
weighted-quantiles
Keywords from Contributors
distributed
Last synced: 6 months ago
·
JSON representation
Repository
MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
Basic Info
- Host: GitHub
- Owner: ekzhu
- License: mit
- Language: Python
- Default Branch: master
- Homepage: https://ekzhu.github.io/datasketch
- Size: 5.68 MB
Statistics
- Stars: 2,699
- Watchers: 48
- Forks: 299
- Open Issues: 54
- Releases: 32
Topics
data-sketches
data-summary
hnsw
hyperloglog
jaccard-similarity
locality-sensitive-hashing
lsh
lsh-ensemble
lsh-forest
minhash
python
search
top-k
weighted-quantiles
Created almost 11 years ago
· Last pushed over 1 year ago
Metadata Files
Readme
License
README.rst
datasketch: Big Data Looks Small
================================
.. image:: https://static.pepy.tech/badge/datasketch/month
:target: https://pepy.tech/project/datasketch
.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.598238.svg
:target: https://zenodo.org/doi/10.5281/zenodo.598238
datasketch gives you probabilistic data structures that can process and
search very large amount of data super fast, with little loss of
accuracy.
This package contains the following data sketches:
+-------------------------+-----------------------------------------------+
| Data Sketch | Usage |
+=========================+===============================================+
| `MinHash`_ | estimate Jaccard similarity and cardinality |
+-------------------------+-----------------------------------------------+
| `Weighted MinHash`_ | estimate weighted Jaccard similarity |
+-------------------------+-----------------------------------------------+
| `HyperLogLog`_ | estimate cardinality |
+-------------------------+-----------------------------------------------+
| `HyperLogLog++`_ | estimate cardinality |
+-------------------------+-----------------------------------------------+
The following indexes for data sketches are provided to support
sub-linear query time:
+---------------------------+-----------------------------+------------------------+
| Index | For Data Sketch | Supported Query Type |
+===========================+=============================+========================+
| `MinHash LSH`_ | MinHash, Weighted MinHash | Jaccard Threshold |
+---------------------------+-----------------------------+------------------------+
| `MinHash LSH Forest`_ | MinHash, Weighted MinHash | Jaccard Top-K |
+---------------------------+-----------------------------+------------------------+
| `MinHash LSH Ensemble`_ | MinHash | Containment Threshold |
+---------------------------+-----------------------------+------------------------+
| `HNSW`_ | Any | Custom Metric Top-K |
+---------------------------+-----------------------------+------------------------+
datasketch must be used with Python 3.7 or above, NumPy 1.11 or above, and Scipy.
Note that `MinHash LSH`_ and `MinHash LSH Ensemble`_ also support Redis and Cassandra
storage layer (see `MinHash LSH at Scale`_).
Install
-------
To install datasketch using ``pip``:
::
pip install datasketch
This will also install NumPy as dependency.
To install with Redis dependency:
::
pip install datasketch[redis]
To install with Cassandra dependency:
::
pip install datasketch[cassandra]
.. _`MinHash`: https://ekzhu.github.io/datasketch/minhash.html
.. _`Weighted MinHash`: https://ekzhu.github.io/datasketch/weightedminhash.html
.. _`HyperLogLog`: https://ekzhu.github.io/datasketch/hyperloglog.html
.. _`HyperLogLog++`: https://ekzhu.github.io/datasketch/hyperloglog.html#hyperloglog-plusplus
.. _`MinHash LSH`: https://ekzhu.github.io/datasketch/lsh.html
.. _`MinHash LSH Forest`: https://ekzhu.github.io/datasketch/lshforest.html
.. _`MinHash LSH Ensemble`: https://ekzhu.github.io/datasketch/lshensemble.html
.. _`Minhash LSH at Scale`: http://ekzhu.github.io/datasketch/lsh.html#minhash-lsh-at-scale
.. _`HNSW`: https://ekzhu.github.io/datasketch/documentation.html#hnsw
Owner
- Name: Eric Zhu
- Login: ekzhu
- Kind: user
- Website: https://ekzhu.com
- Repositories: 90
- Profile: https://github.com/ekzhu
GitHub Events
Total
- Issues event: 2
- Watch event: 207
- Issue comment event: 11
- Pull request review event: 3
- Pull request review comment event: 6
- Pull request event: 2
- Fork event: 7
Last Year
- Issues event: 2
- Watch event: 207
- Issue comment event: 11
- Pull request review event: 3
- Pull request review comment event: 6
- Pull request event: 2
- Fork event: 7
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| ekzhu | e****u@c****u | 187 |
| Vadim Markovtsev | g****r@g****m | 8 |
| aastafiev | a****v@g****m | 3 |
| ae-foster | a****r | 2 |
| Wojciech Łukasiewicz | w****s@g****m | 2 |
| Chris Ha | h****9@g****m | 2 |
| Arham Khan | a****n@g****m | 2 |
| Jordan Martin | j****n@h****c | 2 |
| Eric Zhu | e****u@n****m | 1 |
| Peter Kubov | p****v@a****m | 1 |
| Andrii Oriekhov | a****v@g****m | 1 |
| Ekevoo | e****o | 1 |
| Joe Halliwell | j****l@g****m | 1 |
| JonR | 5****v | 1 |
| ronassa | a****1@g****m | 1 |
| oisincar | o****r@g****m | 1 |
| long2ice | l****e@g****m | 1 |
| hguhlich | 9****h | 1 |
| fpug | f****b@p****n | 1 |
| Zac Bentley | z****y | 1 |
| Vojtech Letal | l****j | 1 |
| Titusz | t****n@g****m | 1 |
| Stefano Ortolani | o****o | 1 |
| Spandan Thakur | s****r@a****m | 1 |
| Senad Ibraimoski | s****i@g****m | 1 |
| Rupesh Kumar | 5****r | 1 |
| Qin TianHuan | 6****7@q****m | 1 |
| Michael Joseph Rosenthal | r****3@g****m | 1 |
| Keyur Joshi | k****i@g****m | 1 |
| Kevin Mann | K****3@g****m | 1 |
Committer Domains (Top 20 + Academic)
qq.com: 1
adobe.com: 1
puglier.in: 1
avast.com: 1
noreply.users.github.com: 1
heimdall.llc: 1
cs.toronto.edu: 1
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 97
- Total pull requests: 46
- Average time to close issues: 4 months
- Average time to close pull requests: 20 days
- Total issue authors: 79
- Total pull request authors: 23
- Average comments per issue: 3.06
- Average comments per pull request: 2.52
- Merged pull requests: 37
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 4
- Pull requests: 1
- Average time to close issues: 5 months
- Average time to close pull requests: N/A
- Issue authors: 3
- Pull request authors: 1
- Average comments per issue: 0.5
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- pavelnemirovsky (3)
- ajosh0504 (3)
- Priyabrata409 (3)
- blah-crusader (3)
- ekzhu (3)
- rocke2020 (2)
- ghost (2)
- charlotte-ling (2)
- 123epsilon (2)
- surkova (2)
- bryanyzhu (2)
- hsicsa (2)
- gmanlan (2)
- tomorrow1pan (1)
- ophiry (1)
Pull Request Authors
- ekzhu (17)
- 123epsilon (4)
- Sinusoidal36 (2)
- chris-ha458 (2)
- rupeshkumaar (2)
- mkmohangb (1)
- oisincar (1)
- IbraheemTaha (1)
- edholland (1)
- EliseAv (1)
- xkubov (1)
- ronassa (1)
- SenadI (1)
- researcher2 (1)
- QthCN (1)
Top Labels
Issue Labels
question (17)
enhancement (10)
help wanted (7)
bug (1)
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- pypi 3,839,058 last-month
- Total docker downloads: 380,633
-
Total dependent packages: 22
(may contain duplicates) -
Total dependent repositories: 557
(may contain duplicates) - Total versions: 115
- Total maintainers: 1
pypi.org: datasketch
Probabilistic data structures for processing and searching very large datasets
- Homepage: https://ekzhu.github.io/datasketch
- Documentation: https://datasketch.readthedocs.io/
- License: MIT
-
Latest release: 1.6.5
published over 1 year ago
Rankings
Downloads: 0.3%
Dependent packages count: 0.6%
Dependent repos count: 0.6%
Docker downloads count: 1.0%
Average: 1.2%
Stargazers count: 1.5%
Forks count: 3.2%
Maintainers (1)
Last synced:
6 months ago
proxy.golang.org: github.com/ekzhu/datasketch
- Documentation: https://pkg.go.dev/github.com/ekzhu/datasketch#section-documentation
- License: mit
-
Latest release: v1.6.5
published almost 2 years ago
Rankings
Stargazers count: 1.3%
Forks count: 1.6%
Average: 4.9%
Dependent packages count: 7.5%
Dependent repos count: 9.4%
Last synced:
6 months ago
Dependencies
benchmark/indexes/containment/requirements.txt
pypi
- SetSimilaritySearch *
- farmhash *
- matplotlib *
- pandas *
- scipy *
benchmark/indexes/jaccard/requirements.txt
pypi
- SetSimilaritySearch *
- matplotlib *
- nmslib *
- pyfarmhash *
setup.py
pypi
- numpy >=1.11
- scipy >=1.0.0
.github/workflows/doc.yml
actions
- JamesIves/github-pages-deploy-action v4 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
.github/workflows/pypi.yml
actions
- actions/checkout v3 composite
- actions/setup-python v4 composite
- pypa/gh-action-pypi-publish 27b31702a0e7fc50959f5ad993c78deac1bdfc29 composite
.github/workflows/test-cassandra.yml
actions
- actions/checkout v3 composite
- actions/setup-python v4 composite
- cassandra * docker
.github/workflows/test-mongo.yml
actions
- actions/checkout v3 composite
- actions/setup-python v4 composite
- supercharge/mongodb-github-action 1.8.0 composite
.github/workflows/test.yml
actions
- actions/checkout v3 composite
- actions/setup-python v4 composite