Recent Releases of datasketch

datasketch - v1.6.5

What's Changed

  • Retrieve MinHash from LSHForest by @123epsilon in https://github.com/ekzhu/datasketch/pull/234
  • Merging (Identically Specified) MinHashLSH objects by @rupeshkumaar in https://github.com/ekzhu/datasketch/pull/232
  • Update bBitMinHash Benchmark by @123epsilon in https://github.com/ekzhu/datasketch/pull/238

New Contributors

  • @123epsilon made their first contribution in https://github.com/ekzhu/datasketch/pull/234
  • @rupeshkumaar made their first contribution in https://github.com/ekzhu/datasketch/pull/232

Full Changelog: https://github.com/ekzhu/datasketch/compare/v1.6.4...v1.6.5

- Python
Published by ekzhu over 1 year ago

datasketch - v1.6.4

What's Changed

  • HNSW bug fixes by @ekzhu in https://github.com/ekzhu/datasketch/pull/230

Full Changelog: https://github.com/ekzhu/datasketch/compare/v1.6.3...v1.6.4

- Python
Published by ekzhu over 2 years ago

datasketch - v1.6.3

What's Changed

  • Update docs by @ekzhu in https://github.com/ekzhu/datasketch/pull/224
  • HNSW remove() point in-place. by @ekzhu in https://github.com/ekzhu/datasketch/pull/225
  • Benchmark HNSW for Jaccard by @ekzhu in https://github.com/ekzhu/datasketch/pull/226
  • HNSW support for soft-remove and hard-remove. by @ekzhu in https://github.com/ekzhu/datasketch/pull/227

Full Changelog: https://github.com/ekzhu/datasketch/compare/v1.6.2...v1.6.3

- Python
Published by ekzhu over 2 years ago

datasketch - v1.6.2

What's Changed

  • HNSW as MutableMap by @ekzhu in https://github.com/ekzhu/datasketch/pull/223

Full Changelog: https://github.com/ekzhu/datasketch/compare/v1.6.1...v1.6.2

- Python
Published by ekzhu over 2 years ago

datasketch - v1.6.1

What's Changed

  • simplify reshapes by @chris-ha458 in https://github.com/ekzhu/datasketch/pull/217
  • HNSW Update Point by @ekzhu in https://github.com/ekzhu/datasketch/pull/220
  • HNSW Dict interface by @ekzhu in https://github.com/ekzhu/datasketch/pull/221
  • HNSW Doc by @ekzhu in https://github.com/ekzhu/datasketch/pull/222

Full Changelog: https://github.com/ekzhu/datasketch/compare/v1.6.0...v1.6.1

- Python
Published by ekzhu over 2 years ago

datasketch - v1.6.0

What's Changed

  • Update MinHashLSH.query docstring detailing proximal nature of results by @micimize in https://github.com/ekzhu/datasketch/pull/199
  • Fix doc with new template. by @ekzhu in https://github.com/ekzhu/datasketch/pull/202
  • Update lsh.rst by @ekzhu in https://github.com/ekzhu/datasketch/pull/208
  • Benchmark ANN index for Jaccard by @ekzhu in https://github.com/ekzhu/datasketch/pull/210
  • Update hashfunc.py by @chris-ha458 in https://github.com/ekzhu/datasketch/pull/211
  • HNSW Index by @ekzhu in https://github.com/ekzhu/datasketch/pull/218

New Contributors

  • @micimize made their first contribution in https://github.com/ekzhu/datasketch/pull/199
  • @chris-ha458 made their first contribution in https://github.com/ekzhu/datasketch/pull/211

Full Changelog: https://github.com/ekzhu/datasketch/compare/v1.5.9...v1.6.0

- Python
Published by ekzhu over 2 years ago

datasketch - v1.5.9

What's Changed

  • Create python-publish.yml by @ekzhu in https://github.com/ekzhu/datasketch/pull/191
  • Support numpy>=1.20.0 by @joehalliwell in https://github.com/ekzhu/datasketch/pull/192
  • Add note to documentation to address #195 by @ekzhu in https://github.com/ekzhu/datasketch/pull/197

New Contributors

  • @joehalliwell made their first contribution in https://github.com/ekzhu/datasketch/pull/192

Full Changelog: https://github.com/ekzhu/datasketch/compare/v1.5.8...v1.5.9

- Python
Published by ekzhu about 3 years ago

datasketch - v1.5.8

What's Changed

  • Add GitHub URL for PyPi by @andriyor in https://github.com/ekzhu/datasketch/pull/179
  • Support asyncio redis by @long2ice in https://github.com/ekzhu/datasketch/pull/185
  • Fix name construction for all values of b by @SenadI in https://github.com/ekzhu/datasketch/pull/190

New Contributors

  • @andriyor made their first contribution in https://github.com/ekzhu/datasketch/pull/179
  • @long2ice made their first contribution in https://github.com/ekzhu/datasketch/pull/185
  • @SenadI made their first contribution in https://github.com/ekzhu/datasketch/pull/190

Full Changelog: https://github.com/ekzhu/datasketch/compare/v1.5.7...v1.5.8

- Python
Published by ekzhu over 3 years ago

datasketch - v1.5.7

What's Changed

  • Unable to create multiple lsh indices each one in its own keyspace - issue #171 by @ronassa in https://github.com/ekzhu/datasketch/pull/172

New Contributors

  • @ronassa made their first contribution in https://github.com/ekzhu/datasketch/pull/172

Full Changelog: https://github.com/ekzhu/datasketch/compare/v1.5.6...v1.5.7

- Python
Published by ekzhu about 4 years ago

datasketch - Fixed broken packaging script for datasketch/experimental/aio

Fixed broken packaging setup.py that missed experimental/aio.

- Python
Published by ekzhu about 4 years ago

datasketch - v1.5.5

What's Changed

  • Adding minhash_many to WeightedMinHashGenerator. by @jroose-jv in https://github.com/ekzhu/datasketch/pull/165
  • Add query buffer by @hguhlich in https://github.com/ekzhu/datasketch/pull/167

New Contributors

  • @jroose-jv made their first contribution in https://github.com/ekzhu/datasketch/pull/165
  • @hguhlich made their first contribution in https://github.com/ekzhu/datasketch/pull/167

Full Changelog: https://github.com/ekzhu/datasketch/compare/v1.5.4...v1.5.5

- Python
Published by ekzhu about 4 years ago

datasketch - v1.5.4

What's Changed

  • Fixes #146; MinhashLSH creates mongo index key. by @oisincar in https://github.com/ekzhu/datasketch/pull/148
  • Add redis_buffer configuration. by @QthCN in https://github.com/ekzhu/datasketch/pull/152
  • minhash: Get rid of deprecation warning by @xkubov in https://github.com/ekzhu/datasketch/pull/156

New Contributors

  • @oisincar made their first contribution in https://github.com/ekzhu/datasketch/pull/148
  • @QthCN made their first contribution in https://github.com/ekzhu/datasketch/pull/152
  • @xkubov made their first contribution in https://github.com/ekzhu/datasketch/pull/156

Full Changelog: https://github.com/ekzhu/datasketch/compare/1.5.2...v1.5.4

- Python
Published by ekzhu about 4 years ago

datasketch - Improved performance for MinHash and MinHashLSH

  • Performance improvement for MinHash's update method.
  • Make MinHash updates 4.5X faster by using update_batch method for bulk update on MinHash. [See API doc].(http://ekzhu.com/datasketch/documentation.html#datasketch.MinHash.update_batch)
  • Further performance gain by using bulk generation of MinHash using MinHash.bulk or MinHash.generator. See API doc and pull request.
  • Optional compression for MinHash LSH index by hashing the bucket key produced by MinHashLSH._H. See pull request. This leads to saving of memory/storage space used by the index.

Thank you @Sinusoidal36!

- Python
Published by ekzhu about 5 years ago

datasketch - Add Cassandra storage layer.

  • Minor bug fixes
  • Cassandra storage layer, thank @ostefano! Now you can specify the Cassandra config just like the Redis one.

```python from datasketch import MinHashLSH

lsh = MinHashLSH( threashold=0.5, numperm=128, storageconfig={ 'type': 'cassandra', 'cassandra': { 'seeds': ['127.0.0.1'], 'keyspace': 'lshtest', 'replication': { 'class': 'SimpleStrategy', 'replicationfactor': '1', }, 'dropkeyspace': False, 'droptables': False, } } ) ```

- Python
Published by ekzhu about 6 years ago

datasketch - hashfunc to replace hashobj

Now support hashfunc parameter for MinHash and HyperLogLog. The old parameter hashobj is removed.

```python

Let's use MurmurHash3.

import mmh3

We need to define a new hash function that outputs an integer that

can be encoded in 32 bits.

def hashfunc(d): return mmh3.hash32(d)

Use this function in MinHash constructor.

m = MinHash(hashfunc=hashfunc) ```

- Python
Published by ekzhu about 7 years ago

datasketch - Better LSH Ensemble

Use dynamic programming to create optimal partition, allow LSH Ensemble index to adapt to any set size distribution.

- Python
Published by ekzhu about 7 years ago

datasketch - Batch removal of keys from Async MinHashLSH index

  • Adding batch removal functionality for Async MinHashLSH
  • Because Redis does not support async operation, removed Redis support from Async MinHashLSH

For details see Pull #70 Thanks @aastafiev for the contribution.

- Python
Published by ekzhu over 7 years ago

datasketch - MongoDB replicas

Add support for MongoDB replica set

- Python
Published by ekzhu over 7 years ago

datasketch - Fix bug #68

- Python
Published by ekzhu over 7 years ago

datasketch - Asynchronous MinHash LSH module and storage base name

  • Added Asynchronous MinHash LSH module. Thanks @aastafiev!
  • Added ability to set the base name in storage config. Base name is used as the prefix for generating keys in the underlying storage (e.g., Redis). This change allows client to "reconnect" to an existing LSH index in the storage through its base name.

- Python
Published by ekzhu over 7 years ago

datasketch - Fix bug in storage

Fix a bug with UnorderedStorage.get_many (#56)

- Python
Published by ekzhu over 7 years ago

datasketch - Fix bug in LSH Forest for Weighted MinHash

  • Fix issue #35
  • Test cases for checking consistency of hash value length in LSH.

- Python
Published by ekzhu over 8 years ago

datasketch - Optional redis storage requirement.

Thanks @vmarkovtsev

- Python
Published by ekzhu over 8 years ago

datasketch - Redis storage layer for MinHash LSH

  • Introduced a Redis storage layer for MinHash LSH. Thanks to @ae-foster
  • Added __hash__ method for Lean MinHash.

- Python
Published by ekzhu over 8 years ago

datasketch - LSH Ensemble

  • Added a slightly simplified version of LSH Ensemble that supports containment search with MinHash data sketches.
  • An introduction on containment link.
  • Update documentations

- Python
Published by ekzhu almost 9 years ago

datasketch - Consistent MinHash hash values across Python versions

MinHash now uses Numpy's random number generator instead of Python's built-in random. This makes MinHash generate consistent hash values across different Python versions.

The side-effect is that now MinHash created before version 1.1.3 won’t work (i.e., jaccard, merge and union) correctly with those created after.

- Python
Published by ekzhu almost 9 years ago

datasketch - Introduce Lean MinHash and better documentation

  • LeanMinHash is a subclass of MinHash. It uses less memory and allows faster (de)serialization. See documentation for details.
  • Removed serialize, deserialize, and bytesize methods from MinHash. These are supported in LeanMinHash instead.
  • Serialized MinHash objects before this version will not be deserialized properly. To migrate see here.
  • Documentation now have its own website!

- Python
Published by ekzhu almost 9 years ago

datasketch - First stable release

After nearly 2 years working on this project on-and-off, the API is now stable, and the features of MinHash-related sketches are completed.

I will continue to add more data sketches and indexes.

- Python
Published by ekzhu about 9 years ago

datasketch - MinHash LSH Forest

  • MinHash LSH Forest implementation and benchmark using synthetic data
  • Improve existing MinHash LSH benchmark using synthetic data for more tunable data distributions
  • Improve MinHash and LSH performance

- Python
Published by ekzhu about 9 years ago

datasketch - Windows compatibility

  • Fixed Issue #4 - int overflow error on Windows platform
  • Use Python build-in random number generator for better MinHash accuracy

- Python
Published by ekzhu over 9 years ago

datasketch - Functionality for removing key from LSH index

  • Add remove method for LSH index - lsh.remove(key)
  • Add membership check for LSH - key in lsh

- Python
Published by ekzhu almost 10 years ago

datasketch - Introduce Weighted MinHash and interface change

  • Add Weighted MinHash data sketch
  • Add Weighted MinHash LSH index
  • Performance and accuracy benchmark for Weighted MinHash
  • Rename digest to update for MinHash and HyperLogLog, and use bytes as input argument.
  • Make hashobj customizable through data sketch constructors
  • Add new methods for data sketches
  • Bug fixes

- Python
Published by ekzhu almost 10 years ago