hnswlib

Header-only C++/python library for fast approximate nearest neighbors

https://github.com/nmslib/hnswlib

Keywords from Contributors

distributed

Last synced: 7 months ago · JSON representation

Repository

Header-only C++/python library for fast approximate nearest neighbors

Basic Info

Host: GitHub
Owner: nmslib
License: apache-2.0
Language: C++
Default Branch: master
Homepage: https://github.com/nmslib/hnswlib
Size: 626 KB

Statistics

Stars: 4,857
Watchers: 65
Forks: 722
Open Issues: 259
Releases: 9

Created over 8 years ago · Last pushed 9 months ago

Metadata Files

Readme License

Hnswlib - fast approximate nearest neighbor search

Header-only C++ HNSW implementation with python bindings, insertions and updates.

NEWS:

version 0.8.0

Multi-vector document search and epsilon search (for now, only in C++)
By default, there is no statistic aggregation, which speeds up the multi-threaded search (it does not seem like people are using it anyway: Issue #495).
Various bugfixes and improvements
get_items now have return_type parameter, which can be either 'numpy' or 'list'

Full list of changes: https://github.com/nmslib/hnswlib/pull/523

version 0.7.0

Added support to filtering (#402, #430) by @kishorenc
Added python interface for filtering (though note its performance is limited by GIL) (#417) by @gtsoukas
Added support for replacing the elements that were marked as delete with newly inserted elements (to control the size of the index, #418) by @dyashuni
Fixed data races/deadlocks in updates/insertion, added stress test for multithreaded operation (#418) by @dyashuni
Documentation, tests, exception handling, refactoring (#375, #379, #380, #395, #396, #401, #406, #404, #409, #410, #416, #415, #431, #432, #433) by @jlmelville, @dyashuni, @kishorenc, @korzhenevski, @yoshoku, @jianshu93, @PLNech
global linkages (#383) by @MasterAler, USE_SSE usage in MSVC (#408) by @alxvth

Highlights:

1) Lightweight, header-only, no dependencies other than C++ 11 2) Interfaces for C++, Python, external support for Java and R (https://github.com/jlmelville/rcpphnsw). 3) Has full support for incremental index construction and updating the elements (thanks to the contribution by Apoorv Sharma). Has support for element deletions (by marking them in index, later can be replaced with other elements). Python index is picklable. 4) Can work with custom user defined distances (C++). 5) Significantly less memory footprint and faster build time compared to current nmslib's implementation.

Description of the algorithm parameters can be found in ALGO_PARAMS.md.

Python bindings

Supported distances:

| Distance | parameter | Equation | | ------------- |:---------------:| -----------------------:| |Squared L2 |'l2' | d = sum((Ai-Bi)^2) | |Inner product |'ip' | d = 1.0 - sum(Ai*Bi) | |Cosine similarity |'cosine' | d = 1.0 - sum(Ai*Bi) / sqrt(sum(Ai*Ai) * sum(Bi*Bi))|

Note that inner product is not an actual metric. An element can be closer to some other element than to itself. That allows some speedup if you remove all elements that are not the closest to themselves from the index.

For other spaces use the nmslib library https://github.com/nmslib/nmslib.

API description

hnswlib.Index(space, dim) creates a non-initialized index an HNSW in space space with integer dimension dim.

hnswlib.Index methods: * init_index(max_elements, M = 16, ef_construction = 200, random_seed = 100, allow_replace_deleted = False) initializes the index from with no elements. * max_elements defines the maximum number of elements that can be stored in the structure(can be increased/shrunk). * ef_construction defines a construction time/accuracy trade-off (see ALGO_PARAMS.md). * M defines tha maximum number of outgoing connections in the graph (ALGO_PARAMS.md). * allow_replace_deleted enables replacing of deleted elements with new added ones.

add_items(data, ids, num_threads = -1, replace_deleted = False) - inserts the data(numpy array of vectors, shape:N*dim) into the structure.
- num_threads sets the number of cpu threads to use (-1 means use default).
- ids are optional N-size numpy array of integer labels for all elements in data.
  - If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
- replace_deleted replaces deleted elements. Note it allows to save memory.
  - to use it init_index should be called with allow_replace_deleted=True
- Thread-safe with other add_items calls, but not with knn_query.
mark_deleted(label) - marks the element as deleted, so it will be omitted from search results. Throws an exception if it is already deleted.
unmark_deleted(label) - unmarks the element as deleted, so it will be not be omitted from search results.
resize_index(new_size) - changes the maximum capacity of the index. Not thread safe with add_items and knn_query.
set_ef(ef) - sets the query time accuracy/speed trade-off, defined by the ef parameter ( ALGO_PARAMS.md). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading.
knn_query(data, k = 1, num_threads = -1, filter = None) make a batch query for k closest elements for each element of the
- data (shape:N*dim). Returns a numpy array of (shape:N*k).
- num_threads sets the number of cpu threads to use (-1 means use default).
- filter filters elements by its labels, returns elements with allowed ids. Note that search with a filter works slow in python in multithreaded mode. It is recommended to set num_threads=1
- Thread-safe with other knn_query calls, but not with add_items.
load_index(path_to_index, max_elements = 0, allow_replace_deleted = False) loads the index from persistence to the uninitialized index.
- max_elements(optional) resets the maximum number of elements in the structure.
- allow_replace_deleted specifies whether the index being loaded has enabled replacing of deleted elements.
save_index(path_to_index) saves the index from persistence.
set_num_threads(num_threads) set the default number of cpu threads used during data insertion/querying.
get_items(ids, return_type = 'numpy') - returns a numpy array (shape:N*dim) of vectors that have integer identifiers specified in ids numpy vector (shape:N) if return_type is list return list of lists. Note that for cosine similarity it currently returns normalized vectors.
get_ids_list() - returns a list of all elements' ids.
get_max_elements() - returns the current capacity of the index
get_current_count() - returns the current number of element stored in the index

Read-only properties of hnswlib.Index class:

space - name of the space (can be one of "l2", "ip", or "cosine").
dim - dimensionality of the space.
M - parameter that defines the maximum number of outgoing connections in the graph.
ef_construction - parameter that controls speed/accuracy trade-off during the index construction.
max_elements - current capacity of the index. Equivalent to p.get_max_elements().
element_count - number of items in the index. Equivalent to p.get_current_count().

Properties of hnswlib.Index that support reading and writing:

ef - parameter controlling query time/accuracy trade-off.
num_threads - default number of threads to use in add_items or knn_query. Note that calling p.set_num_threads(3) is equivalent to p.num_threads=3.

Python bindings examples

See more examples here: * Creating index, inserting elements, searching, serialization/deserialization * Filtering during the search with a boolean function * Deleting the elements and reusing the memory of the deleted elements for newly added elements

An example of creating index, inserting elements, searching and pickle serialization: ```python import hnswlib import numpy as np import pickle

dim = 128 num_elements = 10000

Generating sample data

data = np.float32(np.random.random((numelements, dim))) ids = np.arange(numelements)

Declaring index

p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip

Initializing index - the maximum number of elements should be known beforehand

p.initindex(maxelements = numelements, efconstruction = 200, M = 16)

Element insertion (can be called several times):

p.add_items(data, ids)

Controlling the recall by setting ef:

p.set_ef(50) # ef should always be > k

Query dataset, k - number of the closest elements (returns 2 numpy arrays)

labels, distances = p.knn_query(data, k = 1)

Index objects support pickling

WARNING: serialization via pickle.dumps(p) or p.getstate() is NOT thread-safe with p.add_items method!

Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load

p_copy = pickle.loads(pickle.dumps(p)) # creates a copy of index p using pickle round-trip

Index parameters are exposed as class properties:

print(f"Parameters passed to constructor: space={pcopy.space}, dim={pcopy.dim}") print(f"Index construction: M={pcopy.M}, efconstruction={pcopy.efconstruction}") print(f"Index size is {pcopy.elementcount} and index capacity is {pcopy.maxelements}") print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}") ```

An example with updates after serialization/deserialization: ```python import hnswlib import numpy as np

dim = 16 num_elements = 10000

Generating sample data

data = np.float32(np.random.random((num_elements, dim)))

We split the data in two batches:

data1 = data[:numelements // 2] data2 = data[numelements // 2:]

Declaring index

p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip

Initializing index

max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded

during insertion of an element.

The capacity can be increased by saving/loading the index, see below.

ef_construction - controls index search speed/build speed tradeoff

M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)

Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

p.initindex(maxelements=numelements//2, efconstruction=100, M=16)

Controlling the recall by setting ef:

higher ef leads to better accuracy, but slower search

p.set_ef(10)

Set number of threads used during batch search/construction

By default using all available cores

p.setnumthreads(4)

print("Adding first batch of %d elements" % (len(data1))) p.add_items(data1)

Query the elements for themselves and measure recall:

labels, distances = p.knn_query(data1, k=1) print("Recall for the first batch:", np.mean(labels.reshape(-1) == np.arange(len(data1))), "\n")

Serializing and deleting the index:

indexpath='firsthalf.bin' print("Saving index to '%s'" % indexpath) p.saveindex("first_half.bin") del p

Re-initializing, loading the index

p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function.

print("\nLoading index from 'first_half.bin'\n")

Increase the total capacity (max_elements), so that it will handle the new data

p.loadindex("firsthalf.bin", maxelements = numelements)

print("Adding the second batch of %d elements" % (len(data2))) p.add_items(data2)

Query the elements for themselves and measure recall:

labels, distances = p.knn_query(data, k=1) print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(data))), "\n") ```

C++ examples

See examples here: * creating index, inserting elements, searching, serialization/deserialization * filtering during the search with a boolean function * deleting the elements and reusing the memory of the deleted elements for newly added elements * multithreaded usage * multivector search * epsilon search

Bindings installation

You can install from sources: bash apt-get install -y python-setuptools python-pip git clone https://github.com/nmslib/hnswlib.git cd hnswlib pip install .

or you can install via pip: pip install hnswlib

For developers

Contributions are highly welcome!

Please make pull requests against the develop branch.

When making changes please run tests (and please add a test to tests/python in case there is new functionality): bash python -m unittest discover --start-directory tests/python --pattern "bindings_test*.py"

Other implementations

Non-metric space library (nmslib) - main library(python, C++), supports exotic distances: https://github.com/nmslib/nmslib
Faiss library by facebook, uses own HNSW implementation for coarse quantization (python, C++): https://github.com/facebookresearch/faiss
Code for the paper "Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors" (current state-of-the-art in compressed indexes, C++): https://github.com/dbaranchuk/ivf-hnsw
Amazon PECOS https://github.com/amzn/pecos
TOROS N2 (python, C++): https://github.com/kakao/n2
Online HNSW (C++): https://github.com/andrusha97/online-hnsw)
Go implementation: https://github.com/Bithack/go-hnsw
Python implementation (as a part of the clustering code by by Matteo Dell'Amico): https://github.com/matteodellamico/flexible-clustering
Julia implmentation https://github.com/JuliaNeighbors/HNSW.jl
Java implementation: https://github.com/jelmerk/hnswlib
Java bindings using Java Native Access: https://github.com/stepstone-tech/hnswlib-jna
.Net implementation: https://github.com/curiosity-ai/hnsw-sharp
CUDA implementation: https://github.com/js1010/cuhnsw
Rust implementation https://github.com/rust-cv/hnsw
Rust implementation for memory and thread safety purposes and There is A Trait to enable the user to implement its own distances. It takes as data slices of types T satisfying T:Serialize+Clone+Send+Sync.: https://github.com/jean-pierreBoth/hnswlib-rs

200M SIFT test reproduction

To download and extract the bigann dataset (from root directory): bash python tests/cpp/download_bigann.py To compile: bash mkdir build cd build cmake .. make all

To run the test on 200M SIFT subset: bash ./main

The size of the BigANN subset (in millions) is controlled by the variable subsetsizemillions hardcoded in sift_1b.cpp.

Updates test

To generate testing data (from root directory): bash cd tests/cpp python update_gen_data.py To compile (from root directory): bash mkdir build cd build cmake .. make To run test without updates (from build directory) bash ./test_updates

To run test with updates (from build directory) bash ./test_updates update

HNSW example demos

Visual search engine for 1M amazon products (MXNet + HNSW): website, code, demo by @ThomasDelteil

References

HNSW paper: @article{malkov2018efficient, title={Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs}, author={Malkov, Yu A and Yashunin, Dmitry A}, journal={IEEE transactions on pattern analysis and machine intelligence}, volume={42}, number={4}, pages={824--836}, year={2018}, publisher={IEEE} }

The update algorithm supported in this repository is to be published in "Dynamic Updates For HNSW, Hierarchical Navigable Small World Graphs" US Patent 15/929,802 by Apoorv Sharma, Abhishek Tayal and Yury Malkov.

Owner

Name: nmslib
Login: nmslib
Kind: organization

Repositories: 2
Profile: https://github.com/nmslib

GitHub Events

Total

Issues event: 19
Watch event: 502
Issue comment event: 37
Push event: 2
Pull request review comment event: 6
Pull request review event: 9
Pull request event: 31
Fork event: 98

Last Year

Issues event: 19
Watch event: 502
Issue comment event: 37
Push event: 2
Pull request review comment event: 6
Pull request review event: 9
Pull request event: 31
Fork event: 98

Committers

Last synced: 10 months ago

All Time

Total Commits: 354
Total Committers: 70
Avg Commits per committer: 5.057
Development Distribution Score (DDS): 0.751

Past Year

Commits: 1
Committers: 1
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Yury Malkov	y**v@m**u	88
dbespalov	u**a@g**m	23
Dmitry Yashunin	y**a@y**u	21
Marek Hanuš	m**o@g**m	14
Kishore Nallan	k**c@g**m	13
Dmitry Yashunin	D**n@h**m	13
Yury Malkov	y**v@s**m	12
James Melville	j**e@g**m	12
uestc-lfs	u**s@o**m	12
Tony Kuo	t**o@c**m	11
apoorv sharma	a**s@t**m	8
louisabraham	l**m@y**r	6
Greg Roodt	g**g@c**m	6
Martin Dimitrov	m**v@i**m	6
Paul Brossier	p**m@p**g	6
alon	a**4@g**m	5
TakaakiFuruse	t**e@g**m	5
Bespalov, Dmitriy (CORP)	D**v@a**m	4
Uri Goren	u**i@g**m	4
Yury Malkov	y**v@n**m	4
LTLA	i**s@g**m	4
yoshoku	y**u@o**m	4
Taras Tsugrii	t**y@g**m	4
Alexander Vieth	a**h@t**l	3
Bo Li	l**o@b**g	3
Peter Sobot	p**t@s**m	3
ctero-graham	e**e@l**m	3
Étienne Mollier	e**r@d**g	3
Jack Wimberley	j**l@a**m	2
Yury	y**v@v**n	2
and 40 more...

Committer Domains (Top 20 + Academic)

yandex.ru: 2 verse.in: 2 bytedance.com: 2 twitter.com: 2 intel.com: 2 debian.org: 1 spotify.com: 1 broadinstitute.org: 1 tudelft.nl: 1 ntechlab.com: 1 goren4u.com: 1 adp.com: 1 piem.org: 1 canva.com: 1 contractors.roche.com: 1 samsung.com: 1 harman.com: 1 mail.ru: 1 akamai.com: 1 sberbank.ru: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 174
Total pull requests: 117
Average time to close issues: 3 months
Average time to close pull requests: about 1 month
Total issue authors: 134
Total pull request authors: 54
Average comments per issue: 2.49
Average comments per pull request: 1.55
Merged pull requests: 52
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 24
Pull requests: 34
Average time to close issues: 9 days
Average time to close pull requests: 5 days
Issue authors: 21
Pull request authors: 13
Average comments per issue: 0.83
Average comments per pull request: 0.26
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

screwface106 (6)
jianshu93 (6)
shaozhixue (4)
siddhsql (4)
patelprateek (4)
Arthur-Bi (3)
DavidGOrtega (3)
chasingegg (2)
clstaudt (2)
zlwu92 (2)
yurymalkov (2)
lizhiyuell (2)
wilsonzlin (2)
davnn (2)
iamsabhoho (2)

Pull Request Authors

dyashuni (19)
michaelbautin (6)
kiplingw (6)
jlmelville (6)
kishorenc (5)
ttsugriy (5)
yurymalkov (5)
Axlgrep (5)
LockedThread (4)
jeongseophan (4)
drons (4)
sanketkedia (3)
yuhaijun999 (2)
jrade (2)
atroyn (2)

Top Labels

Issue Labels

enhancement (1) help wanted (1)

Pull Request Labels

Packages

Total packages: 4
Total downloads:
- nuget 919 total

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 1
(may contain duplicates)
Total versions: 15
Total maintainers: 1

proxy.golang.org: github.com/nmslib/hnswlib

Documentation: https://pkg.go.dev/github.com/nmslib/hnswlib#section-documentation
License: apache-2.0
Latest release: v0.8.0
published over 2 years ago

Versions: 9
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 7.0%

Average: 8.2%

Dependent repos count: 9.3%

Last synced: 8 months ago

nuget.org: hnswlib

hnswlib: Header-only C++ library for fast approximate nearest neighbors

Homepage: https://github.com/nmslib/hnswlib
License: Apache-2.0
Latest release: 0.0.0.1
published almost 7 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 919 Total

Rankings

Stargazers count: 0.5%

Forks count: 0.8%

Dependent repos count: 12.7%

Average: 15.6%

Dependent packages count: 19.5%

Downloads: 44.5%

Maintainers (1)

hani

Last synced: 8 months ago

pypi.org: prebuilt-hnsw

Prebuilt wheels for hnswlib - Hierarchical Navigable Small World Graph Library

Homepage: https://github.com/nmslib/hnswlib
Documentation: https://prebuilt-hnsw.readthedocs.io/
License: Apache Software License
Latest release: 0.8.0
published over 1 year ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Stargazers count: 1.8%

Forks count: 2.7%

Dependent packages count: 9.9%

Average: 17.5%

Dependent repos count: 55.6%

Last synced: over 1 year ago

conda-forge.org: hnswlib

Homepage: https://github.com/nmslib/hnswlib
License: Apache-2.0
Latest release: 0.6.2
published about 4 years ago

Versions: 4
Dependent Packages: 0
Dependent Repositories: 1

Rankings

Forks count: 7.8%

Stargazers count: 8.1%

Average: 22.9%

Dependent repos count: 24.1%

Dependent packages count: 51.5%

Last synced: 8 months ago

hnswlib

Science Score: 46.0%

Keywords from Contributors

Repository

Basic Info

Statistics

Metadata Files

README.md

Hnswlib - fast approximate nearest neighbor search

Highlights:

Python bindings

Supported distances:

API description

Python bindings examples

Generating sample data

Declaring index

Initializing index - the maximum number of elements should be known beforehand

Element insertion (can be called several times):

Controlling the recall by setting ef:

Query dataset, k - number of the closest elements (returns 2 numpy arrays)

Index objects support pickling

WARNING: serialization via pickle.dumps(p) or p.getstate() is NOT thread-safe with p.add_items method!

Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load

Index parameters are exposed as class properties:

Generating sample data

We split the data in two batches:

Declaring index

Initializing index

max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded

during insertion of an element.

The capacity can be increased by saving/loading the index, see below.

ef_construction - controls index search speed/build speed tradeoff

M - is tightly connected with internal dimensionality of the data. Strongly affects memory consumption (~M)

Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

Controlling the recall by setting ef:

higher ef leads to better accuracy, but slower search

Set number of threads used during batch search/construction

By default using all available cores

Query the elements for themselves and measure recall:

Serializing and deleting the index:

Re-initializing, loading the index

Increase the total capacity (max_elements), so that it will handle the new data

Query the elements for themselves and measure recall:

C++ examples

Bindings installation

For developers

Other implementations

200M SIFT test reproduction

Updates test

HNSW example demos

References

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/nmslib/hnswlib

Rankings

nuget.org: hnswlib

Rankings

Maintainers (1)

pypi.org: prebuilt-hnsw

Rankings

conda-forge.org: hnswlib

Rankings

Dependencies