mlann

A Multilabel Classification Framework for Approximate Nearest Neighbor Search

https://github.com/ejaasaari/mlann

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary

Keywords

approximate-nearest-neighbor-search cpp machine-learning nearest-neighbor-search python vector-search

Last synced: 6 months ago · JSON representation ·

Repository

A Multilabel Classification Framework for Approximate Nearest Neighbor Search

Basic Info

Host: GitHub
Owner: ejaasaari
License: mit
Language: C++
Default Branch: main
Homepage:
Size: 1.35 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

approximate-nearest-neighbor-search cpp machine-learning nearest-neighbor-search python vector-search

Created 11 months ago · Last pushed 7 months ago

Metadata Files

Readme License Citation

MLANN

Approximate nearest neighbor search library implementing the Multilabel Classification Framework (NeurIPS '22). This is a research library and will not offer state-of-the-art performance in most scenarios. However, it can be useful in extreme out-of-distribution (OOD) settings or in maximum inner product search (MIPS) where a small portion of queries have the highest inner products with most queries.

An extended version of the paper was published in Journal of Machine Learning Research (JMLR).

The original code used in the paper is available here.

Getting started

Install the Python module with pip install git+https://github.com/ejaasaari/mlann

[!TIP] On macOS, it is recommended to use the Homebrew version of Clang as the compiler:

shell script brew install llvm libomp CC=/opt/homebrew/opt/llvm/bin/clang CXX=/opt/homebrew/opt/llvm/bin/clang++ pip install git+https://github.com/ejaasaari/mlann

An example for indexing and querying a dataset using MLANN is provided below:

```python import mlann import numpy as np from sklearn.datasets import fetch_openml # scikit-learn is used only for loading the data

k = 10 trainingk = 50 # should be equal or larger to k ntrees = 10 # increase for higher recall, slower search depth = 6 # increase for lower recall, faster search voting_threshold = 5 # increase for lower recall, faster search dist = mlann.IP # or mlann.L2

for RF index, the voting threshold should be a probability:

voting_threshold = 0.000005

X, _ = fetchopenml("mnist784", version=1, returnXy=True, as_frame=False) X = np.ascontiguousarray(X, dtype=np.float32)

data = X[:30000] trainingdata = X[30000:60000]

q = X[-1]

index = mlann.MLANNIndex(data, "PCA") # one of RP, PCA, or RF knn = index.exactsearch(trainingdata, training_k, dist=dist) # required for training

index.build(trainingdata, knn, ntrees, depth)

print('Exact: ', index.exactsearch(q, k, dist=dist)) print('Approximate:', index.ann(q, k, votingthreshold, dist=dist)) ```

The following distances are available: L2, IP. Cosine distance can be used with IP by normalizing vectors.

The following index types are available: - RF: random forest - RP: random projection tree - PCA: PCA tree

On most datasets, RF will likely provide the best query performance but can be slower to build. RP will likely be the fastest to build while offering the worst query performance, and PCA is a compromise between the two.

Building an MLANN index requires a training set of queries and their k nearest neighbors. If no separate training set is available, the database vectors can be used also as the training set. The k nearest neighbors can be computed e.g. by using

index.exact_search(training_data, training_k, dist=dist)

If this is too slow, the following can be tried:

Sample a smaller training set
Use a different approximate nearest neighbor library to search for approximate nearest neighbors instead
If available, use a GPU to compute the nearest neighbors (with e.g. cuVS)

Citation

If you use the library in an academic context, please consider citing the following paper:

Hyvönen, V., Jääsaari, E., and Roos, T. "A Multilabel Classification Framework for Approximate Nearest Neighbor Search." Advances in Neural Information Processing Systems 35 (2022): 35741-35754.

~~~~ @article{hyvonen2022multilabel, title={A Multilabel Classification Framework for Approximate Nearest Neighbor Search}, author={Hyv{\"o}nen, Ville and J{\"a}{\"a}saari, Elias and Roos, Teemu}, journal={Advances in Neural Information Processing Systems}, volume={35}, pages={35741--35754}, year={2022} } ~~~~

License

MLANN is available under the MIT License (see LICENSE). Note that third-party libraries in the cpp/lib folder may be distributed under other open source licenses (see licenses).

Owner

Name: Elias Jääsaari
Login: ejaasaari
Kind: user
Company: Carnegie Mellon University

Website: https://eliasjaasaari.com
Repositories: 1
Profile: https://github.com/ejaasaari

Citation (CITATION)

@article{hyvonen2022multilabel,
  title={A Multilabel Classification Framework for Approximate Nearest Neighbor Search},
  author={Hyv{\"o}nen, Ville and J{\"a}{\"a}saari, Elias and Roos, Teemu},
  journal={Advances in Neural Information Processing Systems},
  volume={35},
  pages={35741--35754},
  year={2022}
}

GitHub Events

Total

Public event: 1
Push event: 15

Last Year

Public event: 1
Push event: 15

Dependencies

pyproject.toml pypi

setup.py pypi

numpy *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science