mlann
A Multilabel Classification Framework for Approximate Nearest Neighbor Search
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary
Keywords
Repository
A Multilabel Classification Framework for Approximate Nearest Neighbor Search
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
MLANN
Approximate nearest neighbor search library implementing the Multilabel Classification Framework (NeurIPS '22). This is a research library and will not offer state-of-the-art performance in most scenarios. However, it can be useful in extreme out-of-distribution (OOD) settings or in maximum inner product search (MIPS) where a small portion of queries have the highest inner products with most queries.
An extended version of the paper was published in Journal of Machine Learning Research (JMLR).
The original code used in the paper is available here.
Getting started
Install the Python module with pip install git+https://github.com/ejaasaari/mlann
[!TIP] On macOS, it is recommended to use the Homebrew version of Clang as the compiler:
shell script
brew install llvm libomp
CC=/opt/homebrew/opt/llvm/bin/clang CXX=/opt/homebrew/opt/llvm/bin/clang++ pip install git+https://github.com/ejaasaari/mlann
An example for indexing and querying a dataset using MLANN is provided below:
```python import mlann import numpy as np from sklearn.datasets import fetch_openml # scikit-learn is used only for loading the data
k = 10 trainingk = 50 # should be equal or larger to k ntrees = 10 # increase for higher recall, slower search depth = 6 # increase for lower recall, faster search voting_threshold = 5 # increase for lower recall, faster search dist = mlann.IP # or mlann.L2
for RF index, the voting threshold should be a probability:
voting_threshold = 0.000005
X, _ = fetchopenml("mnist784", version=1, returnXy=True, as_frame=False) X = np.ascontiguousarray(X, dtype=np.float32)
data = X[:30000] trainingdata = X[30000:60000]
q = X[-1]
index = mlann.MLANNIndex(data, "PCA") # one of RP, PCA, or RF knn = index.exactsearch(trainingdata, training_k, dist=dist) # required for training
index.build(trainingdata, knn, ntrees, depth)
print('Exact: ', index.exactsearch(q, k, dist=dist)) print('Approximate:', index.ann(q, k, votingthreshold, dist=dist)) ```
The following distances are available: L2, IP. Cosine distance can be used with IP by normalizing vectors.
The following index types are available:
- RF: random forest
- RP: random projection tree
- PCA: PCA tree
On most datasets, RF will likely provide the best query performance but can be slower to build. RP will likely be the fastest to build while offering the worst query performance, and PCA is a compromise between the two.
Building an MLANN index requires a training set of queries and their k nearest neighbors. If no separate training set is available, the database vectors can be used also as the training set. The k nearest neighbors can be computed e.g. by using
index.exact_search(training_data, training_k, dist=dist)
If this is too slow, the following can be tried:
- Sample a smaller training set
- Use a different approximate nearest neighbor library to search for approximate nearest neighbors instead
- If available, use a GPU to compute the nearest neighbors (with e.g. cuVS)
Citation
If you use the library in an academic context, please consider citing the following paper:
Hyvönen, V., Jääsaari, E., and Roos, T. "A Multilabel Classification Framework for Approximate Nearest Neighbor Search." Advances in Neural Information Processing Systems 35 (2022): 35741-35754.
~~~~ @article{hyvonen2022multilabel, title={A Multilabel Classification Framework for Approximate Nearest Neighbor Search}, author={Hyv{\"o}nen, Ville and J{\"a}{\"a}saari, Elias and Roos, Teemu}, journal={Advances in Neural Information Processing Systems}, volume={35}, pages={35741--35754}, year={2022} } ~~~~
License
MLANN is available under the MIT License (see LICENSE). Note that third-party libraries in the cpp/lib folder may be distributed under other open source licenses (see licenses).
Owner
- Name: Elias Jääsaari
- Login: ejaasaari
- Kind: user
- Company: Carnegie Mellon University
- Website: https://eliasjaasaari.com
- Repositories: 1
- Profile: https://github.com/ejaasaari
Citation (CITATION)
@article{hyvonen2022multilabel,
title={A Multilabel Classification Framework for Approximate Nearest Neighbor Search},
author={Hyv{\"o}nen, Ville and J{\"a}{\"a}saari, Elias and Roos, Teemu},
journal={Advances in Neural Information Processing Systems},
volume={35},
pages={35741--35754},
year={2022}
}
GitHub Events
Total
- Public event: 1
- Push event: 15
Last Year
- Public event: 1
- Push event: 15
Dependencies
- numpy *