torchpq

Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda

https://github.com/demoriarty/torchpq

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary

Keywords

cuda nearest-neighbor-search pytorch
Last synced: 6 months ago · JSON representation ·

Repository

Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda

Basic Info
  • Host: GitHub
  • Owner: DeMoriarty
  • License: mit
  • Language: Cuda
  • Default Branch: main
  • Homepage:
  • Size: 15.5 MB
Statistics
  • Stars: 225
  • Watchers: 5
  • Forks: 22
  • Open Issues: 6
  • Releases: 5
Topics
cuda nearest-neighbor-search pytorch
Created about 5 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Code of conduct Citation

README.md

TorchPQ

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm. TorchPQ is implemented mainly with PyTorch, with some extra CUDA kernels to accelerate clustering, indexing and searching.

Install

  • make sure you have the latest version of PyTorch installed: https://pytorch.org/get-started/locally/
  • install or upgrade to CUDA toolkit 11.0 or greater
  • install a version of CuPy library that matches your CUDA toolkit version pip install cupy-cuda110 pip install cupy-cuda111 pip install cupy-cuda112 ... for a full list of cupy-cuda versions, please go to Installation Guide
  • install TorchPQ pip install torchpq

Quick Start

IVFPQ

InVerted File Product Quantization (IVFPQ) is a type of ANN search algorithm that is designed to do fast and efficient vector search in million, or even billion scale vector sets. check the original paper for more details.

Training

```python from torchpq.index import IVFPQIndex import torch

ndata = 1000000 # number of data points dvector = 128 # dimentionality / number of features

index = IVFPQIndex( dvector=dvector, nsubvectors=64, ncells=1024, initial_size=2048, distance="euclidean", )

trainset = torch.randn(dvector, ndata, device="cuda:0") index.train(trainset) `` There are some important parameters that need to be explained: - **d_vector**: dimentionality of input vectors. there are 2 constraints ondvector: (1) it needs to be divisible bynsubvectors; (2) it needs to be a multiple of 4.* - **n_subvectors**: number of subquantizers, essentially this is the byte size of each quantized vector, 64 byte per vector in the above example.** - **n_cells**: number of coarse quantizer clusters - **initial_size**: initial capacity assigned to each voronoi cell of coarse quantizer. ncells * initialsizeis the number of vectors that can be stored initially. if any cell has reached its capacity, that cell will be automatically expanded. If you need to add vectors frequently, a larger value forinitial_size` is recommended.

Remember that the shape of any tensor that contains data points has to be [d_vector, n_data].

* the second constraint could be removed in the future
** actual byte size would be (nsubvectors+9) bytes, 8 bytes for ID and 1 byte for isempty

Adding new vectors

python baseset = torch.randn(d_vector, n_data, device="cuda:0") ids = torch.arange(n_data, device="cuda") index.add(baseset, ids=ids) Each ID in ids needs to be a unique int64 (torch.long) value that identifies a vector in x. if ids is not provided, it will be set to torch.arange(n_data, device="cuda") + previous_max_id

Removing vectors

python index.remove(ids=ids) index.remove(ids=ids) will virtually remove vectors with specified ids from storage. It ignores ids that doesn't exist.

Topk search

python index.n_probe = 32 n_query = 10000 queryset = torch.randn(d_vector, n_query, device="cuda:0") topk_values, topk_ids = index.search(queryset, k=100) - when distance="inner", topk_values are inner product of queries and topk closest data points. - when distance="euclidean", topk_values are negative squared L2 distance between queries and topk closest data points. - when distance="manhattan", topk_values are negative L1 distance between queries and topk closest data points. - when distance="cosine", topk_values are cosine similarity between queries and topk closest data points.

Encode and Decode

you can use IVFPQ as a vector codec for lossy compression of vectors python code = index.encode(queryset) # compression reconstruction = index.decode(code) # reconstruction

Save and Load

Most of the TorchPQ modules are inherited from torch.nn.Module, this means you can save and load them just like a regular pytorch model. ```python

Save to PATH

torch.save(index.state_dict(), PATH)

Load from PATH

index.loadstatedict(torch.load(PATH)) ```

Clustering

K-means

```python from torchpq.clustering import KMeans import torch

ndata = 1000000 # number of data points dvector = 128 # dimentionality / number of features x = torch.randn(dvector, ndata, device="cuda")

kmeans = KMeans(nclusters=4096, distance="euclidean") labels = kmeans.fit(x) Notice that the shape of the tensor that contains data points has to be[dvector, n_data]```, this is consistant in TorchPQ.

Multiple concurrent K-means

Sometimes, we have multiple independent datasets that need to be clustered, instead of running multiple KMeans sequentianlly, we can perform multiple kmeans concurrently with MultiKMeans ```python from torchpq.clustering import MultiKMeans import torch

ndata = 1000000 nkmeans = 16 dvector = 64 x = torch.randn(nkmeans, dvector, ndata, device="cuda") kmeans = MultiKMeans(n_clusters=256, distance="euclidean") labels = kmeans.fit(x) ```

Prediction with K-means

labels = kmeans.predict(x)

Benchmarks

Owner

  • Login: DeMoriarty
  • Kind: user

Beep boop

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Omer"
  given-names: "Sehban"
  orcid: "https://orcid.org/0000-0002-5465-5841"
title: "TorchPQ"
version: 0.3.0
doi: 10.5281/zenodo.7115607
date-released: 2021-02-08
url: "https://github.com/DeMoriarty/TorchPQ"

GitHub Events

Total
  • Watch event: 13
  • Fork event: 1
Last Year
  • Watch event: 13
  • Fork event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 289
  • Total Committers: 5
  • Avg Commits per committer: 57.8
  • Development Distribution Score (DDS): 0.017
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
DeMoriarty s****2@v****m 284
Rémi Francis f****r 2
mhamilton723 m****l@m****m 1
bezilla5 h****0@g****m 1
John Hughes j****h@s****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 18
  • Total pull requests: 8
  • Average time to close issues: 27 days
  • Average time to close pull requests: 10 days
  • Total issue authors: 18
  • Total pull request authors: 5
  • Average comments per issue: 2.44
  • Average comments per pull request: 0.25
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • nlgranger (1)
  • tgy (1)
  • SEC4SR (1)
  • Chen-Cai-OSU (1)
  • ZhangIceNight (1)
  • Songweiping (1)
  • lucidrains (1)
  • hellodrx (1)
  • Tomiinek (1)
  • mhudecheck (1)
  • abhinavvs (1)
  • JohnTailor (1)
  • bryanhx (1)
  • nmynol (1)
  • LiZhangMing (1)
Pull Request Authors
  • DeMoriarty (2)
  • francisr (2)
  • mhamilton723 (2)
  • bezilla5 (1)
  • McHughes288 (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 1,477 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 4
  • Total versions: 51
  • Total maintainers: 1
pypi.org: torchpq

Efficient implementations of Product Quantization and its variants

  • Versions: 51
  • Dependent Packages: 0
  • Dependent Repositories: 4
  • Downloads: 1,477 Last month
Rankings
Stargazers count: 5.2%
Dependent repos count: 7.5%
Average: 8.4%
Forks count: 8.7%
Dependent packages count: 10.0%
Downloads: 10.8%
Maintainers (1)
Last synced: 7 months ago

Dependencies

setup.py pypi
  • numpy *
  • torch *