torchpq

Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda

https://github.com/demoriarty/torchpq

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary

Keywords

cuda nearest-neighbor-search pytorch

Last synced: 8 months ago · JSON representation ·

Repository

Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda

Basic Info

Host: GitHub
Owner: DeMoriarty
License: mit
Language: Cuda
Default Branch: main
Homepage:
Size: 15.5 MB

Statistics

Stars: 225
Watchers: 5
Forks: 22
Open Issues: 6
Releases: 5

Topics

cuda nearest-neighbor-search pytorch

Created over 5 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Code of conduct Citation

TorchPQ

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm. TorchPQ is implemented mainly with PyTorch, with some extra CUDA kernels to accelerate clustering, indexing and searching.

Install

make sure you have the latest version of PyTorch installed: https://pytorch.org/get-started/locally/
install or upgrade to CUDA toolkit 11.0 or greater
install a version of CuPy library that matches your CUDA toolkit version pip install cupy-cuda110 pip install cupy-cuda111 pip install cupy-cuda112 ... for a full list of cupy-cuda versions, please go to Installation Guide
install TorchPQ pip install torchpq

Quick Start

IVFPQ

InVerted File Product Quantization (IVFPQ) is a type of ANN search algorithm that is designed to do fast and efficient vector search in million, or even billion scale vector sets. check the original paper for more details.

Training

```python from torchpq.index import IVFPQIndex import torch

ndata = 1000000 # number of data points dvector = 128 # dimentionality / number of features

index = IVFPQIndex( dvector=dvector, nsubvectors=64, ncells=1024, initial_size=2048, distance="euclidean", )

trainset = torch.randn(dvector, ndata, device="cuda:0") index.train(trainset) ``There are some important parameters that need to be explained: - **d_vector**: dimentionality of input vectors. there are 2 constraints ondvector: (1) it needs to be divisible bynsubvectors; (2) it needs to be a multiple of 4.* - **n_subvectors**: number of subquantizers, essentially this is the byte size of each quantized vector, 64 byte per vector in the above example.** - **n_cells**: number of coarse quantizer clusters - **initial_size**: initial capacity assigned to each voronoi cell of coarse quantizer.ncells * initialsizeis the number of vectors that can be stored initially. if any cell has reached its capacity, that cell will be automatically expanded. If you need to add vectors frequently, a larger value forinitial_size` is recommended.

Remember that the shape of any tensor that contains data points has to be [d_vector, n_data].

* the second constraint could be removed in the future
** actual byte size would be (nsubvectors+9) bytes, 8 bytes for ID and 1 byte for isempty

Adding new vectors

python baseset = torch.randn(d_vector, n_data, device="cuda:0") ids = torch.arange(n_data, device="cuda") index.add(baseset, ids=ids) Each ID in ids needs to be a unique int64 (torch.long) value that identifies a vector in x. if ids is not provided, it will be set to torch.arange(n_data, device="cuda") + previous_max_id

Removing vectors

python index.remove(ids=ids) index.remove(ids=ids) will virtually remove vectors with specified ids from storage. It ignores ids that doesn't exist.

Topk search

python index.n_probe = 32 n_query = 10000 queryset = torch.randn(d_vector, n_query, device="cuda:0") topk_values, topk_ids = index.search(queryset, k=100) - when distance="inner", topk_values are inner product of queries and topk closest data points. - when distance="euclidean", topk_values are negative squared L2 distance between queries and topk closest data points. - when distance="manhattan", topk_values are negative L1 distance between queries and topk closest data points. - when distance="cosine", topk_values are cosine similarity between queries and topk closest data points.

Encode and Decode

you can use IVFPQ as a vector codec for lossy compression of vectors python code = index.encode(queryset) # compression reconstruction = index.decode(code) # reconstruction

Save and Load

Most of the TorchPQ modules are inherited from torch.nn.Module, this means you can save and load them just like a regular pytorch model. ```python

Save to PATH

torch.save(index.state_dict(), PATH)

Load from PATH

index.loadstatedict(torch.load(PATH)) ```

Clustering

K-means

```python from torchpq.clustering import KMeans import torch

ndata = 1000000 # number of data points dvector = 128 # dimentionality / number of features x = torch.randn(dvector, ndata, device="cuda")

kmeans = KMeans(nclusters=4096, distance="euclidean") labels = kmeans.fit(x) Notice that the shape of the tensor that contains data points has to be[dvector, n_data]```, this is consistant in TorchPQ.

Multiple concurrent K-means

Sometimes, we have multiple independent datasets that need to be clustered, instead of running multiple KMeans sequentianlly, we can perform multiple kmeans concurrently with MultiKMeans ```python from torchpq.clustering import MultiKMeans import torch

ndata = 1000000 nkmeans = 16 dvector = 64 x = torch.randn(nkmeans, dvector, ndata, device="cuda") kmeans = MultiKMeans(n_clusters=256, distance="euclidean") labels = kmeans.fit(x) ```

Prediction with K-means

labels = kmeans.predict(x)

Benchmarks

K-Means
Sift1M
Gist1M
Sift10M (coming soon)
Sift100M (coming soon)

Owner

Login: DeMoriarty
Kind: user

Website: demoriarty.github.io
Repositories: 5
Profile: https://github.com/DeMoriarty

Beep boop

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Omer"
  given-names: "Sehban"
  orcid: "https://orcid.org/0000-0002-5465-5841"
title: "TorchPQ"
version: 0.3.0
doi: 10.5281/zenodo.7115607
date-released: 2021-02-08
url: "https://github.com/DeMoriarty/TorchPQ"

GitHub Events

Total

Watch event: 13
Fork event: 1

Last Year

Watch event: 13
Fork event: 1

Committers

Last synced: 12 months ago

All Time

Total Commits: 289
Total Committers: 5
Avg Commits per committer: 57.8
Development Distribution Score (DDS): 0.017

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
DeMoriarty	s**2@v**m	284
Rémi Francis	f****r	2
mhamilton723	m**l@m**m	1
bezilla5	h**0@g**m	1
John Hughes	j**h@s**m	1

Committer Domains (Top 20 + Academic)

speechmatics.com: 1 microsoft.com: 1 vip.qq.com: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 18
Total pull requests: 8
Average time to close issues: 27 days
Average time to close pull requests: 10 days
Total issue authors: 18
Total pull request authors: 5
Average comments per issue: 2.44
Average comments per pull request: 0.25
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

nlgranger (1)
tgy (1)
SEC4SR (1)
Chen-Cai-OSU (1)
ZhangIceNight (1)
Songweiping (1)
lucidrains (1)
hellodrx (1)
Tomiinek (1)
mhudecheck (1)
abhinavvs (1)
JohnTailor (1)
bryanhx (1)
nmynol (1)
LiZhangMing (1)

Pull Request Authors

DeMoriarty (2)
francisr (2)
mhamilton723 (2)
bezilla5 (1)
McHughes288 (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 1,477 last-month

Total dependent packages: 0
Total dependent repositories: 4
Total versions: 51
Total maintainers: 1

pypi.org: torchpq

Efficient implementations of Product Quantization and its variants

Homepage: https://github.com/DeMoriarty/TorchPQ
Documentation: https://torchpq.readthedocs.io/
License: MIT
Latest release: 0.1.8
published about 5 years ago

Versions: 51
Dependent Packages: 0
Dependent Repositories: 4
Downloads: 1,477 Last month

Rankings

Stargazers count: 5.2%

Dependent repos count: 7.5%

Average: 8.4%

Forks count: 8.7%

Dependent packages count: 10.0%

Downloads: 10.8%

Maintainers (1)

demoriarty

Last synced: 9 months ago

Dependencies

setup.py pypi

numpy *
torch *

torchpq

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

TorchPQ

Install

Quick Start

IVFPQ

Training

Adding new vectors

Removing vectors

Topk search

Encode and Decode

Save and Load

Save to PATH

Load from PATH

Clustering

K-means

Multiple concurrent K-means

Prediction with K-means

Benchmarks

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: torchpq

Rankings

Maintainers (1)

Dependencies