torchpq
Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.7%) to scientific vocabulary
Keywords
Repository
Approximate nearest neighbor search with product quantization on GPU in pytorch and cuda
Basic Info
Statistics
- Stars: 225
- Watchers: 5
- Forks: 22
- Open Issues: 6
- Releases: 5
Topics
Metadata Files
README.md
TorchPQ
TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm. TorchPQ is implemented mainly with PyTorch, with some extra CUDA kernels to accelerate clustering, indexing and searching.
Install
- make sure you have the latest version of PyTorch installed: https://pytorch.org/get-started/locally/
- install or upgrade to CUDA toolkit 11.0 or greater
- install a version of CuPy library that matches your CUDA toolkit version
pip install cupy-cuda110 pip install cupy-cuda111 pip install cupy-cuda112 ...for a full list of cupy-cuda versions, please go to Installation Guide - install TorchPQ
pip install torchpq
Quick Start
IVFPQ
InVerted File Product Quantization (IVFPQ) is a type of ANN search algorithm that is designed to do fast and efficient vector search in million, or even billion scale vector sets. check the original paper for more details.
Training
```python from torchpq.index import IVFPQIndex import torch
ndata = 1000000 # number of data points dvector = 128 # dimentionality / number of features
index = IVFPQIndex( dvector=dvector, nsubvectors=64, ncells=1024, initial_size=2048, distance="euclidean", )
trainset = torch.randn(dvector, ndata, device="cuda:0")
index.train(trainset)
``
There are some important parameters that need to be explained:
- **d_vector**: dimentionality of input vectors. there are 2 constraints ondvector: (1) it needs to be divisible bynsubvectors; (2) it needs to be a multiple of 4.*
- **n_subvectors**: number of subquantizers, essentially this is the byte size of each quantized vector, 64 byte per vector in the above example.**
- **n_cells**: number of coarse quantizer clusters
- **initial_size**: initial capacity assigned to each voronoi cell of coarse quantizer.
ncells * initialsizeis the number of vectors that can be stored initially. if any cell has reached its capacity, that cell will be automatically expanded.
If you need to add vectors frequently, a larger value forinitial_size` is recommended.
Remember that the shape of any tensor that contains data points has to be [d_vector, n_data].
* the second constraint could be removed in the future
** actual byte size would be (nsubvectors+9) bytes, 8 bytes for ID and 1 byte for isempty
Adding new vectors
python
baseset = torch.randn(d_vector, n_data, device="cuda:0")
ids = torch.arange(n_data, device="cuda")
index.add(baseset, ids=ids)
Each ID in ids needs to be a unique int64 (torch.long) value that identifies a vector in x.
if ids is not provided, it will be set to torch.arange(n_data, device="cuda") + previous_max_id
Removing vectors
python
index.remove(ids=ids)
index.remove(ids=ids) will virtually remove vectors with specified ids from storage.
It ignores ids that doesn't exist.
Topk search
python
index.n_probe = 32
n_query = 10000
queryset = torch.randn(d_vector, n_query, device="cuda:0")
topk_values, topk_ids = index.search(queryset, k=100)
- when distance="inner", topk_values are inner product of queries and topk closest data points.
- when distance="euclidean", topk_values are negative squared L2 distance between queries and topk closest data points.
- when distance="manhattan", topk_values are negative L1 distance between queries and topk closest data points.
- when distance="cosine", topk_values are cosine similarity between queries and topk closest data points.
Encode and Decode
you can use IVFPQ as a vector codec for lossy compression of vectors
python
code = index.encode(queryset) # compression
reconstruction = index.decode(code) # reconstruction
Save and Load
Most of the TorchPQ modules are inherited from torch.nn.Module, this means you can save and load them just like a regular pytorch model.
```python
Save to PATH
torch.save(index.state_dict(), PATH)
Load from PATH
index.loadstatedict(torch.load(PATH)) ```
Clustering
K-means
```python from torchpq.clustering import KMeans import torch
ndata = 1000000 # number of data points dvector = 128 # dimentionality / number of features x = torch.randn(dvector, ndata, device="cuda")
kmeans = KMeans(nclusters=4096, distance="euclidean")
labels = kmeans.fit(x)
Notice that the shape of the tensor that contains data points has to be[dvector, n_data]```, this is consistant in TorchPQ.
Multiple concurrent K-means
Sometimes, we have multiple independent datasets that need to be clustered, instead of running multiple KMeans sequentianlly, we can perform multiple kmeans concurrently with MultiKMeans ```python from torchpq.clustering import MultiKMeans import torch
ndata = 1000000 nkmeans = 16 dvector = 64 x = torch.randn(nkmeans, dvector, ndata, device="cuda") kmeans = MultiKMeans(n_clusters=256, distance="euclidean") labels = kmeans.fit(x) ```
Prediction with K-means
labels = kmeans.predict(x)
Benchmarks
Owner
- Login: DeMoriarty
- Kind: user
- Website: demoriarty.github.io
- Repositories: 5
- Profile: https://github.com/DeMoriarty
Beep boop
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Omer" given-names: "Sehban" orcid: "https://orcid.org/0000-0002-5465-5841" title: "TorchPQ" version: 0.3.0 doi: 10.5281/zenodo.7115607 date-released: 2021-02-08 url: "https://github.com/DeMoriarty/TorchPQ"
GitHub Events
Total
- Watch event: 13
- Fork event: 1
Last Year
- Watch event: 13
- Fork event: 1
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| DeMoriarty | s****2@v****m | 284 |
| Rémi Francis | f****r | 2 |
| mhamilton723 | m****l@m****m | 1 |
| bezilla5 | h****0@g****m | 1 |
| John Hughes | j****h@s****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 18
- Total pull requests: 8
- Average time to close issues: 27 days
- Average time to close pull requests: 10 days
- Total issue authors: 18
- Total pull request authors: 5
- Average comments per issue: 2.44
- Average comments per pull request: 0.25
- Merged pull requests: 7
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- nlgranger (1)
- tgy (1)
- SEC4SR (1)
- Chen-Cai-OSU (1)
- ZhangIceNight (1)
- Songweiping (1)
- lucidrains (1)
- hellodrx (1)
- Tomiinek (1)
- mhudecheck (1)
- abhinavvs (1)
- JohnTailor (1)
- bryanhx (1)
- nmynol (1)
- LiZhangMing (1)
Pull Request Authors
- DeMoriarty (2)
- francisr (2)
- mhamilton723 (2)
- bezilla5 (1)
- McHughes288 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 1,477 last-month
- Total dependent packages: 0
- Total dependent repositories: 4
- Total versions: 51
- Total maintainers: 1
pypi.org: torchpq
Efficient implementations of Product Quantization and its variants
- Homepage: https://github.com/DeMoriarty/TorchPQ
- Documentation: https://torchpq.readthedocs.io/
- License: MIT
-
Latest release: 0.1.8
published almost 5 years ago
Rankings
Maintainers (1)
Dependencies
- numpy *
- torch *