https://github.com/cwida/pdx

⚡ Faster similarity search with PDX: A vertical data layout for vectors

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.1%) to scientific vocabulary

Keywords

faiss information-retrieval nearest-neighbor-search simd similarity-search vector-indexing vector-search vector-similarity-search

Last synced: 5 months ago · JSON representation

Repository

⚡ Faster similarity search with PDX: A vertical data layout for vectors

Basic Info

Host: GitHub
Owner: cwida
License: mit
Language: C++
Default Branch: main
Homepage: https://www.lkuffo.com/sub-milisecond-similarity-search-with-pdx/
Size: 193 MB

Statistics

Stars: 54
Watchers: 5
Forks: 5
Open Issues: 3
Releases: 2

Topics

faiss information-retrieval nearest-neighbor-search simd similarity-search vector-indexing vector-search vector-similarity-search

Created 12 months ago · Last pushed 6 months ago

Metadata Files

Readme License

PDX: A Transposed Data Layout for Similarity Search

PDX is a data layout that transposes vectors in a column-major order. This layout unleashes the true potential of dimension pruning, accelerating vanilla IVF indexes by factors:

PDX Layout

PDX benefits:

⚡ Up to 10x faster IVF searches than FAISS+AVX512.
⚡ Up to 30x faster exhaustive search.
🔍 Efficient filtered search.

Pruning in a nutshell
Try PDX
Use cases (comparison with FAISS)
The data layout
Roadmap

Pruning in a nutshell

Pruning means avoiding checking all the dimensions of a vector to determine if it is a neighbour of a query. The PDX layout unleashes the true potential of these algorithms (e.g., ADSampling), accelerating vanilla IVF indexes by factors.

Pruning is especially effective for large embeddings (d > 512) and when targeting high recalls (> 0.90) or nearly exact results.

Down below, you will find benchmarks against FAISS.

Try PDX

Try PDX with your data using our Python bindings and examples. We have implemented PDX on Flat (float32) and Quantized (8-bit) IVF indexes and exhaustive search settings.

Prerequisites

PDX is available for x86_64 (with AVX512), ARM, and Apple silicon
Python 3.11 or higher
FAISS with Python Bindings
Clang++17 or higher
CMake 3.26 or higher

Installation Steps

Clone and init submodules sh git clone https://github.com/cwida/PDX git submodule init git submodule update
[Optional] Install FFTW (for higher throughput) sh wget https://www.fftw.org/fftw-3.3.10.tar.gz tar -xvzf fftw-3.3.10.tar.gz cd fftw-3.3.10 ./configure --enable-float --enable-shared sudo make sudo make install ldconfig
Install Python dependencies and the bindings. sh export CXX="/usr/bin/clang++-18" # Set proper CXX first pip install -r requirements.txt python setup.py clean --all python -m pip install .
Run the examples under /examples ```sh

Creates an IVF index with FAISS on random data

Then, it compares the search performance of PDXearch and FAISS

python ./examples/pdx_simple.py ``` For more details on the available examples and how to use your own data, refer to /examples/README.md.

[!NOTE] We heavily rely on FAISS to create the underlying IVF indexes.

Use Cases and Benchmarks

We present single-threaded benchmarks against FAISS+AVX512 on an r7iz.xlarge (Intel Sapphire Rapids) instance.

Two-Level IVF (IVF₂)

IVF₂ tackles a bottleneck of IVF indexes: finding the nearest centroids. By clustering the original IVF centroids, we can use PDX to quickly scan them (thanks to pruning) without sacrificing recall. This achieves significant throughput improvements when paired with 8-bit quantization.

PDX Layout

Vanilla IVF

Here, PDX, paired with the pruning algorithm ADSampling on float32, achieves significant speedups.

PDX Layout

Exhaustive search + IVF

An exhaustive search scans all the vectors in the collection. Having an IVF index with PDX can EXTREMELY accelerate this without sacrificing recall, thanks to the reliable pruning of ADSampling.

PDX Layout

The key observation here is that thanks to the underlying IVF index, the exhaustive search starts with the most promising clusters. A tight threshold is found early on, which enables the quick pruning of most candidates.

Exhaustive search without an index

By creating random clusters with the PDX layout, you can still accelerate exhaustive search without an index. Unlike ADSampling, with BOND (our pruning algorithm), you can use the raw vectors. Gains vary depending on the distribution of the data.

PDX Layout

No pruning and no index

Even without pruning, PDX distance kernels can be faster than SIMD ones in most CPU microarchitectures. For detailed information, check Figure 3 of our publication. You can also try it yourself in our playground here.

The Data Layout

PDX is a transposed layout (a.k.a. columnar, or decomposed layout), which means that the same dimensions of different vectors are stored sequentially. This decomposition occurs within a block (e.g., a cluster in an IVF index).

We have evolved our layout from the one presented in our publication to reduce random access, and adapted it to work with 8-bit and (in the future) 1-bit vectors.

`float32`

For float32, the first 25% of the dimensions are fully decomposed. We refer to this as the "vertical block." The rest (75%) are decomposed into subvectors of 64 dimensions. We refer to this as the "horizontal block." The vertical block is used for efficient pruning, and the horizontal block is accessed on the candidates that were not pruned. This horizontal block is still decomposed every 64 dimensions. The idea behind this is that we still have a chance to prune the few remaining candidates every 64 dimensions.

The following image shows this layout. Storage is sequential from left to right, and from top to bottom.

PDX Layout F32

`8 bits`

Smaller data types are not friendly to PDX, as we must accumulate distances on wider types, resulting in asymmetry. We can work around this by changing the PDX layout. For 8 bits, the vertical block is decomposed every 4 dimensions. This allows us to use dot product instructions (VPDPBUSD in x86 and UDOT/SDOT in NEON) to calculate L2 or IP kernels while still benefiting from PDX. The horizontal block remains decomposed every 64 dimensions.

PDX Layout F32

`binary`

For Hamming/Jaccard kernels, we use a layout decomposed every 8 dimensions (naturally grouped into bytes). The population count accumulation can be done in bytes. If d > 256, we flush the popcounts into a wider type every 32 words (corresponding to 256 dimensions). This has not been implemented in this repository yet, but you can find some promising benchmarks here.

Roadmap

Out-of-core execution (disk-based setting).
Add unit tests.
Implement multi-threading capabilities.
Add PDX to the VIBE benchmark.
Adaptive quantization on 8-bit and 4-bit.
Create a documentation.

[!IMPORTANT]
PDX is an ongoing research project. In its current state, it is not production-ready code.

Benchmarking

To run our benchmark suite in C++, refer to BENCHMARKING.md.

Citation

If you use PDX for your research, consider citing us:

@article{kuffo2025pdx, title={PDX: A Data Layout for Vector Similarity Search}, author={Kuffo, Leonardo and Krippner, Elena and Boncz, Peter}, journal={Proceedings of the ACM on Management of Data}, volume={3}, number={3}, pages={1--26}, year={2025}, publisher={ACM New York, NY, USA} }

SIGMOD

The code used for the experiments presented at SIGMOD'25 can be found in the sigmod branch.

Owner

Name: CWI Database Architectures Group
Login: cwida
Kind: organization
Location: Amsterdam, The Netherlands

Website: https://www.cwi.nl/research/groups/database-architectures
Twitter: cwi_da
Repositories: 19
Profile: https://github.com/cwida

GitHub Events

Total

Create event: 5
Issues event: 3
Release event: 1
Watch event: 40
Delete event: 2
Issue comment event: 11
Member event: 1
Push event: 70
Pull request event: 7
Fork event: 3

Last Year

Create event: 5
Issues event: 3
Release event: 1
Watch event: 40
Delete event: 2
Issue comment event: 11
Member event: 1
Push event: 70
Pull request event: 7
Fork event: 3

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 1
Total pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 4 minutes
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 4 minutes
Issue authors: 1
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

https://github.com/cwida/pdx

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

PDX: A Transposed Data Layout for Similarity Search

PDX benefits:

Contents

Pruning in a nutshell

Try PDX

Prerequisites

Installation Steps

Creates an IVF index with FAISS on random data

Then, it compares the search performance of PDXearch and FAISS

Use Cases and Benchmarks

Two-Level IVF (IVF2)

Vanilla IVF

Exhaustive search + IVF

Exhaustive search without an index

No pruning and no index

The Data Layout

float32

8 bits

binary

Roadmap

Benchmarking

Citation

SIGMOD

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Two-Level IVF (IVF₂)

`float32`

`8 bits`

`binary`