candidatevectorsearch

Searching for peptide candidates using sparse matrix + matrix/vector multiplication.

https://github.com/hgb-bin-proteomics/candidatevectorsearch

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (4.3%) to scientific vocabulary

Keywords

cuda eigen engine gpu identification mass mass-spectrometry peptide peptide-identification proteomics psm search search-engine sparse spectrometry spgemm spmm spmv

Last synced: 6 months ago · JSON representation

Repository

Searching for peptide candidates using sparse matrix + matrix/vector multiplication.

Basic Info

Host: GitHub
Owner: hgb-bin-proteomics
License: mit
Language: C++
Default Branch: master
Homepage: https://hgb-bin-proteomics.github.io/CandidateVectorSearch
Size: 200 MB

Statistics

Stars: 1
Watchers: 2
Forks: 1
Open Issues: 1
Releases: 9

Topics

cuda eigen engine gpu identification mass mass-spectrometry peptide peptide-identification proteomics psm search search-engine sparse spectrometry spgemm spmm spmv

Created over 2 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

CandidateVectorSearch

Searching for peptide candidates using sparse matrix + [sparse] vector/matrix multiplication. This is the computational backend for CandidateSearch - a search engine that aims to (quickly) identify peptide candidates for a given mass spectrum without any information about precursor mass or variable modifications. This is also the computational backend for the non-cleavable crosslink search in MS Annika.

Implements the following methods across two DLLs: - VectorSearch.dll: - findTopCandidates: sparse matrix - sparse vector multiplication [f32] using Eigen. - findTopCandidatesInt: sparse matrix - sparse vector multiplication [i32] using Eigen. - findTopCandidates2: sparse matrix - dense vector multiplication [f32] using Eigen. - findTopCandidates2Int: sparse matrix - dense vector multiplication [i32] using Eigen. - findTopCandidatesBatched: sparse matrix - sparse matrix multiplication [f32] using Eigen. - findTopCandidatesBatchedInt: sparse matrix - sparse matrix multiplication [i32] using Eigen. - findTopCandidatesBatched2: sparse matrix - dense matrix multiplication [f32] using Eigen. - findTopCandidatesBatched2Int: sparse matrix - dense matrix multiplication [i32] using Eigen. - VectorSearchCUDA.dll: - findTopCandidatesCuda: sparse matrix - dense vector multiplication [f32] using CUDA (SpMV). - findTopCandidatesCudaBatched: sparse matrix - sparse matrix multiplication [f32] using CUDA (SpGEMM). - findTopCandidatesCudaBatched2: sparse matrix - dense matrix multiplication [f32] using CUDA (SpMM).

VectorSearch.dll implements functions that run on the CPU, while VectorSearchCUDA.dll implements functions that run on a NVIDIA GPU using CUDA (version 12.2.0536.25windows).

Which functions should be used depends on the problem size and the available hardware. A general recommendation is to use findTopCanidates2 or findTopCandidates2Int on CPUs and findTopCandidatesCuda on GPUs.

Documentation

Functions are documented within the source code: - VectorSearch.dll - VectorSearchCUDA.dll

A better description of the input arrays is given in input.md.

An example usage where functions are called from a C# application is given in here (CPU) and here (GPU). A wrapper for C# is given in here.

Documentation is also available on https://hgb-bin-proteomics.github.io/CandidateVectorSearch/.

Benchmarks

See benchmarks.md.

Requirements

.NET may be required on some systems to run the DataLoader testing suite.
[Optional] Using GPU based approaches (e.g. anything implemented in VectorSearchCUDA.dll) requires a CUDA capable GPU and CUDA version == 12.2.0 (download here). Other CUDA versions may or may not produce the desired results (see this issue).

Downloads

Compiled DLLs are available in the dll folder or in Releases.

We supply compiled executables and DLLs for: - Windows 10/11 (x86, 64-bit) - Ubuntu 22.04 (x86, 64-bit) - macOS 14.4 (arm, 64-bit)

For other operating systems/architectures please compile the source code yourself! Exemplary compilation instructions for linux and macOS can be found here: linux.md and macos.md.

Limitations

Please be aware of the following limitations: - Ions/peaks up to 5000 m/z are supported, beyond that they are discarded. - The encoding precision is 0.01 (m/z, Dalton). - Only matrices up to 2 * 10⁹ non-zero elements are supported [see this issue]. - [Eigen][Sparse] Sparse candidate matrices support up to 100 elements per row, beyond that matrix creation might be slow due to resizing. - This means every peptide candidate can be encoded up to 100 ions. - [Eigen][Sparse] Sparse spectrum matrices support up to 1000 elements per row, beyond that matrix creation might be slow due to resizing. - This means spectra with more than 1000 peaks should be deisotoped, deconvoluted or peak picked to decrease the number of peaks. - This does not affect dense spectrum matrices. - [Eigen][i32] The rounding precision of converting floats to integers is 0.001, the exact rounding for a float val is (int) round(val * 1000.0f). - [Eigen][i32] Integer based methods do not allow tolerances below 0.01 because they might cause overflows. - [CUDA] Sparse matrix - sparse matrix multiplication tends to be very slow and very memory hungry, most likely caused by memory overhead and the output matrix not being sparse.

Implementing your own matrix products

If you want to implement your own (and hopefully faster) computation for matrix products, we offer a template repository that walks you through that: CandidateVectorSearch_template

Acknowledgements

This project uses Eigen and CUDA to implement sparse linear algebra, Eigen is licensed under MPL2, and CUDA is owned by NVIDIA Corporation.
Special thanks goes to the Eigen Community Discord who helped fixing a bug in the original implementation of VectorSearch::findTopCandidates.

Citing

If you are using [parts of] CandidateVectorSearch please cite this publication:

Proteome-wide non-cleavable crosslink identification with MS Annika 3.0 reveals the structure of the C. elegans Box C/D complex Micha J. Birklbauer, Frnze Mller, Sowmya Sivakumar Geetha, Manuel Matzinger, Karl Mechtler, and Viktoria Dorfer Communications Chemistry 2024 7 (300) DOI: 10.1038/s42004-024-01386-x

License

Contact

micha.birklbauer@fh-hagenberg.at

Owner

Name: FHOOE Hagenberg Bioinformatics/Proteomics Research Group
Login: hgb-bin-proteomics
Kind: organization
Location: Austria

Website: https://bioinformatics.fh-hagenberg.at/
Repositories: 3
Profile: https://github.com/hgb-bin-proteomics

Bioinformatics/Proteomics Research Group of the FH OOE Hagenberg

GitHub Events

Total

Issues event: 1
Push event: 2
Pull request event: 2
Fork event: 1

Last Year

Issues event: 1
Push event: 2
Pull request event: 2
Fork event: 1

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science