protein-search

Semantic similarity search for proteins.

https://github.com/braceal/protein-search

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.0%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Semantic similarity search for proteins.

Basic Info

Host: GitHub
Owner: braceal
License: mit
Language: Python
Default Branch: main
Size: 18.4 MB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 3
Releases: 0

Created over 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

protein-search

Semantic similarity search for proteins.

Installation

To install the package for a GPU system, run the following command: bash git clone git@github.com:braceal/protein-search.git cd protein-search pip install -e . pip install faiss-gpu==1.7.2

Usage

CATH Example

We provide an example of how to use the package to search the CATH database. For convenience, we provide a pre-built index and embeddings for the CATH database. To use the pre-built index and embeddings, please skip to the protein-search search-index command.

First download the CATH database, run the following command: bash protein-search download-dataset --dataset cath --output_dir data/cath

Then to create embeddings, run the following command: bash nohup python -m protein_search.distributed_inference --config examples/cath/cath_esm_8m_polaris.yaml &> nohup.out &

To build the search index, run the following command: bash protein-search build-index --fasta_dir data/cath/ --embedding_dir examples/cath/cath_esm_8m_embeddings/embeddings --dataset_dir examples/cath/cath_esm_8m_faiss

To search the index, run the following command: bash protein-search search-index --dataset_dir examples/cath/cath_esm_8m_faiss --query_file examples/cath/faiss-test-cath-20.fasta --top_k 1

Which should output the following: console scores: [0.01191352], indices: [0], tags: ['cath|4_2_0|12asA00/4-330'] scores: [0.03016754], indices: [1], tags: ['cath|4_2_0|132lA00/2-129']

Swiss-Prot Example

This example demonstrates how to use the package to search the Swiss-Prot database.

A copy of the dataset is located on Polaris@ALCF here: bash /lus/eagle/projects/FoundEpidem/braceal/projects/kbase-protein-search/data/sprot

Then to create embeddings, run the following command: bash nohup python -m protein_search.distributed_inference --config examples/sprot/sprot_esm_8m.yaml &> nohup.out &

To build the search index, run the following command: bash protein-search build-index --fasta_dir /lus/eagle/projects/FoundEpidem/braceal/projects/kbase-protein-search/data/sprot --embedding_dir examples/sprot/sprot_esm_8m_embeddings/embeddings --dataset_dir examples/sprot/sprot_esm_8m_faiss

To search the index, run the following command: bash protein-search search-index --dataset_dir examples/sprot/sprot_esm_8m_faiss --query_file examples/sprot/faiss-test-sprot.fasta --top_k 1

Which should output the following: console scores: [0.00508352], indices: [467483], tags: ['Q5HAN0'] scores: [0.01382443], indices: [467484], tags: ['Q5AYI7']

Converting Uniprot XML to FASTA

To process the sprot XML files into FASTA files, run the following command (on lambda10): bash nohup python protein_search/xml_to_fasta.py --input_dir /nfs/ml_lab/projects/ml_lab/afreiburger/proteins/Uniprot/uniprot/sprot --output_dir data/sprot --num_workers 10 --chunk_size 100 &> sprot.log &

To process the trembl XML files into FASTA files, run the following command (on lambda10): bash nohup python protein_search/xml_to_fasta.py --input_dir /nfs/ml_lab/projects/ml_lab/afreiburger/proteins/Uniprot/uniprot/trembl --output_dir data/trembl --num_workers 20 --chunk_size 100 &> trembl.log &

Contributing

For development, it is recommended to use a virtual environment. The following commands will create a virtual environment, install the package in editable mode, and install the pre-commit hooks. bash python3.10 -m venv venv source venv/bin/activate pip install -U pip setuptools wheel pip install -e '.[dev,docs]' pre-commit install To test the code, run the following command: bash pre-commit run --all-files tox -e py310

Owner

Name: Alex Brace
Login: braceal
Kind: user
Company: University of Chicago

Repositories: 11
Profile: https://github.com/braceal

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite it as below.
authors:
  - family-names: Brace
    given-names: Alexander
    orcid: https://orcid.org/0000-0001-9873-9177
license: MIT
repository-code: https://github.com/braceal/protein-search
title: protein-search
url: https://github.com/braceal/protein-search

GitHub Events

Total

Last Year

Dependencies

pyproject.toml pypi

accelerate ==0.26.1
beautifulsoup4 ==4.12.3
datasets ==2.17.0
lxml ==5.1.0
parsl ==2024.1.29
pydantic ==2.6.0
torch *
transformers ==4.37.1
typer [all]==0.9.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

protein-search

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

protein-search

Installation

Usage

CATH Example

Swiss-Prot Example

Converting Uniprot XML to FASTA

Contributing

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies