Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.0%) to scientific vocabulary
Repository
Semantic similarity search for proteins.
Basic Info
- Host: GitHub
- Owner: braceal
- License: mit
- Language: Python
- Default Branch: main
- Size: 18.4 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 3
- Releases: 0
Metadata Files
README.md
protein-search
Semantic similarity search for proteins.
Installation
To install the package for a GPU system, run the following command:
bash
git clone git@github.com:braceal/protein-search.git
cd protein-search
pip install -e .
pip install faiss-gpu==1.7.2
Usage
CATH Example
We provide an example of how to use the package to search the CATH database.
For convenience, we provide a pre-built index and embeddings for the CATH database.
To use the pre-built index and embeddings, please skip to the protein-search search-index command.
First download the CATH database, run the following command:
bash
protein-search download-dataset --dataset cath --output_dir data/cath
Then to create embeddings, run the following command:
bash
nohup python -m protein_search.distributed_inference --config examples/cath/cath_esm_8m_polaris.yaml &> nohup.out &
To build the search index, run the following command:
bash
protein-search build-index --fasta_dir data/cath/ --embedding_dir examples/cath/cath_esm_8m_embeddings/embeddings --dataset_dir examples/cath/cath_esm_8m_faiss
To search the index, run the following command:
bash
protein-search search-index --dataset_dir examples/cath/cath_esm_8m_faiss --query_file examples/cath/faiss-test-cath-20.fasta --top_k 1
Which should output the following:
console
scores: [0.01191352], indices: [0], tags: ['cath|4_2_0|12asA00/4-330']
scores: [0.03016754], indices: [1], tags: ['cath|4_2_0|132lA00/2-129']
Swiss-Prot Example
This example demonstrates how to use the package to search the Swiss-Prot database.
A copy of the dataset is located on Polaris@ALCF here:
bash
/lus/eagle/projects/FoundEpidem/braceal/projects/kbase-protein-search/data/sprot
Then to create embeddings, run the following command:
bash
nohup python -m protein_search.distributed_inference --config examples/sprot/sprot_esm_8m.yaml &> nohup.out &
To build the search index, run the following command:
bash
protein-search build-index --fasta_dir /lus/eagle/projects/FoundEpidem/braceal/projects/kbase-protein-search/data/sprot --embedding_dir examples/sprot/sprot_esm_8m_embeddings/embeddings --dataset_dir examples/sprot/sprot_esm_8m_faiss
To search the index, run the following command:
bash
protein-search search-index --dataset_dir examples/sprot/sprot_esm_8m_faiss --query_file examples/sprot/faiss-test-sprot.fasta --top_k 1
Which should output the following:
console
scores: [0.00508352], indices: [467483], tags: ['Q5HAN0']
scores: [0.01382443], indices: [467484], tags: ['Q5AYI7']
Converting Uniprot XML to FASTA
To process the sprot XML files into FASTA files, run the following command (on lambda10):
bash
nohup python protein_search/xml_to_fasta.py --input_dir /nfs/ml_lab/projects/ml_lab/afreiburger/proteins/Uniprot/uniprot/sprot --output_dir data/sprot --num_workers 10 --chunk_size 100 &> sprot.log &
To process the trembl XML files into FASTA files, run the following command (on lambda10):
bash
nohup python protein_search/xml_to_fasta.py --input_dir /nfs/ml_lab/projects/ml_lab/afreiburger/proteins/Uniprot/uniprot/trembl --output_dir data/trembl --num_workers 20 --chunk_size 100 &> trembl.log &
Contributing
For development, it is recommended to use a virtual environment. The following commands will create a virtual environment, install the package in editable mode, and install the pre-commit hooks.
bash
python3.10 -m venv venv
source venv/bin/activate
pip install -U pip setuptools wheel
pip install -e '.[dev,docs]'
pre-commit install
To test the code, run the following command:
bash
pre-commit run --all-files
tox -e py310
Owner
- Name: Alex Brace
- Login: braceal
- Kind: user
- Company: University of Chicago
- Repositories: 11
- Profile: https://github.com/braceal
Citation (CITATION.cff)
cff-version: 1.2.0
message: If you use this software, please cite it as below.
authors:
- family-names: Brace
given-names: Alexander
orcid: https://orcid.org/0000-0001-9873-9177
license: MIT
repository-code: https://github.com/braceal/protein-search
title: protein-search
url: https://github.com/braceal/protein-search
GitHub Events
Total
Last Year
Dependencies
- accelerate ==0.26.1
- beautifulsoup4 ==4.12.3
- datasets ==2.17.0
- lxml ==5.1.0
- parsl ==2024.1.29
- pydantic ==2.6.0
- torch *
- transformers ==4.37.1
- typer [all]==0.9.0