https://github.com/braceal/protein_search_evals
Protein search project
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary
Repository
Protein search project
Basic Info
- Host: GitHub
- Owner: braceal
- License: mit
- Language: Python
- Default Branch: main
- Size: 159 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
proteinsearchevals
Protein search project
Installation
To install the package, run the following command:
bash
git clone git@github.com:braceal/protein_search_evals.git
cd protein_search_evals
pip install -U pip setuptools wheel
pip install -e .
To install Faiss, for GPU support with CUDA 12, run the following command:
bash
pip install faiss-gpu-cu12
For ESMC, you can install the following packages and model weights:
bash
pip uninstall transformers
pip install 'transformers<4.48.2'
pip install esm
pip install "huggingface_hub[hf_transfer]"
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download EvolutionaryScale/esmc-300m-2024-12
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download EvolutionaryScale/esmc-600m-2024-12
For ESM2 with faesm, you can install the following package:
bash
pip install flash-attn --no-build-isolation
pip install faesm[flash_attn]
Note: requires CUDA 11.7 or later.
Or, if you want to forego flash attention and just use SDPA
bash
pip install faesm
Building the datasets
The Pfam20 benchmark dataset can be built using the following command:
bash
python -m protein_search_evals.datasets.pfam
The Radical SAM benchmark dataset can be built using the following command:
bash
tar -zxvf data/radicalsam.tar.gz -C data
python -m protein_search_evals.datasets.radicalsam
Running the embedding computation
To compute the embeddings for the Pfam20 dataset using ESM2-3B with faesm, run the following command:
bash
nohup python -m protein_search_evals.distributed_embeddings --config examples/pfam/embedding_configs/esm2-3B-faesm.yaml &> nohup.log &
Modify the YAML file to use different models or datasets.
Computing embeddings on Polaris
Create a new conda environment with the following commands:
bash
qsub -I -l select=1 -l filesystems=home:eagle -l walltime=1:00:00 -q debug -A FoundEpidem
module use /soft/modulefiles; module load conda
conda create -n protein_search_evals_03_25 python=3.12 -y
conda activate protein_search_evals_03_25
Then install the package and dependencies:
bash
git clone git@github.com:braceal/protein_search_evals.git
cd protein_search_evals
pip install -U pip setuptools wheel
pip install -e .
pip install flash-attn --no-build-isolation
pip install faesm[flash_attn]
pip install faiss-gpu-cu12
Then run the embedding computation for SwissProt:
bash
qsub examples/swissprot/submit.sh
To run the embedding computation for TrEMBL:
bash
qsub examples/trembl/submit.sh
See the examples swissprot and trembl directories for more configuration details.
Merging embeddings
To combine embeddings from multiple workflow runs, you can use symlinks: ```bash SRCDIR=/path/to/sprot-embeddings/esm3-3Bfaesmembeddings/embeddings DSTDIR=/path/to/combined_embeddings
mkdir -p "$DSTDIR"
for dir in "$SRCDIR"/*; do
ln -s "$(realpath "$dir")" "$DSTDIR/$(basename "$dir")"
done
``
Simply replace theSRCDIRandDSTDIR` with the paths to the embeddings you want to combine.
You can run the command for multiple SRCDIRs to merge embeddings from multiple runs.
Once you have all the embeddings in the same directory, you can run the following command to merge
them into a single Arrow file:
bash
protein_search_evals merge --dataset_dir /path/to/combined_embeddings/ --output_dir /path/to/combined_embeddings.merge
Contributing
For development, it is recommended to use a virtual environment. The following
commands will create a virtual environment, install the package in editable
mode, and install the pre-commit hooks.
bash
python -m venv venv
source venv/bin/activate
pip install -U pip setuptools wheel
pip install -e '.[dev,docs]'
pre-commit install
To test the code, run the following command:
bash
pre-commit run --all-files
tox -e py310
Owner
- Name: Alex Brace
- Login: braceal
- Kind: user
- Company: University of Chicago
- Repositories: 11
- Profile: https://github.com/braceal
GitHub Events
Total
- Delete event: 9
- Public event: 1
- Push event: 66
- Pull request event: 16
- Create event: 8
Last Year
- Delete event: 9
- Public event: 1
- Push event: 66
- Pull request event: 16
- Create event: 8
Dependencies
- biopython >=1.85
- datasets >=3.3.2
- h5py >=3.13.0
- natsort >=8.4.0
- numba >=0.61.0
- parsl >=2025.3.10
- parsl-object-registry @ git+https://github.com/braceal/parsl_object_registry.git
- pydantic >=2.10.6
- sentence-transformers >=3.4.1
- sentencepiece >=0.2.0
- torch >=2.6.0
- tqdm >=4.67.1
- transformers >=4.49.0
- typer >=0.15.2