https://github.com/braceal/protein_search_evals

Protein search project

https://github.com/braceal/protein_search_evals

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Protein search project

Basic Info
  • Host: GitHub
  • Owner: braceal
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 159 MB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed 10 months ago
Metadata Files
Readme License

README.md

proteinsearchevals

Protein search project

Installation

To install the package, run the following command: bash git clone git@github.com:braceal/protein_search_evals.git cd protein_search_evals pip install -U pip setuptools wheel pip install -e .

To install Faiss, for GPU support with CUDA 12, run the following command: bash pip install faiss-gpu-cu12

For ESMC, you can install the following packages and model weights: bash pip uninstall transformers pip install 'transformers<4.48.2' pip install esm pip install "huggingface_hub[hf_transfer]" HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download EvolutionaryScale/esmc-300m-2024-12 HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download EvolutionaryScale/esmc-600m-2024-12

For ESM2 with faesm, you can install the following package: bash pip install flash-attn --no-build-isolation pip install faesm[flash_attn] Note: requires CUDA 11.7 or later.

Or, if you want to forego flash attention and just use SDPA bash pip install faesm

Building the datasets

The Pfam20 benchmark dataset can be built using the following command: bash python -m protein_search_evals.datasets.pfam

The Radical SAM benchmark dataset can be built using the following command: bash tar -zxvf data/radicalsam.tar.gz -C data python -m protein_search_evals.datasets.radicalsam

Running the embedding computation

To compute the embeddings for the Pfam20 dataset using ESM2-3B with faesm, run the following command: bash nohup python -m protein_search_evals.distributed_embeddings --config examples/pfam/embedding_configs/esm2-3B-faesm.yaml &> nohup.log &

Modify the YAML file to use different models or datasets.

Computing embeddings on Polaris

Create a new conda environment with the following commands: bash qsub -I -l select=1 -l filesystems=home:eagle -l walltime=1:00:00 -q debug -A FoundEpidem module use /soft/modulefiles; module load conda conda create -n protein_search_evals_03_25 python=3.12 -y conda activate protein_search_evals_03_25

Then install the package and dependencies: bash git clone git@github.com:braceal/protein_search_evals.git cd protein_search_evals pip install -U pip setuptools wheel pip install -e . pip install flash-attn --no-build-isolation pip install faesm[flash_attn] pip install faiss-gpu-cu12

Then run the embedding computation for SwissProt: bash qsub examples/swissprot/submit.sh

To run the embedding computation for TrEMBL: bash qsub examples/trembl/submit.sh

See the examples swissprot and trembl directories for more configuration details.

Merging embeddings

To combine embeddings from multiple workflow runs, you can use symlinks: ```bash SRCDIR=/path/to/sprot-embeddings/esm3-3Bfaesmembeddings/embeddings DSTDIR=/path/to/combined_embeddings

mkdir -p "$DSTDIR" for dir in "$SRCDIR"/*; do ln -s "$(realpath "$dir")" "$DSTDIR/$(basename "$dir")" done `` Simply replace theSRCDIRandDSTDIR` with the paths to the embeddings you want to combine. You can run the command for multiple SRCDIRs to merge embeddings from multiple runs.

Once you have all the embeddings in the same directory, you can run the following command to merge them into a single Arrow file: bash protein_search_evals merge --dataset_dir /path/to/combined_embeddings/ --output_dir /path/to/combined_embeddings.merge

Contributing

For development, it is recommended to use a virtual environment. The following commands will create a virtual environment, install the package in editable mode, and install the pre-commit hooks. bash python -m venv venv source venv/bin/activate pip install -U pip setuptools wheel pip install -e '.[dev,docs]' pre-commit install To test the code, run the following command: bash pre-commit run --all-files tox -e py310

Owner

  • Name: Alex Brace
  • Login: braceal
  • Kind: user
  • Company: University of Chicago

GitHub Events

Total
  • Delete event: 9
  • Public event: 1
  • Push event: 66
  • Pull request event: 16
  • Create event: 8
Last Year
  • Delete event: 9
  • Public event: 1
  • Push event: 66
  • Pull request event: 16
  • Create event: 8

Dependencies

pyproject.toml pypi
  • biopython >=1.85
  • datasets >=3.3.2
  • h5py >=3.13.0
  • natsort >=8.4.0
  • numba >=0.61.0
  • parsl >=2025.3.10
  • parsl-object-registry @ git+https://github.com/braceal/parsl_object_registry.git
  • pydantic >=2.10.6
  • sentence-transformers >=3.4.1
  • sentencepiece >=0.2.0
  • torch >=2.6.0
  • tqdm >=4.67.1
  • transformers >=4.49.0
  • typer >=0.15.2