multicom_ligand

Comprehensive ensembling of protein-ligand structure and affinity prediction methods (CASP16)

https://github.com/bioinfomachinelearning/multicom_ligand

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 9 DOI reference(s) in README
  • Academic publication links
    Links to: nature.com, science.org, acs.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (17.0%) to scientific vocabulary

Keywords

binding-affinity deep-learning diffusion-model drug-discovery flow-matching pose-prediction protein-ligand-interactions
Last synced: 6 months ago · JSON representation ·

Repository

Comprehensive ensembling of protein-ligand structure and affinity prediction methods (CASP16)

Basic Info
  • Host: GitHub
  • Owner: BioinfoMachineLearning
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 94.2 MB
Statistics
  • Stars: 7
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Topics
binding-affinity deep-learning diffusion-model drug-discovery flow-matching pose-prediction protein-ligand-interactions
Created about 1 year ago · Last pushed 11 months ago
Metadata Files
Readme Changelog License Citation

README.md

# MULTICOM_ligand [![Paper](http://img.shields.io/badge/arXiv-2405.14108-B31B1B.svg)](https://www.authorea.com/users/885651/articles/1263768-protein-ligand-structure-and-affinity-prediction-in-casp16-using-a-geometric-deep-learning-ensemble-and-flow-matching?commit=81affaec47d6d23f58d02a1e7bbbdfcb09935ef7) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.11477766.svg)](https://doi.org/10.5281/zenodo.11477766) [![PyPI version](https://badge.fury.io/py/multicom_ligand.svg)](https://badge.fury.io/py/multicom_ligand) [![Project Status: Active The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active) [![Docs](https://assets.readthedocs.org/static/projects/badges/passing-flat.svg)](https://bioinfomachinelearning.github.io/MULTICOM_ligand/) Config: Hydra Code style: black [![License: MIT](https://img.shields.io/badge/license-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Description

Comprehensive ensembling of protein-ligand structure and affinity prediction methods

Documentation

Contents

Installation

### Portable installation To reuse modules and utilities within `MULTICOM_ligand` in other projects, one can simply use `pip` ```bash pip install multicom_ligand ``` ### Full installation To reproduce, customize, or extend the `MULTICOM_ligand` benchmark, we recommend fully installing `MULTICOM_ligand` using `mamba` as follows: First, install `mamba` for dependency management (as a fast alternative to Anaconda) ```bash wget "https://github.com/conda-forge/miniforge/releases/download/24.11.3-0/Miniforge3-$(uname)-$(uname -m).sh" bash Miniforge3-$(uname)-$(uname -m).sh # accept all terms and install to the default location rm Miniforge3-$(uname)-$(uname -m).sh # (optionally) remove installer after using it source ~/.bashrc # alternatively, one can restart their shell session to achieve the same result ``` Install dependencies for each method's environment (as desired) ```bash # clone project sudo apt-get install git-lfs # NOTE: run this if you have not already installed `git-lfs` git lfs install git clone https://github.com/BioinfoMachineLearning/MULTICOM_ligand --recursive cd MULTICOM_ligand # create conda environments (~80 GB total) # - MULTICOM_ligand environment # (~15 GB) mamba env create -f environments/multicom_ligand_environment.yaml conda activate MULTICOM_ligand # NOTE: one still needs to use `conda` to (de)activate environments pip3 install -e . pip3 install numpy==1.26.4 --no-dependencies pip3 install prody==2.4.1 --no-dependencies # - casp15_ligand_scoring environment (~3 GB) mamba env create -f environments/casp15_ligand_scoring_environment.yaml conda activate casp15_ligand_scoring # NOTE: one still needs to use `conda` to (de)activate environments pip3 install -e . # - DiffDock environment (~13 GB) mamba env create -f environments/diffdock_environment.yaml --prefix forks/DiffDock/DiffDock/ conda activate forks/DiffDock/DiffDock/ && pip3 install pyg-lib -f https://data.pyg.org/whl/torch-2.1.0+cu118.html # NOTE: one still needs to use `conda` to (de)activate environments # - FABind environment (~6 GB) mamba env create -f environments/fabind_environment.yaml --prefix forks/FABind/FABind/ conda activate forks/FABind/FABind/ # NOTE: one still needs to use `conda` to (de)activate environments # - DynamicBind environment (~13 GB) mamba env create -f environments/dynamicbind_environment.yaml --prefix forks/DynamicBind/DynamicBind/ conda activate forks/DynamicBind/DynamicBind/ && pip3 install pyg-lib -f https://data.pyg.org/whl/torch-2.1.0+cu118.html # NOTE: one still needs to use `conda` to (de)activate environments # - NeuralPLexer environment (~14 GB) mamba env create -f environments/neuralplexer_environment.yaml --prefix forks/NeuralPLexer/NeuralPLexer/ conda activate forks/NeuralPLexer/NeuralPLexer/ # NOTE: one still needs to use `conda` to (de)activate environments cd forks/NeuralPLexer/ && pip3 install -e . && cd ../../ # - RoseTTAFold-All-Atom environment (~14 GB) - NOTE: after running these commands, follow the installation instructions in `forks/RoseTTAFold-All-Atom/README.md` starting at Step 4 (with `forks/RoseTTAFold-All-Atom/` as the current working directory) mamba env create -f environments/rfaa_environment.yaml --prefix forks/RoseTTAFold-All-Atom/RFAA/ conda activate forks/RoseTTAFold-All-Atom/RFAA/ # NOTE: one still needs to use `conda` to (de)activate environments cd forks/RoseTTAFold-All-Atom/rf2aa/SE3Transformer/ && pip3 install --no-cache-dir -r requirements.txt && python3 setup.py install && cd ../../../../ # - AutoDock Vina Tools environment (~1 GB) mamba env create -f environments/adfr_environment.yaml --prefix forks/Vina/ADFR/ conda activate forks/Vina/ADFR/ # NOTE: one still needs to use `conda` to (de)activate environments # - P2Rank (~0.5 GB) wget -P forks/P2Rank/ https://github.com/rdk/p2rank/releases/download/2.4.2/p2rank_2.4.2.tar.gz tar -xzf forks/P2Rank/p2rank_2.4.2.tar.gz -C forks/P2Rank/ rm forks/P2Rank/p2rank_2.4.2.tar.gz ``` Download checkpoints (~8.25 GB total) ```bash # DynamicBind checkpoint (~0.25 GB) cd forks/DynamicBind/ wget https://zenodo.org/records/10137507/files/workdir.zip unzip workdir.zip rm workdir.zip cd ../../ # NeuralPLexer checkpoint (~6.5 GB) cd forks/NeuralPLexer/ wget https://zenodo.org/records/10373581/files/neuralplexermodels_downstream_datasets_predictions.zip unzip neuralplexermodels_downstream_datasets_predictions.zip rm neuralplexermodels_downstream_datasets_predictions.zip cd ../../ # RoseTTAFold-All-Atom checkpoint (~1.5 GB) cd forks/RoseTTAFold-All-Atom/ wget http://files.ipd.uw.edu/pub/RF-All-Atom/weights/RFAA_paper_weights.pt cd ../../ ```

Tutorials

We provide a two-part tutorial series of Jupyter notebooks to provide users with examples of how to extend `MULTICOM_ligand`, as outlined below. 1. [Adding a new dataset](https://github.com/BioinfoMachineLearning/MULTICOM_ligand/blob/main/notebooks/adding_new_dataset_tutorial.ipynb) 2. [Adding a new method](https://github.com/BioinfoMachineLearning/MULTICOM_ligand/blob/main/notebooks/adding_new_method_tutorial.ipynb)

How to prepare MULTICOM_ligand data

### Downloading Astex, PoseBusters, DockGen, and CASP15 data ```bash # fetch, extract, and clean-up preprocessed Astex Diverse, PoseBusters Benchmark, DockGen, and CASP15 data (~3 GB) # wget https://zenodo.org/records/11477766/files/astex_diverse_set.tar.gz wget https://zenodo.org/records/11477766/files/posebusters_benchmark_set.tar.gz wget https://zenodo.org/records/11477766/files/dockgen_set.tar.gz wget https://zenodo.org/records/11477766/files/casp15_set.tar.gz tar -xzf astex_diverse_set.tar.gz tar -xzf posebusters_benchmark_set.tar.gz tar -xzf dockgen_set.tar.gz tar -xzf casp15_set.tar.gz rm astex_diverse_set.tar.gz rm posebusters_benchmark_set.tar.gz rm dockgen_set.tar.gz rm casp15_set.tar.gz ``` ### Downloading benchmark method predictions ```bash # fetch, extract, and clean-up benchmark method predictions to reproduce paper results (~19 GB) # # DiffDock predictions and results wget https://zenodo.org/records/11477766/files/diffdock_benchmark_method_predictions.tar.gz tar -xzf diffdock_benchmark_method_predictions.tar.gz rm diffdock_benchmark_method_predictions.tar.gz # FABind predictions and results wget https://zenodo.org/records/11477766/files/fabind_benchmark_method_predictions.tar.gz tar -xzf fabind_benchmark_method_predictions.tar.gz rm fabind_benchmark_method_predictions.tar.gz # DynamicBind predictions and results wget https://zenodo.org/records/11477766/files/dynamicbind_benchmark_method_predictions.tar.gz tar -xzf dynamicbind_benchmark_method_predictions.tar.gz rm dynamicbind_benchmark_method_predictions.tar.gz # NeuralPLexer predictions and results wget https://zenodo.org/records/11477766/files/neuralplexer_benchmark_method_predictions.tar.gz tar -xzf neuralplexer_benchmark_method_predictions.tar.gz rm neuralplexer_benchmark_method_predictions.tar.gz # RoseTTAFold-All-Atom predictions and results wget https://zenodo.org/records/11477766/files/rfaa_benchmark_method_predictions.tar.gz tar -xzf rfaa_benchmark_method_predictions.tar.gz rm rfaa_benchmark_method_predictions.tar.gz # TULIP predictions and results wget https://zenodo.org/records/11477766/files/tulip_benchmark_method_predictions.tar.gz tar -xzf tulip_benchmark_method_predictions.tar.gz rm tulip_benchmark_method_predictions.tar.gz # AutoDock Vina predictions and results wget https://zenodo.org/records/11477766/files/vina_benchmark_method_predictions.tar.gz tar -xzf vina_benchmark_method_predictions.tar.gz rm vina_benchmark_method_predictions.tar.gz # Astex Diverse, PoseBusters Benchmark (w/ pocket-only results), DockGen, and CASP15 consensus ensemble predictions and results wget https://zenodo.org/records/11477766/files/astex_diverse_ensemble_benchmark_method_predictions.tar.gz wget https://zenodo.org/records/11477766/files/posebusters_benchmark_ensemble_benchmark_method_predictions.tar.gz wget https://zenodo.org/records/11477766/files/dockgen_ensemble_benchmark_method_predictions.tar.gz wget https://zenodo.org/records/11477766/files/casp15_ensemble_benchmark_method_predictions.tar.gz tar -xzf astex_diverse_ensemble_benchmark_method_predictions.tar.gz tar -xzf posebusters_benchmark_ensemble_benchmark_method_predictions.tar.gz tar -xzf dockgen_ensemble_benchmark_method_predictions.tar.gz tar -xzf casp15_ensemble_benchmark_method_predictions.tar.gz rm astex_diverse_ensemble_benchmark_method_predictions.tar.gz rm posebusters_benchmark_ensemble_benchmark_method_predictions.tar.gz rm dockgen_ensemble_benchmark_method_predictions.tar.gz rm casp15_ensemble_benchmark_method_predictions.tar.gz ``` **NOTE:** One can reproduce the _pocket-only_ experiments with the PoseBusters Benchmark set by adding the argument `pocket_only_baseline=true` to each command below used to run PoseBusters Benchmark dataset inference with all the baseline methods, since the pocket-only versions of the dataset's holo-aligned predicted protein structures have also been included in the downloadable Zenodo archive `posebusters_benchmark_set.tar.gz` referenced above. However, be aware that one then needs to _rename_ any existing directories containing PoseBusters Benchmark dataset inference results for each baseline method, to prevent these existing inference directories from being merged with new pocket-only results. Please see the config files within `configs/data/`, `configs/model/`, and `configs/analysis/` for more details. ### Downloading sequence databases (required only for RoseTTAFold-All-Atom inference) ```bash # acquire multiple sequence alignment databases for RoseTTAFold-All-Atom (~2.5 TB) cd forks/RoseTTAFold-All-Atom/ # uniref30 [46G] wget http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz mkdir -p UniRef30_2020_06 tar xfz UniRef30_2020_06_hhsuite.tar.gz -C ./UniRef30_2020_06 # BFD [272G] wget https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz mkdir -p bfd tar xfz bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz -C ./bfd # structure templates (including *_a3m.ffdata, *_a3m.ffindex) wget https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz tar xfz pdb100_2021Mar03.tar.gz cd ../../ ``` ### Predicting apo protein structures using ESMFold First create all the corresponding FASTA files for each protein sequence ```bash python3 multicom_ligand/data/components/esmfold_fasta_preparation.py dataset=posebusters_benchmark python3 multicom_ligand/data/components/esmfold_fasta_preparation.py dataset=astex_diverse ``` To generate the apo version of each protein structure, create ESMFold-ready versions of the combined FASTA files prepared above by the script `esmfold_fasta_preparation.py` for the PoseBusters Benchmark and Astex Diverse sets, respectively ```bash python3 multicom_ligand/data/components/esmfold_sequence_preparation.py dataset=posebusters_benchmark python3 multicom_ligand/data/components/esmfold_sequence_preparation.py dataset=astex_diverse ``` Then, predict each apo protein structure using ESMFold's batch inference script ```bash python3 multicom_ligand/data/components/esmfold_batch_structure_prediction.py -i data/posebusters_benchmark_set/posebusters_benchmark_esmfold_sequences.fasta -o data/posebusters_benchmark_set/posebusters_benchmark_esmfold_structures --skip-existing python3 multicom_ligand/data/components/esmfold_batch_structure_prediction.py -i data/astex_diverse_set/astex_diverse_esmfold_sequences.fasta -o data/astex_diverse_set/astex_diverse_esmfold_structures --skip-existing ``` **NOTE:** Having a CUDA-enabled device available when running ESMFold is highly recommended **NOTE:** ESMFold may not be able to predict apo protein structures for a handful of exceedingly-long (e.g., >2000 token) input sequences Lastly, align each apo protein structure to its corresponding holo protein structure counterpart in the PoseBusters Benchmark or Astex Diverse set, taking ligand conformations into account during each alignment ```bash python3 multicom_ligand/data/components/esmfold_apo_to_holo_alignment.py dataset=posebusters_benchmark num_workers=1 python3 multicom_ligand/data/components/esmfold_apo_to_holo_alignment.py dataset=astex_diverse num_workers=1 ``` **NOTE:** The preprocessed DockGen and CASP15 data available via [Zenodo](https://doi.org/10.5281/zenodo.11477766) provide pre-holo-aligned predicted protein structures for these respective datasets.

Available inference methods

### Methods available individually #### Fixed Protein Methods | Name | Source | Astex Benchmarked | PoseBusters Benchmarked | DockGen Benchmarked | CASP Benchmarked | | --------------- | --------------------------------------------------------------------- | ----------------- | ----------------------- | ------------------- | ---------------- | | `DiffDock` | [Corso et al.](https://openreview.net/forum?id=UfBIxpTK10) | | | | | | `FABind` | [Pei et al.](https://openreview.net/forum?id=PnWakgg1RL) | | | | | | `AutoDock Vina` | [Eberhardt et al.](https://pubs.acs.org/doi/10.1021/acs.jcim.1c00203) | | | | | | `TULIP` | | | | | | #### Flexible Protein Methods | Name | Source | Astex Benchmarked | PoseBusters Benchmarked | DockGen Benchmarked | CASP Benchmarked | | ---------------------- | --------------------------------------------------------------------- | ----------------- | ----------------------- | ------------------- | ---------------- | | `DynamicBind` | [Lu et al.](https://www.nature.com/articles/s41467-024-45461-2) | | | | | | `NeuralPLexer` | [Qiao et al.](https://www.nature.com/articles/s42256-024-00792-z) | | | | | | `RoseTTAFold-All-Atom` | [Krishna et al.](https://www.science.org/doi/10.1126/science.adl2528) | | | | | ### Methods available for ensembling #### Fixed Protein Methods | Name | Source | Astex Benchmarked | PoseBusters Benchmarked | DockGen Benchmarked | CASP Benchmarked | | --------------- | --------------------------------------------------------------------- | ----------------- | ----------------------- | ------------------- | ---------------- | | `DiffDock` | [Corso et al.](https://openreview.net/forum?id=UfBIxpTK10) | | | | | | `AutoDock Vina` | [Eberhardt et al.](https://pubs.acs.org/doi/10.1021/acs.jcim.1c00203) | | | | | | `TULIP` | | | | | | #### Flexible Protein Methods | Name | Source | Astex Benchmarked | PoseBusters Benchmarked | DockGen Benchmarked | CASP Benchmarked | | ---------------------- | --------------------------------------------------------------------- | ----------------- | ----------------------- | ------------------- | ---------------- | | `DynamicBind` | [Lu et al.](https://www.nature.com/articles/s41467-024-45461-2) | | | | | | `NeuralPLexer` | [Qiao et al.](https://www.nature.com/articles/s42256-024-00792-z) | | | | | | `RoseTTAFold-All-Atom` | [Krishna et al.](https://www.science.org/doi/10.1126/science.adl2528) | | | | | **NOTE**: Have a new method to add? Please let us know by creating a pull request. We would be happy to work with you to integrate new methodology into this benchmark!

How to run inference with individual methods

### How to run inference with `DiffDock` Prepare CSV input files ```bash python3 multicom_ligand/data/diffdock_input_preparation.py dataset=posebusters_benchmark python3 multicom_ligand/data/diffdock_input_preparation.py dataset=astex_diverse python3 multicom_ligand/data/diffdock_input_preparation.py dataset=dockgen python3 multicom_ligand/data/diffdock_input_preparation.py dataset=casp15 input_data_dir="$PWD"/data/casp15_set/targets input_protein_structure_dir="$PWD"/data/casp15_set/predicted_structures ``` Run inference on each dataset ```bash python3 multicom_ligand/models/diffdock_inference.py dataset=posebusters_benchmark repeat_index=1 ... python3 multicom_ligand/models/diffdock_inference.py dataset=astex_diverse repeat_index=1 ... python3 multicom_ligand/models/diffdock_inference.py dataset=dockgen repeat_index=1 ... python3 multicom_ligand/models/diffdock_inference.py dataset=casp15 batch_size=1 repeat_index=1 ... ``` Relax the generated ligand structures inside of their respective protein pockets ```bash python3 multicom_ligand/models/inference_relaxation.py method=diffdock dataset=posebusters_benchmark remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 repeat_index=1 ... python3 multicom_ligand/models/inference_relaxation.py method=diffdock dataset=astex_diverse remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 repeat_index=1 ... python3 multicom_ligand/models/inference_relaxation.py method=diffdock dataset=dockgen remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 repeat_index=1 ... ``` **NOTE**: Increase `num_processes` according to your available CPU/GPU resources to improve throughput Analyze inference results for each dataset ```bash python3 multicom_ligand/analysis/inference_analysis.py method=diffdock dataset=posebusters_benchmark repeat_index=1 ... python3 multicom_ligand/analysis/inference_analysis.py method=diffdock dataset=astex_diverse repeat_index=1 ... python3 multicom_ligand/analysis/inference_analysis.py method=diffdock dataset=dockgen repeat_index=1 ... ``` Analyze inference results for the CASP15 dataset ```bash # first assemble (unrelaxed and post ranking-relaxed) CASP15-compliant prediction submission files for scoring python3 multicom_ligand/models/ensemble_generation.py ensemble_methods=\[diffdock\] input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_diffdock_ensemble_predictions_1 skip_existing=true relax_method_ligands_post_ranking=false export_file_format=casp15 export_top_n=5 combine_casp_output_files=true max_method_predictions=40 method_top_n_to_select=40 resume=true ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 cuda_device_index=0 ensemble_benchmarking_repeat_index=1 python3 multicom_ligand/models/ensemble_generation.py ensemble_methods=\[diffdock\] input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_diffdock_ensemble_predictions_1 skip_existing=true relax_method_ligands_post_ranking=true export_file_format=casp15 export_top_n=5 combine_casp_output_files=true max_method_predictions=40 method_top_n_to_select=40 resume=true ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 cuda_device_index=0 ensemble_benchmarking_repeat_index=1 # NOTE: the suffixes for both `output_dir` and `ensemble_benchmarking_repeat_index` should be modified to e.g., 2, 3, ... ... # now score the CASP15-compliant submissions using the official CASP scoring pipeline python3 multicom_ligand/analysis/inference_analysis_casp.py method=diffdock dataset=casp15 repeat_index=1 ... ``` ### How to run inference with `FABind` Prepare CSV input files ```bash python3 multicom_ligand/data/fabind_input_preparation.py dataset=posebusters_benchmark python3 multicom_ligand/data/fabind_input_preparation.py dataset=astex_diverse python3 multicom_ligand/data/fabind_input_preparation.py dataset=dockgen ``` Run inference on each dataset ```bash python3 multicom_ligand/models/fabind_inference.py dataset=posebusters_benchmark repeat_index=1 ... python3 multicom_ligand/models/fabind_inference.py dataset=astex_diverse repeat_index=1 ... python3 multicom_ligand/models/fabind_inference.py dataset=dockgen repeat_index=1 ... ``` Relax the generated ligand structures inside of their respective protein pockets ```bash python3 multicom_ligand/models/inference_relaxation.py method=fabind dataset=posebusters_benchmark remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 repeat_index=1 ... python3 multicom_ligand/models/inference_relaxation.py method=fabind dataset=astex_diverse remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 repeat_index=1 ... python3 multicom_ligand/models/inference_relaxation.py method=fabind dataset=dockgen remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 repeat_index=1 ... ``` **NOTE**: Increase `num_processes` according to your available CPU/GPU resources to improve throughput Analyze inference results for each dataset ```bash python3 multicom_ligand/analysis/inference_analysis.py method=fabind dataset=posebusters_benchmark repeat_index=1 ... python3 multicom_ligand/analysis/inference_analysis.py method=fabind dataset=astex_diverse repeat_index=1 ... python3 multicom_ligand/analysis/inference_analysis.py method=fabind dataset=dockgen repeat_index=1 ... ``` ### How to run inference with `DynamicBind` Prepare CSV input files ```bash python3 multicom_ligand/data/dynamicbind_input_preparation.py dataset=posebusters_benchmark python3 multicom_ligand/data/dynamicbind_input_preparation.py dataset=astex_diverse python3 multicom_ligand/data/dynamicbind_input_preparation.py dataset=dockgen python3 multicom_ligand/data/dynamicbind_input_preparation.py dataset=casp15 input_data_dir="$PWD"/data/casp15_set/targets ``` Run inference on each dataset ```bash python3 multicom_ligand/models/dynamicbind_inference.py dataset=posebusters_benchmark repeat_index=1 ... python3 multicom_ligand/models/dynamicbind_inference.py dataset=astex_diverse repeat_index=1 ... python3 multicom_ligand/models/dynamicbind_inference.py dataset=dockgen repeat_index=1 ... python3 multicom_ligand/models/dynamicbind_inference.py dataset=casp15 batch_size=1 input_data_dir="$PWD"/data/casp15_set/predicted_structures repeat_index=1 ... ``` Relax the generated ligand structures inside of their respective protein pockets ```bash python3 multicom_ligand/models/inference_relaxation.py method=dynamicbind dataset=posebusters_benchmark remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 repeat_index=1 ... python3 multicom_ligand/models/inference_relaxation.py method=dynamicbind dataset=astex_diverse remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 repeat_index=1 ... python3 multicom_ligand/models/inference_relaxation.py method=dynamicbind dataset=dockgen remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 repeat_index=1 ... ``` **NOTE**: Increase `num_processes` according to your available CPU/GPU resources to improve throughput Analyze inference results for each dataset ```bash python3 multicom_ligand/analysis/inference_analysis.py method=dynamicbind dataset=posebusters_benchmark repeat_index=1 ... python3 multicom_ligand/analysis/inference_analysis.py method=dynamicbind dataset=astex_diverse repeat_index=1 ... python3 multicom_ligand/analysis/inference_analysis.py method=dynamicbind dataset=dockgen repeat_index=1 ... ``` Analyze inference results for the CASP15 dataset ```bash # first assemble (unrelaxed and post ranking-relaxed) CASP15-compliant prediction submission files for scoring python3 multicom_ligand/models/ensemble_generation.py ensemble_methods=\[dynamicbind\] input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_dynamicbind_ensemble_predictions_1 skip_existing=true relax_method_ligands_post_ranking=false export_file_format=casp15 export_top_n=5 combine_casp_output_files=true max_method_predictions=40 method_top_n_to_select=40 resume=true ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 cuda_device_index=0 ensemble_benchmarking_repeat_index=1 python3 multicom_ligand/models/ensemble_generation.py ensemble_methods=\[dynamicbind\] input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_dynamicbind_ensemble_predictions_1 skip_existing=true relax_method_ligands_post_ranking=true export_file_format=casp15 export_top_n=5 combine_casp_output_files=true max_method_predictions=40 method_top_n_to_select=40 resume=true ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 cuda_device_index=0 ensemble_benchmarking_repeat_index=1 # NOTE: the suffixes for both `output_dir` and `ensemble_benchmarking_repeat_index` should be modified to e.g., 2, 3, ... ... # now score the CASP15-compliant submissions using the official CASP scoring pipeline python3 multicom_ligand/analysis/inference_analysis_casp.py method=dynamicbind dataset=casp15 repeat_index=1 ... ``` ### How to run inference with `NeuralPLexer` Prepare CSV input files ```bash python3 multicom_ligand/data/neuralplexer_input_preparation.py dataset=posebusters_benchmark python3 multicom_ligand/data/neuralplexer_input_preparation.py dataset=astex_diverse python3 multicom_ligand/data/neuralplexer_input_preparation.py dataset=dockgen python3 multicom_ligand/data/neuralplexer_input_preparation.py dataset=casp15 input_data_dir="$PWD"/data/casp15_set/targets input_receptor_structure_dir="$PWD"/data/casp15_set/predicted_structures ``` Run inference on each dataset ```bash python3 multicom_ligand/models/neuralplexer_inference.py dataset=posebusters_benchmark repeat_index=1 ... python3 multicom_ligand/models/neuralplexer_inference.py dataset=astex_diverse repeat_index=1 ... python3 multicom_ligand/models/neuralplexer_inference.py dataset=dockgen repeat_index=1 ... python3 multicom_ligand/models/neuralplexer_inference.py dataset=casp15 repeat_index=1 ... ``` Relax the generated ligand structures inside of their respective protein pockets ```bash python3 multicom_ligand/models/inference_relaxation.py method=neuralplexer dataset=posebusters_benchmark num_processes=1 remove_initial_protein_hydrogens=true assign_partial_charges_manually=true cache_files=false repeat_index=1 ... python3 multicom_ligand/models/inference_relaxation.py method=neuralplexer dataset=astex_diverse num_processes=1 remove_initial_protein_hydrogens=true assign_partial_charges_manually=true cache_files=false repeat_index=1 ... python3 multicom_ligand/models/inference_relaxation.py method=neuralplexer dataset=dockgen num_processes=1 remove_initial_protein_hydrogens=true assign_partial_charges_manually=true cache_files=false repeat_index=1 ... ``` **NOTE**: Increase `num_processes` according to your available CPU/GPU resources to improve throughput Align predicted protein-ligand structures to ground-truth complex structures ```bash python3 multicom_ligand/analysis/complex_alignment.py method=neuralplexer dataset=posebusters_benchmark repeat_index=1 ... python3 multicom_ligand/analysis/complex_alignment.py method=neuralplexer dataset=astex_diverse repeat_index=1 ... python3 multicom_ligand/analysis/complex_alignment.py method=neuralplexer dataset=dockgen repeat_index=1 ... ``` Analyze inference results for each dataset ```bash python3 multicom_ligand/analysis/inference_analysis.py method=neuralplexer dataset=posebusters_benchmark repeat_index=1 ... python3 multicom_ligand/analysis/inference_analysis.py method=neuralplexer dataset=astex_diverse repeat_index=1 ... python3 multicom_ligand/analysis/inference_analysis.py method=neuralplexer dataset=dockgen repeat_index=1 ... ``` Analyze inference results for the CASP15 dataset ```bash # first assemble (unrelaxed and post ranking-relaxed) CASP15-compliant prediction submission files for scoring python3 multicom_ligand/models/ensemble_generation.py ensemble_methods=\[neuralplexer\] input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_neuralplexer_ensemble_predictions_1 skip_existing=true relax_method_ligands_post_ranking=false export_file_format=casp15 export_top_n=5 combine_casp_output_files=true max_method_predictions=40 method_top_n_to_select=40 resume=true ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 cuda_device_index=0 ensemble_benchmarking_repeat_index=1 python3 multicom_ligand/models/ensemble_generation.py ensemble_methods=\[neuralplexer\] input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_neuralplexer_ensemble_predictions_1 skip_existing=true relax_method_ligands_post_ranking=true export_file_format=casp15 export_top_n=5 combine_casp_output_files=true max_method_predictions=40 method_top_n_to_select=40 resume=true ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 cuda_device_index=0 ensemble_benchmarking_repeat_index=1 # NOTE: the suffixes for both `output_dir` and `ensemble_benchmarking_repeat_index` should be modified to e.g., 2, 3, ... ... # now score the CASP15-compliant submissions using the official CASP scoring pipeline python3 multicom_ligand/analysis/inference_analysis_casp.py method=neuralplexer dataset=casp15 repeat_index=1 ... ``` ### How to run inference with `RoseTTAFold-All-Atom` Prepare CSV input files ```bash python3 multicom_ligand/data/rfaa_input_preparation.py dataset=posebusters_benchmark python3 multicom_ligand/data/rfaa_input_preparation.py dataset=astex_diverse python3 multicom_ligand/data/rfaa_input_preparation.py dataset=dockgen python3 multicom_ligand/data/rfaa_input_preparation.py dataset=casp15 input_data_dir="$PWD"/data/casp15_set/targets ``` Run inference on each dataset ```bash conda activate forks/RoseTTAFold-All-Atom/RFAA/ python3 multicom_ligand/models/rfaa_inference.py dataset=posebusters_benchmark run_inference_directly=true python3 multicom_ligand/models/rfaa_inference.py dataset=astex_diverse run_inference_directly=true python3 multicom_ligand/models/rfaa_inference.py dataset=dockgen run_inference_directly=true python3 multicom_ligand/models/rfaa_inference.py dataset=casp15 run_inference_directly=true conda deactivate ``` Extract predictions into separate files for proteins and ligands ```bash python3 multicom_ligand/data/rfaa_output_extraction.py dataset=posebusters_benchmark python3 multicom_ligand/data/rfaa_output_extraction.py dataset=astex_diverse python3 multicom_ligand/data/rfaa_output_extraction.py dataset=dockgen python3 multicom_ligand/data/rfaa_output_extraction.py dataset=casp15 ``` Relax the generated ligand structures inside of their respective protein pockets ```bash python3 multicom_ligand/models/inference_relaxation.py method=rfaa dataset=posebusters_benchmark num_processes=1 remove_initial_protein_hydrogens=true python3 multicom_ligand/models/inference_relaxation.py method=rfaa dataset=astex_diverse num_processes=1 remove_initial_protein_hydrogens=true python3 multicom_ligand/models/inference_relaxation.py method=rfaa dataset=dockgen num_processes=1 remove_initial_protein_hydrogens=true ``` **NOTE**: Increase `num_processes` according to your available CPU/GPU resources to improve throughput Align predicted protein-ligand structures to ground-truth complex structures ```bash python3 multicom_ligand/analysis/complex_alignment.py method=rfaa dataset=posebusters_benchmark python3 multicom_ligand/analysis/complex_alignment.py method=rfaa dataset=astex_diverse python3 multicom_ligand/analysis/complex_alignment.py method=rfaa dataset=dockgen ``` Analyze inference results for each dataset ```bash python3 multicom_ligand/analysis/inference_analysis.py method=rfaa dataset=posebusters_benchmark python3 multicom_ligand/analysis/inference_analysis.py method=rfaa dataset=astex_diverse python3 multicom_ligand/analysis/inference_analysis.py method=rfaa dataset=dockgen ``` Analyze inference results for the CASP15 dataset ```bash # first assemble (unrelaxed and post ranking-relaxed) CASP15-compliant prediction submission files for scoring python3 multicom_ligand/models/ensemble_generation.py ensemble_methods=\[rfaa\] input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_rfaa_ensemble_predictions_1 skip_existing=true relax_method_ligands_post_ranking=false export_file_format=casp15 export_top_n=5 combine_casp_output_files=true max_method_predictions=40 method_top_n_to_select=40 resume=true ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 cuda_device_index=0 ensemble_benchmarking_repeat_index=1 python3 multicom_ligand/models/ensemble_generation.py ensemble_methods=\[rfaa\] input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_rfaa_ensemble_predictions_1 skip_existing=true relax_method_ligands_post_ranking=true export_file_format=casp15 export_top_n=5 combine_casp_output_files=true max_method_predictions=40 method_top_n_to_select=40 resume=true ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 cuda_device_index=0 ensemble_benchmarking_repeat_index=1 # NOTE: the suffixes for both `output_dir` and `ensemble_benchmarking_repeat_index` should be modified to e.g., 2, 3, ... ... # now score the CASP15-compliant submissions using the official CASP scoring pipeline python3 multicom_ligand/analysis/inference_analysis_casp.py method=rfaa dataset=casp15 targets="[T1124, T1127v2, T1146, T1152, T1158v1, T1158v2, T1158v3, T1158v4, T1186, T1187, T1188]" repeat_index=1 ... ``` ### How to run inference with `AutoDock Vina` Prepare CSV input files ```bash cp forks/DiffDock/inference/diffdock_posebusters_benchmark_inputs.csv forks/Vina/inference/vina_posebusters_benchmark_inputs.csv cp forks/DiffDock/inference/diffdock_astex_diverse_inputs.csv forks/Vina/inference/vina_astex_diverse_inputs.csv cp forks/DiffDock/inference/diffdock_dockgen_inputs.csv forks/Vina/inference/vina_dockgen_inputs.csv cp forks/DiffDock/inference/diffdock_casp15_inputs.csv forks/Vina/inference/vina_casp15_inputs.csv ``` Run inference on each dataset ```bash python3 multicom_ligand/models/vina_inference.py dataset=posebusters_benchmark method=diffdock repeat_index=1 # NOTE: DiffDock-L's binding pockets are recommended as the default Vina input ... python3 multicom_ligand/models/vina_inference.py dataset=astex_diverse method=diffdock repeat_index=1 ... python3 multicom_ligand/models/vina_inference.py dataset=dockgen method=diffdock repeat_index=1 ... python3 multicom_ligand/models/vina_inference.py dataset=casp15 method=diffdock repeat_index=1 ... ``` Copy Vina's predictions to the corresponding inference directory for each repeat ```bash mkdir -p forks/Vina/inference/vina_diffdock_posebusters_benchmark_outputs_1 && cp -r data/test_cases/posebusters_benchmark/vina_diffdock_posebusters_benchmark_outputs_1/* forks/Vina/inference/vina_diffdock_posebusters_benchmark_outputs_1 ... mkdir -p forks/Vina/inference/vina_diffdock_astex_diverse_outputs_1 && cp -r data/test_cases/astex_diverse/vina_diffdock_astex_diverse_outputs_1/* forks/Vina/inference/vina_diffdock_astex_diverse_outputs_1 ... mkdir -p forks/Vina/inference/vina_diffdock_dockgen_outputs_1 && cp -r data/test_cases/dockgen/vina_diffdock_dockgen_outputs_1/* forks/Vina/inference/vina_diffdock_dockgen_outputs_1 ... mkdir -p forks/Vina/inference/vina_diffdock_casp15_outputs_1 && cp -r data/test_cases/casp15/vina_diffdock_casp15_outputs_1/* forks/Vina/inference/vina_diffdock_casp15_outputs_1 ... ``` Relax the generated ligand structures inside of their respective protein pockets ```bash python3 multicom_ligand/models/inference_relaxation.py method=vina vina_binding_site_method=diffdock dataset=posebusters_benchmark remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 repeat_index=1 ... python3 multicom_ligand/models/inference_relaxation.py method=vina vina_binding_site_method=diffdock dataset=astex_diverse remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 repeat_index=1 ... python3 multicom_ligand/models/inference_relaxation.py method=vina vina_binding_site_method=diffdock dataset=dockgen remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 repeat_index=1 ... ``` **NOTE**: Increase `num_processes` according to your available CPU/GPU resources to improve throughput Analyze inference results for each dataset ```bash python3 multicom_ligand/analysis/inference_analysis.py method=vina vina_binding_site_method=diffdock dataset=posebusters_benchmark repeat_index=1 ... python3 multicom_ligand/analysis/inference_analysis.py method=vina vina_binding_site_method=diffdock dataset=astex_diverse repeat_index=1 ... python3 multicom_ligand/analysis/inference_analysis.py method=vina vina_binding_site_method=diffdock dataset=dockgen repeat_index=1 ... ``` Analyze inference results for the CASP15 dataset ```bash # assemble (unrelaxed and post ranking-relaxed) CASP15-compliant prediction submission files for scoring python3 multicom_ligand/models/ensemble_generation.py ensemble_methods=\[vina\] vina_binding_site_methods=\[diffdock\] input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_vina_diffdock_ensemble_predictions_1 skip_existing=true relax_method_ligands_post_ranking=false export_file_format=casp15 export_top_n=5 combine_casp_output_files=true max_method_predictions=40 method_top_n_to_select=40 resume=true ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 cuda_device_index=0 ensemble_benchmarking_repeat_index=1 python3 multicom_ligand/models/ensemble_generation.py ensemble_methods=\[vina\] vina_binding_site_methods=\[diffdock\] input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_vina_diffdock_ensemble_predictions_1 skip_existing=true relax_method_ligands_post_ranking=true export_file_format=casp15 export_top_n=5 combine_casp_output_files=true max_method_predictions=40 method_top_n_to_select=40 resume=true ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 cuda_device_index=0 ensemble_benchmarking_repeat_index=1 # NOTE: the suffixes for both `output_dir` and `ensemble_benchmarking_repeat_index` should be modified to e.g., 2, 3, ... ... # now score the CASP15-compliant submissions using the official CASP scoring pipeline python3 multicom_ligand/analysis/inference_analysis_casp.py method=vina vina_binding_site_method=diffdock dataset=casp15 repeat_index=1 ... ``` ### How to run inference with `TULIP` Gather all template ligands generated by `TULIP` via its dedicated [GitHub repository](https://github.com/BioinfoMachineLearning/tulip) and collate the resulting ligand fragment SDF files ```bash python3 multicom_ligand/data/tulip_output_extraction.py dataset=posebusters_benchmark python3 multicom_ligand/data/tulip_output_extraction.py dataset=astex_diverse python3 multicom_ligand/data/tulip_output_extraction.py dataset=casp15 ``` Relax the generated ligand structures inside of their respective protein pockets ```bash python3 multicom_ligand/models/inference_relaxation.py method=tulip dataset=posebusters_benchmark remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 ... python3 multicom_ligand/models/inference_relaxation.py method=tulip dataset=astex_diverse remove_initial_protein_hydrogens=true assign_partial_charges_manually=true num_processes=1 ... ``` **NOTE**: Increase `num_processes` according to your available CPU/GPU resources to improve throughput Analyze inference results for each dataset ```bash python3 multicom_ligand/analysis/inference_analysis.py method=tulip dataset=posebusters_benchmark ... python3 multicom_ligand/analysis/inference_analysis.py method=tulip dataset=astex_diverse ... ``` Analyze inference results for the CASP15 dataset ```bash # then assemble (unrelaxed and post ranking-relaxed) CASP15-compliant prediction submission files for scoring python3 multicom_ligand/models/ensemble_generation.py ensemble_methods=\[tulip\] input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_tulip_ensemble_predictions_1 skip_existing=true relax_method_ligands_post_ranking=false export_file_format=casp15 export_top_n=5 combine_casp_output_files=true max_method_predictions=40 method_top_n_to_select=40 resume=true ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 cuda_device_index=0 ensemble_benchmarking_repeat_index=1 python3 multicom_ligand/models/ensemble_generation.py ensemble_methods=\[tulip\] input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_tulip_ensemble_predictions_1 skip_existing=true relax_method_ligands_post_ranking=true export_file_format=casp15 export_top_n=5 combine_casp_output_files=true max_method_predictions=40 method_top_n_to_select=40 resume=true ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 cuda_device_index=0 ensemble_benchmarking_repeat_index=1 # NOTE: the suffixes for both `output_dir` and `ensemble_benchmarking_repeat_index` should be modified to e.g., 2, 3, ... ... # now score the CASP15-compliant submissions using the official CASP scoring pipeline python3 multicom_ligand/analysis/inference_analysis_casp.py method=tulip dataset=casp15 targets='[H1135, H1171v1, H1171v2, H1172v1, H1172v2, H1172v3, H1172v4, T1124, T1127v2, T1152, T1158v1, T1158v2, T1158v3, T1158v4, T1186, T1187]' ... ```

How to run inference with a method ensemble

Using an `ensemble` of methods, generate predictions for a new protein target using each method and (e.g., consensus-)rank the pool of predictions (n.b., see the function `execute_steps` within `scripts/execute_casp16_ensemble_generation_strategy.py` for more details regarding `MULTICOM_ligand`'s usage during CASP16) ```bash # generate each method's prediction script for a target # NOTE: to predict input ESMFold protein structures when they are not already locally available in `data/ensemble_proteins/`, e.g., on a SLURM cluster first run e.g., `srun --partition=gpu --gres=gpu:A100:1 --mem=59G --time=01:00:00 --pty bash` to ensure a GPU is available for inference python3 multicom_ligand/models/ensemble_generation.py input_csv_filepath=data/test_cases/5S8I_2LY/ensemble_inputs.csv output_dir=data/test_cases/5S8I_2LY/top_consensus_ensemble_predictions_1 max_method_predictions=40 ensemble_ranking_method=consensus resume=false ensemble_methods='[diffdock, dynamicbind, neuralplexer, rfaa]' # ... # now, manually run each desired method's generated prediction script, with the exception of AutoDock Vina which uses other methods' predictions # ... python3 multicom_ligand/models/ensemble_generation.py input_csv_filepath=data/test_cases/5S8I_2LY/ensemble_inputs.csv output_dir=data/test_cases/5S8I_2LY/top_consensus_ensemble_predictions_1 max_method_predictions=40 ensemble_ranking_method=consensus resume=true generate_vina_scripts=true vina_binding_site_methods=[diffdock] # now, manually run AutoDock Vina's generated prediction script for each binding site prediction method #... # lastly, organize each method's predictions together python3 multicom_ligand/models/ensemble_generation.py input_csv_filepath=data/test_cases/5S8I_2LY/ensemble_inputs.csv output_dir=data/test_cases/5S8I_2LY/top_consensus_ensemble_predictions_1 max_method_predictions=40 ensemble_ranking_method=consensus resume=true generate_vina_scripts=false vina_binding_site_methods=[diffdock] ``` Benchmark (ensemble-)ranked predictions across each test dataset ```bash # benchmark using the PoseBusters Benchmark dataset e.g., after generating 40 complexes per target with each method python3 multicom_ligand/models/ensemble_generation.py input_csv_filepath=data/test_cases/posebusters_benchmark/ensemble_inputs.csv output_dir=data/test_cases/posebusters_benchmark/top_consensus_ensemble_predictions_1 max_method_predictions=40 export_top_n=1 export_file_format=null skip_existing=true relax_method_ligands_post_ranking=false resume=true cuda_device_index=0 ensemble_methods='[diffdock, dynamicbind, neuralplexer, rfaa, tulip, vina]' ensemble_benchmarking=true ensemble_benchmarking_dataset=posebusters_benchmark ensemble_ranking_method=consensus ensemble_benchmarking_repeat_index=1 python3 multicom_ligand/models/ensemble_generation.py input_csv_filepath=data/test_cases/posebusters_benchmark/ensemble_inputs.csv output_dir=data/test_cases/posebusters_benchmark/top_consensus_ensemble_predictions_1 max_method_predictions=40 export_top_n=1 export_file_format=null skip_existing=true relax_method_ligands_post_ranking=true resume=true cuda_device_index=0 ensemble_methods='[diffdock, dynamicbind, neuralplexer, rfaa, tulip, vina]' ensemble_benchmarking=true ensemble_benchmarking_dataset=posebusters_benchmark ensemble_ranking_method=consensus ensemble_benchmarking_repeat_index=1 ... # benchmark using the Astex Diverse dataset e.g., after generating 40 complexes per target with each method python3 multicom_ligand/models/ensemble_generation.py input_csv_filepath=data/test_cases/astex_diverse/ensemble_inputs.csv output_dir=data/test_cases/astex_diverse/top_consensus_ensemble_predictions_1 max_method_predictions=40 export_top_n=1 export_file_format=null skip_existing=true relax_method_ligands_post_ranking=false resume=true cuda_device_index=0 ensemble_methods='[diffdock, dynamicbind, neuralplexer, rfaa, tulip, vina]' ensemble_benchmarking=true ensemble_benchmarking_dataset=astex_diverse ensemble_ranking_method=consensus ensemble_benchmarking_repeat_index=1 python3 multicom_ligand/models/ensemble_generation.py input_csv_filepath=data/test_cases/astex_diverse/ensemble_inputs.csv output_dir=data/test_cases/astex_diverse/top_consensus_ensemble_predictions_1 max_method_predictions=40 export_top_n=1 export_file_format=null skip_existing=true relax_method_ligands_post_ranking=true resume=true cuda_device_index=0 ensemble_methods='[diffdock, dynamicbind, neuralplexer, rfaa, tulip, vina]' ensemble_benchmarking=true ensemble_benchmarking_dataset=astex_diverse ensemble_ranking_method=consensus ensemble_benchmarking_repeat_index=1 ... # benchmark using the DockGen dataset e.g., after generating 40 complexes per target with each method python3 multicom_ligand/models/ensemble_generation.py input_csv_filepath=data/test_cases/dockgen/ensemble_inputs.csv output_dir=data/test_cases/dockgen/top_consensus_ensemble_predictions_1 max_method_predictions=40 export_top_n=1 export_file_format=null skip_existing=true relax_method_ligands_post_ranking=false resume=true cuda_device_index=0 ensemble_methods='[diffdock, dynamicbind, neuralplexer, rfaa, vina]' ensemble_benchmarking=true ensemble_benchmarking_dataset=dockgen ensemble_ranking_method=consensus ensemble_benchmarking_repeat_index=1 python3 multicom_ligand/models/ensemble_generation.py input_csv_filepath=data/test_cases/dockgen/ensemble_inputs.csv output_dir=data/test_cases/dockgen/top_consensus_ensemble_predictions_1 max_method_predictions=40 export_top_n=1 export_file_format=null skip_existing=true relax_method_ligands_post_ranking=true resume=true cuda_device_index=0 ensemble_methods='[diffdock, dynamicbind, neuralplexer, rfaa, vina]' ensemble_benchmarking=true ensemble_benchmarking_dataset=dockgen ensemble_ranking_method=consensus ensemble_benchmarking_repeat_index=1 ... # benchmark using the CASP15 dataset e.g., after generating 40 complexes per target with each method python3 multicom_ligand/models/ensemble_generation.py input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_consensus_ensemble_predictions_1 combine_casp_output_files=true max_method_predictions=40 export_top_n=5 export_file_format=casp15 skip_existing=true relax_method_ligands_post_ranking=false resume=true cuda_device_index=0 ensemble_methods='[diffdock, dynamicbind, neuralplexer, rfaa, tulip, vina]' ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 ensemble_ranking_method=consensus ensemble_benchmarking_repeat_index=1 python3 multicom_ligand/models/ensemble_generation.py input_csv_filepath=data/test_cases/casp15/ensemble_inputs.csv output_dir=data/test_cases/casp15/top_consensus_ensemble_predictions_1 combine_casp_output_files=true max_method_predictions=40 export_top_n=5 export_file_format=casp15 skip_existing=true relax_method_ligands_post_ranking=true resume=true cuda_device_index=0 ensemble_methods='[diffdock, dynamicbind, neuralplexer, rfaa, tulip, vina]' ensemble_benchmarking=true ensemble_benchmarking_dataset=casp15 ensemble_ranking_method=consensus ensemble_benchmarking_repeat_index=1 ... # analyze benchmarking results for the PoseBusters Benchmark dataset python3 multicom_ligand/analysis/inference_analysis.py method=ensemble dataset=posebusters_benchmark repeat_index=1 ... # analyze benchmarking results for the Astex Diverse dataset python3 multicom_ligand/analysis/inference_analysis.py method=ensemble dataset=astex_diverse repeat_index=1 ... # analyze benchmarking results for the DockGen dataset python3 multicom_ligand/analysis/inference_analysis.py method=ensemble dataset=dockgen repeat_index=1 ... # analyze benchmarking results for the CASP15 dataset python3 multicom_ligand/analysis/inference_analysis_casp.py method=ensemble dataset=casp15 ensemble_ranking_method=consensus repeat_index=1 ... ``` To benchmark ensemble ranking using the above commands, you must have already run the corresponding `*_inference.py` script for each method described in the section [How to run inference with individual methods](#how-to-run-inference-with-individual-methods) (with the exception of `FABind`, which will not referenced during CASP15 benchmarking) **NOTE**: In addition to having `consensus` as an available value for `ensemble_ranking_method`, one can also set `ensemble_ranking_method=ff` to have the method ensemble's top-ranked predictions selected using the criterion of "minimum (molecular dynamics) force field energy" (albeit while incurring a very large runtime complexity)

How to create comparative plots of inference results

Execute (and customize as desired) notebooks to prepare paper-ready result plots ```bash jupyter notebook notebooks/posebusters_astex_inference_results_plotting.ipynb jupyter notebook notebooks/posebusters_pocket_only_inference_results_plotting.ipynb jupyter notebook notebooks/dockgen_inference_results_plotting.ipynb jupyter notebook notebooks/casp15_inference_results_plotting.ipynb ```

For developers

### Dependency management We use `mamba` to manage the project's underlying dependencies. Notably, to update the dependencies listed in a particular `environments/*_environment.yml` file: ```bash mamba env export > env.yaml # e.g., run this after installing new dependencies locally within a given `conda` environment diff environments/multicom_ligand_environment.yaml env.yaml # note the differences and copy accepted changes back into e.g., `environments/multicom_ligand_environment.yaml` rm env.yaml # clean up temporary environment file ``` ### Code formatting We use `pre-commit` to automatically format the project's code. To set up `pre-commit` (one time only) for automatic code linting and formatting upon each execution of `git commit`: ```bash pre-commit install ``` To manually reformat all files in the project as desired: ```bash pre-commit run -a ``` ### Documentation We `sphinx` to maintain the project's code documentation. To build a local version of the project's `sphinx` documentation web pages: ```bash # assuming you are located in the `MULTICOM_ligand` top-level directory pip install -r docs/.docs.requirements # one-time only rm -rf docs/build/ && sphinx-build docs/source/ docs/build/ # NOTE: errors can safely be ignored ```

Acknowledgements

MULTICOM_ligand builds upon the source code and data from the following projects:

We thank all their contributors and maintainers!

Citing this work

If you use the code or benchmark method predictions associated with this repository or otherwise find this work useful, please cite:

bibtex @inproceedings{morehead2024multicom, title={Protein-ligand structure and affinity prediction in CASP16 using a geometric deep learning ensemble and flow matching}, author={Morehead, Alex and Liu, Jian and Neupane, Pawan and Giri, Nabin and Cheng, Jianlin}, booktitle={CASP16 Abstracts}, year={2025}, note={presented at CASP16 as a top-5 ligand prediction method}, }

Bonus

Lastly, thanks to Stable Diffusion for generating this quaint representation of what my brain looked like after assembling this codebase.

Owner

  • Name: BioinfoMachineLearning
  • Login: BioinfoMachineLearning
  • Kind: organization

Citation (citation.bib)

@inproceedings{morehead2024multicom,
  title={Protein-ligand structure and affinity prediction in CASP16 using a geometric deep learning ensemble and flow matching},
  author={Morehead, Alex and Liu, Jian and Neupane, Pawan and Giri, Nabin and Cheng, Jianlin},
  booktitle={CASP16 Abstracts},
  year={2025},
  note={presented at CASP16 as a top-5 ligand prediction method},
}

GitHub Events

Total
  • Release event: 1
  • Watch event: 7
  • Delete event: 1
  • Public event: 1
  • Push event: 22
  • Create event: 2
Last Year
  • Release event: 1
  • Watch event: 7
  • Delete event: 1
  • Public event: 1
  • Push event: 22
  • Create event: 2

Dependencies

.github/workflows/changelog-enforcer.yaml actions
  • actions/checkout v3 composite
  • dangoslen/changelog-enforcer v3 composite
.github/workflows/code-quality-main.yaml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • pre-commit/action v3.0.1 composite
.github/workflows/code-quality-pr.yaml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • pre-commit/action v3.0.1 composite
  • trilom/file-changes-action v1.2.4 composite
.github/workflows/documentation.yaml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • conda-incubator/setup-miniconda v3 composite
  • peaceiris/actions-gh-pages v4 composite
.github/workflows/publish.yaml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
  • pypa/gh-action-pypi-publish v1.9.0 composite
.github/workflows/release-drafter.yml actions
  • release-drafter/release-drafter v6 composite
forks/DiffDock/Dockerfile docker
  • nvidia/cuda 11.7.1-devel-ubuntu22.04 build
  • nvidia/cuda 11.7.1-runtime-ubuntu22.04 build
forks/DiffDock/app/Dockerfile docker
  • rbgcsail/diffdock latest build
forks/NeuralPLexer/Dockerfile docker
  • nvidia/cuda 11.7.1-devel-ubuntu22.04 build
forks/RoseTTAFold-All-Atom/rf2aa/SE3Transformer/Dockerfile docker
  • ${FROM_IMAGE_NAME} latest build
forks/DiffDock/app/requirements.txt pypi
  • gradio ==3.50.
  • requests *
forks/DiffDock/requirements.txt pypi
  • e3nn ==0.5.0
  • fair-esm ==2.0.0
  • networkx ==2.8.4
  • pandas ==1.5.1
  • prody ==2.2.0
  • pybind11 ==2.11.1
  • rdkit ==2022.03.3
  • scikit-learn ==1.1.0
  • scipy ==1.12.0
  • torch ==1.13.1
  • torch-cluster ==1.6.0
  • torch-geometric ==2.2.0
  • torch-scatter ==2.1.0
  • torch-sparse ==0.6.16
  • torch-spline-conv ==1.2.1
  • torchmetrics ==0.11.0
forks/DynamicBind/esm/pyproject.toml pypi
forks/DynamicBind/esm/setup.py pypi
forks/FlowDock/pyproject.toml pypi
forks/FlowDock/setup.py pypi
  • lightning *
forks/NeuralPLexer/NeuralPLexer-requirements.txt pypi
  • fairscale *
  • mendeleev *
  • pytorch-lightning <2.0.0
  • seaborn *
forks/NeuralPLexer/requirements.txt pypi
  • fairscale *
  • mendeleev *
  • pytorch-lightning <2.0.0
  • seaborn *
  • torch-scatter *
forks/NeuralPLexer/setup.py pypi
forks/RoseTTAFold-All-Atom/rf2aa/SE3Transformer/requirements.txt pypi
  • dllogger *
  • e3nn ==0.3.3
  • pynvml ==11.0.0
  • wandb ==0.12.0
forks/RoseTTAFold-All-Atom/rf2aa/SE3Transformer/setup.py pypi
pyproject.toml pypi
  • beartype *
  • biopandas *
  • biopython ==1.79
  • hydra-colorlog ==1.2.0
  • hydra-core ==1.3.2
  • hydra-optuna-sweeper ==1.2.0
  • ipykernel *
  • jaxtyping >=0.2.12
  • joblib *
  • lightning *
  • loguru *
  • lovely-numpy *
  • lovely-tensors *
  • meeko *
  • numpy *
  • omegaconf *
  • pandas >=1.3.5
  • plotly *
  • posebusters @ git+https://github.com/amorehead/posebusters.git@posebench
  • pre-commit *
  • prody *
  • prolif *
  • pydantic >=1.10.15
  • pypdb *
  • pyyaml *
  • rdkit >=2023.3.2
  • rich *
  • rootutils *
  • scikit-learn >=1.0.2
  • seaborn *
  • setuptools *
  • spyrmsd *
  • timeout_decorator >=0.5.0
  • torch *
  • torchmetrics *
  • torchvision *
  • tqdm *
forks/DiffDock/environment.yml pypi
  • torch ==1.13.1
forks/DynamicBind/environment.yml pypi
  • biopython ==1.78
  • brotlipy ==0.7.0
  • cffi ==1.15.1
  • comm ==0.1.2
  • contourpy ==1.0.7
  • cryptography ==38.0.4
  • debugpy ==1.5.1
  • e3nn ==0.5.1
  • fair-esm ==2.0.0
  • filelock ==3.9.0
  • fonttools ==4.39.3
  • gmpy2 ==2.1.2
  • greenlet ==2.0.2
  • idna ==3.4
  • ipykernel ==6.19.2
  • ipython ==8.12.0
  • jedi ==0.18.1
  • jinja2 ==3.1.2
  • joblib ==1.1.1
  • jupyter-client ==8.1.0
  • jupyter-core ==5.3.0
  • kiwisolver ==1.4.4
  • markupsafe ==2.1.1
  • matplotlib ==3.7.1
  • matplotlib-inline ==0.1.6
  • mkl-fft ==1.3.6
  • mkl-random ==1.2.2
  • mkl-service ==2.4.0
  • mpmath ==1.2.1
  • nest-asyncio ==1.5.6
  • networkx ==2.8.4
  • numpy ==1.24.3
  • opt-einsum ==3.3.0
  • opt-einsum-fx ==0.1.4
  • pandas ==2.0.1
  • pillow ==9.4.0
  • pip ==23.0.1
  • platformdirs ==2.5.2
  • prompt-toolkit ==3.0.36
  • psutil ==5.9.0
  • pygments ==2.15.1
  • pyopenssl ==23.0.0
  • pysocks ==1.7.1
  • pyyaml ==6.0
  • pyzmq ==25.0.2
  • reportlab ==3.6.13
  • requests ==2.29.0
  • scikit-learn ==1.2.2
  • scipy ==1.10.1
  • setuptools ==66.0.0
  • spyrmsd ==0.5.2
  • sqlalchemy ==1.4.46
  • sympy ==1.11.1
  • torch ==2.0.1
  • torch-geometric ==2.3.0
  • torchaudio ==2.0.2
  • torchvision ==0.15.2
  • tornado ==6.2
  • tqdm ==4.65.0
  • traitlets ==5.7.1
  • triton ==2.0.0
  • typing-extensions ==4.5.0
  • unicodedata2 ==15.0.0
  • urllib3 ==1.26.15
  • wheel ==0.38.4
forks/DynamicBind/esm/environment.yml pypi
  • PyYAML ==5.4.1
  • biopython ==1.79
  • deepspeed ==0.5.9
  • dm-tree ==0.1.6
  • ml-collections ==0.1.0
  • numpy ==1.21.2
  • pytorch_lightning ==1.5.10
  • requests ==2.26.0
  • scipy ==1.7.1
  • tqdm ==4.62.2
  • typing-extensions ==3.10.0.2
  • wandb ==0.12.21