mdeepfri

Pipeline for searching and aligning contact maps for proteins, then running DeepFri's GCN.

https://github.com/bioinf-mcb/metagenomic-deepfri

Science Score: 75.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
    Links to: nature.com
  • Academic email domains
  • Institutional organization owner
    Organization bioinf-mcb has institutional domain (mcb.uj.edu.pl)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.5%) to scientific vocabulary

Keywords

genomics mmseqs protein-function-prediction protein-sequences protein-structure
Last synced: 4 months ago · JSON representation ·

Repository

Pipeline for searching and aligning contact maps for proteins, then running DeepFri's GCN.

Basic Info
Statistics
  • Stars: 40
  • Watchers: 5
  • Forks: 6
  • Open Issues: 3
  • Releases: 10
Topics
genomics mmseqs protein-function-prediction protein-sequences protein-structure
Created almost 4 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog Contributing License Citation Codeowners

README.md

🍳 Metagenomic-DeepFRI Stars

Support Ukraine License PyPI Wheel Python Versions Python Implementations Source GitHub issues Docs Changelog Downloads

A pipeline for annotation of genes with DeepFRI, a deep learning model for functional protein annotation with Gene Ontology (GO) terms. It incorporates FoldComp databases of predicted protein structures for fast annotation of metagenomic gene catalogues.

🔍 Overview

Proteins perform most of the work of living cells. Amino acid sequence and structural features of proteins determine a wide range of functions: from binding specificity and conferring mechanical stability, to catalysis of biochemical reactions, transport, and signal transduction. DeepFRI is a neural network designed to predict protein function within the framework of the Gene Ontology (GO). The exponential growth in the number of available protein sequences, driven by advancements in low-cost sequencing technologies and computational methods (e.g. gene prediction), has resulted in a pressing need for efficient software to facilitate the annotation of protein databases. Metagenomic-DeepFRI addresses such needs, building upon efficient libraries. It incorporates novel databases of predicted structures (AlphaFold, ESMFold, MIP, etc.) and improves runtimes of DeepFRI by 2-12 times!

📋 Pipeline stages

  1. Search proteins similar to query in PDB and supply FoldComp databases with MMSeqs2.
  2. Find the best alignment among MMSeqs2 hits using PyOpal.
  3. Align target protein contact map to query protein with unknown structure.
  4. Run DeepFRI with the structure if found in the database, otherwise run DeepFRI with sequence only.

image.png

🛠️ Built With

Python requirements

Python >= 3.9; < 3.12 The app was tested for 3.11.

🔧 Installation

  1. Install from PyPI. Installation might take a few minutes due to download of MMseqs2 binaries. {code-block} bash pip install mdeepfri
  2. Run and view the help message. {code-block} bash mDeepFRI --help

💡 Usage

1. Prepare structural database

1.1 Existing FoldComp databases

The PDB database will be automatically downloaded and installed during the first run of mDeepFRI. The PDB suffers from formatting inconsistencies, therefore during PDB alignment around 10% will fail and will be reported via WARNING. We suggest coupling PDB search with predicted databases, as it massively improves the structural coverage of the protein universe. A good protein structure allows DeepFRI to annotate the function in more detail. However, the sequence branch of the model has the largest weight, thus even if the predicted structure is erroneous, it will have a minor effect on the prediction. The details can be found in the original manuscript, fig. 2A.

You can download additional databases from website. During a first run, FASTA sequences will be extracted from FoldComp database and MMseqs2 database will be created and indexed. You can use different databases, but be mindful that computation time might increase exponentially with the size of the database.

Tested databases: - afdb_swissprot - afdb_swissprot_v4 - afdb_rep_v4 - afdb_rep_dark_v4 - afdb_uniprot_v4 - esmatlas - esmatlas_v2023_02 - highquality_clust30

ATTENTION: Please, do not rename downloaded databases. FoldComp has certain inconsistencies in the way FASTA sequences are extracted (example), therefore pipeline was tweaked for each database. If database you need does not work, please report in issues and we will add it as soon as possible. Sorry for the inconvenience.

ATTENTION: database creation is a very sensitive step which relies on external software. If pipeline is interrupted during this step, the databases might be corrupted. If you are not sure about your database, rerun the pipeline with --overwrite flag - it will rerun database creation process.

1.2. Custom FoldComp database

In order to use personal database of structures, you will have to create a custom FoldComp database. For that, download a FoldComp executable and run the following command: foldcomp compress [-t number] <dir|tar(.gz)> [<dir|tar|db>]

2. Download models

Two versions of models available: - v1.0 - is the original version from DeepFRI publication. - v1.1 - is a version finetuned on AlphaFold models and machine-generated Gene Ontology Uniprot annotations. You can read details about v1.1 in ISMB 2023 presentation by Pawel Szczerbiak

To download models run command: mDeepFRI get-models --output path/to/weights/folder -v {1.0 or 1.1}

3. Predict protein function & capture log

mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path 2> log.txt

The logging module writes output into stderr, so use 2> to redirect it to the file. Other available parameters can be found upon command mDeepFRI --help.

✅ Results

The output folder will contain: 1. {database_name}.search_results.tsv 2. query.mmseqsDB + index from MMSeqs2 search. 3. results.tsv - a final output from the DeepFRI model.

Example output (results.tsv)

| Protein | GOterm/ECnumer | Score | Annotation | Neuralnet | DeepFRImode | DBhit | DBname |Identity | |--------------|------------------|-------|----------------------------------------------|------------|--------------|---------------|----------------|------------| | MIP00215364 | GO:0016798 | 0.218 | hydrolase activity, acting on glycosyl bonds | gcn | mf | MIP00215364 | miprosettahq |0.933 | | 1GVH1 | GO:0009055 | 0.217 | electron transfer activity | gnn | mf | AF-P24232-F1-modelv4 | afdbswissprotv4 | 1.0 | | unaligned | 3.2.1.- | 0.215 | 3.2.1.- | cnn | ec | nan | nan | nan

This is an example of protein annotation with the AlphaFold database. - Protein - the name of the protein from the FASTA file. - GOterm/ECnumer - predicted GO term or EC number (dependent on mode) - Score - DeepFRI score, translates to model confidence in prediction. Details in publication. - Annotation - annotation from ontology - Neuralnet - type of neural network used for prediction (gcn = Graph Convolutional Network; cnn = Convolutional Neural Network). GCN (Graph Convolutional Network) is used when structural information is available in the database, allowing for generally more confident predictions. When there are no proteins above similarity cut-off (50% identity by default), CNN is used. - DeepFRImode: mf = molecular_function bp = biological_process cc = cellular_component ec = enzyme_commission - DBhit - name of the hit in the database. Empty if no hit was found. - DBname - name of the database. Empty if no hit was found. - Identity - sequence identity between query and hit. Empty if no hit was found.

⚙️Features

1. Prediction modes

The GO ontology contains three subontologies, defined by their root nodes: - Molecular Function (MF) - Biological Process (BP) - Cellular Component (CC) - Additionally, Metagenomic-DeepFRI v1.0 is able to predict Enzyme Comission number (EC). By default, the tool makes predictions in all 4 categories. To select only a few pass the parameter -p or --processing-modes few times, i.e.: mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/foldcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path -p mf -p bp

2. Hierarchical database search

Different databases have a different level of evidence. For example, PDB structures are real experimental structures, thus they are considered to be the data of highest quality. Therefore new proteins are first queried against PDB. Computational predictions differ by quality, i.e. AlphaFold predictions are often more accurate than ESMFold predictions. We provide an opporunity to search multiple databases in a hierarchical manner. For example, if you want to search AlphaFold database first, and then ESMFold, you can pass the parameter -d or --databases few times, i.e.: mDeepFRI predict-function -i /path/to/protein/sequences -d /path/to/alphafold/database/ -d /path/to/another/esmcomp/database/ -w /path/to/deepfri/weights/folder -o /output_path

3. Temporary files

The first run of mDeepFRI with the database will create temporary files, needed for the pipeline. If you don't want to keep them for the next run add flag --remove-intermediate.

4. CPU / GPU utilization

If argument threads is provided, the app will parallelize certain steps (alignment, contact map alignment, functional annotation). GPU is often used to speed up neural networks. Metagenomic-DeepFRI takes care of this and, if CUDA is installed on your machine, mDeepFRI will automatically use it for prediction. If not, the model will use CPUs. Technical tip: Single instance of DeepFRI on GPU requires 2GB VRAM. Every currently available GPU with CUDA support should be able to run the model.

🔖 Citations

Metagenomic-DeepFRI is a scientific software. If you use it in an academic work, please cite the papers behind it: - Gligorijević et al. "Structure-based protein function prediction using graph convolutional networks" Nat. Comms. (2021). https://doi.org/10.1038/s41467-021-23303-9 - Steinegger & Söding "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets" Nat. Biotechnol. (2017) https://doi.org/10.1038/nbt.3988 - Kim, Midrita & Steinegger "Foldcomp: a library and format for compressing and indexing large protein structure sets" Bioinformatics (2023) https://doi.org/10.1093/bioinformatics/btad153 - Maranga et al. "Comprehensive Functional Annotation of Metagenomes and Microbial Genomes Using a Deep Learning-Based Method" mSystems (2023) https://doi.org/10.1128/msystems.01178-22

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the The 3-Clause BSD License.

Owner

  • Name: Bioinformatics at Małopolska Centre of Biotechnology
  • Login: bioinf-mcb
  • Kind: organization

Citation (CITATION.cff)

# YAML 1.2
# Metadata for citation of this software according to the CFF format (https://citation-file-format.github.io/)

cff-version: 1.2.0
message: If you use this software, please cite it as below.
title: 'Structure-based protein function prediction using
  graph convolutional networks'
doi: 10.1038/s41467-021-23303-9
authors:
- given-names: Vladimir
  family-names: Gligorijević
  affiliation: Center for Computational Biology, Flatiron
    Institute, New York, NY, USA
  orcid: https://orcid.org/0000-0002-5165-0973
- given-names: P. Douglas
  family-names: Renfrew
  affiliation: Center for Computational Biology, Flatiron Institute,
    New York, NY, USA
- given-names: Tomasz
  family-names: Kosciolek
  affiliation: Malopolska Centre of Biotechnology, Jagiellonian University,
    Krakow, Poland
  orcid: https://orcid.org/0000-0002-5693-3593
- given-names: Julia Koehler
  family-names: Leman
  affiliation: Center for Computational Biology, Flatiron Institute,
    New York, NY, USA
  orcid: https://orcid.org/0000-0002-5693-3593
- given-names: Daniel
  family-names: Berenberg
  affiliation: Center for Computational Biology, Flatiron Institute,
    New York, NY, USA
- given-names: Tommi
  family-names: Vatanen
  affiliation: Broad Institute of MIT and Harvard, Cambridge, MA, USA
  orcid: https://orcid.org/0000-0003-0949-1291
- given-names: Chris
  family-names: Chandler
  affiliation: Center for Computational Biology, Flatiron Institute,
    New York, NY, USA
- given-names: Bryn C.
  family-names: Taylor
  affiliation: Biomedical Sciences Graduate Program,
    University of California San Diego, La Jolla, CA, USA
- given-names: Ian M.
  family-names: Fisk
  affiliation: Scientific Computing Core, Flatiron Institute,
    Simons Foundation, New York, NY, USA
- given-names: Hera
  family-names: Vlamakis
  affiliation: Broad Institute of MIT and Harvard, Cambridge, MA, USA
  orcid: https://orcid.org/0000-0003-1086-9191
- given-names: Ramnik J.
  family-names: Xavier
  affiliation: Broad Institute of MIT and Harvard, Cambridge, MA, USA
  orcid: https://orcid.org/0000-0002-5630-5167
- given-names: Rob
  family-names: Knight
  affiliation: Department of Pediatrics, University of California San Diego,
    La Jolla, CA, USA
  orcid: https://orcid.org/0000-0002-0975-9019
- given-names: Kyunghyun
  family-names: Cho
  affiliation: Center for Data Science, New York University, New York, NY, USA
- given-names: Richard
  family-names: Bonneau
  affiliation: Center for Computational Biology, Flatiron Institute, New York, NY, USA
  orcid: https://orcid.org/0000-0003-4354-7906
version: 1.0.0
date-released: 2021-03-31
repository-code: https://github.com/flatironinstitute/DeepFRI
license: BSD-3-Clause
keywords:
- "Graph Neural Networks"
- "Protein function"
- "Function prediction"
preferred-citation:
  type: article
  authors:
  - given-names: Vladimir
    family-names: Gligorijević
    affiliation: Center for Computational Biology, Flatiron
      Institute, New York, NY, USA
    orcid: https://orcid.org/0000-0002-5165-0973
  - given-names: P. Douglas
    family-names: Renfrew
    affiliation: Center for Computational Biology, Flatiron Institute,
      New York, NY, USA
  - given-names: Tomasz
    family-names: Kosciolek
    affiliation: Malopolska Centre of Biotechnology, Jagiellonian University,
      Krakow, Poland
    orcid: https://orcid.org/0000-0002-5693-3593
  - given-names: Julia Koehler
    family-names: Leman
    affiliation: Center for Computational Biology, Flatiron Institute,
      New York, NY, USA
    orcid: https://orcid.org/0000-0002-5693-3593
  - given-names: Daniel
    family-names: Berenberg
    affiliation: Center for Computational Biology, Flatiron Institute,
      New York, NY, USA
  - given-names: Tommi
    family-names: Vatanen
    affiliation: Broad Institute of MIT and Harvard, Cambridge, MA, USA
    orcid: https://orcid.org/0000-0003-0949-1291
  - given-names: Chris
    family-names: Chandler
    affiliation: Center for Computational Biology, Flatiron Institute,
      New York, NY, USA
  - given-names: Bryn C.
    family-names: Taylor
    affiliation: Biomedical Sciences Graduate Program,
      University of California San Diego, La Jolla, CA, USA
  - given-names: Ian M.
    family-names: Fisk
    affiliation: Scientific Computing Core, Flatiron Institute,
      Simons Foundation, New York, NY, USA
  - given-names: Hera
    family-names: Vlamakis
    affiliation: Broad Institute of MIT and Harvard, Cambridge, MA, USA
    orcid: https://orcid.org/0000-0003-1086-9191
  - given-names: Ramnik J.
    family-names: Xavier
    affiliation: Broad Institute of MIT and Harvard, Cambridge, MA, USA
    orcid: https://orcid.org/0000-0002-5630-5167
  - given-names: Rob
    family-names: Knight
    affiliation: Department of Pediatrics, University of California San Diego,
      La Jolla, CA, USA
    orcid: https://orcid.org/0000-0002-0975-9019
  - given-names: Kyunghyun
    family-names: Cho
    affiliation: Center for Data Science, New York University, New York, NY, USA
  - given-names: Richard
    family-names: Bonneau
    affiliation: Center for Computational Biology, Flatiron Institute, New York, NY, USA
    orcid: https://orcid.org/0000-0003-4354-7906
  doi: "10.1038/s41467-021-23303-9"
  journal: "Nature Communications"
  month: 5
  title: "Structure-based protein function prediction using
    graph convolutional networks"
  abstract: 'The rapid increase in the number of proteins in sequence databases
    and the diversity of their functions challenge computational approaches for
    automated function prediction. Here, we introduce DeepFRI,
    a Graph Convolutional Network for predicting protein functions by leveraging
    sequence features extracted from a protein language model and protein structures.
    It outperforms current leading methods and sequence-based Convolutional Neural Networks
    and scales to the size of current sequence repositories. Augmenting the training set
    of experimental structures with homology models allows us to significantly
    expand the number of predictable functions. DeepFRI has significant de-noising capability,
    with only a minor drop in performance when experimental structures are replaced
    by protein models. Class activation mapping allows function predictions
    at an unprecedented resolution, allowing site-specific annotations at the
    residue-level in an automated manner. We show the utility and high performance
    of our method by annotating structures from the PDB and SWISS-MODEL,
    making several new confident function predictions.
    DeepFRI is available as a webserver at https://beta.deepfri.flatironinstitute.org/.'
  year: 2021

GitHub Events

Total
  • Issues event: 10
  • Watch event: 7
  • Issue comment event: 7
  • Push event: 2
Last Year
  • Issues event: 10
  • Watch event: 7
  • Issue comment event: 7
  • Push event: 2

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 5
  • Total pull requests: 0
  • Average time to close issues: 5 days
  • Average time to close pull requests: N/A
  • Total issue authors: 3
  • Total pull request authors: 0
  • Average comments per issue: 0.2
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 5
  • Pull requests: 0
  • Average time to close issues: 5 days
  • Average time to close pull requests: N/A
  • Issue authors: 3
  • Pull request authors: 0
  • Average comments per issue: 0.2
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • valentynbez (6)
  • lmszydlowski (6)
  • Galaxy-228 (2)
  • tkosciol (2)
  • gilles-20 (2)
  • annotatebio (1)
  • lilithfeer (1)
  • FilipSchymik (1)
  • SoliareofAstora (1)
  • achousal (1)
  • jiaojiaoguan (1)
Pull Request Authors
  • valentynbez (9)
Top Labels
Issue Labels
bug (4) enhancement (3) question (3) wontfix (2) refactor (2) documentation (1) help wanted (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 216 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 8
  • Total maintainers: 1
pypi.org: mdeepfri

Pipeline for searching and aligning contact maps for proteins, and function prediction with DeepFRI.

  • Versions: 8
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 216 Last month
Rankings
Dependent packages count: 10.1%
Average: 38.3%
Dependent repos count: 66.5%
Maintainers (1)
Last synced: 5 months ago

Dependencies

docs/requirements.txt pypi
  • myst-parser *
  • sphinx *
  • sphinx-argparse *
  • sphinxawesome-theme *
  • sphinxcontrib-bibtex *
requirements.txt pypi
  • biopython ==1.79
  • dataclasses ==0.6
  • numpy ==1.21.5
  • pandas ==1.3.5
  • pathos ==0.2.8
  • requests ==2.27.1
  • setuptools ==58.0.4
  • tensorflow ==2.8.0
  • tqdm *