bgcflow

Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)

https://github.com/nbchub/bgcflow

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 51 DOI reference(s) in README
✓
Academic publication links
Links to: nature.com
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.0%) to scientific vocabulary

Keywords

biosynthetic-gene-clusters genome-analysis genome-annotation pangenome-pipeline

Last synced: 10 months ago · JSON representation ·

Repository

Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)

Basic Info

Host: GitHub
Owner: NBChub
License: mit
Language: Python
Default Branch: main
Homepage: https://github.com/NBChub/bgcflow/wiki
Size: 57.9 MB

Statistics

Stars: 41
Watchers: 2
Forks: 9
Open Issues: 40
Releases: 34

Topics

biosynthetic-gene-clusters genome-analysis genome-annotation pangenome-pipeline

Created almost 5 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Citation

BGCFlow

BGCFlow is a systematic workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes) from internal & public datasets.

At present, BGCFlow is only tested and confirmed to work on Linux systems with conda / mamba package manager.

Publication

Matin Nuhamunada, Omkar S Mohite, Patrick V Phaneuf, Bernhard O Palsson, Tilmann Weber, BGCFlow: systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets, Nucleic Acids Research, 2024;, gkae314, https://doi.org/10.1093/nar/gkae314

Pre-requisites

BGCFlow requires gcc and the conda/mamba package manager. See installation instruction for details.

Please use the latest version of BGCFlow available.

Quick Start

A quick and easy way to use BGCFlow using the command line interface wrapper:

Create a conda environment and install the BGCFlow python wrapper :

```bash

create and activate a new conda environment

mamba create -n bgcflow -c conda-forge python=3.11 pip openjdk -y # also install java for metabase conda activate bgcflow

install `BGCFlow` wrapper

pip install bgcflow_wrapper==0.4.0

make sure to use bgcflow_wrapper version >= 0.2.7

bgcflow --version ```

Additional pre-requisites: With the environment activated, install or setup this configurations:
- Set conda channel priorities to flexible bash conda config --set channel_priority disabled conda config --describe channel_priority
Deploy and run BGCFlow, change your_bgcflow_directory variable accordingly: ```bash

Deploy and run BGCFlow

bgcflow clone bgcflow # clone BGCFlow a directory named bgcflow cd bgcflow # move to bgcflow directory bgcflow init # initiate BGCFlow config and examples from template bgcflow run -n # do a dry run, remove the flag "-n" to run the example dataset ```
Build and serve interactive report (after bgcflow run finished). The report will be served in http://localhost:8001/. A demo of the report is available here:

```bash

build a report

bgcflow build report

show available projects

bgcflow serve

serve interactive report

bgcflow serve --project Lactobacillus_delbrueckii ```

For detailed usage and configurations, have a look at the WIKI:
Read more about bgcflow_wrapper for a detailed overview of the command line interface.

Workflow overview

The main Snakefile workflow comprises various pipelines for data selection, functional annotation, phylogenetic analysis, genome mining, and comparative genomics for Prokaryotic datasets.

dag

Available pipelines in the main Snakefile can be checked using the following command: bash bgcflow pipelines

List of Available Pipelines

Here you can find pipeline keywords that you can run using the main Snakefile of BGCflow.

| | Keyword | Description | Links | |---:|:------------------|:-------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------| | 0 | eggnog | Annotate samples with eggNOG database (http://eggnog5.embl.de) | eggnog-mapper | | 1 | mash | Calculate distance estimation for all samples using MinHash. | Mash | | 2 | fastani | Do pairwise Average Nucleotide Identity (ANI) calculation across all samples. | FastANI | | 3 | automlst-wrapper | Simplified Tree building using autoMLST | automlst-simplified-wrapper | | 4 | roary | Build pangenome using Roary. | Roary | | 5 | eggnog-roary | Annotate Roary output using eggNOG mapper | eggnog-mapper | | 6 | seqfu | Calculate sequence statistics using SeqFu. | seqfu2 | | 7 | bigslice | Cluster BGCs using BiG-SLiCE (https://github.com/medema-group/bigslice) | bigslice | | 8 | query-bigslice | Map BGCs to BiG-FAM database (https://bigfam.bioinformatics.nl/) | bigfam.bioinformatics.nl | | 9 | checkm | Assess genome quality with CheckM. | CheckM | | 10 | gtdbtk | Taxonomic placement with GTDB-Tk | GTDBTk | | 11 | prokka-gbk | Copy annotated genbank results. | prokka | | 12 | antismash | Summarizes antiSMASH result. | antismash | | 13 | arts | Run Antibiotic Resistant Target Seeker (ARTS) on samples. | arts | | 14 | deeptfactor | Use deep learning to find Transcription Factors. | deeptfactor | | 15 | deeptfactor-roary | Use DeepTFactor on Roary outputs. | Roary | | 16 | cblaster-genome | Build diamond database of genomes for cblaster search. | cblaster | | 17 | cblaster-bgc | Build diamond database of BGCs for cblaster search. | cblaster | | 18 | bigscape | Cluster BGCs using BiG-SCAPE | BiG-SCAPE | | 19 | gecco | GEne Cluster prediction with COnditional random fields. | GECCO

Development & Funding

The development of BGCFlow commenced within the Natural Products Genome Mining research group at the Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark (DTU Biosustain). BGCFlow development was/is made possible through the generous support of various funding organizations:

Novo Nordisk Foundation: BGCFlow development was supported by grants from the Novo Nordisk Foundation, specifically [NNF20CC0035580] and [NNF16OC0021746]. Matin Nuhamunada received support from the NNF Copenhagen Bioscience PhD Program: , grant [NNF20SA0035588].
Danish National Research Foundation: Additional funding was provided by the Danish National Research Foundation for the Center for Microbial Secondary Metabolites (CeMiSt), under the grant [DNRF137].

References

Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.

Mash Screen: high-throughput sequence containment estimation for genome discovery. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM. Genome Biol. 2019 Nov 5;20(1):232. doi: 10.1186/s13059-019-1841-x.

Jain, C., Rodriguez-R, L.M., Phillippy, A.M. et al. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9, 5114 (2018). https://doi.org/10.1038/s41467-018-07641-9

Mohammad Alanjary, Katharina Steinke, Nadine Ziemert, AutoMLST: an automated web server for generating multi-locus species trees highlighting natural product potential,Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W276–W282

Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421

eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Carlos P. Cantalapiedra, Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021. Molecular Biology and Evolution, msab293

eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085

Telatin, A., Birolo, G., & Fariselli, P. SeqFu [Computer software]. GITHUB: https://github.com/telatin/seqfu2

Satria A Kautsar, Kai Blin, Simon Shaw, Tilmann Weber, Marnix H Medema, BiG-FAM: the biosynthetic gene cluster families database, Nucleic Acids Research, gkaa812, https://doi.org/10.1093/nar/gkaa812

Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154.

Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2014. Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25: 1043-1055.

Chaumeil PA, et al. 2019. GTDB-Tk: A toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics, btz848.

Parks DH, et al. 2020. A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8.

Parks DH, et al. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology, http://dx.doi.org/10.1038/nbt.4229.

Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063

antiSMASH 6.0: improving cluster detection and comparison capabilities. Kai Blin, Simon Shaw, Alexander M Kloosterman, Zach Charlop-Powers, Gilles P van Weezel, Marnix H Medema, & Tilmann Weber. Nucleic Acids Research (2021) doi: 10.1093/nar/gkab335.

Mungan,M.D., Alanjary,M., Blin,K., Weber,T., Medema,M.H. and Ziemert,N. (2020) ARTS 2.0: feature updates and expansion of the Antibiotic Resistant Target Seeker for comparative genome mining. Nucleic Acids Res.,10.1093/nar/gkaa374

Alanjary,M., Kronmiller,B., Adamek,M., Blin,K., Weber,T., Huson,D., Philmus,B. and Ziemert,N. (2017) The Antibiotic Resistant Target Seeker (ARTS), an exploration engine for antibiotic cluster prioritization and novel drug target discovery. Nucleic Acids Res.,10.1093/nar/gkx360

Kim G.B., Gao Y., Palsson B.O., Lee S.Y. 2020. DeepTFactor: A deep learning-based tool for the prediction of transcription factors. PNAS. doi: 10.1073/pnas.2021171118

Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421

Gilchrist, C., Booth, T. J., van Wersch, B., van Grieken, L., Medema, M. H., & Chooi, Y. (2021). cblaster: a remote search tool for rapid identification and visualisation of homologous gene clusters (Version 1.3.9) [Computer software]. https://doi.org/10.1101/2020.11.08.370601

Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).

Navarro-Muñoz, J.C., Selem-Mojica, N., Mullowney, M.W. et al. A computational framework to explore large-scale biosynthetic diversity. Nat Chem Biol 16, 60–68 (2020)

Kai Blin, Simon Shaw, Hannah E Augustijn, Zachary L Reitz, Friederike Biermann, Mohammad Alanjary, Artem Fetter, Barbara R Terlouw, William W Metcalf, Eric J N Helfrich, Gilles P van Wezel, Marnix H Medema, Tilmann Weber, antiSMASH 7.0: new and improved predictions for detection, regulation, chemical structures and visualisation, Nucleic Acids Research, Volume 51, Issue W1, 5 July 2023, Pages W46–W50, https://doi.org/10.1093/nar/gkad344

Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller. 2021. Accurate de novo identification of biosynthetic gene clusters with GECCO. bioRxiv 2021.05.03.442509; doi:10.1101/2021.05.03.442509

Owner

Name: NBCHub
Login: NBChub
Kind: organization
Location: Denmark

Repositories: 18
Profile: https://github.com/NBChub

Repositories for analyzing large scale datasets for natural products discovery developed at DTU Biosustain

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you find this repository useful, please cite it using the following publication."
authors:
- family-names: "Nuhamunada"
  given-names: "Matin"
  orcid: "https://orcid.org/0000-0003-3177-8299"
- family-names: "Mohite"
  given-names: "Omkar S."
  orcid: "https://orcid.org/0000-0002-3240-1656"
- family-names: "Phaneuf"
  given-names: "Patrick V."
  orcid: "https://orcid.org/0000-0002-4122-6589"
- family-names: "Palsson"
  given-names: "Bernhard O."
  orcid: "https://orcid.org/0000-0003-2357-6785"
- family-names: "Weber"
  given-names: "Tilmann"
  orcid: "https://orcid.org/0000-0002-8260-5120"
title: "BGCFlow"
version: 0.6.3
doi: "N/A"
date-released: 2023-06-15
url: "https://github.com/NBChub/bgcflow_wrapper"
preferred-citation:
  type: article
  authors:
  - family-names: "Nuhamunada"
    given-names: "Matin"
    orcid: "https://orcid.org/0000-0003-3177-8299"
  - family-names: "Mohite"
    given-names: "Omkar S."
    orcid: "https://orcid.org/0000-0002-3240-1656"
  - family-names: "Phaneuf"
    given-names: "Patrick V."
    orcid: "https://orcid.org/0000-0002-4122-6589"
  - family-names: "Palsson"
    given-names: "Bernhard O."
    orcid: "https://orcid.org/0000-0003-2357-6785"
  - family-names: "Weber"
    given-names: "Tilmann"
    orcid: "https://orcid.org/0000-0002-8260-5120"
  doi: "10.1101/2023.06.14.545018"
  journal: "bioRxiv"
  publisher: "Cold Spring Harbor Laboratory"
  title: "BGCFlow: Systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets"
  elocation-id: "2023.06.14.545018"
  year: 2023
  url: "https://www.biorxiv.org/content/early/2023/06/15/2023.06.14.545018"
  eprint: "https://www.biorxiv.org/content/early/2023/06/15/2023.06.14.545018.full.pdf"

GitHub Events

Total

Create event: 3
Issues event: 8
Release event: 2
Watch event: 12
Delete event: 4
Issue comment event: 9
Push event: 12
Pull request event: 2
Fork event: 1

Last Year

Create event: 3
Issues event: 8
Release event: 2
Watch event: 12
Delete event: 4
Issue comment event: 9
Push event: 12
Pull request event: 2
Fork event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 73
Total pull requests: 30
Average time to close issues: 4 months
Average time to close pull requests: about 21 hours
Total issue authors: 9
Total pull request authors: 2
Average comments per issue: 2.42
Average comments per pull request: 0.9
Merged pull requests: 29
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 3
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 3
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

matinnuhamunada (39)
OmkarSaMo (39)
MatissMaleckis (4)
yzjie6 (3)
TanyaC505 (2)
robertosanchezn (2)
Sam-Will (1)
FriederikeBiermann (1)
coderbins (1)
andrekind17 (1)
artfet (1)
alpole23 (1)
540889956 (1)
sadhanokar (1)
MagnusHaahr (1)

Pull Request Authors

matinnuhamunada (44)
OmkarSaMo (15)
JackSun1997 (1)
andrekind17 (1)
anpanche (1)

Top Labels

Issue Labels

enhancement (30) bug (22) question (12) documentation (5) good first issue (1)

Pull Request Labels

enhancement (6) bug (3)

Dependencies

.github/workflows/build.yml actions

actions/checkout v4 composite
mamba-org/setup-micromamba v1 composite

.github/workflows/push.yml actions

actions/checkout v4 composite
actions/setup-python v4 composite
coroo/pytest-coverage-commentator v1.0.2 composite
github/super-linter v4 composite
snakemake/snakemake-github-action v1.24.0 composite

Dockerfile docker

condaforge/mambaforge latest build

workflow/bgcflow/requirements.txt pypi

alive_progress *
biopython *
bokeh *
datetime *
ete3 *
ipykernel *
ipywidgets >=7.6
jupyter-dash *
jupyterlab >3
nbconvert *
ncbi-genome-download ==0.3.1
networkx *
openpyxl *
pandas *
pathlib *
plotly *
seaborn *
subprocess *
xlrd >=1.0.0
zip *

workflow/bgcflow/requirements_dev.txt pypi

Sphinx ==1.8.5 development
bump2version ==0.5.11 development
coverage ==4.5.4 development
flake8 ==3.7.8 development
pip ==21.1 development
tox ==3.14.0 development
twine ==1.14.0 development
watchdog ==0.9.0 development
wheel >=0.38.1 development

workflow/bgcflow/setup.py pypi