bgcflow
Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 51 DOI reference(s) in README -
✓Academic publication links
Links to: nature.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.0%) to scientific vocabulary
Keywords
Repository
Snakemake workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes)
Basic Info
- Host: GitHub
- Owner: NBChub
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://github.com/NBChub/bgcflow/wiki
- Size: 57.9 MB
Statistics
- Stars: 41
- Watchers: 2
- Forks: 9
- Open Issues: 40
- Releases: 34
Topics
Metadata Files
README.md
BGCFlow
BGCFlow is a systematic workflow for the analysis of biosynthetic gene clusters across large collections of genomes (pangenomes) from internal & public datasets.
At present, BGCFlow is only tested and confirmed to work on Linux systems with conda / mamba package manager.
Publication
Matin Nuhamunada, Omkar S Mohite, Patrick V Phaneuf, Bernhard O Palsson, Tilmann Weber, BGCFlow: systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets, Nucleic Acids Research, 2024;, gkae314, https://doi.org/10.1093/nar/gkae314
Pre-requisites
BGCFlow requires gcc and the conda/mamba package manager. See installation instruction for details.
Please use the latest version of BGCFlow available.
Quick Start
A quick and easy way to use BGCFlow using the command line interface wrapper:
- Create a conda environment and install the
BGCFlowpython wrapper :
```bash
create and activate a new conda environment
mamba create -n bgcflow -c conda-forge python=3.11 pip openjdk -y # also install java for metabase conda activate bgcflow
install BGCFlow wrapper
pip install bgcflow_wrapper==0.4.0
make sure to use bgcflow_wrapper version >= 0.2.7
bgcflow --version ```
Additional pre-requisites: With the environment activated, install or setup this configurations:
- Set
condachannel priorities toflexiblebash conda config --set channel_priority disabled conda config --describe channel_priority
- Set
Deploy and run BGCFlow, change
your_bgcflow_directoryvariable accordingly: ```bashDeploy and run BGCFlow
bgcflow clone bgcflow # clone
BGCFlowa directory named bgcflow cd bgcflow # move to bgcflow directory bgcflow init # initiateBGCFlowconfig and examples from template bgcflow run -n # do a dry run, remove the flag "-n" to run the example dataset ```Build and serve interactive report (after
bgcflow runfinished). The report will be served in http://localhost:8001/. A demo of the report is available here:
```bash
build a report
bgcflow build report
show available projects
bgcflow serve
serve interactive report
bgcflow serve --project Lactobacillus_delbrueckii ```
For detailed usage and configurations, have a look at the WIKI:
Read more about
bgcflow_wrapperfor a detailed overview of the command line interface.
Workflow overview
The main Snakefile workflow comprises various pipelines for data selection, functional annotation, phylogenetic analysis, genome mining, and comparative genomics for Prokaryotic datasets.

Available pipelines in the main Snakefile can be checked using the following command:
bash
bgcflow pipelines
List of Available Pipelines
Here you can find pipeline keywords that you can run using the main Snakefile of BGCflow.
| | Keyword | Description | Links | |---:|:------------------|:-------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------| | 0 | eggnog | Annotate samples with eggNOG database (http://eggnog5.embl.de) | eggnog-mapper | | 1 | mash | Calculate distance estimation for all samples using MinHash. | Mash | | 2 | fastani | Do pairwise Average Nucleotide Identity (ANI) calculation across all samples. | FastANI | | 3 | automlst-wrapper | Simplified Tree building using autoMLST | automlst-simplified-wrapper | | 4 | roary | Build pangenome using Roary. | Roary | | 5 | eggnog-roary | Annotate Roary output using eggNOG mapper | eggnog-mapper | | 6 | seqfu | Calculate sequence statistics using SeqFu. | seqfu2 | | 7 | bigslice | Cluster BGCs using BiG-SLiCE (https://github.com/medema-group/bigslice) | bigslice | | 8 | query-bigslice | Map BGCs to BiG-FAM database (https://bigfam.bioinformatics.nl/) | bigfam.bioinformatics.nl | | 9 | checkm | Assess genome quality with CheckM. | CheckM | | 10 | gtdbtk | Taxonomic placement with GTDB-Tk | GTDBTk | | 11 | prokka-gbk | Copy annotated genbank results. | prokka | | 12 | antismash | Summarizes antiSMASH result. | antismash | | 13 | arts | Run Antibiotic Resistant Target Seeker (ARTS) on samples. | arts | | 14 | deeptfactor | Use deep learning to find Transcription Factors. | deeptfactor | | 15 | deeptfactor-roary | Use DeepTFactor on Roary outputs. | Roary | | 16 | cblaster-genome | Build diamond database of genomes for cblaster search. | cblaster | | 17 | cblaster-bgc | Build diamond database of BGCs for cblaster search. | cblaster | | 18 | bigscape | Cluster BGCs using BiG-SCAPE | BiG-SCAPE | | 19 | gecco | GEne Cluster prediction with COnditional random fields. | GECCO
Development & Funding
The development of BGCFlow commenced within the Natural Products Genome Mining research group at the Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark (DTU Biosustain). BGCFlow development was/is made possible through the generous support of various funding organizations:
Novo Nordisk Foundation: BGCFlow development was supported by grants from the Novo Nordisk Foundation, specifically [NNF20CC0035580] and [NNF16OC0021746]. Matin Nuhamunada received support from the NNF Copenhagen Bioscience PhD Program: , grant [NNF20SA0035588].
Danish National Research Foundation: Additional funding was provided by the Danish National Research Foundation for the Center for Microbial Secondary Metabolites (CeMiSt), under the grant [DNRF137].
References
- Mash: fast genome and metagenome distance estimation using MinHash. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Genome Biol. 2016 Jun 20;17(1):132. doi: 10.1186/s13059-016-0997-x.
- Mash Screen: high-throughput sequence containment estimation for genome discovery. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S, Buck CB, Phillippy AM. Genome Biol. 2019 Nov 5;20(1):232. doi: 10.1186/s13059-019-1841-x.
- Jain, C., Rodriguez-R, L.M., Phillippy, A.M. et al. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 9, 5114 (2018). https://doi.org/10.1038/s41467-018-07641-9
- Mohammad Alanjary, Katharina Steinke, Nadine Ziemert, AutoMLST: an automated web server for generating multi-locus species trees highlighting natural product potential,Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019, Pages W276–W282
- Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421
- eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Carlos P. Cantalapiedra, Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021. Molecular Biology and Evolution, msab293
- eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Jaime Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas Rattei, Lars J Jensen, Christian von Mering, Peer Bork Nucleic Acids Res. 2019 Jan 8; 47(Database issue): D309–D314. doi: 10.1093/nar/gky1085
- Telatin, A., Birolo, G., & Fariselli, P. SeqFu [Computer software]. GITHUB: https://github.com/telatin/seqfu2
- Satria A Kautsar, Kai Blin, Simon Shaw, Tilmann Weber, Marnix H Medema, BiG-FAM: the biosynthetic gene cluster families database, Nucleic Acids Research, gkaa812, https://doi.org/10.1093/nar/gkaa812
- Satria A Kautsar, Justin J J van der Hooft, Dick de Ridder, Marnix H Medema, BiG-SLiCE: A highly scalable tool maps the diversity of 1.2 million biosynthetic gene clusters, GigaScience, Volume 10, Issue 1, January 2021, giaa154.
- Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2014. Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Research, 25: 1043-1055.
- Chaumeil PA, et al. 2019. GTDB-Tk: A toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics, btz848.
- Parks DH, et al. 2020. A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8.
- Parks DH, et al. 2018. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nature Biotechnology, http://dx.doi.org/10.1038/nbt.4229.
- Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063
- antiSMASH 6.0: improving cluster detection and comparison capabilities. Kai Blin, Simon Shaw, Alexander M Kloosterman, Zach Charlop-Powers, Gilles P van Weezel, Marnix H Medema, & Tilmann Weber. Nucleic Acids Research (2021) doi: 10.1093/nar/gkab335.
- Mungan,M.D., Alanjary,M., Blin,K., Weber,T., Medema,M.H. and Ziemert,N. (2020) ARTS 2.0: feature updates and expansion of the Antibiotic Resistant Target Seeker for comparative genome mining. Nucleic Acids Res.,10.1093/nar/gkaa374
- Alanjary,M., Kronmiller,B., Adamek,M., Blin,K., Weber,T., Huson,D., Philmus,B. and Ziemert,N. (2017) The Antibiotic Resistant Target Seeker (ARTS), an exploration engine for antibiotic cluster prioritization and novel drug target discovery. Nucleic Acids Res.,10.1093/nar/gkx360
- Kim G.B., Gao Y., Palsson B.O., Lee S.Y. 2020. DeepTFactor: A deep learning-based tool for the prediction of transcription factors. PNAS. doi: 10.1073/pnas.2021171118
- Andrew J. Page, Carla A. Cummins, Martin Hunt, Vanessa K. Wong, Sandra Reuter, Matthew T. G. Holden, Maria Fookes, Daniel Falush, Jacqueline A. Keane, Julian Parkhill, 'Roary: Rapid large-scale prokaryote pan genome analysis', Bioinformatics, 2015;31(22):3691-3693 doi:10.1093/bioinformatics/btv421
- Gilchrist, C., Booth, T. J., van Wersch, B., van Grieken, L., Medema, M. H., & Chooi, Y. (2021). cblaster: a remote search tool for rapid identification and visualisation of homologous gene clusters (Version 1.3.9) [Computer software]. https://doi.org/10.1101/2020.11.08.370601
- Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
- Navarro-Muñoz, J.C., Selem-Mojica, N., Mullowney, M.W. et al. A computational framework to explore large-scale biosynthetic diversity. Nat Chem Biol 16, 60–68 (2020)
- Kai Blin, Simon Shaw, Hannah E Augustijn, Zachary L Reitz, Friederike Biermann, Mohammad Alanjary, Artem Fetter, Barbara R Terlouw, William W Metcalf, Eric J N Helfrich, Gilles P van Wezel, Marnix H Medema, Tilmann Weber, antiSMASH 7.0: new and improved predictions for detection, regulation, chemical structures and visualisation, Nucleic Acids Research, Volume 51, Issue W1, 5 July 2023, Pages W46–W50, https://doi.org/10.1093/nar/gkad344
- Laura M Carroll, Martin Larralde, Jonas Simon Fleck, Ruby Ponnudurai, Alessio Milanese, Elisa Cappio Barazzone, Georg Zeller. 2021. Accurate de novo identification of biosynthetic gene clusters with GECCO. bioRxiv 2021.05.03.442509; doi:10.1101/2021.05.03.442509
Owner
- Name: NBCHub
- Login: NBChub
- Kind: organization
- Location: Denmark
- Repositories: 18
- Profile: https://github.com/NBChub
Repositories for analyzing large scale datasets for natural products discovery developed at DTU Biosustain
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you find this repository useful, please cite it using the following publication."
authors:
- family-names: "Nuhamunada"
given-names: "Matin"
orcid: "https://orcid.org/0000-0003-3177-8299"
- family-names: "Mohite"
given-names: "Omkar S."
orcid: "https://orcid.org/0000-0002-3240-1656"
- family-names: "Phaneuf"
given-names: "Patrick V."
orcid: "https://orcid.org/0000-0002-4122-6589"
- family-names: "Palsson"
given-names: "Bernhard O."
orcid: "https://orcid.org/0000-0003-2357-6785"
- family-names: "Weber"
given-names: "Tilmann"
orcid: "https://orcid.org/0000-0002-8260-5120"
title: "BGCFlow"
version: 0.6.3
doi: "N/A"
date-released: 2023-06-15
url: "https://github.com/NBChub/bgcflow_wrapper"
preferred-citation:
type: article
authors:
- family-names: "Nuhamunada"
given-names: "Matin"
orcid: "https://orcid.org/0000-0003-3177-8299"
- family-names: "Mohite"
given-names: "Omkar S."
orcid: "https://orcid.org/0000-0002-3240-1656"
- family-names: "Phaneuf"
given-names: "Patrick V."
orcid: "https://orcid.org/0000-0002-4122-6589"
- family-names: "Palsson"
given-names: "Bernhard O."
orcid: "https://orcid.org/0000-0003-2357-6785"
- family-names: "Weber"
given-names: "Tilmann"
orcid: "https://orcid.org/0000-0002-8260-5120"
doi: "10.1101/2023.06.14.545018"
journal: "bioRxiv"
publisher: "Cold Spring Harbor Laboratory"
title: "BGCFlow: Systematic pangenome workflow for the analysis of biosynthetic gene clusters across large genomic datasets"
elocation-id: "2023.06.14.545018"
year: 2023
url: "https://www.biorxiv.org/content/early/2023/06/15/2023.06.14.545018"
eprint: "https://www.biorxiv.org/content/early/2023/06/15/2023.06.14.545018.full.pdf"
GitHub Events
Total
- Create event: 3
- Issues event: 8
- Release event: 2
- Watch event: 12
- Delete event: 4
- Issue comment event: 9
- Push event: 12
- Pull request event: 2
- Fork event: 1
Last Year
- Create event: 3
- Issues event: 8
- Release event: 2
- Watch event: 12
- Delete event: 4
- Issue comment event: 9
- Push event: 12
- Pull request event: 2
- Fork event: 1
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 73
- Total pull requests: 30
- Average time to close issues: 4 months
- Average time to close pull requests: about 21 hours
- Total issue authors: 9
- Total pull request authors: 2
- Average comments per issue: 2.42
- Average comments per pull request: 0.9
- Merged pull requests: 29
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 3
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- matinnuhamunada (39)
- OmkarSaMo (39)
- MatissMaleckis (4)
- yzjie6 (3)
- TanyaC505 (2)
- robertosanchezn (2)
- Sam-Will (1)
- FriederikeBiermann (1)
- coderbins (1)
- andrekind17 (1)
- artfet (1)
- alpole23 (1)
- 540889956 (1)
- sadhanokar (1)
- MagnusHaahr (1)
Pull Request Authors
- matinnuhamunada (44)
- OmkarSaMo (15)
- JackSun1997 (1)
- andrekind17 (1)
- anpanche (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions/checkout v4 composite
- mamba-org/setup-micromamba v1 composite
- actions/checkout v4 composite
- actions/setup-python v4 composite
- coroo/pytest-coverage-commentator v1.0.2 composite
- github/super-linter v4 composite
- snakemake/snakemake-github-action v1.24.0 composite
- condaforge/mambaforge latest build
- alive_progress *
- biopython *
- bokeh *
- datetime *
- ete3 *
- ipykernel *
- ipywidgets >=7.6
- jupyter-dash *
- jupyterlab >3
- nbconvert *
- ncbi-genome-download ==0.3.1
- networkx *
- openpyxl *
- pandas *
- pathlib *
- plotly *
- seaborn *
- subprocess *
- xlrd >=1.0.0
- zip *
- Sphinx ==1.8.5 development
- bump2version ==0.5.11 development
- coverage ==4.5.4 development
- flake8 ==3.7.8 development
- pip ==21.1 development
- tox ==3.14.0 development
- twine ==1.14.0 development
- watchdog ==0.9.0 development
- wheel >=0.38.1 development