MetaGenePipe

MetaGenePipe: An Automated, Portable Pipeline for Contig-based Functional and Taxonomic Analysis - Published in JOSS (2023)

https://github.com/parkvilledata/metagenepipe

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 9 DOI reference(s) in README and JOSS metadata
  • Academic publication links
    Links to: joss.theoj.org
  • Committers with academic emails
    10 of 13 committers (76.9%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Scientific Fields

Biology Life Sciences - 34% confidence
Last synced: 4 months ago · JSON representation

Repository

MGP

Basic Info
  • Host: GitHub
  • Owner: ParkvilleData
  • License: apache-2.0
  • Language: Jupyter Notebook
  • Default Branch: master
  • Size: 53.1 MB
Statistics
  • Stars: 2
  • Watchers: 4
  • Forks: 1
  • Open Issues: 0
  • Releases: 1
Created almost 4 years ago · Last pushed over 2 years ago
Metadata Files
Readme Contributing License

README.rst

MetaGenePipe
============

.. start-badges

|pipline| |docs| |Contributor Covenant| |joss|

.. |pipline| image:: https://github.com/parkvilledata/MetaGenePipe/actions/workflows/testing.yml/badge.svg
.. |docs| image:: https://github.com/parkvilledata/MetaGenePipe/actions/workflows/docs.yml/badge.svg
   :target: https://parkvilledata.github.io/MetaGenePipe
.. |Contributor Covenant| image:: https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg
   :target: https://www.contributor-covenant.org/version/2/1/code_of_conduct/
.. |joss| image:: https://joss.theoj.org/papers/10.21105/joss.04851/status.svg
   :target: https://doi.org/10.21105/joss.04851

.. end-badges


MetaGenePipe (MGP) is an efficient, flexible, portable, and scalable
metagenomics pipeline that uses performant bioinformatics software
suites and genomic databases to create an accurate taxonomic and
functional characterization of the prokaryotic fraction of sequenced
microbiomes.

Microorganisms such as bacteria, viruses, archaea, and fungi are
ubiquitous in our environment. The study of microorganisms and their
full genomes has been enabled through advances in culture independent
techniques and high-throughput sequencing technologies. Whole genome
metagenomics shotgun sequencing (WGS) empowers researchers to study
biological functions of microorganisms, and how their presence affects
human disease or a specific ecosystem. However, advanced and novel
bioinformatics techniques are required to process the data into a
suitable format. There is no universally accepted standardized
bioinformatics framework that a computational microbiologist can use
effectively.

MGP is written in WDL and thus differs from existing assembly-based
workflow pipelines such as Atlas (which uses Snakemake) and Muffin
(which is written in Nextflow). MGP is an example of WDL and
containerization best practice. Similar to NF-core/Mag, MGP employs
co-assembly of multiple metagenome samples as a feature.

MGP overcomes traditional portability obstacles by using Singularity
containers, and increases flexibility of research focus by using the
DIAMOND aligner — able to create bespoke databases in a matter of
minutes. Researchers often have institutional datasets resulting from
previous research, which can be incorporated into MGP in place of the
default DIAMOND and BLAST databases. Once these databases have been
created, an update in the relevant line in the configuration file will
allow the workflow to use the new database. While MGP is focussed on
prokaryotes, it can easily be adapted to eukaryotes or viruses by
changing the prokaryotic gene prediction software, Prodigal, to
eukaryotic gene prediction software such as GeneMark-EP+ or
`EuGene `__, or a gene finding tool for
viruses.

A Microbiome is a collection of all microbes, such as bacteria, from a
given environment. The environment can be human related, i.e. the human
gut or can be from environments such as soil or water. The composition
of the microbiome, in terms of the percentages of bacterial species
present in the environment can be beneficial in determining the effect
bacteria has on the environment.

Microbial communities which make up microbiomes have the potential to be
major regulators in biogeochemical processes and can determine ecosystem
function. Understanding the composition, both in terms of diversity and
taxnomic profiles, is useful for determining potential effects of
changes in environmental conditions. For example, taxonomic composition
of soil samples can be taken in base line (regular conditions)
conditions and then compared to areas which may be experiencing distress
in the form of physical ecological changes such as mining operations.

MGP is a WDL workflow created using existing bioinformatics tools with
the view of allowing the user to incorporate extra flexibility in
creating taxonomic profiles of their microbiome samples. Through the use
of Kegg’s Brite Heirarchy, MGP is able to process raw microbiome
(metagenomic) samples, perform quality checks to remove poor quality
sequences, assemble the samples to create longer length sequences,
i.e. contigs which are then processed for open reading frames. These
open reading frames are potential protein sequences which are then
aligned to major protein databases. The matches from the alignment are
parsed and using in house scripts, are assigned taxonomic ID’s using the
brite heirarchy. Example output from MGP can be seen in the table below.

======================================================== =====
Pathway                                                  Count
======================================================== =====
09121 Transcription                                      0
09182 Protein families: genetic information processing   0
09183 Protein families: signaling and cellular processes 0
09192 Unclassified: genetic information processing       0
09194 Poorly characterized                               0
Brite Hierarchies                                        6
DNA repair and recombination proteins [BR:ko03400]       0
DNA replication proteins [BR:ko03032]                    0
Function unknown                                         0
Genetic Information Processing                           1
Not Included in Pathway or Brite                         3
Prokaryotic defense system [BR:ko02048]                  0
RNA polymerase [PATH:ko03020]                            0
Replication and repair                                   0
Transcription machinery [BR:ko03021]                     0
Transporters [BR:ko02000]                                0
======================================================== =====

More details can be found in the `documentation `_.

Installation
====================

To install MetaGenePipe, clone the repository:

.. code-block:: bash

    git clone https://github.com/ParkvilleData/MetaGenePipe.git
    cd MetaGenePipe

MetaGenePipe requires Java, Singularity, and other dependencies to run. 
See the `installation instructions in the documentation `_ for further information.

Usage
======

You can start the workflow with the following command:

.. code-block:: bash

   java -Dconfig.file=./metaGenePipe.config -jar cromwell-latest.jar run metaGenePipe.wdl -i metaGenePipe.json -o metaGenePipe.options.json

The pipeline has been set up to run against the swissprot database. We have supplied sample fastq files consisting of 100,000 reads so the pipeline can be tested.
You can modify the input file ``input_file.txt`` to reflect your sample files. 

Read the `documentation `_ for more information about usage and modifying the configuration files.

Output
======
.. start-output

There are four main output folders: qc (quality control), assembly, readalignment, and geneprediction and one intermediary, data, which contains the samples for assembly after running through TrimGalore and concatenating the samples for co-assembly if specified. 

Quality control
~~~~~~~~~~~~~~~~

* trimmed

  * {sampleName}.T{G|T}_R{1|2}.fq.gz: Trimmed output for each of the individual sample files, TG if the chosen trimmer is TrimGalore, and TT if it is Trimmomatic

* fastqc

  * {sampleName}.T{G|T}_R{1|2}_fastqc.zip: Fastqc output for each of the individual sample files

* multiqc_report.html: Combined report of all fastqc files
* flash

  * {sampleName}.extendedFrags.fastq: The merged reads

Data
~~~~~~~~~~~~~~~~

* {sampleName}_R{1|2}.fq.gz Sample files after trimming and/or concatenating for co-assembly. If files are concatenated for co-assembly, the sample name is set to be `combined`

Assembly
~~~~~~~~~~~~~~~~

* {sampleName}.megahit.contigs.fa: Final assembled contigs
* {sampleName}.{kmer}.fastg: Assembly graph for {kmer} assembled contigs, where {kmer} produces the largest assembled contig file size in the `intermediate_contigs` folder
* intermediate_contigs: a folder containing all intermediate assembled contigs {sampleName}.contigs.k{kmer}.fastg
* {sampleName}.megahit.blast.out: Raw blast results for the contigs
* {sampleName}.megahit.blast.parsed: Blast results parsed to be easily viewed in tsv format

Read alignment
~~~~~~~~~~~~~~~~

* {sampleName}.T{G|T}.flagstat.txt: Samtools flagstat output. Reports statistics on alignment of reads back to assembled contigs
* {sampleName}.T{G|T}.sam: Alignment of reads back to contigs in SAM format
* {sampleName}.T{G|T}.sorted.bam: Alignment of reads back to contigs in BAM format

Gene prediction
~~~~~~~~~~~~~~~~

* {sampleName}.megahit.proteins.fa.xml.out.xml: XML output of alignment of predicted Amino Acids to NCBI database (We chose swissprot, but any blast database can be substituted)
* diamond

  * {sampleName}.megahit.proteins.fa.xml.out:
  
* hmmer

  * {sampleName}.megahit.proteins.hmmer.out: Raw hmmer output aligned to Koalafam profiles
  * {sampleName}.megahit.proteins.hmmer.tblout: Parsed hmmer output aligned to Koalafam profiles
  
* prodigal

  * {sampleName}.megahit.gene_coordinates.gbk: Gene coordinates file (Genbank like file)
  * {sampleName}.megahit.nucl_genes.fa: Predicted gene nucleotide sequences
  * {sampleName}.megahit.proteins.fa: Predicted gene amino acid sequences
  * {sampleName}.megahit.starts.txt: Prodigal starts file

Classification
~~~~~~~~~~~~~~~~

* taxon - These files are produced for each sample (pair of read files) if the inputs are assembled separately (as opposed to co-assembly).

  * LevelA.brite.counts.tsv: Level A Kegg Brite Hierarchical gene count
  * LevelB.brite.counts.tsv: Level B Kegg Brite Hierarchical gene count
  * LevelC.brite.counts.tsv: Level C Kegg Brite Hierarchical gene count
  * OTU.brite.tsv: Table with counts of taxonomic (organism) IDs of genes

.. end-output

Output Tree
~~~~~~~~~~~

.. start-output-tree
Below is an example tree of the the output directory:

::

  .
  ├── assembly
  │   ├── combined.57.fastg
  │   ├── combined.megahit.blast.out
  │   ├── combined.megahit.blast.parsed
  │   ├── combined.megahit.contigs.fa
  │   └── intermediate_contigs
  │       ├── combined.contigs.k27.fa
  │       ├── combined.contigs.k37.fa
  │       ├── combined.contigs.k47.fa
  │       ├── combined.contigs.k57.fa
  │       ├── combined.contigs.k67.fa
  │       ├── combined.contigs.k77.fa
  │       ├── combined.contigs.k87.fa
  │       └── combined.contigs.k97.fa
  ├── data
  │   ├── combined_R1.fq.gz
  │   └── combined_R2.fq.gz
  ├── geneprediction
  │   ├── combined.megahit.proteins.fa.xml.out.xml
  │   ├── diamond
  │   │   └── combined.megahit.proteins.fa.xml.out
  │   ├── hmmer
  │   │   ├── combined.megahit.proteins.hmmer.out
  │   │   └── combined.megahit.proteins.hmmer.tblout
  │   └── prodigal
  │       ├── combined.megahit.gene_coordinates.gbk
  │       ├── combined.megahit.nucl_genes.fa
  │       ├── combined.megahit.proteins.fa
  │       └── combined.megahit.starts.txt
  ├── qc
  │   ├── fastqc
  │   │   ├── SRR5808831.TG_R1_fastqc.zip
  │   │   ├── SRR5808831.TG_R2_fastqc.zip
  │   │   ├── SRR5808882.TG_R1_fastqc.zip
  │   │   └── SRR5808882.TG_R2_fastqc.zip
  │   ├── flash
  │   │   ├── SRR5808831.extendedFrags.fastq
  │   │   └── SRR5808882.extendedFrags.fastq
  │   ├── multiqc_report.html
  │   └── trimmed
  │       ├── SRR5808831.TG_R1.fq.gz
  │       ├── SRR5808831.TG_R2.fq.gz
  │       ├── SRR5808882.TG_R1.fq.gz
  │       └── SRR5808882.TG_R2.fq.gz
  ├── readalignment
  │   ├── SRR5808831.TG.flagstat.txt
  │   ├── SRR5808831.TG.sam
  │   ├── SRR5808831.TG.sorted.bam
  │   ├── SRR5808882.TG.flagstat.txt
  │   ├── SRR5808882.TG.sam
  │   └── SRR5808882.TG.sorted.bam
  └── taxon
      ├── LevelA.brite.counts.tsv
      ├── LevelB.brite.counts.tsv
      ├── LevelC.brite.counts.tsv
      └── OTU.brite.tsv

.. end-output-tree

Please refer to the
`documentation `__ for
how to run.

Citation and Attribution
========================

.. start-citation

MetaGenePipe was developed at the Melbourne Data Analytics Platform (MDAP).

An article about the software package is published in the `Journal of Open Source Software `_:

  Shaban, Bobbie, Maria del Mar Quiroga, Robert Turnbull, Edoardo Tescari, Kim-Anh Lê Cao, Heroen Verbruggen. (2023). 
  MetaGenePipe: An Automated, Portable Pipeline for Contig-based Functional and Taxonomic Analysis. Journal of Open Source Software, 
  8 (82), 4851. doi: 10.21105/joss.04851

Here is the citation details in BibTeX format:

.. code-block:: BibTeX

  @article{
      Shaban2023, 
      doi = {10.21105/joss.04851}, 
      url = {https://doi.org/10.21105/joss.04851}, 
      year = {2023}, 
      publisher = {The Open Journal}, 
      volume = {8}, 
      number = {82}, 
      pages = {4851}, 
      author = {Babak Shaban and Maria del Mar Quiroga and Robert Turnbull and Edoardo Tescari and Kim-Anh Lê Cao and Heroen Verbruggen}, 
      title = {{MetaGenePipe: An Automated, Portable Pipeline for Contig-based Functional and Taxonomic Analysis}}, 
      journal = {Journal of Open Source Software} 
  }


If you create a derivative work from this software package, attribution
should be included as follows:

   This is a derivative work of MetaGenePipe, originally released under
   the Apache 2.0 license, developed by Bobbie Shaban, Mar Quiroga,
   Robert Turnbull and Edoardo Tescari at Melbourne Data Analytics
   Platform (MDAP) at the University of Melbourne.


.. end-citation


Contributing
========================

If you would like to contribute to this software package, please make sure you follow the `code of conduct `_.


Owner

  • Name: Parkville Data Workflow User Group
  • Login: ParkvilleData
  • Kind: user

Parkville Data Workflow User Group - A data workflow knowledge base.

JOSS Publication

MetaGenePipe: An Automated, Portable Pipeline for Contig-based Functional and Taxonomic Analysis
Published
February 20, 2023
Volume 8, Issue 82, Page 4851
Authors
Babak Shaban ORCID
Melbourne Data Analytics Platform, The University of Melbourne
Maria del Mar Quiroga ORCID
Melbourne Data Analytics Platform, The University of Melbourne
Robert Turnbull ORCID
Melbourne Data Analytics Platform, The University of Melbourne
Edoardo Tescari ORCID
Melbourne Data Analytics Platform, The University of Melbourne
Kim-Anh Lê Cao ORCID
School of Mathematics and Statistics, Melbourne Integrative Genomics, The University of Melbourne
Heroen Verbruggen ORCID
School of BioSciences, The University of Melbourne
Editor
Jacob Schreiber ORCID
Tags
metagenomics WDL Singularity Containerization

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 323
  • Total Committers: 13
  • Avg Commits per committer: 24.846
  • Development Distribution Score (DDS): 0.598
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Maria del Mar Quiroga m****a@u****u 130
Robert Turnbull r****l@u****u 75
bshaban b****n@u****u 63
Bobbie Shaban b****n@u****u 41
etescari e****i@u****u 3
Ubuntu u****u@v****l 3
Edoardo Tescari e****i@s****u 2
Heroen Verbruggen h****n@g****m 1
etescari e****i@u****u 1
Ubuntu u****u@m****l 1
Robert Turnbull r****l@s****u 1
Robert Turnbull r****l@s****u 1
Bobbie Shaban (unimelb) b****n@s****u 1

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 0
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 minutes
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • mariadelmarq (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/docs.yml actions
  • JamesIves/github-pages-deploy-action 4.1.5 composite
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
.github/workflows/draft-pdf.yml actions
  • actions/checkout v2 composite
  • actions/upload-artifact v1 composite
  • openjournals/openjournals-draft-action master composite
.github/workflows/testing.yml actions
  • actions/checkout v2 composite
  • eWaterCycle/setup-singularity v7 composite
docs/poetry.lock pypi
  • attrs 21.4.0 develop
  • beautifulsoup4 4.11.1 develop
  • bleach 5.0.1 develop
  • cffi 1.15.1 develop
  • defusedxml 0.7.1 develop
  • entrypoints 0.4 develop
  • fastjsonschema 2.16.1 develop
  • importlib-resources 5.9.0 develop
  • jsonschema 4.15.0 develop
  • jupyter-client 7.3.5 develop
  • jupyter-core 4.11.1 develop
  • jupyterlab-pygments 0.2.2 develop
  • livereload 2.6.3 develop
  • lxml 4.9.1 develop
  • markdown-it-py 1.1.0 develop
  • mdit-py-plugins 0.2.8 develop
  • mistune 2.0.4 develop
  • myst-parser 0.15.2 develop
  • nbclient 0.6.7 develop
  • nbconvert 7.0.0 develop
  • nbformat 5.4.0 develop
  • nbsphinx 0.8.9 develop
  • nest-asyncio 1.5.5 develop
  • pandocfilters 1.5.0 develop
  • pkgutil_resolve_name 1.3.10 develop
  • py 1.11.0 develop
  • pycparser 2.21 develop
  • pyrsistent 0.18.1 develop
  • python-dateutil 2.8.2 develop
  • pywin32 304 develop
  • pyzmq 23.2.1 develop
  • soupsieve 2.3.2.post1 develop
  • sphinx-autobuild 2021.3.14 develop
  • sphinx-copybutton 0.4.0 develop
  • sphinx-rtd-theme 1.0.0 develop
  • tinycss2 1.1.1 develop
  • tornado 6.2 develop
  • traitlets 5.3.0 develop
  • webencodings 0.5.1 develop
  • Babel 2.10.3
  • Jinja2 3.1.2
  • MarkupSafe 2.1.1
  • PyYAML 6.0
  • Pygments 2.13.0
  • Sphinx 4.5.0
  • alabaster 0.7.12
  • certifi 2022.6.15
  • charset-normalizer 2.1.1
  • colorama 0.4.5
  • docutils 0.17.1
  • idna 3.3
  • imagesize 1.4.1
  • importlib-metadata 4.12.0
  • latexcodec 2.0.1
  • packaging 21.3
  • pybtex 0.24.0
  • pybtex-docutils 1.0.2
  • pyparsing 3.0.9
  • pytz 2022.2.1
  • requests 2.28.1
  • six 1.16.0
  • snowballstemmer 2.2.0
  • sphinxcontrib-applehelp 1.0.2
  • sphinxcontrib-bibtex 2.5.0
  • sphinxcontrib-devhelp 1.0.2
  • sphinxcontrib-htmlhelp 2.0.0
  • sphinxcontrib-jsmath 1.0.1
  • sphinxcontrib-qthelp 1.0.3
  • sphinxcontrib-serializinghtml 1.1.5
  • typing-extensions 4.3.0
  • urllib3 1.26.12
  • zipp 3.8.1
docs/pyproject.toml pypi
  • Sphinx ^4.2.0 develop
  • myst-parser ^0.15.2 develop
  • nbsphinx ^0.8.7 develop
  • sphinx-autobuild ^2021.3.14 develop
  • sphinx-copybutton ^0.4.0 develop
  • sphinx-rtd-theme ^1.0.0 develop
  • python ^3.7.1
  • requests ^2.28.1
  • sphinxcontrib-bibtex ^2.4.1
  • urllib3 ^1.26.12
setup.py pypi