MetaGenePipe

MetaGenePipe: An Automated, Portable Pipeline for Contig-based Functional and Taxonomic Analysis - Published in JOSS (2023)

https://github.com/parkvilledata/metagenepipe

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 9 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
✓
Committers with academic emails
10 of 13 committers (76.9%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Scientific Fields

Biology Life Sciences - 34% confidence

Last synced: 6 months ago · JSON representation

Repository

MGP

Basic Info

Host: GitHub
Owner: ParkvilleData
License: apache-2.0
Language: Jupyter Notebook
Default Branch: master
Size: 53.1 MB

Statistics

Stars: 2
Watchers: 4
Forks: 1
Open Issues: 0
Releases: 1

Created almost 4 years ago · Last pushed almost 3 years ago

Metadata Files

Readme Contributing License

README.rst

MetaGenePipe
============

.. start-badges

|pipline| |docs| |Contributor Covenant| |joss|

.. |pipline| image:: https://github.com/parkvilledata/MetaGenePipe/actions/workflows/testing.yml/badge.svg
.. |docs| image:: https://github.com/parkvilledata/MetaGenePipe/actions/workflows/docs.yml/badge.svg
   :target: https://parkvilledata.github.io/MetaGenePipe
.. |Contributor Covenant| image:: https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg
   :target: https://www.contributor-covenant.org/version/2/1/code_of_conduct/
.. |joss| image:: https://joss.theoj.org/papers/10.21105/joss.04851/status.svg
   :target: https://doi.org/10.21105/joss.04851

.. end-badges


MetaGenePipe (MGP) is an efficient, flexible, portable, and scalable
metagenomics pipeline that uses performant bioinformatics software
suites and genomic databases to create an accurate taxonomic and
functional characterization of the prokaryotic fraction of sequenced
microbiomes.

Microorganisms such as bacteria, viruses, archaea, and fungi are
ubiquitous in our environment. The study of microorganisms and their
full genomes has been enabled through advances in culture independent
techniques and high-throughput sequencing technologies. Whole genome
metagenomics shotgun sequencing (WGS) empowers researchers to study
biological functions of microorganisms, and how their presence affects
human disease or a specific ecosystem. However, advanced and novel
bioinformatics techniques are required to process the data into a
suitable format. There is no universally accepted standardized
bioinformatics framework that a computational microbiologist can use
effectively.

MGP is written in WDL and thus differs from existing assembly-based
workflow pipelines such as Atlas (which uses Snakemake) and Muffin
(which is written in Nextflow). MGP is an example of WDL and
containerization best practice. Similar to NF-core/Mag, MGP employs
co-assembly of multiple metagenome samples as a feature.

MGP overcomes traditional portability obstacles by using Singularity
containers, and increases flexibility of research focus by using the
DIAMOND aligner — able to create bespoke databases in a matter of
minutes. Researchers often have institutional datasets resulting from
previous research, which can be incorporated into MGP in place of the
default DIAMOND and BLAST databases. Once these databases have been
created, an update in the relevant line in the configuration file will
allow the workflow to use the new database. While MGP is focussed on
prokaryotes, it can easily be adapted to eukaryotes or viruses by
changing the prokaryotic gene prediction software, Prodigal, to
eukaryotic gene prediction software such as GeneMark-EP+ or
`EuGene `__, or a gene finding tool for
viruses.

A Microbiome is a collection of all microbes, such as bacteria, from a
given environment. The environment can be human related, i.e. the human
gut or can be from environments such as soil or water. The composition
of the microbiome, in terms of the percentages of bacterial species
present in the environment can be beneficial in determining the effect
bacteria has on the environment.

Microbial communities which make up microbiomes have the potential to be
major regulators in biogeochemical processes and can determine ecosystem
function. Understanding the composition, both in terms of diversity and
taxnomic profiles, is useful for determining potential effects of
changes in environmental conditions. For example, taxonomic composition
of soil samples can be taken in base line (regular conditions)
conditions and then compared to areas which may be experiencing distress
in the form of physical ecological changes such as mining operations.

MGP is a WDL workflow created using existing bioinformatics tools with
the view of allowing the user to incorporate extra flexibility in
creating taxonomic profiles of their microbiome samples. Through the use
of Kegg’s Brite Heirarchy, MGP is able to process raw microbiome
(metagenomic) samples, perform quality checks to remove poor quality
sequences, assemble the samples to create longer length sequences,
i.e. contigs which are then processed for open reading frames. These
open reading frames are potential protein sequences which are then
aligned to major protein databases. The matches from the alignment are
parsed and using in house scripts, are assigned taxonomic ID’s using the
brite heirarchy. Example output from MGP can be seen in the table below.

======================================================== =====
Pathway                                                  Count
======================================================== =====
09121 Transcription                                      0
09182 Protein families: genetic information processing   0
09183 Protein families: signaling and cellular processes 0
09192 Unclassified: genetic information processing       0
09194 Poorly characterized                               0
Brite Hierarchies                                        6
DNA repair and recombination proteins [BR:ko03400]       0
DNA replication proteins [BR:ko03032]                    0
Function unknown                                         0
Genetic Information Processing                           1
Not Included in Pathway or Brite                         3
Prokaryotic defense system [BR:ko02048]                  0
RNA polymerase [PATH:ko03020]                            0
Replication and repair                                   0
Transcription machinery [BR:ko03021]                     0
Transporters [BR:ko02000]                                0
======================================================== =====

More details can be found in the `documentation `_.

Installation
====================

To install MetaGenePipe, clone the repository:

.. code-block:: bash

    git clone https://github.com/ParkvilleData/MetaGenePipe.git
    cd MetaGenePipe

MetaGenePipe requires Java, Singularity, and other dependencies to run. 
See the `installation instructions in the documentation `_ for further information.

Usage
======

You can start the workflow with the following command:

.. code-block:: bash

   java -Dconfig.file=./metaGenePipe.config -jar cromwell-latest.jar run metaGenePipe.wdl -i metaGenePipe.json -o metaGenePipe.options.json

The pipeline has been set up to run against the swissprot database. We have supplied sample fastq files consisting of 100,000 reads so the pipeline can be tested.
You can modify the input file ``input_file.txt`` to reflect your sample files. 

Read the `documentation `_ for more information about usage and modifying the configuration files.

Output
======
.. start-output

There are four main output folders: qc (quality control), assembly, readalignment, and geneprediction and one intermediary, data, which contains the samples for assembly after running through TrimGalore and concatenating the samples for co-assembly if specified. 

Quality control
~~~~~~~~~~~~~~~~

* trimmed

  * {sampleName}.T{G|T}_R{1|2}.fq.gz: Trimmed output for each of the individual sample files, TG if the chosen trimmer is TrimGalore, and TT if it is Trimmomatic

* fastqc

  * {sampleName}.T{G|T}_R{1|2}_fastqc.zip: Fastqc output for each of the individual sample files

* multiqc_report.html: Combined report of all fastqc files
* flash

  * {sampleName}.extendedFrags.fastq: The merged reads

Data
~~~~~~~~~~~~~~~~

* {sampleName}_R{1|2}.fq.gz Sample files after trimming and/or concatenating for co-assembly. If files are concatenated for co-assembly, the sample name is set to be `combined`

Assembly
~~~~~~~~~~~~~~~~

* {sampleName}.megahit.contigs.fa: Final assembled contigs
* {sampleName}.{kmer}.fastg: Assembly graph for {kmer} assembled contigs, where {kmer} produces the largest assembled contig file size in the `intermediate_contigs` folder
* intermediate_contigs: a folder containing all intermediate assembled contigs {sampleName}.contigs.k{kmer}.fastg
* {sampleName}.megahit.blast.out: Raw blast results for the contigs
* {sampleName}.megahit.blast.parsed: Blast results parsed to be easily viewed in tsv format

Read alignment
~~~~~~~~~~~~~~~~

* {sampleName}.T{G|T}.flagstat.txt: Samtools flagstat output. Reports statistics on alignment of reads back to assembled contigs
* {sampleName}.T{G|T}.sam: Alignment of reads back to contigs in SAM format
* {sampleName}.T{G|T}.sorted.bam: Alignment of reads back to contigs in BAM format

Gene prediction
~~~~~~~~~~~~~~~~

* {sampleName}.megahit.proteins.fa.xml.out.xml: XML output of alignment of predicted Amino Acids to NCBI database (We chose swissprot, but any blast database can be substituted)
* diamond

  * {sampleName}.megahit.proteins.fa.xml.out:
  
* hmmer

  * {sampleName}.megahit.proteins.hmmer.out: Raw hmmer output aligned to Koalafam profiles
  * {sampleName}.megahit.proteins.hmmer.tblout: Parsed hmmer output aligned to Koalafam profiles
  
* prodigal

  * {sampleName}.megahit.gene_coordinates.gbk: Gene coordinates file (Genbank like file)
  * {sampleName}.megahit.nucl_genes.fa: Predicted gene nucleotide sequences
  * {sampleName}.megahit.proteins.fa: Predicted gene amino acid sequences
  * {sampleName}.megahit.starts.txt: Prodigal starts file

Classification
~~~~~~~~~~~~~~~~

* taxon - These files are produced for each sample (pair of read files) if the inputs are assembled separately (as opposed to co-assembly).

  * LevelA.brite.counts.tsv: Level A Kegg Brite Hierarchical gene count
  * LevelB.brite.counts.tsv: Level B Kegg Brite Hierarchical gene count
  * LevelC.brite.counts.tsv: Level C Kegg Brite Hierarchical gene count
  * OTU.brite.tsv: Table with counts of taxonomic (organism) IDs of genes

.. end-output

Output Tree
~~~~~~~~~~~

.. start-output-tree
Below is an example tree of the the output directory:

::

  .
  ├── assembly
  │   ├── combined.57.fastg
  │   ├── combined.megahit.blast.out
  │   ├── combined.megahit.blast.parsed
  │   ├── combined.megahit.contigs.fa
  │   └── intermediate_contigs
  │       ├── combined.contigs.k27.fa
  │       ├── combined.contigs.k37.fa
  │       ├── combined.contigs.k47.fa
  │       ├── combined.contigs.k57.fa
  │       ├── combined.contigs.k67.fa
  │       ├── combined.contigs.k77.fa
  │       ├── combined.contigs.k87.fa
  │       └── combined.contigs.k97.fa
  ├── data
  │   ├── combined_R1.fq.gz
  │   └── combined_R2.fq.gz
  ├── geneprediction
  │   ├── combined.megahit.proteins.fa.xml.out.xml
  │   ├── diamond
  │   │   └── combined.megahit.proteins.fa.xml.out
  │   ├── hmmer
  │   │   ├── combined.megahit.proteins.hmmer.out
  │   │   └── combined.megahit.proteins.hmmer.tblout
  │   └── prodigal
  │       ├── combined.megahit.gene_coordinates.gbk
  │       ├── combined.megahit.nucl_genes.fa
  │       ├── combined.megahit.proteins.fa
  │       └── combined.megahit.starts.txt
  ├── qc
  │   ├── fastqc
  │   │   ├── SRR5808831.TG_R1_fastqc.zip
  │   │   ├── SRR5808831.TG_R2_fastqc.zip
  │   │   ├── SRR5808882.TG_R1_fastqc.zip
  │   │   └── SRR5808882.TG_R2_fastqc.zip
  │   ├── flash
  │   │   ├── SRR5808831.extendedFrags.fastq
  │   │   └── SRR5808882.extendedFrags.fastq
  │   ├── multiqc_report.html
  │   └── trimmed
  │       ├── SRR5808831.TG_R1.fq.gz
  │       ├── SRR5808831.TG_R2.fq.gz
  │       ├── SRR5808882.TG_R1.fq.gz
  │       └── SRR5808882.TG_R2.fq.gz
  ├── readalignment
  │   ├── SRR5808831.TG.flagstat.txt
  │   ├── SRR5808831.TG.sam
  │   ├── SRR5808831.TG.sorted.bam
  │   ├── SRR5808882.TG.flagstat.txt
  │   ├── SRR5808882.TG.sam
  │   └── SRR5808882.TG.sorted.bam
  └── taxon
      ├── LevelA.brite.counts.tsv
      ├── LevelB.brite.counts.tsv
      ├── LevelC.brite.counts.tsv
      └── OTU.brite.tsv

.. end-output-tree

Please refer to the
`documentation `__ for
how to run.

Citation and Attribution
========================

.. start-citation

MetaGenePipe was developed at the Melbourne Data Analytics Platform (MDAP).

An article about the software package is published in the `Journal of Open Source Software `_:

  Shaban, Bobbie, Maria del Mar Quiroga, Robert Turnbull, Edoardo Tescari, Kim-Anh Lê Cao, Heroen Verbruggen. (2023). 
  MetaGenePipe: An Automated, Portable Pipeline for Contig-based Functional and Taxonomic Analysis. Journal of Open Source Software, 
  8 (82), 4851. doi: 10.21105/joss.04851

Here is the citation details in BibTeX format:

.. code-block:: BibTeX

  @article{
      Shaban2023, 
      doi = {10.21105/joss.04851}, 
      url = {https://doi.org/10.21105/joss.04851}, 
      year = {2023}, 
      publisher = {The Open Journal}, 
      volume = {8}, 
      number = {82}, 
      pages = {4851}, 
      author = {Babak Shaban and Maria del Mar Quiroga and Robert Turnbull and Edoardo Tescari and Kim-Anh Lê Cao and Heroen Verbruggen}, 
      title = {{MetaGenePipe: An Automated, Portable Pipeline for Contig-based Functional and Taxonomic Analysis}}, 
      journal = {Journal of Open Source Software} 
  }


If you create a derivative work from this software package, attribution
should be included as follows:

   This is a derivative work of MetaGenePipe, originally released under
   the Apache 2.0 license, developed by Bobbie Shaban, Mar Quiroga,
   Robert Turnbull and Edoardo Tescari at Melbourne Data Analytics
   Platform (MDAP) at the University of Melbourne.


.. end-citation


Contributing
========================

If you would like to contribute to this software package, please make sure you follow the `code of conduct `_.

Owner

Name: Parkville Data Workflow User Group
Login: ParkvilleData
Kind: user

Repositories: 2
Profile: https://github.com/ParkvilleData

Parkville Data Workflow User Group - A data workflow knowledge base.

JOSS Publication

MetaGenePipe: An Automated, Portable Pipeline for Contig-based Functional and Taxonomic Analysis

Published

February 20, 2023

DOI

10.21105/joss.04851

Volume 8, Issue 82, Page 4851

Authors

Babak Shaban

Melbourne Data Analytics Platform, The University of Melbourne

Maria del Mar Quiroga

Melbourne Data Analytics Platform, The University of Melbourne

Robert Turnbull

Melbourne Data Analytics Platform, The University of Melbourne

Edoardo Tescari

Melbourne Data Analytics Platform, The University of Melbourne

Kim-Anh Lê Cao

School of Mathematics and Statistics, Melbourne Integrative Genomics, The University of Melbourne

Heroen Verbruggen

School of BioSciences, The University of Melbourne

Editor

Jacob Schreiber

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 323
Total Committers: 13
Avg Commits per committer: 24.846
Development Distribution Score (DDS): 0.598

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Maria del Mar Quiroga	m**a@u**u	130
Robert Turnbull	r**l@u**u	75
bshaban	b**n@u**u	63
Bobbie Shaban	b**n@u**u	41
etescari	e**i@u**u	3
Ubuntu	u**u@v**l	3
Edoardo Tescari	e**i@s**u	2
Heroen Verbruggen	h**n@g**m	1
etescari	e**i@u**u	1
Ubuntu	u**u@m**l	1
Robert Turnbull	r**l@s**u	1
Robert Turnbull	r**l@s**u	1
Bobbie Shaban (unimelb)	b**n@s**u	1

Committer Domains (Top 20 + Academic)

unimelb.edu.au: 6 spartan-login3.hpc.unimelb.edu.au: 2 spartan.hpc.unimelb.edu.au: 1 spartan-login2.hpc.unimelb.edu.au: 1 metagenomics-download.novalocal: 1 verbruggen.novalocal: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 0
Total pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 3 minutes
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

mariadelmarq (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

.github/workflows/docs.yml actions

JamesIves/github-pages-deploy-action 4.1.5 composite
actions/checkout v3 composite
actions/setup-python v3 composite

.github/workflows/draft-pdf.yml actions

actions/checkout v2 composite
actions/upload-artifact v1 composite
openjournals/openjournals-draft-action master composite

.github/workflows/testing.yml actions

actions/checkout v2 composite
eWaterCycle/setup-singularity v7 composite

docs/poetry.lock pypi

attrs 21.4.0 develop
beautifulsoup4 4.11.1 develop
bleach 5.0.1 develop
cffi 1.15.1 develop
defusedxml 0.7.1 develop
entrypoints 0.4 develop
fastjsonschema 2.16.1 develop
importlib-resources 5.9.0 develop
jsonschema 4.15.0 develop
jupyter-client 7.3.5 develop
jupyter-core 4.11.1 develop
jupyterlab-pygments 0.2.2 develop
livereload 2.6.3 develop
lxml 4.9.1 develop
markdown-it-py 1.1.0 develop
mdit-py-plugins 0.2.8 develop
mistune 2.0.4 develop
myst-parser 0.15.2 develop
nbclient 0.6.7 develop
nbconvert 7.0.0 develop
nbformat 5.4.0 develop
nbsphinx 0.8.9 develop
nest-asyncio 1.5.5 develop
pandocfilters 1.5.0 develop
pkgutil_resolve_name 1.3.10 develop
py 1.11.0 develop
pycparser 2.21 develop
pyrsistent 0.18.1 develop
python-dateutil 2.8.2 develop
pywin32 304 develop
pyzmq 23.2.1 develop
soupsieve 2.3.2.post1 develop
sphinx-autobuild 2021.3.14 develop
sphinx-copybutton 0.4.0 develop
sphinx-rtd-theme 1.0.0 develop
tinycss2 1.1.1 develop
tornado 6.2 develop
traitlets 5.3.0 develop
webencodings 0.5.1 develop
Babel 2.10.3
Jinja2 3.1.2
MarkupSafe 2.1.1
PyYAML 6.0
Pygments 2.13.0
Sphinx 4.5.0
alabaster 0.7.12
certifi 2022.6.15
charset-normalizer 2.1.1
colorama 0.4.5
docutils 0.17.1
idna 3.3
imagesize 1.4.1
importlib-metadata 4.12.0
latexcodec 2.0.1
packaging 21.3
pybtex 0.24.0
pybtex-docutils 1.0.2
pyparsing 3.0.9
pytz 2022.2.1
requests 2.28.1
six 1.16.0
snowballstemmer 2.2.0
sphinxcontrib-applehelp 1.0.2
sphinxcontrib-bibtex 2.5.0
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.0
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
typing-extensions 4.3.0
urllib3 1.26.12
zipp 3.8.1

docs/pyproject.toml pypi

Sphinx ^4.2.0 develop
myst-parser ^0.15.2 develop
nbsphinx ^0.8.7 develop
sphinx-autobuild ^2021.3.14 develop
sphinx-copybutton ^0.4.0 develop
sphinx-rtd-theme ^1.0.0 develop
python ^3.7.1
requests ^2.28.1
sphinxcontrib-bibtex ^2.4.1
urllib3 ^1.26.12

setup.py pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

MetaGenePipe

Science Score: 95.0%

Scientific Fields

Repository

Basic Info

Statistics

Metadata Files

README.rst

Owner

JOSS Publication

MetaGenePipe: An Automated, Portable Pipeline for Contig-based Functional and Taxonomic Analysis

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies