mag_pipeline

🐍 🧬 Turning my metagenomic MAG (metagenome-assembled genome) pipeline into a snakemake pipeline for increased reproducibility and scalability.

https://github.com/patriciatran/mag_pipeline

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: nature.com
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.9%) to scientific vocabulary

Keywords

metagenomics pipeline
Last synced: 10 months ago · JSON representation ·

Repository

🐍 🧬 Turning my metagenomic MAG (metagenome-assembled genome) pipeline into a snakemake pipeline for increased reproducibility and scalability.

Basic Info
  • Host: GitHub
  • Owner: patriciatran
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 200 KB
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
metagenomics pipeline
Created about 3 years ago · Last pushed about 3 years ago
Metadata Files
Readme License Citation

README.md

About:

I built this snakemake pipeline to showcase how FASTQ files can be taken all the way into a set of good-quality dereplicated MAGs.

MAG pipeline logo

The general steps are: - Quality checked the reads using fastQC - Assembled the fastq reads using SPADES - Bin the MAGs using metawrap - Refining MAGs using dasTool - Deplicate the MAGs (if relevant) using dRep. - Determine MAG quality using checkM - Select only MIMAG quality-standard MAGs for further analyses (e.g. >50% complete, <10% contamination). - Assign taxonomy of this MAG set using gtdbtk.

DAG of the workflow

Example result folder:

https://github.com/patriciatran/MAGpipeline/blob/main/exampleresults_folder.txt

Relevant folders for output:

  • results/{sample}/finalbinset/.fasta* : all the final bins in FASTA format
  • results/{sample}/taxonomyfinalbin_set.tsv : final GTDBTK taxonomic assignment for the final bin set of MAGs

Status:

April 17, 2023 - Pipeline works without errors! - Next step: improving documentation and distribute as a package. - Add ways to report final information : e.g. run time of the pipeline, how many MAGs in the final bin set for each sample. - Add ways to report final information: e.g. bar plot of taxonomies across samples

Thanks to:

This pipeline exists because of the folks making these programs available, please cite their work: - SPADES: https://github.com/ablab/spades - Metawrap: https://github.com/bxlab/metaWRAP - Metabat1 and Metabat2: https://bitbucket.org/berkeleylab/metabat - Maxbin2: https://sourceforge.net/projects/maxbin/ - DasTool: https://github.com/cmks/DAS_Tool - dRep: https://github.com/MrOlm/drep - CheckM: https://github.com/Ecogenomics/CheckM - GTDBTK: https://github.com/Ecogenomics/GTDBTk

  • Snakemake: https://snakemake.readthedocs.io/en/stable/
  • conda/anaconda: https://docs.anaconda.com/anaconda/user-guide/faq/
  • mamba: https://github.com/mamba-org/mamba

Further Reading:

MIMAG Standards: https://www.nature.com/articles/nbt.3893

Data used for pipeline testing:

Tisza MJ et al., "A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases.", Proc Natl Acad Sci U S A, 2021 Jun 8;118(23)

Owner

  • Name: Patricia Tran
  • Login: patriciatran
  • Kind: user
  • Location: Madison, WI, USA
  • Company: University of Wisconsin - Madison

Bioinformatician & Computational Pipeline Scientist

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Tran
    given-names: Patricia
    orcid: https://orcid.org/0000-0003-3948-3938
title: "MAG analysis pipeline"
version: 0.0.1
doi: TBD
date-released: TBD

GitHub Events

Total
Last Year