mag_pipeline
🐍 🧬 Turning my metagenomic MAG (metagenome-assembled genome) pipeline into a snakemake pipeline for increased reproducibility and scalability.
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
✓Academic publication links
Links to: nature.com -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.9%) to scientific vocabulary
Keywords
Repository
🐍 🧬 Turning my metagenomic MAG (metagenome-assembled genome) pipeline into a snakemake pipeline for increased reproducibility and scalability.
Basic Info
Statistics
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
About:
I built this snakemake pipeline to showcase how FASTQ files can be taken all the way into a set of good-quality dereplicated MAGs.

The general steps are:
- Quality checked the reads using fastQC
- Assembled the fastq reads using SPADES
- Bin the MAGs using metawrap
- Refining MAGs using dasTool
- Deplicate the MAGs (if relevant) using dRep.
- Determine MAG quality using checkM
- Select only MIMAG quality-standard MAGs for further analyses (e.g. >50% complete, <10% contamination).
- Assign taxonomy of this MAG set using gtdbtk.
Example result folder:
https://github.com/patriciatran/MAGpipeline/blob/main/exampleresults_folder.txt
Relevant folders for output:
- results/{sample}/finalbinset/.fasta* : all the final bins in FASTA format
- results/{sample}/taxonomyfinalbin_set.tsv : final GTDBTK taxonomic assignment for the final bin set of MAGs
Status:
April 17, 2023 - Pipeline works without errors! - Next step: improving documentation and distribute as a package. - Add ways to report final information : e.g. run time of the pipeline, how many MAGs in the final bin set for each sample. - Add ways to report final information: e.g. bar plot of taxonomies across samples
Thanks to:
This pipeline exists because of the folks making these programs available, please cite their work: - SPADES: https://github.com/ablab/spades - Metawrap: https://github.com/bxlab/metaWRAP - Metabat1 and Metabat2: https://bitbucket.org/berkeleylab/metabat - Maxbin2: https://sourceforge.net/projects/maxbin/ - DasTool: https://github.com/cmks/DAS_Tool - dRep: https://github.com/MrOlm/drep - CheckM: https://github.com/Ecogenomics/CheckM - GTDBTK: https://github.com/Ecogenomics/GTDBTk
- Snakemake: https://snakemake.readthedocs.io/en/stable/
- conda/anaconda: https://docs.anaconda.com/anaconda/user-guide/faq/
- mamba: https://github.com/mamba-org/mamba
Further Reading:
MIMAG Standards: https://www.nature.com/articles/nbt.3893
Data used for pipeline testing:
Tisza MJ et al., "A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases.", Proc Natl Acad Sci U S A, 2021 Jun 8;118(23)
Owner
- Name: Patricia Tran
- Login: patriciatran
- Kind: user
- Location: Madison, WI, USA
- Company: University of Wisconsin - Madison
- Repositories: 19
- Profile: https://github.com/patriciatran
Bioinformatician & Computational Pipeline Scientist
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: Tran
given-names: Patricia
orcid: https://orcid.org/0000-0003-3948-3938
title: "MAG analysis pipeline"
version: 0.0.1
doi: TBD
date-released: TBD