metagenome-assembly

https://github.com/mlplace/metagenome-assembly

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 14 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.4%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: mlplace
Language: Python
Default Branch: main
Size: 2.88 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme Citation

Metagenome Pipeline for GLBRC

This pipeline is set up to analyze both Illumina and PacBio sequencing data. Currently the pipeline is set up with a single set of programs for Illumina and another for PacBio analysis. The pipeline has been designed to be able to add additional programs and modify what programs are used. This will be added to a future update based on feedback from users.

For feedback, contact: glbrc-bioinformatics-help@g-groups.wisc.edu

Objectives for the pipeline design:

1) Provide access to a simple command line interface to process metagenomics data. 2) Report intermediate files and log files. Document programs used, versions, and commands for the user. 3) Allow users to use default options or provide a job template file to indicate what programs/settings to use at each step. Default options : illumina or pacbio. 4) Work on Scarcity taking advantage of HTCondor. 5) Require minimal input or action from users. No need to clone the repo or modify the scripts in anyway. The only input is a single job template file.

Getting started

1) Login to scarcity-submit.org and unset the default PYTHONPATH variable by running (this will avoid potential conflicts)

    unset PYTHONPATH

2) Activate the conda environment containing the programs required by the pipeline

    conda activate /home/glbrc.org/mplace/.conda/envs/metagenomics

3) Move to your project directory in your home directory that contains all the required files

    The recommended way to run this pipeline is in a dedicated project
    directory using the provided conda environment.

4) Generate the job parameter file (jobfile.txt) and the list of FASTQ files to process (fastqfiles.txt)

    See below for information on constructing these files

5) Run the pipeline using the following command:

    /mnt/bigdata/linuxhome/mplace/scripts/metagenome-pipeline/metagenomics.py -j jobfile.txt

    An email will be sent to you when the pipeline is complete.

There is no need to clone this repository for use on GLBRC Scarcity compute cluster.

Input

Two input files are required 1) jobfile.txt and 2) fastqfiles.txt

These can be made on Scarcity using Nano or uploaded from your computer.

NOTE: If uploading from a Windows machine, please run dos2unix followed by the file(s) name to ensure proper line endings

jobfile.txt

illumina example:
## metagenomics.py job template file for illumina pipeline
Pipeline illumina
SampleIDs fastqfiles.txt
Email user@wisc.edu

pacbio example:
## metagenomics.py job template file for pacbio pipeline
Pipeline pacbio
SampleIDs fastqfiles.txt
Email user@wisc.edu
Assembler metaMDBG  (optional, only use if choosing metaMDBG, default is Flye)

Fastq files are provided in a file called fastqfiles.txt, with one sample per line. There are two tab separated columns with no header. The first column is the sample ID and the second is the FASTQ name(s). The sample ID should be a short, human readable name for the experimental sample.

NOTE: FASTQ FILES NEED TO BE ZIPPED (fastq.gz)

For the pipeline the fastqfiles.txt looks like this:

example:
bcAd1067T bcAd1067T.fastq.gz
bcAd1071T bcAd1071T.fastq.gz

**NOTE fastq files for the illumina pipeline are expected to be interleaved. Use this script to interleave your files:

conda activate /home/glbrc.org/mplace/.conda/envs/metagenomics

/home/glbrc.org/mplace/scripts/metagenome-pipeline/interleaveFasta.py -f fastq_files.txt

fastq_files.txt format is a tab delimited text file with one sample's information per line, (sampleName  sampleName-R1.fastq.gz  sampleName-R2.fastq.gz)

Parameters

j : str A job parameters file, used to set parameters.

Example

On scarcity-submit usage:

/mnt/bigdata/linuxhome/mplace/scripts/metagenome-pipeline/metagenomics.py -j jobfile.txt

Pipelines

illumina pipeline:

illumina pipeline paper (https://journals.asm.org/doi/10.1128/mSystems.00804-20)

1) bbduk.sh  in=fastq out=outFile trimg=int maxns=int avgqual=int minlen=int mlf=int
“Duk” stands for Decontamination Using Kmers. BBDuk was developed to combine most common 
data-quality-related trimming, filtering, and masking operations into a single high-performance tool.

2) bbcms.sh in=fastq out=outFile mincount=int hcf=float overwrite=t -Xmx36g
Error corrects reads and/or filters by depth, storing kmer counts in a count-min sketch (a Bloom filter variant).

3) metaspades --12 fastq -o outdir --only-assembler
SPAdes - St. Petersburg genome assembler - is an assembly toolkit containing various assembly pipelines.

4) bbmap ref=reference in=fastq out=outDir covstats=outDir/file ambiguous=random touppercase=t nodisk 
( this step also sorts results with samtools).
BBMap is a splice-aware global aligner for DNA and RNA sequencing reads.

5) runMetaBat.sh  fasta bamfile , at this point all clean bin fasta files are copied to the bins 
directory and processed together from this point onward. MetaBAT2, an automated metagenome binning 
software tool to reconstruct single genomes from microbial communities for subsequent analyses of 
uncultivated microbial species.

6) /opt/bifxapps/prodege/bin/prodege.sh 
ProDeGe: a computational protocol for fully automated decontamination of genomes

7) Calculate GC content using gcContent.py , I combined what Kevin provided into a single script.
Calculate GC content, tetranucleotide frequencies and contig length.

8) run correlations using Kevin's script:  Calculating_TF_Correlations.R

9) Remove contaminated contigs,  creates clean contig files from by finding the overlapping contig
names produced by ProDeGe and tetranucleotide frequency results. See cleanFasta.py Low tetranucleotide
frequency contigs identified by the correlation R script. ProDeGe results have a *_output_contam.fna 
which identifies contigs to remove. The overlap of contig identifiers from both of these are used to 
remove contigs from the original contig fasta files.

10) Make histograms using make_histogram_boxplot_contig_lengths.R (your script).

11) checkm lineage_wf -x fa -t 16 cleanDir outDir
    checkm qa -o 2 outDir/lineage.ms outDir
CheckM provides a set of tools for assessing the quality of genomes recovered from isolates,single cells, or metagenomes.

12) dRep dereplicate dRep_out cleanFastaDir/*.fa -conW 0.5 -N50W 5
dRep is a python program which performs rapid pair-wise comparison of genome sets. One of it’s major purposes is for genome de-replication.

13) 2nd round of checkm

    checkm lineage_wf -x fa -t 16 dRep_out outDir
    checkm qa -o 2 inputdirk    outDir

14) gtdbtk classify_wf --genome_dir /dereplicated_genomes/ --out_dir /gtdb-tk-out/ -x fa --cpus 16 --mash_db mashdb
GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and
archaeal genomes basedon the Genome Database Taxonomy (GTDB).

15) RAxML-ng run on MAGs alone and run on MAGs plus nearest reference genome.
RAxML-NG is a phylogenetic tree inference tool which uses maximum-likelihood (ML) optimality criterion.

Pacbio Lab pipeline:

1) BBTools icecreamfinder.sh, finds PacBio reads containing inverted repeats.  These are candidate 
triangle reads (ice cream cones).  Either ice cream cones only, or all inverted repeats, can be filtered.

icecreamfinder.sh jni=f json=f ow=t ksr=f trim=f ccs=t in=MyFastq.gz stats=Stats.txt out=Filtered.fastq outa=InvertedRepeats.fastq outb=Chimeric.fastq

2) Assembly: 2 options metaFlye or metaMDBG

Flye is a de novo assembler for single-molecule sequencing reads, such as those produced by PacBio 
and Oxford Nanopore Technologies. Polishing is performed as the final assembly stage. By default, Flye runs one polishing iteration. 

flye --pacbio-hifi Filtered.fastq --meta -t 16 --out-dir KW_ome_29_FD

or

MetaMDBG is a fast and low-memory assembler for long and accurate metagenomics reads (e.g. PacBio HiFi).

metaMDBG asm assembly bcAd1067T.fastq.gz -t 4 (for initial assembly)     

3) Polish with Pbmm2

pbmm2 index assembly.fasta assembly.mmi
pbmm2 align --preset=CCS --sort assembly.mmi KW_ome_29_FD.fastq KW_ome_29_FD.bam --log-level INFO --rg KW_ome_29_FD

4) Polish with Racon, Racon is intended as a standalone consensus module to correct raw contigs 
generated by rapid assembly methods which do not include a consensus step.

racon -u -t 32 KW_ome_29_FD.fastq KW_ome_29_FD.sam assembly.fasta

5) Using minimap2 to align original filtered fastq to polished assembly.

minimap2 -t 16 -a -x map-pb KW_ome_29_FD.fasta KW_ome_29_FD.fastq

6) pileup.sh from BBTools, Calculates per-scaffold or per-base coverage information from an unsorted sam or bam file. 

pileup.sh -in minimap2_filtered.bam -ref racon-assembly.fasta

7) Bin metagenomes using MetaBAT, uses consistency of coverage and tetranucleotide frequency as guides,
remove contaminated contigs based on the taxonomic classification.

runMetabat.sh racon-assembly.fasta minimap2_filtered.bam

8) Phylogenetic lineage determined using GTDB-Tk.  GTDB-Tk is a software toolkit for assigning objective 
taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy.

gtdbtk classify_wf --genome_dir metabat/sample/bins/ --out_dir gtdbtk/sample/classify -x fa --cpus 16 --mash_db 

9) Generate stats on assembly with CheckM. CheckM provides a set of tools for assessing the quality of genomes
recovered from isolates, single cells, or metagenomes. CheckM provides robust estimates of genome completeness
and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage.

10) dRep can rapidly and accurately compare a list of genomes in a pair-wise manner.

This allows identification of groups of organisms that share similar DNA content in terms of Average Nucleotide Identity (ANI).

Requirements

NOTE: These are all included in the Conda environment. There is no need to recreate the environment or install these programs

illumina:
1. Bbtools (39.01)  https://illumina.doe.gov/data-and-tools/software-tools/bbtools/
2. Spades  (3.15.5) https://jgi.doe.gov/data-and-tools/software-tools/bbtools/
3. MetaBAT (2.15)   https://bitbucket.org/berkeleylab/metabat/src/master/
4. CheckM  (1.2.2)  https://github.com/Ecogenomics/CheckM
5. dRep    (3.4.5)  https://github.com/MrOlm/drep
6. GTDB-Tk (2.3.2)  https://github.com/Ecogenomics/GTDBTk
7. RAxML   (1.2.0)  https://github.com/amkozlov/raxml-ng
pacbio:
1. Bbtools (39.01)  https://jgi.doe.gov/data-and-tools/software-tools/bbtools/
2. Flye    (2.9.2)  https://github.com/fenderglass/Flye/tree/flye
2. MetaMDBG (0.3)   https://github.com/GaetanBenoitDev/metaMDBG
3. Pbmm2   (1.13.1) https://github.com/PacificBiosciences/pbmm2
4. Racon   (1.5.0)  https://github.com/isovic/racon
5. MetaBAT (2.15)   https://bitbucket.org/berkeleylab/metabat/src/master/
6. dRep    (3.4.5)  https://github.com/MrOlm/drep
7. GTDB-Tk (2.3.2)  https://github.com/Ecogenomics/GTDBTk
8. CheckM  (1.2.2)  https://github.com/Ecogenomics/CheckM

Conda

To make racon_wrapper work, the script file has to be modified as follows:

CONDAENV = os.environ['CONDAPREFIX'] #*******************************************************************************

class RaconWrapper:

 __racon = '{}/bin/racon'.format(CONDA_ENV)
 __rampler = '{}/bin/rampler'.format(CONDA_ENV)

from: https://github.com/isovic/racon/issues/81

References

BBMap - Bushnell B. - sourceforge.net/projects/bbmap/

Chaumeil PA, et al. 2019. GTDB-Tk: A toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics, btz848.

Chaumeil PA, et al. 2022. GTDB-Tk v2: memory friendly classification with the Genome Taxonomy Database. Bioinformatics, btac672.

Clum A, Huntemann M, et al. DOE JGI Metagenome Workflow. mSystems. 2021 May 18; 6(3):e00804-20. doi: 10.1128/mSystems.00804-20. PMID: 34006627; PMCID: PMC8269246.

Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011 Oct;7(10):e1002195. doi: 10.1371/journal.pcbi.1002195. Epub 2011 Oct 20. PMID: 22039361; PMCID: PMC3197634.

Benoit, G., Raguideau, S., James, R. et al. High-quality metagenome assembly from long accurate reads with metaMDBG. Nat Biotechnol (2024). https://doi.org/10.1038/s41587-023-01983-6

Haft DH, DiCuccio M, Badretdin A, et. al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860.

Hyatt D, et al. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11:119. doi: 10.1186/1471-2105-11-119.

Jain C, et al. 2019. High-throughput ANI Analysis of 90K Prokaryotic Genomes Reveals Clear Species Boundaries. Nat. Communications, doi: 10.1038/s41467-018-07641-9.

Kang DD, Froula J, Egan R, Wang Z. 2015. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ 3:e1165 https://doi.org/10.7717/peerj.1165

Kang DD, et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 2019 Jul 26;7:e7359. doi: 10.7717/peerj.7359. PMID: 31388474; PMCID: PMC6662567.

Kolmogorov M, Bickhart D, Behsaz B, Gurevich A, Rayko M, et. al. "metaFlye: scalable long-read metagenome assembly using repeat graphs", Nature Methods, 2020 doi:s41592-020-00971-x

Li W, O'Neill KR, Haft DH, et. al. Expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. RefSeq: Nucleic Acids Res. 2021 Jan 8;49(D1):D1020-D1028.

Nurk S., Meleshko D., Korobeynikov A., Pevzner P. A. metaSPAdes: a new versatile de novo metagenomics assembler. Genome Research, 2017

Olm MR, Brown CT, Brooks B, Banfield JF. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 2017 Dec;11(12):2864-2868. doi: 10.1038/ismej.2017.126. Epub 2017 Jul 25. PMID: 28742071; PMCID: PMC5702732.

Ondov BD, et al. 2016. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132. doi: doi: 10.1186/s13059-016-0997-x.

Parks DH, et al. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55. doi: 10.1101/gr.186072.114. Epub 2015 May 14. PMID: 25977477; PMCID: PMC4484387.

Togkousidis A, Kozlov OM, Haag J, Höhler D, Stamatakis A. Adaptive RAxML-NG: Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty. Mol Biol Evol. 2023 Oct 4;40(10):msad227. doi: 10.1093/molbev/msad227. PMID: 37804116; PMCID: PMC10584362.

Owner

Name: Mike Place
Login: mlplace
Kind: user
Company: University of Wisconsin

Repositories: 1
Profile: https://github.com/mlplace

Bioinformatics Programmer

Citation (citations.py)

#!/usr/bin/env python
"""citations.py

Print the relevant (illumina, pacbio) pipeline citations to file.

"""
pipeline = { 'illumina': ['bbduk', 'bbcms', 'metaspades', 'bbmap', 'metabat', 'QC',
                     'dRep', 'checkm', 'gtdbtk', 'raxml'], 
            'pacbio' : [ 'icecreamfinder.sh', 'samtools', 'flye', 'pbmm2',
                         'racon', 'fungalrelease.sh', 'minimap2', 'pileup.sh',
                         'metabat2' ]}

citations = { 'illumina': """
    Bushnell B. BBMAP - sourceforge.net/projects/bbmap/

    Chaumeil PA, et al. 2019. GTDB-Tk: A toolkit to classify genomes with
    the Genome Taxonomy Database. Bioinformatics, btz848.    

    Chaumeil PA, et al. 2022. GTDB-Tk v2: memory friendly classification with
    the Genome Taxonomy Database. Bioinformatics, btac672.
    
    Clum A, Huntemann M, et al. DOE JGI Metagenome Workflow. mSystems. 2021 May 18;
    6(3):e00804-20. doi: 10.1128/mSystems.00804-20. PMID: 34006627; PMCID: PMC8269246.
             
    Danecek P, Bonfield James K, Liddle J, et al. Twelve years of SAMtools and BCFtools,
    GigaScience, Volume 10, Issue 2, February 2021, giab008, 
    https://doi.org/10.1093/gigascience/giab008

    Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011 Oct;7(10):e1002195.
    doi: 10.1371/journal.pcbi.1002195. Epub 2011 Oct 20. PMID: 22039361; PMCID: PMC3197634.    

    Haft DH, DiCuccio M, Badretdin A, et. al.
    RefSeq: an update on prokaryotic genome annotation and curation.
    Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860.    

    Hyatt D, et al. 2010. Prodigal: prokaryotic gene recognition and translation
    initiation site identification. BMC Bioinformatics, 11:119. 
    doi: 10.1186/1471-2105-11-119.    

    Jain C, et al. 2019. High-throughput ANI Analysis of 90K Prokaryotic Genomes 
    Reveals Clear Species Boundaries. Nat. Communications, doi: 10.1038/s41467-018-07641-9.    

    Kang DD, Froula J, Egan R, Wang Z. 2015. MetaBAT, an efficient tool for accurately
    reconstructing single genomes from complex microbial communities. 
    PeerJ 3:e1165 https://doi.org/10.7717/peerj.1165

    Kang DD, et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient
    genome reconstruction from metagenome assemblies. PeerJ. 2019 Jul 26;7:e7359. 
    doi: 10.7717/peerj.7359. PMID: 31388474; PMCID: PMC6662567.

    Li W, O'Neill KR, Haft DH, et. al.
    Expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.
    RefSeq: Nucleic Acids Res. 2021 Jan 8;49(D1):D1020-D1028.
        
    Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and 
    Bayesian phylogenetic placement of sequences onto a fixed reference tree. 
    BMC Bioinformatics. 2010 Oct 30;11:538. doi: 10.1186/1471-2105-11-538. 
    PMID: 21034504; PMCID: PMC3098090.    

    Nurk S., Meleshko D., Korobeynikov A., Pevzner P. A. metaSPAdes: a new versatile
    de novo metagenomics assembler. Genome Research, 2017

    Olm MR, Brown CT, Brooks B, Banfield JF. dRep: a tool for fast and accurate 
    genomic comparisons that enables improved genome recovery from metagenomes
    through de-replication. ISME J. 2017 Dec;11(12):2864-2868. 
    doi: 10.1038/ismej.2017.126. Epub 2017 Jul 25. PMID: 28742071; 
    PMCID: PMC5702732.

    Ondov BD, et al. 2016. Mash: fast genome and metagenome distance estimation 
    using MinHash. Genome Biol 17, 132. doi: doi: 10.1186/s13059-016-0997-x.    

    Parks DH, et al. CheckM: assessing the quality of microbial genomes recovered
    from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55.
    doi: 10.1101/gr.186072.114. Epub 2015 May 14. PMID: 25977477; PMCID: PMC4484387.

    Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood 
    trees for large alignments. PLoS One. 2010 Mar 10;5(3):e9490. 
    doi: 10.1371/journal.pone.0009490. PMID: 20224823; PMCID: PMC2835736.

    Tatusova T, et al. NCBI prokaryotic genome annotation pipeline. Nucleic Acids Res.
    2016 Aug 19;44(14):6614-24. doi: 10.1093/nar/gkw569. Epub 2016 Jun 24. 
    PMID: 27342282; PMCID: PMC5001611.

    Togkousidis A, Kozlov OM, Haag J, Höhler D, Stamatakis A. Adaptive RAxML-NG: 
    Accelerating Phylogenetic Inference under Maximum Likelihood using Dataset Difficulty.
    Mol Biol Evol. 2023 Oct 4;40(10):msad227. doi: 10.1093/molbev/msad227. 
    PMID: 37804116; PMCID: PMC10584362.
    """, 

    'pacbio': """
    Bushnell B. BBMAP - sourceforge.net/projects/bbmap/

    Chaumeil PA, et al. 2019. GTDB-Tk: A toolkit to classify genomes with
    the Genome Taxonomy Database. Bioinformatics, btz848.    

    Chaumeil PA, et al. 2022. GTDB-Tk v2: memory friendly classification with
    the Genome Taxonomy Database. Bioinformatics, btac672.

    Danecek P, Bonfield James K, Liddle J, et al. Twelve years of SAMtools and BCFtools,
    GigaScience, Volume 10, Issue 2, February 2021, giab008, 
    https://doi.org/10.1093/gigascience/giab008

    Kang DD, Froula J, Egan R, Wang Z. 2015. MetaBAT, an efficient tool for accurately
    reconstructing single genomes from complex microbial communities. 
    PeerJ 3:e1165 https://doi.org/10.7717/peerj.1165  
    
    Kolmogorov M, Bickhart D, Behsaz B, Gurevich A, Rayko M, et. al. 
    "metaFlye: scalable long-read metagenome assembly using repeat graphs",
    Nature Methods, 2020 doi:s41592-020-00971-x

    Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. 
    Bioinformatics, 34:3094-3100. doi:10.1093/bioinformatics/bty191

    PacBio, pbmm2: pbmm2 is a SMRT C++ wrapper for minimap2's C API. 
    https://github.com/PacificBiosciences/pbmm2
    
    Parks DH, et al. CheckM: assessing the quality of microbial genomes recovered
    from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55.
    doi: 10.1101/gr.186072.114. Epub 2015 May 14. PMID: 25977477; PMCID: PMC4484387.

    Vaser R, Sović I, Nagarajan N, Šikić M. Fast and accurate de novo genome assembly
    from long uncorrected reads. Genome Res. 2017 May;27(5):737-746. 
    doi: 10.1101/gr.214270.116. Epub 2017 Jan 18. PMID: 28100585; PMCID: PMC5411768.
    """ }

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science