meta-short

Nf-core based workflow for short read metagenomic data

https://github.com/srusher/meta-short

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.8%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Nf-core based workflow for short read metagenomic data

Basic Info
  • Host: GitHub
  • Owner: srusher
  • License: mit
  • Language: Nextflow
  • Default Branch: main
  • Size: 220 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme Changelog License Citation

README.md

Introduction

metashort is a bioinformatics workflow that accepts short reads as input and runs them through the following processes/analyses:

  1. OPTIONAL: Subsampling (BBMap)

  2. Read QC (FastQC)

  3. Adapter Trimming and Quality Filtering (trimmomatic or fastp)

  4. Taxonomic Classification (kraken2)

  5. Taxonomy distribution visualization (Krona)

  6. OPTIONAL: Taxonomic Filtering (KrakenTools)

  7. De Novo Assembly Spades

  8. Assembly QC (quast)

  9. Binning (Maxbin2)

  10. Contig alignment and identification (blast)

  11. Generate summary report (MultiQC)

Setup

This workflow uses assets and depencies native to the CDC's SciComp environment. If you do not have access to the SciComp environment, you can request an account here.

Usage

First, prepare a samplesheet with your input data that looks as follows:

samplesheet.csv:

csv sample,fastq_1,fastq_2 SAMPLE_1,/data/reads/sample1-R1.fastq.gz,data/reads/sample1-R2.fastq.gz SAMPLE_2,/data/reads/sample2-R1.fastq.gz,data/reads/sample2-R2.fastq.gz

The top row is the header row ("sample,fastq1,fastq2") and should never be altered. Each row below the header, represents two paired-end fastq file with a unique identifier in the "sample" column (SAMPLE1 and SAMPLE2 in the example above). Each fastq file needs to be gzipped/compressed to prevent validation errors from occuring at the initialization of the pipeline

There is an example samplesheet located under the assets folder (assets/samplesheet.csv) that you can view and edit yourself. NOTE If you use this samplesheet, please make a back up copy of it as it will be overwritten each time you pull an updated version of this repository.

Once the samplesheet has been formatted, we can run the workflow using one of the 3 methods methods listed below.

Method 1: Cluster Submission:

The qsub method allows you to submit the job to SciComp's high memory cluster computing nodes for fast performance and load distribution. This is a good "fire and forget" method for new users who aren't as familiar with SciComp's compute environment

Format: bash bash ./run_qsub.sh --input "/path/to/samplesheet" --outdir "/path/to/output/directory" "<additional-parameters>"

Example: bash bash ./run_qsub.sh --input "assets/samplesheet.csv" --outdir "results/test" "--skip_subsample false --num_subsamples 1000 --skip_kraken2 false"

Method 2: Local Execution:

The local method may be a better option if you are experiencing technical issues with the qsub method. qsub adds additonal layers of complexity to workflow execution, while local simply runs the workflow on your local machine or the host that you're connected to, provided it has sufficient memory/RAM and CPUs to execute the workflow

Format: bash bash ./run_local.sh --input "/path/to/samplesheet" --outdir "/path/to/output/directory" "<additional-parameters>"

Example: bash bash ./run_local.sh --input "./assets/samplesheet.csv" --outdir "./results/test" "--skip_subsample false --num_subsamples 1000 --skip_kraken2 false"

Method 3: Native Nextflow Execution:

If you are familiar with nextflow and Scicomp's computing environment, you can invoke the nextflow command straight from the terminal. NOTE: if you are using this method you will need to load up a nextflow environment via module load or conda

Format: bash nextflow run main.nf -profile singularity,local --input "/path/to/samplesheet" --outdir "/path/to/output/directory" \<additional flags\>

Example: bash nextflow run main.nf -profile singularity,local --input "./assets/samplesheet.csv" --outdir "./results/test" --skip_subsample false --num_subsamples 1000 --skip_kraken2 false

Parameters

See below for all possible input parameters:

Global Variables: | Parameter | Data Type | Default Value | |:---------:|:---------:|:-------------:| | --metagenomic_sample | boolean | true |

Workflow processes: | Parameter | Data Type | Default Value | |:---------:|:---------:|:-------------:| | --skip_subsample | boolean | true | | --skip_fastq_screen | boolean | true | | --skip_kraken2 | boolean | true | | --skip_extract_kraken_reads | boolean | true | | --skip_metaphlan | boolean | true | | --skip_assembly | boolean | true | | --skip_medaka | boolean | true | | --skip_binning | boolean | true | | --skip_blast | boolean | true |

BBmap subsampling parameters: | Parameter | Data Type | Default Value | |:---------:|:---------:|:-------------:| | --num_subsamples | integer | 1000 |

Global Trimming parameters: | Parameter | Data Type | Default Value | |:---------:|:---------:|:-------------:| | --trim_tool | string | "trimmomatic" | | --adapt_ref | string | "./assets/sequencing-adapters.fasta" |

Trimmomatic parameters: | Parameter | Data Type | Default Value | Notes | |:---------:|:---------:|:-------------:|:-------------:| | --trimmomatic params | string | "ILLUMINACLIP:./assets/sequencing-adapters.fasta:2:30:10 SLIDINGWINDOW:3:20 MINLEN:36" | This string can be modified to any known command line arguments for trimmomatic - simply format the string in the same way you would enter it on the command line |

fastp parameters: | Parameter | Data Type | Default Value | Notes | |:---------:|:---------:|:-------------:|:-------------:| | --adapter_auto_detect | boolean | false | when set to false fastp will use the --adapter_ref fasta to locate adapter sequences | | --fastp params | string | "" | This string can be modified to any known command line arguments for fastp - simply format the string in the same way you would enter it on the command line |

Kraken2 parameters: | Parameter | Data Type | Default Value | |:---------:|:---------:|:-------------:| | --kraken_db_main | string | "/scicomp/groups-pure/OID/NCEZID/DFWED/WDPB/EMEL/Projects/LongReadAnalysis/data/kraken-db/bactarchvirfungiamoeba-DB41-mer" | | `--krakencustom_params` | string | "" |

Kraken tools - Extract Kraken Reads: | Parameter | Data Type | Default Value | Notes | |:---------:|:---------:|:-------------:|:-------------:| | --kraken_tax_id | string | "5754" | Taxonomic ID value(s) that you want krakentools to pull out of your classified reads | | --include-children | boolean | true | filter for all child taxonomic IDs of the parent tax ID declared in --kraken_tax_id |

FastqScreen parameters: | Parameter | Data Type | Default Value | |:---------:|:---------:|:-------------:| | --fastq_screen_conf | string | "./assets/fastq_screen.conf" |

Metaphlan parameters: | Parameter | Data Type | Default Value | |:---------:|:---------:|:-------------:| | --methaphlan_db | string | "/scicomp/groups-pure/OID/NCEZID/DFWED/WDPB/EMEL/Projects/LongReadAnalysis/data/metaphlan/metaphlan_databases" |

Assembler paramaters: | Parameter | Data Type | Default Value | |:---------:|:---------:|:-------------:| | --assembler | string | 'spades' |

BLAST parameters: | Parameter | Data Type | Default Value | |:---------:|:---------:|:-------------:| | --blast_db | string | "/scicomp/groups-pure/OID/NCEZID/DFWED/WDPB/EMEL/Projects/LongReadAnalysis/data/blast/arch-bact-fung-hum-amoebarefseq/arch-bact-fung-hum-amoebarefseq" | | --blast_evalue | string | "1e-10" | | --blast_perc_identity | string | "90" | | --blast_target_seqs | string | "5" |

Credits

Meta-short was originally written by Sam Rusher (rtq0@cdc.gov)..

Citations

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

  • Name: Samuel Rusher
  • Login: srusher
  • Kind: user
  • Location: Frankfort, KY
  • Company: Bioinformatics Specialist with Leidos

Programs with an emphasis on bioinformatics | Experience with C#, Python, Java, SQL, R, HTML, and CSS

Citation (CITATIONS.md)

# emel/metashort: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online].

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total
  • Push event: 3
Last Year
  • Push event: 3

Dependencies

modules/nf-core/blast/makeblastdb/meta.yml cpan
modules/nf-core/busco/busco/meta.yml cpan
modules/nf-core/checkm/lineagewf/meta.yml cpan
modules/nf-core/chopper/meta.yml cpan
modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan
modules/nf-core/fastp/meta.yml cpan
modules/nf-core/fastqc/meta.yml cpan
modules/nf-core/flye/meta.yml cpan
modules/nf-core/kraken2/kraken2/meta.yml cpan
modules/nf-core/krakentools/extractkrakenreads/meta.yml cpan
modules/nf-core/krakentools/kreport2krona/meta.yml cpan
modules/nf-core/krona/krona_db/meta.yml cpan
modules/nf-core/krona/ktimporttaxonomy/meta.yml cpan
modules/nf-core/krona/ktimporttext/meta.yml cpan
modules/nf-core/maxbin2/meta.yml cpan
modules/nf-core/medaka/meta.yml cpan
modules/nf-core/megahit/meta.yml cpan
modules/nf-core/metabat2/jgisummarizebamcontigdepths/meta.yml cpan
modules/nf-core/metabat2/metabat2/meta.yml cpan
modules/nf-core/metaphlan/meta.yml cpan
modules/nf-core/minimap2/meta.yml cpan
modules/nf-core/multiqc/meta.yml cpan
modules/nf-core/nanoplot/meta.yml cpan
modules/nf-core/porechop/porechop/meta.yml cpan
modules/nf-core/quast/meta.yml cpan
modules/nf-core/samtools/fastq/meta.yml cpan
modules/nf-core/samtools/index/meta.yml cpan
modules/nf-core/samtools/sort/meta.yml cpan
modules/nf-core/spades/meta.yml cpan
modules/nf-core/trimmomatic/meta.yml cpan
modules/nf-core/blast/makeblastdb/environment.yml pypi
modules/nf-core/busco/busco/environment.yml pypi
modules/nf-core/checkm/lineagewf/environment.yml pypi
modules/nf-core/chopper/environment.yml pypi
modules/nf-core/fastp/environment.yml pypi
modules/nf-core/flye/environment.yml pypi
modules/nf-core/kraken2/kraken2/environment.yml pypi
modules/nf-core/krakentools/extractkrakenreads/environment.yml pypi
modules/nf-core/krakentools/kreport2krona/environment.yml pypi
modules/nf-core/krona/krona_db/environment.yml pypi
modules/nf-core/krona/ktimporttaxonomy/environment.yml pypi
modules/nf-core/krona/ktimporttext/environment.yml pypi
modules/nf-core/maxbin2/environment.yml pypi
modules/nf-core/medaka/environment.yml pypi
modules/nf-core/megahit/environment.yml pypi
modules/nf-core/metabat2/jgisummarizebamcontigdepths/environment.yml pypi
modules/nf-core/metabat2/metabat2/environment.yml pypi
modules/nf-core/metaphlan/environment.yml pypi
modules/nf-core/minimap2/environment.yml pypi
modules/nf-core/nanoplot/environment.yml pypi
modules/nf-core/porechop/porechop/environment.yml pypi
modules/nf-core/samtools/fastq/environment.yml pypi
modules/nf-core/samtools/index/environment.yml pypi
modules/nf-core/samtools/sort/environment.yml pypi
modules/nf-core/spades/environment.yml pypi
pyproject.toml pypi