gari

https://github.com/rki-mf1/gari

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: rki-mf1
License: mit
Language: Nextflow
Default Branch: main
Size: 2.59 MB

Statistics

Stars: 0
Watchers: 5
Forks: 0
Open Issues: 1
Releases: 3

Created over 1 year ago · Last pushed 10 months ago

Metadata Files

Readme Changelog License Citation

Generic Assembly and Reconstruction pIpeline (GARI)

Introduction

Generic Assembly and Reconstruction pIpeline (GARI)

Nextflow pipeline for the de novo genome reconstruction of bacterial pathogens. The pipeline comprises of the following steps/modules: 1. Read QC - fastp to remove remaining adapters and perform very basic quality trimming - Kraken2 to check for contamination 2. Genome Assembly/Reconstruction - spades (default) / shovill / skesa 3. Assembly QC - bbrename to rename contigs and remove contigs < 200bp (default value) - bbmap to remap the reads to the assembly and calculate coverage, etc. - assembly-scan to produce general assembly statistics - Kraken2 to check for contamination - skani to identify the reference genome with the highest nucleotide identity. - CheckM to check for genomic completeness and contamination using conserved single-copy core genes 4. Classification combining read & assembly parameters

GARI

Usage

Note If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data. (NOT SETUP YET...)

Now, you can install the pipeline using:

```bash

get the most recent pipeline version (the same command updates the pipeline)

nextflow pull rki-mf1/GARI

check the available release versions and development branches

nextflow info rki-mf1/GARI

select a recent release and run

nextflow run rki-mf1/GARI -r v1.1.1 -profile -params-file params.yaml ```

Another option is to clone the repository and run the pipeline but we recommend using the nextflow pull option and stable release versions via -r.

The pipeline needs a few input parameters to be defined. This can be done either directly in the command line, or via a parameter file (params.yaml as in the command above). Using a params file is advised. Here is a minimum example of a params file with the required parameters:

params.yaml: input: '/path/to/input/samplesheet.csv' outdir: '/path/to/output' skani_db: '/path/to/skani_database'

The additional flags in the command e.g. "-profile" will define how the pipeline is executed e.g. singularity, conda, mamba or docker (we recommend using singularity, if available). When executing the pipeline on a HPC with a queuing system you might want to limit the amount of jobs submitted in parallel you can use the option "-queue-size 20" to limit the jobs submitted to the queue to 20 in the nextflow command above.

Inputs:

Parameters and Filter Options

| name | required (to set by user) | description | type in config | default value | |---|---|---|---|---| | input | YES | path to input samplesheet in csv format (more detailed explanation below) | string | null | | outdir | YES | path to output directory | string | null | | skanidb | YES | path to precomputed skani database to use fro reference/species verification | string | null | | krakendb | NO | path to precomputed Kraken2 database to use for classification | string | null (will download and use the babykraken DB if no local DB is specified) | | tmpdir | NO | path to temp directory (used for some processes) | string | /tmp/ | | qcmode | NO | if set to true expects assemblies as input and only performs QC | boolean | false | | preset | NO | predefined preset settings to use (more details below) | string | | | assembler | NO | assembler to use (options: 'spades', 'shovill' or 'skesa')| string | 'spades' | | checkmdb | NO | path to local copy of checkM database, if not set checkM downloads the database itself | string | | | minsize | NO | minimum contig size to filter out/remove | integer | 200 | | fastpparams | NO | additional parameters to add to the FASTP command | string | '--detectadapterforpe' | | assemblyscanparams | NO | additional parameters to add to the assemblyscan command | string | '--json' (json flag required for QC assessment!) | | krakenRparams | NO | additional parameters to add to the KRAKEN2 command assessing reads | string | '--minimum-base-quality 10 --minimum-hit-groups 3 --confidence 0.05' | | krakenAparams | NO | additional parameters to add to the KRAKEN2 command assessing assemblies | string | '--minimum-base-quality 10 --minimum-hit-groups 3 --confidence 0.05' | | spadesparams | NO | additional parameters to add to the SPADES command | string | '--isolate' | | shovillparams | NO | additional parameters to add to the SHOVILL command | string | | | skesaparams | NO | additional parameters to add to the SKESA command | string | | | skaniparams | NO | additional parameters to add to the skani command | string | '--mode genome' | | publishdirenabled | NO |---| boolean | false | | publishdirmode | NO |---| string | 'copy' |

Detailed walkthrough

First, prepare a samplesheet with your input data that looks as follows, with each row representing a pair of fastq files (paired end): samplesheet.csv: csv sample,fastq_1,fastq_2,species S1,/path/to/S1_R1.fastq.gz,/path/to/S1_R2.fastq.gz,Escherichia coli S2,/path/to/S2_R1.fastq.gz,/path/to/S2_R2.fastq.gz,Acinetobacter baumannii ... A samplesheet like the one shown above can be created using the python script 'createSampleSheetGARI.py' within the bin folder. Given a directory containing paired-end reads and a seperator/delimiter to reduce the filename to a sampleID this script will create a samplesheet directly usable for GARI (for more information on usage check the help function of the python script). If no species information can or is provided the species field will be filled with NA. In this case no comparison to the identified reference will be performed. When run in QC-mode (--qcmode true) assemblies need to be provided as an input in the samplesheet. For this purpose assemblies are added the the fastq1 column while the fastq2 column is left empty. A samplesheet in QC-mode would look like: csv sample,fastq_1,fastq_2,species S1,/path/to/S1_ASM.fasta,,Escherichia coli S2,/path/to/S2_ASM.fasta,,Acinetobacter baumannii ... The kraken_db is the path of the Kraken2 database used to classify reads and assembly. Some precomputed Kraken2 databses can be found here.

The skani_db is the path of the skani database used to check and identify the closest reference genome. We recommend the use of the GTDB database. A link to precomputed databases as well as a tutorial on how to set up a local version can be found here.

Warning: Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Thresholds and QC filters:

GARI outputs a variety of assembly and assembly quality statistics that can be found in the final GARI QC report. A subset of these values is used to "classify" each generated assembly into the categories: "PASSED", "FLAGGED" and "FAILED". \

Here is an overview of parameters and thresholds used for the assessment:

| parameter | only species specific | default threshold | description | |---|---|---|---| flagmaxtotalcontigs | NO | 500 | - | flagAvgCov | NO | 50 | - | failAvgCov | NO | 30 | - | flagPercMapped | NO | 95 | - | flagrefident | NO | 90 | - | flagcheckMcomplete | NO | 98 | - | failcheckMcomplete | NO | 95 | - | flagcheckMcontamination | NO | 2 | - | failcheckMcontamination | NO | 10 | - | flagkrakenTarget | NO | 60 | - | flagkrakenHost | NO | 5 | - | flagmaxlength | YES | - | - | failmaxlength | YES | - | - | flagminlength | YES | - | - | failminlength | YES | - | - | flagmaxGC | YES | - | - | flagminGC | YES | - | - |

Thresholds are defined in the file QC_thresholds.json within the GARI assests folder. Species specific thresholds are defined for 56 relevant bacterial pathogens. These species specific thresholds were calculated using complete RefSeq genomes from NCBI. Thresholds to flag assemblies are generally based on the 10/90% quantiles of the RefSeq assemblies while thresholds to fail assemblies are based on the 0/100% quantiles. \ Feel free to add a new species to QC_thresholds.json if needed.

Presets:

GARI allows to define presets of run wide parameters. This is mostly useful when running GARI with specific settings that should be also be applied to a later dataset, so the parameters don't need to be specified for each run. \ Such presets are defined in the presets.config within the config folder. Defining the preset parameter to any string defined in the presets.config will automatically load these setting and apply them to the run of GARI. Feel free to define new presets within bin/presets.config if needed.

HPC & QOL

When running GARI on the HPC you need to set the executor to use slurm. This can be done either in a configfile provided via -c or by defining it in e.g. your .bashrc like: export NXF_EXECUTOR=slurm Additionally you can set the directory where singularity images used to run nextflow modules are cached, so they can be reused and do not need to be downloaded for each execution of GARI: export NXF_SINGULARITY_CACHEDIR='/path/to/NEXTFLOW_CACHE_folder/'

Credits

GARI was originally written by Maximilian Driller.

We thank the following people for their extensive assistance in the development of this pipeline:

Silver A. Wolf, Torsten Houwaart, Lakshmipriya Thrukonda, Vladimir Bajić and Mustafa Helal

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Shifu Chen, Yanqing Zhou, Yaru Chen, Jia Gu, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17, September 2018, Pages i884–i890, https://doi.org/10.1093/bioinformatics/bty560

Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., & Korobeynikov, A. (2020). Using SPAdes de novo assembler. Current Protocols in Bioinformatics, 70, e102. doi: 10.1002/cpbi.102

Wood, D.E., Lu, J. & Langmead, B. Improved metagenomic analysis with Kraken 2. Genome Biol 20, 257 (2019). https://doi.org/10.1186/s13059-019-1891-0

Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015 Jul;25(7):1043-55. doi: 10.1101/gr.186072.114. Epub 2015 May 14. PMID: 25977477; PMCID: PMC4484387.

Prjibelski, A., Antipov, D., Meleshko, D., Lapidus, A., & Korobeynikov, A. (2020). Using SPAdes de novo assembler. Current Protocols in Bioinformatics, 70, e102. doi: 10.1002/cpbi.102

Souvorov, A., Agarwala, R. & Lipman, D. SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biol 19, 153 (2018). https://doi.org/10.1186/s13059-018-1540-z

Jim Shaw and Yun William Yu. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods (2023). https://doi.org/10.1038/s41592-023-02018-3

Owner

Name: RKI MF1 Bioinformatics
Login: rki-mf1
Kind: organization
Location: Germany

Repositories: 9
Profile: https://github.com/rki-mf1

Bioinformatics code of MF1

Citation (CITATIONS.md)

# gari/gari: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

- [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)

  > Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online https://www.bioinformatics.babraham.ac.uk/projects/fastqc/.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

  > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Create event: 4
Release event: 2
Issues event: 5
Member event: 2
Issue comment event: 2
Push event: 12
Public event: 1
Pull request event: 2

Last Year

Create event: 4
Release event: 2
Issues event: 5
Member event: 2
Issue comment event: 2
Push event: 12
Public event: 1
Pull request event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 2
Total pull requests: 1
Average time to close issues: 3 months
Average time to close pull requests: 3 months
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 1.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 1
Average time to close issues: 3 months
Average time to close pull requests: 3 months
Issue authors: 1
Pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 1.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

gari

Science Score: 57.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Generic Assembly and Reconstruction pIpeline (GARI)

Introduction

Usage

get the most recent pipeline version (the same command updates the pipeline)

check the available release versions and development branches

select a recent release and run

Inputs:

Parameters and Filter Options

Detailed walkthrough

Thresholds and QC filters:

Presets:

HPC & QOL

Credits

Citations

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels