dryad

Dryad is a Nextflow pipeline for examining prokaryote relatedness. Dryad can perform a reference free analysis and/or SNP analysis.

https://github.com/wslh-bio/dryad

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: pubmed.ncbi, ncbi.nlm.nih.gov
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.2%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Dryad is a Nextflow pipeline for examining prokaryote relatedness. Dryad can perform a reference free analysis and/or SNP analysis.

Basic Info
  • Host: GitHub
  • Owner: wslh-bio
  • License: gpl-3.0
  • Language: Groovy
  • Default Branch: main
  • Homepage:
  • Size: 4.42 MB
Statistics
  • Stars: 18
  • Watchers: 3
  • Forks: 5
  • Open Issues: 0
  • Releases: 0
Created over 8 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog License Citation

README.md

Dryad

dryad_logo

GPL-3.0 Github_Release

Dryad is a Nextflow pipeline for examining prokaryote relatedness. Dryad can perform a reference free analysis and/or SNP analysis.

Dryad analyzes fasta files that have been processed either by Spriggan or by PHoeNIx. Dryad is split into two major workflows: 1. A workflow dedicated to fine scale outbreak investigations that are within a singular outbreak. This process uses a reference to determine relatedness and snp distances. The reference can be removed from the alignment based workflow to create a phylogenetic tree that gives a high resolution look at a singular outbreak. 2. A workflow dedicated to identifying historical relatedness across multiple years and multiple outbreaks without the use of a reference. This alignment free workflow gives a low resolution look at historical relatedness.

Table of Contents:

Usage
Input
Parameters
Workflow
Output
Credits
Contributions-and-Support
Citations

Usage

[!NOTE] If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data. To use Dryad, a Nextflow minimum version of 24.04.2.5914 is needed.

To run an alignment free comparison, use:

bash nextflow run wslh-bio/dryad \ -latest \ -profile <docker/singularity/.../institute> \ --input samplesheet.csv \ --outdir <OUTDIR> \ --alignment_free

Alternatively, to run an alignment based comparison, use:

bash nextflow run wslh-bio/dryad \ -latest \ -profile <docker/singularity/.../institute> \ --input samplesheet.csv \ --outdir <OUTDIR> \ --fasta <REFERENCE_FASTA | random> \ --alignment_based

To run both and alignment based and an alignment free comparison, use: bash nextflow run wslh-bio/dryad \ -latest \ -profile <docker/singularity/.../institute> \ --input samplesheet.csv \ --outdir <OUTDIR> \ --fasta <REFERENCE_FASTA | random> \ --alignment_based \ --alignment_free

  • Nextflow caches previously run pipelines. This can result in an older version of a pipeline being utilized. To get the most up-to-date version of a pipeline like Dryad, use the -latest tag.

Input

Prepare a samplesheet with your input data with each row representing one fasta file. The samplesheet will look as follows:

samplesheet.csv:
| sample | fasta | | ------------- | ------------- | | sample1 | 20241.contigs.fa | | sample2 | 20242.contigs.fa |

Parameters

Dryad's main parameters and their defaults are shown in the table below:

| Parameter | Parameter description and defaults | Example usage | | ------------- | ------------- | ------------- | | input | Path to comma-separated file containing information about the samples in the experiment | --input | | outdir | Output directory where the results will be saved. Absolute path must be used for storage on cloud infrastructure | --outdir | | profile | Denotes how to access containerized software. | -profile aws | | fasta | Reference fasta used for alignment based comparisons. Default is no reference fasta. | --fasta | | fasta random | Reference fasta used for alignment based comparisons is chosen by Parsnp's algorithm. Default is not to use a random fasta file as a reference. | --fasta random | | alignmentbased | Performs a fine scale analysis within a singular outbreak | --alignmentbased | | alignmentfree | Performs a historical analysis across multiple years and outbreaks | --alignmentfree | | task.cpus | Denotes how many cpus to use for Mashtree. Default task.cpus is 2. |--task.cpus 4 | | cgtreemodel | Tells IQ-TREE what model to use. Default cgtreemodel is GTR+G | --cgtreemodel "GTR+G" | | parsnppartition | Tells parsnp the minimum partition amount or to not partition. Default is --no-partition.* | --parsnppartition "--min-partition-size 50" | | skipquast | If the data was run through pheonix or another pipeline with a quality check, skips QUAST and the summary options. Default is to run QUAST as if quality summaries were not previously run. | --skipquast | | addreference | Used to include reference in outputs. This option should not be used if you are using --fasta random. Default is false | --addreference |

*If you are running an alignment based workflow on >100 samples, it may be beneficial to take into account a higher partitioning value than the default of 100. More information can be found in parsnp 2.0's paper.

Workflow

dryad_workflow

1. Universal Steps

  • Enter assembled FASTA genomes into a samplesheet.
  • QUAST v5.2.0 is used to determine assembly quality if skip_quast is not indicated.
  • QUAST results are summarized with a custom python script to increase readability.

2. Comparison Steps

  • Historical Comparison
  • Fine scale Comparison
    • Bootstrapping in IQ-TREE2 requires at least 4 genomes. If less than 4 genomes are used, IQ-TREE2 will not perform bootstrapping.
    • Parsnp v2.0.5 is used to perform a core genome alignment.
    • IQ-TREE2 v2.3.4 is used for inferring a phylogenetic tree.
    • Snp-dists v0.8.2 is used to calculate the SNP distance matrix.

Output

An example of Dryad's output directory structure for both alignment based and alignment free steps can be seen below. These directories will not include QUAST if --skip_quast is used: alignment_based_output/ ├── compare │ └── sample_exclusion_status.csv ├── dryad │ └── dryad_summary.csv ├── iqtree │ └── parsnp.snps.mblocks.treefile ├── parse │ └── aligner_log.tsv ├── parsnp │ └── parsnp_output │ ├── config │ │ ├── all.mumi │ │ └── all_mumi.ini │ ├── log │ │ ├── harvest-mblocks.err │ │ ├── harvest-mblocks.out │ │ ├── parsnp-aligner.err │ │ ├── parsnpAligner.log │ │ ├── parsnp-aligner.out │ │ ├── parsnp-mumi.err │ │ ├── parsnp-mumi.out │ │ ├── raxml.err │ │ └── raxml.out │ ├── parsnpAligner.ini │ ├── parsnp.ggr │ ├── parsnp.maf │ ├── parsnp.snps.mblocks │ ├── parsnp.tree │ ├── parsnp.xmfa │ └── *.fna.ref ├── pipeline_info │ ├── execution_report_*.html │ ├── execution_timeline_*.html │ ├── execution_trace_*.txt │ ├── pipeline_dag_*.html │ └── samplesheet.valid.csv ├── quast │ ├── *.quast.report.tsv │ ├── *.transposed.quast.tsv │ ├── quast_results.tsv ├── sample │ └── count.txt └── snpdists └── snp_dists_matrix.tsv

alignment_free_output/ ├── mashtree │ └── mashtree.bootstrap.dnd ├── pipeline_info │ ├── *.html │ ├── *.txt │ └── samplesheet.valid.csv └── quast │ ├── *.quast.report.tsv │ ├── *.transposed.quast.report.tsv │ └── quast_results.tsv ├── rejected_samples │ └── Empty_samples.csv Notable output files:

Alignment based
| File | Output | | ------------- | ------------- | | quastresults.tsv* | Assembly quality results | | snpdistsmatrix.tsv | Number of SNP distances between each pair of isolates | | parsnp.snps.mblocks.treefile | Maximum likelihood phylogenetic tree | | alignerlog.tsv | Coverage statistics calculated by parsnp | | excludedsamplesfromparsnp.txt | Lists samples that were excluded from parsnp's analysis due to a MUMi distance > 0.01 | | dryadsummary.csv | Summarizes quast report, if run, and core genome percentages | | Empty_samples.csv| Lists any samples that are empty and were removed from the pipeline |

*QUAST results will not be present if --skip_quast was used.

Alignment free | File | Output | | ------------- | ------------- | | quastresults.tsv* | Assembly quality results | | mashtree.bootstrap.dnd | Neighbor joining tree based on mash distances | | Emptysamples.csv| Lists any samples that are empty and were removed from the pipeline |

*QUAST results will not be present if --skip_quast was utilized.

[!WARNING] Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Credits

Dryad was written by Dr. Kelsey Florek, Dr. Abigail C. Shockey, and Eva Gunawan.

We thank the bioinformatics group at the Wisconsin State Laboratory of Hygiene for all of their contributions.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

If you use Dryad for your analysis, please cite it using the following:

K. Florek, A.C. Shockey, & E. Gunawan (2014). Dryad (Version 4.1.1) [https://github.com/wslh-bio/dryad].

An extensive list of references for the tools used by Dryad can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Owner

  • Name: WSLH Bioinformatics
  • Login: wslh-bio
  • Kind: organization

Wisconsin State Laboratory of Hygiene Bioinformatics

Citation (CITATIONS.md)

# wslh-bio/dryad: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

## [Quast](https://quast.sourceforge.net/docs/manual.html)
> A. Mikheenko, A. Prjibelski, V. Saveliev, D. Antipov, A. Gurevich, Versatile genome assembly evaluation with QUAST-LG, Bioinformatics (2018) 34 (13): i142-i150. doi: 10.1093/bioinformatics/bty266

## [Parsnp](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0524-x)
> Kille B, Nute MG, Huang V, Kim E, Phillippy AM, Treangen TJ: Parsnp 2.0: Scalable Core-Genome Alignment for Massive Microbial Datasets. bioRxiv (2024). doi: https://doi.org/10.1101/2024.01.30.577458

## [IQ-TREE](https://doi.org/10.1093/molbev/msaa015)
> D.T. Hoang, O. Chernomor, A. von Haeseler, B.Q. Minh, and L.S. Vinh (2018) UFBoot2: Improving the ultrafast bootstrap approximation. Mol. Biol. Evol., 35:518–522. https://doi.org/10.1093/molbev/msx281

## [snp-dists](https://github.com/tseemann/snp-dists)
> T. Seemann, F. Klotzl, A. Page (2014). Snp-Dists (Version 0.8.2) [https://github.com/tseemann/snp-dists].

## [Mashtree](https://doi.org/10.21105/joss.01762)
> Katz, L. S., Griswold, T., Morrison, S., Caravas, J., Zhang, S., den Bakker, H.C., Deng, X., and Carleton, H. A., (2019). Mashtree: a rapid comparison of whole genome sequence files. Journal of Open Source Software, 4(44), 1762, https://doi.org/10.21105/joss.01762

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total
  • Create event: 2
  • Release event: 1
  • Issues event: 1
  • Watch event: 1
  • Push event: 61
  • Pull request review event: 4
  • Pull request event: 2
Last Year
  • Create event: 2
  • Release event: 1
  • Issues event: 1
  • Watch event: 1
  • Push event: 61
  • Pull request review event: 4
  • Pull request event: 2

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 285
  • Total Committers: 3
  • Avg Commits per committer: 95.0
  • Development Distribution Score (DDS): 0.611
Past Year
  • Commits: 4
  • Committers: 1
  • Avg Commits per committer: 4.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Abigail Shockey a****y@g****m 111
Kelsey Florek n****k@g****m 96
Kelsey Florek k****2@g****m 78

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 1
  • Total pull requests: 19
  • Average time to close issues: N/A
  • Average time to close pull requests: 5 days
  • Total issue authors: 1
  • Total pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.16
  • Merged pull requests: 15
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 4
  • Average time to close issues: N/A
  • Average time to close pull requests: 6 minutes
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • k-florek (1)
Pull Request Authors
  • evagunawan (13)
  • AbigailShockey (11)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels

Dependencies

.github/workflows/dryad_build.yml actions
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • aws-actions/configure-aws-credentials v1 composite