snvphylnfc

SNVPhyl nf-core Pipeline (Development)

https://github.com/phac-nml/snvphylnfc

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

SNVPhyl nf-core Pipeline (Development)

Basic Info

Host: GitHub
Owner: phac-nml
License: mit
Language: HTML
Default Branch: main
Size: 2.34 MB

Statistics

Stars: 4
Watchers: 5
Forks: 1
Open Issues: 6
Releases: 10

Created over 2 years ago · Last pushed 11 months ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation

SNVPhyl nf-core Pipeline

This is the nf-core-based pipeline for SNVPhyl. The SNVPhyl (Single Nucleotide Variant PHYLogenomics) pipeline identifies Single Nucleotide Variants (SNV) within a collection of microbial genomes and constructs a phylogenetic tree from those SNVs. This pipeline is designed to be integrated into IRIDA Next. However, it may be run as a stand-alone pipeline.

Input

Input is provided to SNVPhyl in the form of a samplesheet (passed as --input samplesheet.csv). This samplesheet is a CSV-formated file, which may be provided as a URI (ex: a file path or web address), and has the following format:

| sample | samplename | fastq1 | fastq2 | referenceassembly | metadata1 | metadata2 | metadata3 | metadata4 | metadata5 | metadata6 | metadata7 | metadata8 | | ------- | ------------ | -------------------------- | -------------------------- | ---------------------------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | | SAMPLE1 | samplename1 | /path/to/sample1fastq1.fq | /path/to/sample1fastq2.fq | /path/to/sample1assembly.fa | meta1 | meta2 | meta3 | meta4 | meta5 | meta6 | meta7 | meta8 | | SAMPLE2 | samplename2 | /path/to/sample2fastq1.fq | | | meta1 | meta2 | meta3 | meta4 | meta5 | meta6 | meta7 | meta8 |

The columns are defined as follows:

sample: Mandatory unique sample identifier. The unique sample identifier to associate with the reads (and optionally the reference assembly).
sample_name: Optional, and overrides sample for outputs (filenames and sample names) and reference assembly identification.
fastq_1: A URI (ex: a file path or web address) to either single-end FASTQ-formatted reads or one pair of pair-end FASTQ-formatted reads.
fastq_2: (Optional) If fastq_1 is paired-end, then this field is a URI to reads that are the other pair of reads associated with fastq_1.
reference_assembly: (Optional) A URI to a reference assembly associated with the sample, so that it may be referenced on the command line by the sample identifier for use as the reference for the whole pipeline. However, it may be easier to leave these fields blank and specify the reference using the --refgenome parameter.
metadata_1...8: (Optional) Permits up to 8 columns for user-defined contextual metadata associated with each sample. Refer to Metadata for more information.

When to use `sample` vs `sample_name`

Either can be used to identify the reference assembly with the parameter --reference_sample_id.

sample is a unique identifier, designed to be used internally or in IRIDA-Next, or when sample_name is not provided.

sample_name, allows more flexibility in naming output files or sample identification. Unlike sample, sample_name is not required to contain unique values. Nextflow requires unique sample names, and therefore in the instance of repeat sample_names, sample will be suffixed to any sample_name. Non-alphanumeric characters (excluding _,-,.) will be replaced with "_".

The structure of this file is defined in assets/schema_input.json. Please see assets/samplesheet.csv to see an example of a samplesheet for this pipeline.

Parameters

Mandatory

The mandatory parameters are as follows:

--input: a URI to the samplesheet
--output: the directory for pipeline output

Additionally, it is mandatory to have one of either --refgenome or --reference_sample_id (but not both) to specify the reference. Please see the Reference section for more details.

Metadata

In order to customize metadata headers, the parameters --metadata_1_header through --metadata_8_header may be specified. These parameters are used to re-name the headers in the final metadata table from the defaults (e.g., rename metadata_1 to country).

Optional

The optional parameters are as follows:

Reference

--refgenome: a URI to the reference genome to use during pipeline analysis
--reference_sample_id: the sample identifier of a sample (sample or sample_name) in the samplesheet that contains a provided reference_assembly to use as a reference genome during pipeline analysis

Please use only one of --refgenome or --reference_sample_id and not both.

SNVPhyl Parameters

--window_size: The window size for determining whether a region is high density.
--density_threshold: The minimum number of SNVs within the window size for a region to be considered high density.
--min_coverage_depth: The minimum depth of coverage for a position within the genome to pass the mapping quality check.
--min_mapping_percent_cov: The total percentage of positions within the genome that must have a depth of coverage greater than the minimum depth of coverage specified in order to pass the mapping quality check.
--min_mean_mapping_quality: The minimum mean mapping quality score for all reads in a pileup to be included in the analysis.
--snv_abundance_ratio: The proportion of reads required to support a variant to be included in the analysis.
--min_repeat_length: The minimum length when identifying repeats on the reference genome.
--min_repeat_pid: The minimum percent identity when identifying repeats on the reference genome.
--skip_density_filter: Whether or not to skip filtering low SNV density regions.

Please refer to the SNVPhyl documentation for more detailed information about pipeline parameters.

Other Parameters

-profile: specifies which profiles to use (ex: -profile singularity)
-r: specifies which revision to use (ex: -r dev)

Running

Test Data

In order to run the pipeline with provided data, please run:

nextflow run phac-nml/snvphylnfc -profile singularity --input https://raw.githubusercontent.com/phac-nml/snvphylnfc/dev/assets/samplesheet.csv --refgenome https://raw.githubusercontent.com/phac-nml/snvphylnfc/dev/assets/reference.fasta --outdir results

The pipeline output will be written to a directory named results. A JSON file for integrating data with IRIDA Next will be written to results/iridanext.output.json.gz (please see the Output section for details).

It is also possible to run the pipeline using the test profile as follows:

nextflow run phac-nml/snvphylnfc -profile singularity,test --outdir results

Output

Results

The following output files are generated by the pipeline:

make/snvMatrix.tsv: a pair-wise distance matrix of SNVs that passed all filtering criteria
filter/filterStats.txt: a summary of the number of SNVs filtered within in the SNV Table
phyml/phylogeneticTree.newick: the maximum likelihood phylogeny generated from an alignment of SNVs extracted from the whole genomes of each input file
phyml/phylogeneticTreeStats.txt: statistics for the generated phylogenetic tree
arbor/SNVPhyl_ArborView.html: an HTML file for examining the phylogenetic tree (.newick) dendrogram in the ArborView HTML app, complete with contextual metadata
vcf2snv/snvTable.tsv: a table of all detected variant sites
vcf2snv/vcf2core.tsv: a table of the evaluated core positions in each reference fasta sequence
vcf2snv/snvAlignment.phy: an alignment of SNVs used to generate the phylogenetic tree
verifying/mappingQuality.txt: describes how well the given reads mapped to the reference genome

For more detailed information, please refer to the SNVPhyl Documentation.

IRIDA Next Integration File

A JSON file for loading the data into IRIDA Next is output by this pipeline. The format of this JSON file is specified in our Pipeline Standards for the IRIDA Next JSON. This JSON file is written directly within the --outdir provided to the pipeline with the name iridanext.output.json.gz (ex: [outdir]/iridanext.output.json.gz).

``` { "files": { "global": [ { "path": "arbor/SNVPhyl__ArborView.html" }, { "path": "make/snvMatrix.tsv" }, { "path": "filter/filterStats.txt" }, { "path": "phyml/phylogeneticTreeStats.txt" }, { "path": "phyml/phylogeneticTree.newick" }, { "path": "vcf2snv/snvTable.tsv" }, { "path": "vcf2snv/vcf2core.tsv" }, { "path": "vcf2snv/snvAlignment.phy" }, { "path": "verifying/mappingQuality.txt" } ], "samples": {

    }
},
"metadata": {
    "samples": {

    }
}

} ```

Within the files section of this JSON file, all of the output paths are relative to the --outdir results. Therefore, "path": "phyml/phylogeneticTree.newick" refers to a file located within results/phyml/phylogeneticTree.newick.

Legal

nf-core

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. In addition, references of tools and data used in this pipeline are as follows:

SNVPhyl NF-Core Pipeline

Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:

https://opensource.org/license/mit/

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Derivative Work

This pipeline includes source code from a nf-core pipeline for SNVPhyl developed by Jill Hagey as a work of the United States Government that was not subject to domestic copyright protection under 17 USC § 105. This work by the United States Government is in the public domain within the United States, and copyright and related rights for the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

The included source code developed by Jill Hagey as a work of the United States Government was distributed under the Apache Software License version 2. A copy of the Apache Software License is included in this repository.

Any such source files in this project that are included from or derived from the original work by Jill Hagey will include a notice.

Owner

Name: National Microbiology Laboratory
Login: phac-nml
Kind: organization

Website: https://www.nml-lnm.gc.ca/
Repositories: 50
Profile: https://github.com/phac-nml

Citation (CITATIONS.md)

# phac-nml/snvphylnfc: Citations

## [SNVPhyl](https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000116)

> Petkau A, Mabon P, Sieffert C, Knox N, Cabral J, Iskander M, Iskander M, Weedmark K, Zaheer R, Katz L, Nadon C, Reimer A, Taboada E, Beiko R, Hsiao W, Brinkman F, Graham M, Van Domselaar G. SNVPhyl: a single nucleotide variant phylogenomics pipeline for microbial genomic epidemiology. 08/06/2017. M Gen 3(6): doi:10.1099/mgen.0.000116.

## [SNVPhyl Nextflow](https://github.com/DHQP/SNVPhyl_Nextflow)

> Hagey J. SNVPhyl Nextflow. 2022-21-4. https://github.com/DHQP/SNVPhyl_Nextflow/.

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Create event: 6
Issues event: 2
Release event: 5
Watch event: 1
Delete event: 3
Issue comment event: 9
Push event: 34
Pull request review comment event: 1
Pull request review event: 6
Pull request event: 14

Last Year

Create event: 6
Issues event: 2
Release event: 5
Watch event: 1
Delete event: 3
Issue comment event: 9
Push event: 34
Pull request review comment event: 1
Pull request review event: 6
Pull request event: 14

Dependencies

.github/workflows/branch.yml actions

mshick/add-pr-comment v1 composite

.github/workflows/ci.yml actions

actions/cache v3 composite
actions/checkout v3 composite
nf-core/setup-nextflow v1 composite

.github/workflows/linting.yml actions

actions/checkout v3 composite
actions/setup-node v3 composite
actions/setup-python v4 composite
actions/upload-artifact v3 composite
mshick/add-pr-comment v1 composite
nf-core/setup-nextflow v1 composite
psf/black stable composite

.github/workflows/linting_comment.yml actions

dawidd6/action-download-artifact v2 composite
marocchino/sticky-pull-request-comment v2 composite

modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan

pyproject.toml pypi

modules/nf-core/cat/cat/meta.yml cpan

modules/nf-core/cat/cat/environment.yml pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

snvphylnfc

Science Score: 57.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

SNVPhyl nf-core Pipeline

Input

When to use `sample` vs `sample_name`

Parameters

Mandatory

Metadata

Optional

Reference

SNVPhyl Parameters

Other Parameters

Running

Test Data

Output

Results

IRIDA Next Integration File

Legal

nf-core

SNVPhyl NF-Core Pipeline

Derivative Work

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Dependencies

snvphylnfc

Science Score: 57.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

SNVPhyl nf-core Pipeline

Input

When to use sample vs sample_name

Parameters

Mandatory

Metadata

Optional

Reference

SNVPhyl Parameters

Other Parameters

Running

Test Data

Output

Results

IRIDA Next Integration File

Legal

nf-core

SNVPhyl NF-Core Pipeline

Derivative Work

Owner

Citation (CITATIONS.md)

GitHub Events

Total

Last Year

Dependencies

When to use `sample` vs `sample_name`