speciesabundance

Estimate the relative abundance of sequence reads originating from different species in a sample.

https://github.com/phac-nml/speciesabundance

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Estimate the relative abundance of sequence reads originating from different species in a sample.

Basic Info

Host: GitHub
Owner: phac-nml
License: mit
Language: Nextflow
Default Branch: main
Size: 12.8 MB

Statistics

Stars: 5
Watchers: 4
Forks: 0
Open Issues: 1
Releases: 3

Created over 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation

SpeciesAbundance Pipeline

This is the nf-core-based pipeline for SpeciesAbundance. This pipeline estimates the relative abundance of sequence reads originating from different species in a sample. This pipeline is designed to be integrated into IRIDA Next. However, it may be run as a stand-alone pipeline.

This pipeline is designed to estimate taxonomic abundance using both single- and paired-end Illumina short-read data. It does not currently accommodate long-read sequencing data (Nanopore or PacBio).

Input

The input to the pipeline is a standard sample sheet (passed as --input samplesheet.csv) that looks like:

| sample | fastq1 | fastq2 | | ------- | --------------- | --------------- | | SampleA | file1.fastq.gz | file2.fastq.gz |

An example samplesheet has been provided with the pipeline.

The structure of this file is defined in assets/schema_input.json. Validation of the sample sheet is performed by nf-validation.

IRIDA-Next Optional Input Configuration

speciesabundance accepts the IRIDA-Next format for samplesheets which can contain an additional column: sample_name

sample_name: An optional column, that overrides sample for outputs (filenames and sample names) and reference assembly identification.

sample_name, allows more flexibility in naming output files or sample identification. Unlike sample, sample_name is not required to contain unique values. Nextflow requires unique sample names, and therefore in the instance of repeat sample_names, sample will be suffixed to any sample_name. Non-alphanumeric characters (excluding _,-,.) will be replaced with "_".

The sample sheet, when including the optional sample_name column, should look like:

| sample | samplename | fastq1 | fastq2 | | ------- | ----------- | --------------- | --------------- | | SampleA | A1 | file1.fastq.gz | file_2.fastq.gz |

An example samplesheet has been provided with the pipeline, which includes the sample_name column.

Parameters

Mandatory

The mandatory parameters are as follows:

--input : a URI to the samplesheet as specified in the Input section.
--output : to specify the output results directory.

Database Selection

It is mandatory to have one of either --database or both [--kraken2_db and --bracken_db].

Please use only:

--database /path/to/database : to specify the directory to the database files required by both Kraken2 and Bracken

Or:

--kraken2_db /path/to/kraken2database : to specify the directory to the Kraken2 database files and
--bracken_db /path/to/brackendatabase : to specify the directory to the Bracken database files

Optional

Additionally, you may choose to provide:

SpeciesAbundance Parameters

--taxonomic_level : to specify the taxonomic level of the bracken abundance estimation.
- Must be one of : S(species)(default), G(genus), O(order), F(family), P(phylum), or K(kingdom)
--kmer_len : to specify the kmer length for the bracken distribution file used to estimate the abundance at the specified taxonomic level
- Must be one of : 50, 75, 100 (default), 150, 200, 250, or 300
- Selecting a lower k-mer length enhances sensitivity, while a higher k-mer length increases specificity.
--top_n : to specify the number of top results to keep and include in the metadata for IRIDA Next.
- Default: 5

Other Parameters

-profile : to specify which profile to use (ex: -profile singularity)
-r [branch] : to specify which GitHub branch you would like to use (ex: -r dev)

Other parameters (defaults from nf-core) are defined in nextflow_schema.json.

Running

Test Data

To run the pipeline using the test profile, please run:

bash nextflow run phac-nml/speciesabundance -profile docker,test --outdir results

The pipeline output will be written to a directory named results. A JSON file for integrating with IRIDA Next will be written to results/iridanext.output.json.gz (as detailed in the Output section)

Output

Results

The following output files are generated by the pipeline:

fastp/
- sampleID_{R1/R2}_trimmed.fastq.gz
- sampleID.fastp.json
- sampleID.fastp.html
kraken2/
- sampleID_kraken2_output.tsv.gz
- sampleID_kraken2_report.txt
bracken/
- sampleID_S_bracken_abundance_unsorted.tsv
- sampleID_S_bracken.txt
failure/
- failures_report.csv
adjust/
- sampleID_S_bracken_abundance.csv
- sampleIS_S_adjusted_report.txt
top/sampleID_S_top_N.csv
csvtk/merged_topN.csv
bracken2krona/sampleID.txt
krona/sampleID.krona.html

IRIDA Next Integration File

A JSON file for loading metadata into IRIDA Next is output by this pipeline. The format of this JSON file is specified in our Pipeline Standards for the IRIDA Next JSON. This JSON file is written directly within the --outdir provided to the pipeline with the name iridanext.output.json.gz (ex: [outdir]/iridanext.output.json.gz).

An example of the what the contents of the IRIDA Next JSON file looks like for this particular pipeline is as follows:

{ "files": { "global": [ { "path": "failure/failures_report.csv" } ], "samples": { "sampleID": [ {"path": "adjust/sampleID_S_bracken_abundances.csv"}, {"path": "krona/sampleID.krona.html"}, {"path": "fastp/sampleID.fastp.html"} ] } }, "metadata": { "samples": { "sampleID": { "taxonomy_level": "S", "abundance_1_name": "Bacteroides fragilis", "abundance_1_ncbi_taxonomy_id": "817", "abundance_1_num_assigned_reads": "28877", "abundance_1_fraction_total_reads": "57.77018", "abundance_2_name": "Escherichia coli", "abundance_2_ncbi_taxonomy_id": "562", "abundance_2_num_assigned_reads": "21065", "abundance_2_fraction_total_reads": "42.1418", "abundance_3_name": "", "abundance_3_ncbi_taxonomy_id": "", "abundance_3_num_assigned_reads": "", "abundance_3_fraction_total_reads": "", "abundance_4_name": "", "abundance_4_ncbi_taxonomy_id": "", "abundance_4_num_assigned_reads": "", "abundance_4_fraction_total_reads": "", "abundance_5_name": "", "abundance_5_ncbi_taxonomy_id": "", "abundance_5_num_assigned_reads": "", "abundance_5_fraction_total_reads": "", "unclassified_name": "unclassified", "unclassified_ncbi_taxonomy_id": "0", "unclassified_num_assigned_reads": "44", "unclassified_fraction_total_reads": "0.08802" } } } }

Within the files section of this JSON file, all of the output paths are relative to the outdir. Therefore, "path": "adjust/SAMPLE1_S_bracken_abundances.csv" refers to a file located within outdir/adjust/SAMPLE1_S_bracken_abundances.csv.

Failures

If one or more samples fail during the pipeline execution, the workflow will still run all other samples in the samplesheet. The samples that fail will be reported in a file named results/failure/failure_report.csv. This CSV file has three columns:

sample : the name of the sample that failed (matching the input samplesheet)
module : the module (or process) where the error occured
error_message : suggestions that aim to provide insights into potential reasons for sample failure in the respective process

For example:

sample,module,error_message [SAMPLE1],FASTP,The input FASTQ file(s) might exhibit either a mismatch in PAIRED files; corruption in one or both SINGLE/PAIRED file(s); or file(s) may not exist in PATH provided by input samplesheet [SAMPLE2],KRAKEN2,The reads may not have passed the quality control and trimming process OR the database directory may be missing required KRAKEN2 files {SAMPLE3},BRACKEN,The reads may have failed to classify against the selected Kraken2 database OR the database directory may be missing the Bracken distribution files

Legal

Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:

https://opensource.org/license/mit/

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Derivative Work

This pipeline includes source code from a nextflow pipeline for taxon-abundance and an IRIDA-plugin for SpeciesAbundance developed by Dan Fornika as a work of the BC Center for Disease Control Public Health Laboratory (BCCDC_PHL).

The included source code developed by Dan Fornika as a work of the BCCDC-PHL was distributed within the public domain under the Apache Software License version 2.0.

Any such source files in this project that are included from or derived from the original work by Dan Fornika will include a notice.

Owner

Name: National Microbiology Laboratory
Login: phac-nml
Kind: organization

Website: https://www.nml-lnm.gc.ca/
Repositories: 50
Profile: https://github.com/phac-nml

Citation (CITATIONS.md)

# phac-nml/speciesabundance: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## Pipeline tools

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Watch event: 1
Create event: 1

Last Year

Watch event: 1
Create event: 1

Dependencies

.github/workflows/branch.yml actions

mshick/add-pr-comment v1 composite

.github/workflows/ci.yml actions

actions/cache v3 composite
actions/checkout v3 composite
nf-core/setup-nextflow v1 composite

.github/workflows/linting.yml actions

actions/checkout v3 composite
actions/setup-node v3 composite
actions/setup-python v4 composite
actions/upload-artifact v3 composite
mshick/add-pr-comment v1 composite
nf-core/setup-nextflow v1 composite
psf/black stable composite

.github/workflows/linting_comment.yml actions

dawidd6/action-download-artifact v2 composite
marocchino/sticky-pull-request-comment v2 composite

modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan

pyproject.toml pypi

modules/nf-core/csvtk/concat/meta.yml cpan

modules/nf-core/csvtk/concat/environment.yml conda

csvtk 0.23.0.*

modules/nf-core/custom/dumpsoftwareversions/environment.yml conda

multiqc 1.20.*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science