speciesabundance
Estimate the relative abundance of sequence reads originating from different species in a sample.
Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary
Repository
Estimate the relative abundance of sequence reads originating from different species in a sample.
Basic Info
- Host: GitHub
- Owner: phac-nml
- License: mit
- Language: Nextflow
- Default Branch: main
- Size: 12.8 MB
Statistics
- Stars: 5
- Watchers: 4
- Forks: 0
- Open Issues: 1
- Releases: 3
Metadata Files
README.md
SpeciesAbundance Pipeline
This is the nf-core-based pipeline for SpeciesAbundance. This pipeline estimates the relative abundance of sequence reads originating from different species in a sample. This pipeline is designed to be integrated into IRIDA Next. However, it may be run as a stand-alone pipeline.
This pipeline is designed to estimate taxonomic abundance using both single- and paired-end Illumina short-read data. It does not currently accommodate long-read sequencing data (Nanopore or PacBio).
Input
The input to the pipeline is a standard sample sheet (passed as --input samplesheet.csv) that looks like:
| sample | fastq1 | fastq2 | | ------- | --------------- | --------------- | | SampleA | file1.fastq.gz | file2.fastq.gz |
An example samplesheet has been provided with the pipeline.
The structure of this file is defined in assets/schema_input.json. Validation of the sample sheet is performed by nf-validation.
IRIDA-Next Optional Input Configuration
speciesabundance accepts the IRIDA-Next format for samplesheets which can contain an additional column: sample_name
sample_name: An optional column, that overrides sample for outputs (filenames and sample names) and reference assembly identification.
sample_name, allows more flexibility in naming output files or sample identification. Unlike sample, sample_name is not required to contain unique values. Nextflow requires unique sample names, and therefore in the instance of repeat sample_names, sample will be suffixed to any sample_name. Non-alphanumeric characters (excluding _,-,.) will be replaced with "_".
The sample sheet, when including the optional sample_name column, should look like:
| sample | samplename | fastq1 | fastq2 | | ------- | ----------- | --------------- | --------------- | | SampleA | A1 | file1.fastq.gz | file_2.fastq.gz |
An example samplesheet has been provided with the pipeline, which includes the sample_name column.
Parameters
Mandatory
The mandatory parameters are as follows:
--input: a URI to the samplesheet as specified in the Input section.--output: to specify the output results directory.
Database Selection
It is mandatory to have one of either --database or both [--kraken2_db and --bracken_db].
Please use only:
--database /path/to/database: to specify the directory to the database files required by both Kraken2 and Bracken
Or:
--kraken2_db /path/to/kraken2database: to specify the directory to the Kraken2 database files and--bracken_db /path/to/brackendatabase: to specify the directory to the Bracken database files
Optional
Additionally, you may choose to provide:
SpeciesAbundance Parameters
--taxonomic_level: to specify the taxonomic level of the bracken abundance estimation.- Must be one of :
S(species)(default),G(genus),O(order),F(family),P(phylum), orK(kingdom)
- Must be one of :
--kmer_len: to specify the kmer length for the bracken distribution file used to estimate the abundance at the specified taxonomic level- Must be one of : 50, 75, 100 (default), 150, 200, 250, or 300
- Selecting a lower k-mer length enhances sensitivity, while a higher k-mer length increases specificity.
--top_n: to specify the number of top results to keep and include in the metadata for IRIDA Next.- Default: 5
Other Parameters
-profile: to specify which profile to use (ex:-profile singularity)-r [branch]: to specify which GitHub branch you would like to use (ex:-r dev)
Other parameters (defaults from nf-core) are defined in nextflow_schema.json.
Running
Test Data
To run the pipeline using the test profile, please run:
bash
nextflow run phac-nml/speciesabundance -profile docker,test --outdir results
The pipeline output will be written to a directory named results. A JSON file for integrating with IRIDA Next will be written to results/iridanext.output.json.gz (as detailed in the Output section)
Output
Results
The following output files are generated by the pipeline:
fastp/sampleID_{R1/R2}_trimmed.fastq.gzsampleID.fastp.jsonsampleID.fastp.html
kraken2/sampleID_kraken2_output.tsv.gzsampleID_kraken2_report.txt
bracken/sampleID_S_bracken_abundance_unsorted.tsvsampleID_S_bracken.txt
failure/failures_report.csv
adjust/sampleID_S_bracken_abundance.csvsampleIS_S_adjusted_report.txt
top/sampleID_S_top_N.csvcsvtk/merged_topN.csvbracken2krona/sampleID.txtkrona/sampleID.krona.html
IRIDA Next Integration File
A JSON file for loading metadata into IRIDA Next is output by this pipeline. The format of this JSON file is specified in our Pipeline Standards for the IRIDA Next JSON. This JSON file is written directly within the --outdir provided to the pipeline with the name iridanext.output.json.gz (ex: [outdir]/iridanext.output.json.gz).
An example of the what the contents of the IRIDA Next JSON file looks like for this particular pipeline is as follows:
{
"files": {
"global": [
{
"path": "failure/failures_report.csv"
}
],
"samples": {
"sampleID": [
{"path": "adjust/sampleID_S_bracken_abundances.csv"},
{"path": "krona/sampleID.krona.html"},
{"path": "fastp/sampleID.fastp.html"}
]
}
},
"metadata": {
"samples": {
"sampleID": {
"taxonomy_level": "S",
"abundance_1_name": "Bacteroides fragilis",
"abundance_1_ncbi_taxonomy_id": "817",
"abundance_1_num_assigned_reads": "28877",
"abundance_1_fraction_total_reads": "57.77018",
"abundance_2_name": "Escherichia coli",
"abundance_2_ncbi_taxonomy_id": "562",
"abundance_2_num_assigned_reads": "21065",
"abundance_2_fraction_total_reads": "42.1418",
"abundance_3_name": "",
"abundance_3_ncbi_taxonomy_id": "",
"abundance_3_num_assigned_reads": "",
"abundance_3_fraction_total_reads": "",
"abundance_4_name": "",
"abundance_4_ncbi_taxonomy_id": "",
"abundance_4_num_assigned_reads": "",
"abundance_4_fraction_total_reads": "",
"abundance_5_name": "",
"abundance_5_ncbi_taxonomy_id": "",
"abundance_5_num_assigned_reads": "",
"abundance_5_fraction_total_reads": "",
"unclassified_name": "unclassified",
"unclassified_ncbi_taxonomy_id": "0",
"unclassified_num_assigned_reads": "44",
"unclassified_fraction_total_reads": "0.08802"
}
}
}
}
Within the files section of this JSON file, all of the output paths are relative to the outdir. Therefore, "path": "adjust/SAMPLE1_S_bracken_abundances.csv" refers to a file located within outdir/adjust/SAMPLE1_S_bracken_abundances.csv.
Failures
If one or more samples fail during the pipeline execution, the workflow will still run all other samples in the samplesheet. The samples that fail will be reported in a file named results/failure/failure_report.csv. This CSV file has three columns:
sample: the name of the sample that failed (matching the input samplesheet)module: the module (or process) where the error occurederror_message: suggestions that aim to provide insights into potential reasons for sample failure in the respective process
For example:
sample,module,error_message
[SAMPLE1],FASTP,The input FASTQ file(s) might exhibit either a mismatch in PAIRED files; corruption in one or both SINGLE/PAIRED file(s); or file(s) may not exist in PATH provided by input samplesheet
[SAMPLE2],KRAKEN2,The reads may not have passed the quality control and trimming process OR the database directory may be missing required KRAKEN2 files
{SAMPLE3},BRACKEN,The reads may have failed to classify against the selected Kraken2 database OR the database directory may be missing the Bracken distribution files
Legal
Copyright 2024 Government of Canada
Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
https://opensource.org/license/mit/
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Derivative Work
This pipeline includes source code from a nextflow pipeline for taxon-abundance and an IRIDA-plugin for SpeciesAbundance developed by Dan Fornika as a work of the BC Center for Disease Control Public Health Laboratory (BCCDC_PHL).
The included source code developed by Dan Fornika as a work of the BCCDC-PHL was distributed within the public domain under the Apache Software License version 2.0.
Any such source files in this project that are included from or derived from the original work by Dan Fornika will include a notice.
Owner
- Name: National Microbiology Laboratory
- Login: phac-nml
- Kind: organization
- Website: https://www.nml-lnm.gc.ca/
- Repositories: 50
- Profile: https://github.com/phac-nml
Citation (CITATIONS.md)
# phac-nml/speciesabundance: Citations ## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/) > Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. ## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/) > Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. ## Pipeline tools ## Software packaging/containerisation tools - [Anaconda](https://anaconda.com) > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. - [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/) > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. - [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/) > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. - [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241) > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241. - [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/) > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.
GitHub Events
Total
- Watch event: 1
- Create event: 1
Last Year
- Watch event: 1
- Create event: 1
Dependencies
- mshick/add-pr-comment v1 composite
- actions/cache v3 composite
- actions/checkout v3 composite
- nf-core/setup-nextflow v1 composite
- actions/checkout v3 composite
- actions/setup-node v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- mshick/add-pr-comment v1 composite
- nf-core/setup-nextflow v1 composite
- psf/black stable composite
- dawidd6/action-download-artifact v2 composite
- marocchino/sticky-pull-request-comment v2 composite
- csvtk 0.23.0.*
- multiqc 1.20.*