fetchdatairidanext

Pipeline for downloading data from INSDC databases for IRIDA Next.

https://github.com/phac-nml/fetchdatairidanext

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Pipeline for downloading data from INSDC databases for IRIDA Next.

Basic Info

Host: GitHub
Owner: phac-nml
License: mit
Language: Nextflow
Default Branch: main
Homepage:
Size: 2.54 MB

Statistics

Stars: 0
Watchers: 4
Forks: 2
Open Issues: 2
Releases: 6

Created over 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog Contributing License Citation

fetchdatairidanext pipeline

This pipeline can be used to fetch data from NCBI for integration into IRIDA Next.

Input

The input to the pipeline is a standard sample sheet (passed as --input samplesheet.csv) that looks like:

| sample | insdc_accession | | ------- | --------------- | | SampleA | ERR1109373 | | SampleB | SRR13191702 |

That is, there are two columns:

sample: The sample identifier downloaded read data should be associated with.
insdc_accession: The accession from the International Sequence Data Collaboration (INSDC) for the data to download (currently only sequence runs supported, e.g., starting with SRR, ERR, or DRR).

The structure of this file is defined in assets/schema_input.json. An example of this file is provided at assets/samplesheet.csv.

IRIDA-Next Optional Input Configuration

fetchdatairidanext accepts the IRIDA-Next format for samplesheets which can contain an additional column: sample_name

sample_name: An optional column, to add the sample_name prefix before the accession code.

sample_name, allows more flexibility in naming reads. Unlike sample, sample_name is not required to contain unique values. Non-alphanumeric characters (excluding _,-,.) will be replaced with "_". sample_name can be provided without renaming by changing parameters.

An example samplesheet has been provided with the pipeline.

Parameters

The main parameters are --input as defined above and --output for specifying the output results directory. You may wish to provide -profile singularity to specify the use of singularity containers (or -profile docker for docker) and -r [branch] to specify which GitHub branch you would like to run.

--rename_with_samplename (Default: true) When false, samplesheet column sample_name not used for reads-renaming.

--provider ['SRA'|'ENA'] (Default: SRA) When using SRA, the data will be pulled with sra-tools fasterq-dump from the Sequence Read Archive and when using ENA, the data will be pulled with fastq-dl from the European Nucleotide Archive

Other parameters (defaults from nf-core) are defined in nextflow_schema.json.

Running

Test data

To run the pipeline with test data, please do:

bash nextflow run phac-nml/fetchdatairidanext -profile test,docker --outdir results

The downloaded data will appear in results/. A JSON file for integrating data with IRIDA Next will be written to results/iridanext.output.json.gz (see the Output section for details).

Other data

To run the pipeline with other data (a custom samplesheet), please do:

bash nextflow run phac-nml/fetchdatairidanext -profile docker --input assets/samplesheet.csv --outdir results

Where the samplesheet.csv is structured as specified in the Input section.

Output

Read data

The sequence reads will appear in the results/reads directory (assuming --outdir results is specified). For example:

results/reads/ ├── ERR1109373.fastq.gz ├── ERR1109373_1.fastq.gz ├── ERR1109373_2.fastq.gz ├── SRR13191702_1.fastq.gz └── SRR13191702_2.fastq.gz

IRIDA Next integration file

A JSON file for loading the data into IRIDA Next is output by this pipeline. The format of this JSON file is specified in our Pipeline Standards for the IRIDA Next JSON. This JSON file is written directly within the --outdir provided to the pipeline with the name irida.output.json.gz (ex: [outdir]/irida.output.json.gz).

json { "files": { "global": [], "samples": { "SampleA": [{ "path": "reads/SRR13191702_1.fastq.gz" }, { "path": "reads/SRR13191702_2.fastq.gz" }] } } }

Within the files section of this JSON file, all of the output paths are relative to the --outdir results. Therefore, "path": "reads/SRR13191702_1.fastq.gz" refers to a file located within sratools/reads/SRR13191702_1.fastq.gz.

An additional example of this file can be found at tests/data/test1_iridanext.output.json.

Failures

If one or more samples fail to download, the workflow will still attempt to download all other samples in the samplesheet. The samples that fail to download will be reported in a file named results/prefetch/failures_report.csv. This CSV file has two columns: sample (the name of the sample, matching the input samplesheet) and error_accession (the accession that failed to download).

For example:

sample,error_accession ERROR1,SRR999908 ERROR2,SRR999934

Acknowledgements

This pipeline uses code and infrastructure developed and maintained by the nf-core initative, and reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

In addition, references of tools and data used in this pipeline are as follows:

The fastqdownloadprefetchfasterqdumpsratools subworkflow from nf-core. Custom modifications to this workflow (and underlying modules) are found in the subworkflows/local and modules/local directories.
The fastq-dl tool for grabbing data from the ENA.

Other works this pipeline makes use of are found in the CITATIONS.md file.

Legal

Licensed under the MIT License (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:

https://opensource.org/license/mit/

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Owner

Name: National Microbiology Laboratory
Login: phac-nml
Kind: organization

Website: https://www.nml-lnm.gc.ca/
Repositories: 50
Profile: https://github.com/phac-nml

Citation (CITATIONS.md)

# phac-nml/fetchdatairidanext: Citations

## [nf-core](https://pubmed.ncbi.nlm.nih.gov/32055031/)

> Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.

## [Nextflow](https://pubmed.ncbi.nlm.nih.gov/28398311/)

> Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.

## [nf-test](https://www.nf-test.com/)

## Pipeline tools

- [NCBI sra-tools](https://github.com/ncbi/sra-tools)
- [nf-core fastq_download_prefetch_fasterqdump_sratools subworkflow](https://nf-co.re/subworkflows/fastq_download_prefetch_fasterqdump_sratools)
- [fastq-dl](https://github.com/rpetit3/fastq-dl)

## Software packaging/containerisation tools

- [Anaconda](https://anaconda.com)

  > Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web.

- [Bioconda](https://pubmed.ncbi.nlm.nih.gov/29967506/)

  > Grüning B, Dale R, Sjödin A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, Köster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506.

- [BioContainers](https://pubmed.ncbi.nlm.nih.gov/28379341/)

  > da Veiga Leprevost F, Grüning B, Aflitos SA, Röst HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671.

- [Docker](https://dl.acm.org/doi/10.5555/2600239.2600241)

  > Merkel, D. (2014). Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2. doi: 10.5555/2600239.2600241.

- [Singularity](https://pubmed.ncbi.nlm.nih.gov/28494014/)

  > Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.

GitHub Events

Total

Create event: 9
Issues event: 2
Release event: 4
Delete event: 4
Issue comment event: 15
Push event: 49
Pull request review comment event: 49
Pull request review event: 45
Pull request event: 16
Fork event: 1

Last Year

Create event: 9
Issues event: 2
Release event: 4
Delete event: 4
Issue comment event: 15
Push event: 49
Pull request review comment event: 49
Pull request review event: 45
Pull request event: 16
Fork event: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 1
Total pull requests: 7
Average time to close issues: 10 days
Average time to close pull requests: 3 days
Total issue authors: 1
Total pull request authors: 2
Average comments per issue: 1.0
Average comments per pull request: 1.43
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 7
Average time to close issues: 10 days
Average time to close pull requests: 3 days
Issue authors: 1
Pull request authors: 2
Average comments per issue: 1.0
Average comments per pull request: 1.43
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

apetkau (2)
mcook19 (1)

Pull Request Authors

apetkau (16)
emarinier (3)
sgsutcliffe (2)
kylacochrane (2)
DarianHole (1)

Top Labels

Issue Labels

enhancement (2) bug (1)

Pull Request Labels

Dependencies

.github/workflows/branch.yml actions

mshick/add-pr-comment v1 composite

.github/workflows/ci.yml actions

actions/cache v3 composite
actions/checkout v3 composite
nf-core/setup-nextflow v1 composite

.github/workflows/linting.yml actions

actions/checkout v3 composite
actions/setup-node v3 composite
actions/setup-python v4 composite
actions/upload-artifact v3 composite
mshick/add-pr-comment v1 composite
nf-core/setup-nextflow v1 composite
psf/black stable composite

.github/workflows/linting_comment.yml actions

dawidd6/action-download-artifact v2 composite
marocchino/sticky-pull-request-comment v2 composite

modules/nf-core/custom/dumpsoftwareversions/meta.yml cpan

pyproject.toml pypi

modules/local/sratools/fasterqdump/meta.yml cpan

modules/nf-core/custom/sratoolsncbisettings/meta.yml cpan

modules/nf-core/sratools/prefetch/meta.yml cpan

subworkflows/local/fastq_download_prefetch_fasterqdump_sratools/meta.yml cpan

modules/local/sratools/fasterqdump/environment.yml conda

pigz 2.6.*
sra-tools 3.0.8.*

modules/nf-core/custom/sratoolsncbisettings/environment.yml conda

sra-tools 3.0.8.*

modules/nf-core/sratools/prefetch/environment.yml conda

sra-tools 3.0.8.*