https://github.com/cdcgov/tostadas

🧬 💻 TOSTADAS → Toolkit for Open Sequence Triage, Annotation and DAtabase Submission

Keywords

azure bioinformatics conda docker genbank genetics liftoff metadata mpox ncbi nextflow nf-tower pipeline python rna-seq scicomp sequencing singularity sra vadr

Last synced: 5 months ago · JSON representation

Repository

🧬 💻 TOSTADAS → Toolkit for Open Sequence Triage, Annotation and DAtabase Submission

Basic Info

Host: GitHub
Owner: CDCgov
License: apache-2.0
Language: Python
Default Branch: dev
Homepage: https://cdcgov.github.io/tostadas/
Size: 48.7 MB

Statistics

Stars: 28
Watchers: 6
Forks: 15
Open Issues: 1
Releases: 37

Topics

azure bioinformatics conda docker genbank genetics liftoff metadata mpox ncbi nextflow nf-tower pipeline python rna-seq scicomp sequencing singularity sra vadr

Created over 3 years ago · Last pushed 6 months ago

Metadata Files

Readme Contributing License Code of conduct

TOSTADAS → Toolkit for Open Sequence Triage, Annotation and DAtabase Submission :dna: :computer:

PATHOGEN ANNOTATION AND SUBMISSION PIPELINE

For the complete TOSTADAS documentation, please see the Complete Documentation

Warnings

Plugin Compatibility Warning

❗ Important Note: This pipeline uses the nf-schema plugin to validate pipeline parameters. Users with Nextflow version 24 or later may encounter a warning message indicating that the plugin must be installed. To resolve this warning message, please install the plugin manually by following the instructions found in this link

Overview

T O S T A D A S
Toolkit for Open Sequence Triage, Annotation, and DAtabase Submission

A portable, open-source pipeline designed to streamline submission of pathogen genomic data to public repositories. Reducing barriers to timely data submission increases the value of public repositories for both public health decision making and scientific research. TOSTADAS facilitates routine sequence submission by standardizing and automating:

Metadata Validation
Genome Annotation
File submission

TOSTADAS is designed to be flexible, modular, and pathogen agnostic, allowing users to customize their submission of raw read data, assembled genomes, or both. The current release has been tested with sequence data from Poxviruses and select bacteria. Testing for additional pathogen is planned for future releases.

Installation and Quick Start

❗ Note: If you are a CDC user, please follow the set-up instructions found here: CDC User Guide

For non-CDC users, please follow the instructions below.

1. Clone the repository to your local machine

git clone https://github.com/CDCgov/tostadas.git ! Note: If you already have Nextflow installed in your local environment, skip ahead to step 5.

2. Install mamba and add it to your PATH

2a. Install mamba

❗ Note: If you have mamba installed in your local environment, skip ahead to step 3 (Create and activate a conda environment) curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge 2b. Add mamba to PATH: export PATH="$HOME/mambaforge/bin:$PATH"

3. Install Nextflow using mamba and the bioconda Channel

mamba install -c bioconda nextflow

4. Update the default submissions config file with your NCBI username and password

```

update this config file (you don't have to use vim)

vim conf/submission_config.yaml ```

5. Run the workflow with default parameters and the local run environment:

```

test command for virus reads

nextflow run main.nf -profile test, --species virus ``The pipeline outputs appear intostadas/test_output`

6. Start running your own analysis

Annotate and submit viral reads nextflow run main.nf -profile <docker|singularity> --workflow biosample_and_sra --species virus --submission --annotation --outdir <path/to/output/dir/> --meta_path <path/to/metadata_file.xlsx> --submission_config <path/to/submission_config.yaml> Annotate and submit bacterial reads nextflow run main.nf -profile <docker|singularity> --workflow biosample_and_sra --species bacteria --submission --annotation --meta_path <path/to/metadata_file.xlsx> --submission_config <path/to/submission_config.yaml> --download_bakta_db --bakta_db_type <light|full> --outdir <path/to/output/dir/>

Submit to GenBank

GenBank submission requires a modified metadata file that includes the GenBank accession ID. This file will be generated as an output of the biosample and SRA workflow and can be found in the results directory, for example: testoutput/mpxvtestmetadata/finalsubmissionoutputs/mpxvtestmetadataupdated.xlsx.

To submit reads to GenBank, use the following command:

nextflow run main.nf -profile <docker|singularity> --workflow genbank --dry_run false --species mpxv --submission_config <path/to/submission_config.yaml> --updated_meta_path <path/to/updated/metadata/file> Refer to the github pages website for more information on input parameters and use cases.

Retrieve accession IDs

To fetch and parse report.xml files from a previous submission, use the following command:

nextflow run main.nf -profile <docker|singularity> --workflow fetch_accessions --dry_run false --species mpxv --submission_config <path/to/submission_config.yaml> --meta_path assets/sample_metadata/mpxv_test_metadata

Submit updates to a BioSample submission

CBI allows UI-less updating of BioSample submissions, and TOSTADAS can do this using the --workflow update_submission workflow option.

To submit updated metadata to biosample, use the following command:

nextflow run main.nf -profile <docker|singularity> --workflow update_submission --dry_run false --species mpxv --submission_config <path/to/submission_config.yaml> --original_submission_dir <results/mpxv_test_metadata/submission_outputs> --meta_path <path/to/updated/metadata/file>

Please make sure your updated metadata Excel file has a biosample_accession column that contains accurate accession IDs. TOSTADAS does not check these for accuracy. Please make sure they are correct.

Note: TOSTADAS uses the ncbi-spuid field to match samples in the metadata file and the original submission.xml. The sample_name field is not preserved in the submission.xml, so it cannot be used as an identifier for this workflow.

7. Custom metadata validation and custom BioSample package

TOSTADAS defaults to Pathogen.cl.1.0 (Pathogen: clinical or host-associated; version 1.0) NCBI BioSample package for submissions to the BioSample repository. You can submit using a different BioSample package by doing the following: 1. Change the package name in the conf/submission_config.yaml. Choose one of the available NCBI BioSample packages. 2. Add the necessary fields for your BioSample package to your input Excel file. 3. Add those fields as keys to the JSON file (assets/custom_meta_fields/example_custom_fields.json) and provide key info as needed. replaceemptywith: TOSTADAS will replace any empty cells with this value (Example application: NCBI expects some value for any mandatory field, so if empty you may want to change it to "Not Provided".) newfieldname: TOSTADAS will replace the field name in your metadata Excel file with this value. (Example application: you get weekly metadata Excel files and they specify 'animalenvironment' but NCBI expects 'animalenv'; you can specify this once in the JSON file and it will changed on every run.)

Submit to a custom BioSample package nextflow run main.nf -profile <docker|singularity> --workflow biosample_and_sra --species virus --submission --annotation --sra true --outdir <path/to/output/dir/> --meta_path <path/to/metadata_file.xlsx> --submission_config <path/to/submission_config.yaml> --custom_fields_file <path/to/metadata_custom_fields.json>

Workflow Parameters Overview

This section outlines the primary parameters available for configuring and running the TOSTADAS pipeline effectively, allowing users to tailor the workflow for their needs:

| Parameter | Description | Input Required | |-------------------------|---------------------------------------------------------------------------------------------------|--------------------------| | --annotation | Toggle for running annotation | Yes (true/false as bool) | | --submission | Toggle for running submission | Yes (true/false as bool) | | --update_submission | Toggle to update data for existing BioSample or SRA records(currently in progress) | Yes (true/false as bool) | | --workflow | Specifies the workflow to execute, allowing users to choose the appropriate processing method. | Yes (string) |

Workflow Options

The following workflows are available for the --workflow parameter:

biosampleandsra: Runs a submission to BioSample and SRA.
genbank: Runs a GenBank submission.
fetch_accessions: Fetches reports and updates the metadata file.
full_submission: Executes BioSample and SRA submissions, waits 60 seconds multiplied by params.batch_size, fetches reports, updates the metadata file with accession IDs, and then performs the GenBank submission.

Note: The GenBank submission cannot complete without a BioSample accession ID.

For more detailed information on each parameter and additional configurations, please refer to the TOSTADAS documentation.

Troubleshooting

If you encounter issues while using the TOSTADAS pipeline, refer to the following troubleshooting steps to resolve common problems:

Common Issues and Solutions

1. Errors with 'table2asn not on PATH' or a Python library missing when using the `singularity` or `docker` profiles

Issue: Nextflow is using an outdated cached image.

Solution: Locate the image (e.g., $HOME/.singularity/staphb-tostadas-latest.img) and delete it. This will force Nextflow to pull the latest version.

2. Pipeline hangs indefinitely during the submission step, or you get a "duplicate BioSeq ID error"

Issue: This may be caused by duplicate sample IDs in the FASTA file (e.g., a multicontig FASTA). This is only a problem for submissions to Genbank using table2asn.

Solution: Review the sequence headers in the sample FASTA files and ensure that each header is unique.

Get in Touch

If you need to report a bug, suggest new features, or just say “thanks”, open an issue and we’ll try to get back to you as soon as possible!

Acknowledgements

Contributors

Tools

The submission portion of this pipeline was adapted from SeqSender. To find more information on this tool, please refer to their GitHub page: SeqSender

Resources

:link: NCBI Submission Guidelines: https://submit.ncbi.nlm.nih.gov/sarscov2/sra/#step6

:link: SeqSender Documentation: https://github.com/CDCgov/seqsender

:link: Liftoff Documentation: https://github.com/agshumate/Liftoff

:link: VADR Documentation: https://github.com/ncbi/vadr.git

:link: Bakta Documentation: https://github.com/oschwengers/bakta

:link: RepeatMasker Documentation: https://www.repeatmasker.org/

CDC Metadata

Organization: NCEZID-OAMD contact email: ncezid_shareit@cdc.gov exemption status: NA exemption justification: NA description fields: Nextflow workflow for viral and bacterial annotation and automated upload to NCBI databases

Owner

Name: Centers for Disease Control and Prevention
Login: CDCgov
Kind: organization
Email: data@cdc.gov
Location: Atlanta, GA

Website: http://open.cdc.gov/
Twitter: CDCgov
Repositories: 114
Profile: https://github.com/CDCgov

CDC's collaborative software projects to protect America from health, safety, and security threats, both foreign and in the U.S.

Committers

Last synced: about 2 years ago

All Time

Total Commits: 594
Total Committers: 6
Avg Commits per committer: 99.0
Development Distribution Score (DDS): 0.52

Past Year

Commits: 594
Committers: 6
Avg Commits per committer: 99.0
Development Distribution Score (DDS): 0.52

Top Committers

Name	Email	Commits
ankushkgupta2	a**a@d**m	285
Cole tindall	1****1	200
Gupta	u**1@b**v	67
Gupta	u**1@c**v	34
Swarnali Louha	1****3	6
Kyle O'Connell	8****l	2

Committer Domains (Top 20 + Academic)

cdc.gov: 1 biolinux.biotech.cdc.gov: 1 deloitte.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 179
Total pull requests: 182
Average time to close issues: 3 months
Average time to close pull requests: 4 days
Total issue authors: 21
Total pull request authors: 10
Average comments per issue: 0.64
Average comments per pull request: 0.5
Merged pull requests: 152
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 80
Pull requests: 83
Average time to close issues: about 1 month
Average time to close pull requests: 5 days
Issue authors: 10
Pull request authors: 4
Average comments per issue: 0.59
Average comments per pull request: 0.89
Merged pull requests: 64
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

jessicarowell (59)
ankushkgupta2 (55)
kyleoconnell (25)
RamiyapriyaS (13)
slsevilla (5)
kyleoconnell-CDC (2)
mikeyweigand (2)
Alan-Collins (2)
erinyoung (2)
jtakakuwa (2)
DOH-KEW4303 (2)
Swarnali3 (1)
lskatz (1)
krt7-cdc (1)
garfinjm (1)

Pull Request Authors

jessicarowell (84)
ankushkgupta2 (56)
RamiyapriyaS (46)
kyleoconnell (42)
Swarnali3 (4)
macoven-del (2)
robsyme (2)
slsevilla (2)
Alan-Collins (1)
zyosufzai (1)

Top Labels

Issue Labels

enhancement (92) bug (77) implement (35) documentation (14) high-priority (11) explore (9) refactor (8) testing-related (3) testing (3) internal (2) good first issue (1) frontend (1) question (1)

Pull Request Labels

bug (11) enhancement (5) documentation (4) governance (1)

https://github.com/cdcgov/tostadas

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

TOSTADAS → Toolkit for Open Sequence Triage, Annotation and DAtabase Submission :dna: :computer:

PATHOGEN ANNOTATION AND SUBMISSION PIPELINE

Warnings

Plugin Compatibility Warning

Overview

Installation and Quick Start

1. Clone the repository to your local machine

2. Install mamba and add it to your PATH

3. Install Nextflow using mamba and the bioconda Channel

4. Update the default submissions config file with your NCBI username and password

update this config file (you don't have to use vim)

5. Run the workflow with default parameters and the local run environment:

test command for virus reads

6. Start running your own analysis

7. Custom metadata validation and custom BioSample package

Workflow Parameters Overview

Workflow Options

Troubleshooting

Common Issues and Solutions

1. Errors with 'table2asn not on PATH' or a Python library missing when using the singularity or docker profiles

2. Pipeline hangs indefinitely during the submission step, or you get a "duplicate BioSeq ID error"

Get in Touch

Acknowledgements

Contributors

Tools

Resources

CDC Metadata

Owner

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

1. Errors with 'table2asn not on PATH' or a Python library missing when using the `singularity` or `docker` profiles