stylo

Nanopore assembly workflow from basecalled reads to polished assembly plus assembly QC, metrics, and plasmid replicon detection

https://github.com/ncezid-narst/stylo

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.2%) to scientific vocabulary

Scientific Fields

Engineering Computer Science - 40% confidence
Last synced: 6 months ago · JSON representation ·

Repository

Nanopore assembly workflow from basecalled reads to polished assembly plus assembly QC, metrics, and plasmid replicon detection

Basic Info
  • Host: GitHub
  • Owner: ncezid-narst
  • License: mit
  • Language: Nextflow
  • Default Branch: main
  • Size: 407 KB
Statistics
  • Stars: 9
  • Watchers: 1
  • Forks: 1
  • Open Issues: 6
  • Releases: 0
Created over 2 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

Stylo

Nanopore assembly workflow from basecalled reads to polished assembly plus assembly QC, metrics, and plasmid replicon detection

Install

Navigate to your home directory and git clone the repository. bash $ git clone https://github.com/ncezid-narst/stylo.git You will need to install Nextflow if you don't already have it: https://www.nextflow.io/docs/latest/getstarted.html

You will also need to install Singularity if you don't already have it: https://docs.sylabs.io/guides/3.0/user-guide/quick_start.html

If you're working on CDC servers, run module load nextflow/XX.XX.X to load Nextflow, Singularity, and Java modules.

As of 4/17/2023, this workflow was developed on Nextflow version 22.10.6.

Overview

  • READFILTERING - Filters reads based on read-length using Nanoq
  • DOWNSAMPLE - Randomly downsamples read set based on organism genome size and desired coverage using Rasusa
  • ASSEMBLE - Generates long-read assembly using Flye
  • HYBRID - Generates hybrid assembly using Unicycler (alternative option to using Flye)
  • ROTATE - Changes start position of contigs using Circlator
  • POLISH - Creates consensus calls using Medaka
  • RENAME - Renames polished assembly for convenience
  • FORMATREADS - Preps reads to be run through staramr using Seqtk
  • PLASMIDCHECK - Detects plasmid replicons in reads and assembly using Staramr
  • SOCRU - Does Socru (*shrug)
  • ASSEMBLYQC - Generates assembly quality metrics using BUSCO

Parameters

Parameters for each process can be changed in stylo.config under the first bracketed section params. Check out Resources for links to each process's main github page to learn more about process-specific parameters.

Prior to running stylo, make sure the INITIAL PARAMETERS are set accurately - the default settings are as follows: java //Initial parameters reads = 'fastq_pass/**.fastq.gz' sampleinfo = 'sampleinfo.txt' outdir = 'stylo' unicycler = false

reads: Stylo will look for any and all fastq.gz files under a directory and assume each one is a unique sample. Prior to running stylo, you should concatenate, rename, and compress (if they're not already) your reads. Anything prior to the file extension will be used as the Sample ID. For example: bash fastq_pass/ ├── 01_2014C-3598 │   └── 01_2014C-3598_all.fastq.gz ├── 02_2014C-3599 │   └── 02_2014C-3599_all.fastq.gz ├── 03_2014C-3857 │   └── 03_2014C-3857_all.fastq.gz

sampleinfo: Tab-delimited text file with sample information. For example: bash BARCODE WGSID GENUS SPECIES barcode01 01_2014C-3598_all Salmonella enterica barcode02 02_2014C-3599_all Salmonella enterica barcode03 03_2014C-3857_all Salmonella enterica * BARCODE: Standard barcode output from MinKNOW e.g. barcode01-barcode96 * WGSID: Sample ID. Must match filename of concatenated reads. * GENUS: Sample genus, used in SOCRU * SPECIES: Sample species, used in SOCRU

outdir: Name of Stylo output directory. Default name is set to stylo. unicycler: Option to use Unicycler instead of Flye as the assembler. Default is set to false.

You can see how parameters are used in the next section Usage.

NOTE: Support for hybrid assemblies using short-reads hasn't been added yet. This option was added as an experiment to test how well k-mer based assemblers perform with ONT's v14/r10.4.1 chemistry.

Processes

Directives for each process can be changed in stylo.config under the second bracketed section process. This is where you can update the containers used in each process. Check out Resources to see a full list of all the containers and the tools' githubs.

Profiles

Configuration settings for each profile can be changed in stylo.config under the third bracketed section profiles. This is where you can update or create profiles that will dictate where and how each process is run. By default, there are two main profiles and three auxiliary profiles:

  • standard: Will execute stylo using the 'local' executor, running processes on the computer where Nextflow is launched.
  • sge: Will execute stylo using a Sun Grid Engine cluster, running processes on the HPC (qsub).
  • short: Auxiliary profile to change the sge default queue to short queue
  • gpu: Auxiliary profile to chage the sge default queue to gpu queue
  • highmem: Auxiliary profile to change the sge default queue to highmem queue

You can see how profiles are used in the next section Usage.

NOTE: The default profile settings were mostly pulled from recommendations made by CDC Scicomp in their Nextflow training called 'Reproducible, scalable, and shareable analysis workflows with Nextflow'. There is a good chance you will have to create/modify your own profile to run stylo using your institution's computing environment. Check out Resources to learn more about creating profiles.

Usage

Once you've made the necessary changes to the configuration file to run the workflow on your computing environment and have set up inital parameters, you can run stylo just as you would any nextflow workflow: bash nextflow run /path/to/stylo/schtappe/stylo.nf -c /path/to/stylo/config/stylo.config Nextflow is picky about single-hyphen flags vs. double-hyphen flags. Single-hyphens affect the nextflow command while double-hyphens affect the parameters in the configuration file. For example, to change the initial parameters without directly editing stylo.config: bash nextflow run /path/to/nanoporeWorkflow/schtappe/stylo.nf -c /path/to/nanoporeWorkflow/config/stylo.config \ --reads path/to/your/reads/**.fastq.gz \ --sampleinfo yoursampleinfofile.txt \ --outdir youroutputdirectory \ --unicycler true

By default, nextflow will run locally. If you want to specify a profile, use the -profile flag. For example, to qsub stylo's processes: bash nextflow run /.../stylo/schtappe/stylo.nf -c /.../stylo/config/stylo.config -profile sge

You can change the queue by adding the auxiliary profile name, separated by a comma: bash nextflow run /.../stylo/schtappe/stylo.nf -c /.../stylo/config/stylo.config -profile sge,highmem Run nextflow help or nextflow run -help for more information on nextflow flags.

NOTE: Nextflow applies the same parameters to each sample being processed. This means you'll want to run stylo on read sets all of the same organism or at least the same genome size and all have been generated using the same chemistry and guppy basecaller version (affects flyereadtype and medaka_model) This could change in the future by adding more fields to the sampleinfo sheet, but for now it is what it is.

Output

Here's what stylo output looks like per sample (directories only): bash stylo/ └── PNUSAS002131 ├── busco │   ├── auto_lineage │   │   ├── run_archaea_odb10 │   │   │   ├── busco_sequences │   │   │   │   ├── fragmented_busco_sequences │   │   │   │   ├── multi_copy_busco_sequences │   │   │   │   └── single_copy_busco_sequences │   │   │   └── hmmer_output │   │   └── run_bacteria_odb10 │   │   ├── busco_sequences │   │   │   ├── fragmented_busco_sequences │   │   │   ├── multi_copy_busco_sequences │   │   │   └── single_copy_busco_sequences │   │   ├── hmmer_output │   │   └── placement_files │   ├── logs │   ├── prodigal_output │   │   └── predicted_genes │   │   └── tmp │   ├── run_bacteria_odb10 │   │   ├── busco_sequences │   │   │   ├── fragmented_busco_sequences │   │   │   ├── multi_copy_busco_sequences │   │   │   └── single_copy_busco_sequences │   │   ├── hmmer_output │   │   └── placement_files │   └── run_enterobacterales_odb10 │   ├── busco_sequences │   │   ├── fragmented_busco_sequences │   │   ├── multi_copy_busco_sequences │   │   └── single_copy_busco_sequences │   └── hmmer_output ├── flye │   ├── 00-assembly │   ├── 10-consensus │   ├── 20-repeat │   ├── 30-contigger │   └── 40-polishing ├── medaka ├── reads ├── socru ├── staramr_assembly │   └── hits └── staramr_reads └── hits

Resources

Containers:

  • Nanoq:
    • https://hub.docker.com/r/jimmyliu1326/nanoq
    • https://github.com/esteinig/nanoq
  • Rasusa:
    • https://hub.docker.com/r/staphb/rasusa
    • https://github.com/mbhall88/rasusa
  • Flye:
    • https://hub.docker.com/r/staphb/flye
    • https://github.com/fenderglass/Flye
  • Unicycler:
    • https://hub.docker.com/r/staphb/unicycler
    • https://github.com/rrwick/Unicycler
  • Circlator:
    • https://hub.docker.com/r/staphb/circlator
    • https://github.com/sanger-pathogens/circlator
  • Medaka:
    • https://hub.docker.com/r/ontresearch/medaka
    • https://github.com/nanoporetech/medaka
  • Seqtk:
    • https://hub.docker.com/r/staphb/seqtk
    • https://github.com/lh3/seqtk
  • Staramr:
    • https://hub.docker.com/r/staphb/staramr
    • https://github.com/phac-nml/staramr
  • Socru:
    • https://hub.docker.com/r/quadraminstitute/socru
    • https://github.com/quadram-institute-bioscience/socru
  • BUSCO:
    • https://hub.docker.com/r/ezlabgva/busco
    • https://busco.ezlab.org/

Owner

  • Name: ncezid-narst
  • Login: ncezid-narst
  • Kind: organization

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: stylo
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Justin
    family-names: Kim
    email: mwq9@cdc.gov
    affiliation: Enteric Diseases Laboratory Branch
    orcid: 'https://orcid.org/0000-0001-8745-9612'
  - given-names: Joe
    family-names: Wirth
    email: uma2@cdc.gov
    affilitaion: Enteric Disease Laboratory Branch
    orcid: 'https://orcid.org/0000-0002-9750-2845'
repository-code: 'https://github.com/ncezid-narst/stylo'
abstract: >-
  Nanopore assembly workflow from basecalled reads to
  polished assembly plus assembly QC, metrics, and plasmid
  replicon detection
keywords:
  - ONT
  - nanopore
  - assembly
  - long-read
  - longread
  - schtappe
license: MIT

GitHub Events

Total
  • Push event: 19
  • Fork event: 1
  • Create event: 2
Last Year
  • Push event: 19
  • Fork event: 1
  • Create event: 2