dragonflye

:dragon: :fly: Assemble bacterial isolate genomes from Nanopore reads

https://github.com/rpetit3/dragonflye

Last synced: 9 months ago · JSON representation ·

Repository

:dragon: :fly: Assemble bacterial isolate genomes from Nanopore reads

Basic Info

Host: GitHub
Owner: rpetit3
License: gpl-3.0
Language: Perl
Default Branch: main
Homepage:
Size: 240 KB

Statistics

Stars: 127
Watchers: 4
Forks: 11
Open Issues: 9
Releases: 20

Created almost 5 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog License Code of conduct Citation

README.md

NOTE: This is under active development, any feedback will be very useful

dragonflye

:dragon: :fly: Assemble bacterial isolate genomes from Nanopore reads

A Quick Note

If you've worked with bacterial sequences, in all likelihood you have used one of Torsten Seemann's tools. One such tool is Shovill, which takes the bacterial genome assembly process and makes it quick and painless. Shovill was developed for paired-end Illumina reads, and there is a fork, shovill-se, which supports single-end reads.

Given the widespread usage of Shovill, and Torsten basically laying much of the groundwork, I decided to use Shovill as a framework for Dragonflye. Dragonflye can be considered a fork of Shovill that supports assembling Oxford Nanopore sequences. By going this route users will not have to relearn parameters, and will already be familiar with the outputs.

At this point, you might be wondering: so Robert you just hacked Shovill to work with ONT reads, why not just call it 'shovill-ont'?

That's because when I asked if there was interest in a "Shovill" for ONT reads, Curtis Kapsak (@kapsakcj) responded:

Curtis Kapsak (@kapsakcj): if wrapping flye , perhaps call it dragonflye (a very fast flye)?.

And, honestly how could I not go with that?!? It's an amazing play-on-words that I'm willing to bet Torsten would be proud of it!

So to sum it up, thank you Torsten for Shovill and providing a framework for Dragonflye.

Introduction

Dragonflye is a pipeline that aims to make assembling Oxford Nanopore reads quick and easy. Still working on the quick part, but I think the easy part is there. Dragonflye currently supports Flye, Miniasm and Raven assemblers, and Racon and Medaka polishers.

Main Steps

Estimate genome size and read length from reads (unless --gsize provided) (kmc)
Filter reads by length (default --minreadlength 1000) (Nanoq)
Reduce FASTQ files to a sensible depth (default --depth 150) (rasusa)
Remove adapters (requires --trim be given) (Porechop)
Assemble with Flye, Miniasm, or Raven
Polish assembly with Racon and/or Medaka
Polish assembly with short reads via Polypolish and/or Pilon
Remove contigs that are too short, too low coverage, or pure homopolymers
Produce final FASTA with nicer names and parsable annotations
Reorient contigs from final FASTA using dnaapler
Output parsable assembly statistics (assembly-scan)

Quick Start

```{bash} dragonflye --reads my-ont.fastq.gz --outdir dragonflye --gsize 5000000 ... LOG TEXT ... [dragonflye] Final assembly contigs: /home/robert_petit/repos/dragonflye/temp/dragonflye/contigs.fa [dragonflye] It contains 3 (min=4864) contigs totalling 4939840 bp. [dragonflye] Dragonfly fossils have been found with wingspans up to two feet (61cm)! [dragonflye] Done.

ls dragonflye/ contigs.fa contigs.gfa dragonflye.log flye-info.txt flye.fasta

head -n4 dragonfly/contigs.fa

contig00001 len=2753792 origname=Utg1024LN:i:2753792RC:i:486_XO:i:0 polish=none sw=dragonflye-raven/1.2.0 date=20231031 TTCTATTTATCAGTATCATTACTTTTATATTATCGATAATTAATCCGAACATATCATTAA TCAAGTTATTATTCGAAGTGGTTTTGCTGCATTTGGAACAGTCGGGTTAAGTATGAACCT TACCACAGAAGATAATAATGGTATTACTAAAATAATTATTATATTCGTTATGCTTTGCGG

head -n4 dragonfly/contigs.reoriented.fa

contig00001 len=2753792 origname=Utg1024LN:i:2753792RC:i:486_XO:i:0 polish=none sw=dragonflye-raven/1.2.0 date=20231031 rotated=True ATGTCGGAAAAAGAAATTTGGGAAAAGTGCTTGAAATTGCTCAAGAAAAATTATCAGCTG TAAGTTACTCAACTTTCCTAAAAGATGACGAGGCTTTACACGATTAAAGATGGTGAAGCT ATCGTATTATCGAGTATTCCTTTTAATGCAAATTGGTTAAATCAACAATATGCTGAAATT ```

Installation

Dragonflye is available from Bioconda. Dragonflye includes a lot of programs, so it can take conda a while to solve the environment. Because of this, I personally use Mamba to install it, because it's so much faster.

```{bash}

With conda

conda create -n dragonflye -c conda-forge -c bioconda dragonflye

With Mamba (much quicker)

mamba create -n dragonflye -c conda-forge -c bioconda dragonflye ```

Usage

```{bash} Dragonflye - A very fast flye

SYNOPSIS De novo assembly pipeline for bacterial isolates with Nanopore reads USAGE dragonflye [options] --outdir DIR --reads READS.fastq.gz GENERAL --help This help --version Print version and exit --check Check dependencies are installed --seed N Random seed to use (default: 42) INPUT --reads XXX Input Nanopore FASTQ (default: '') --depth N Sub-sample --reads to this depth. Disable with --depth 0 (default: 150) --minreadlen N Minimum read length. Disable with --minreadlength 0 (default: 1000) --gsize XXX Estimated genome size eg. 3.2M (default: '') OUTPUT --outdir XXX Output folder (default: '') --prefix XXX Prefix to use for final assembly FASTA (default: 'contigs') --force Force overwite of existing output folder (default: OFF) --minlen N Minimum contig length <0=AUTO> (default: 500) --mincov n.nn Minimum contig coverage <0=AUTO> (default: 2) --namefmt XXX Format of contig FASTA IDs in 'printf' style (default: 'contig%05d') --keepfiles Keep intermediate files (default: OFF) RESOURCES --tmpdir XXX Fast temporary directory (default: '') --cpus N Number of CPUs to use (0=ALL) (default: 8) --ram n.nn Try to keep RAM usage below this many GB (default: 16) ASSEMBLER --assembler XXX Assembler: raven miniasm flye (default: 'flye') --opts XXX Extra assembler options in quotes eg. flye: '--interations' (default: '') --nanohq For Flye, use '--nano-hq' instead of --nano-raw (default: OFF) POLISHER --racon N Number of polishing rounds to conduct with Racon (default: 1) --medaka N Number of polishing rounds to conduct with Medaka (requires --model) (default: 0) --model XXX The model to be used by Medaka, (Assumes 1 polishing round, if --medaka not used) (default: '') --listmodels List the models available to Medaka (default: OFF) SHORT-READ POLISHER --polypolish N Number of polishing rounds to conduct with Polypolish (requires --R1 and --R2) (default: 1) --polypolishcareful Polypolish will ignore any reads with multiple alignments (default: OFF) --pilon N Number of polishing rounds to conduct with Pilon (requires --R1 and --R2) (default: 0) --R1 XXX Read 1 FASTQ to use for polishing (default: '') --R2 XXX Read 2 FASTQ to use for polishing (default: '') REORIENT --noreorient Disable contig reorientation using dnaapler (default: OFF) --dnaaplermode XXX The mode of reorientation to execute (default: 'all') --dnaapleropts XXX Extra dnaapler options in quotes eg. '--evalue 1e-5' (default: '') MODULES --trim Enable adaptor trimming (default: OFF) --trimopts XXX Extra porechop options in quotes eg. '--adapter_threshold 80' (default: '') --nofilter Disable read length filtering (default: OFF) --nopolish Disable assembly polishing (default: OFF) HOMEPAGE https://github.com/rpetit3/dragonflye - Robert A Petit III ```

--depth

Giving an assembler too much data is a bad thing. There comes a point where you are no longer adding new information (as the genome is a fixed size), and only adding more noise (sequencing errors). Because of this Dragonflye will downsample your FASTQ files to a specific depth (defaults to 150x). It estimates depth by dividing read yield by genome size.

--gsize

The genome size is needed to estimate depth and for the assembly stage. If you don't provide --gsize, it will be estimated via k-mer frequencies using kmc. It doesn't need to be a perfect estimate, just in the right ballpark. If you know the genome size it is usually better then the estimate, and will save some time.

--keepfiles

This will keep all the intermediate files in --outdir so you can explore and debug.

--cpus

By default it will attempt to use all available CPU cores.

--ram

Dragonflye will do its best to keep memory usage below this value, but it is not guaranteed. If you are on a HPC cluster, you should make sure you tell your job submission engine a value higher than this.

--assembler

By default it will use FlyeA.

--opts

If you want to provide some assembler-specific parameters you can use the --opts parameter. Make sure you quote the parameters so they get passed as a single string eg. For --assembler flye you might use --opts "--iterations 4 --plasmids".

--racon & --medaka

These two parameters adjust how many polishing rounds are conducted per-polisher. For example, --racon 2 would conduct 2 rounds of polishing with Racon. If --medaka is provided, a model must also be provided with --model.

--model

A valid basecaller model must be provided with --model. If a valid model is provided, but --medaka was not provided it will assume --medaka 1.

--list_models

This will list all basecaller models that are avialable in Medaka.

--polypolish & --pilon & --R1 & --R2

If Illumina short-reads are provided, polishing will be done with Polypolish and/or Pilon. The value of --polypolish (Default 1) is the number of polishing rounds that will be conducted. By default Pilon is turned off.

Choosing which stages to use

Stage | Enable | Disable ------|--------|-------- Genome size estimation | default | --gsize INT Read subsampling | --depth INT | --depth 0 Read length filtering | default | --nofilter Adapter Trimming | --trim | default

Environment variables recognised

These env-vars will be used as defaults instead of the built-in defaults. You can use the normal command line option to override them still.

Variable | Option | Default ---------|--------|------------ $DRAGONFLYE_CPUS | --cpus | 8 $DRAGONFLYE_RAM | --ram | 16 $DRAGONFLYE_ASSEMBLER | --assembler | flye $TMPDIR | --tmpdir | /tmp

Output Files

Filename | Description ---------|------------ contigs.fa | The final assembly you should use contigs.reoriented.fa | If available, a reorientation of the final assembly contigs.dnaapler.summary.tsv | If available, a summary description of reoriented contigs contigs.gfa | Assembly graph dragonflye.log | Full log file for bug reporting flye.fasta | Raw assembly (flye) flye-info.txt | Information about contigs output by Flye miniasm.fasta | Raw assembly (miniasm) raven.fasta | Raw assembly (raven)

FAQ

Perl?!?! Perl?!? Really, why Perl?

Dragonflye is a fok of Shovill, and Shovill was written in Perl. Haha so yeah, instead of writing from scratch, I dusted off the old Perl skills. Upon which the Perl interpretor basically told me I sucked at Perl every time I tried to make a change (haha kept forgetting the semi-colons at the end of the line!).

Does dragonflye accept Illumina reads?

It does, only if you would like to use them for short-read polishing. Otherwise, if you want to assemble just Illumina reads, use Shovill.

Doesn't Trycycler already do this?

Dragonflye is not trying to replicate Trycycler, Trycycler is on a whole 'nother level. If you are looking to get super high quality assemblies with some manual inspection steps in between, use Trycycler. But, if you are looking to just get a quick assembly that you can work with, that's what Dragonfly is for.

Can I assemble more than one genome at a time?

If you would like to assemble more than one genome using Dragonflye, I would recommend you do this with Bactopia. Bactopia will allow you to process a single genome or thousands, and it also includes many other bacterial genome analyses. If you don't want to use Bactopia, I suggest you see the next question!

Are there other similar pipelines?

hybracter is a similar alternative to Dragonflye. It is written in Snakemake and includes many of the same analyses, with many fun additions by @gbouras13. Another alternative is bacass which is a Nextflow pipeline maintained by nf-core.

Feedback

Please file questions, bugs or ideas to the Issue Tracker

Acknowledgements

I would like to personally extend my many thanks and gratitude to the authors of these software packages. Really, thank you very much!

Software Included (19)

any2fasta
Convert various sequence formats to FASTA
Seemann, T any2fasta: Convert various sequence formats to FASTA.
assembly-scan
Generate basic stats for an assembly.
Petit III, RA assembly-scan: generate basic stats for an assembly.
BWA
Burrow-Wheeler Aligner for short-read alignment
Li, H Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv q-bio.GN
dnaapler
Reorients assembled microbial sequences
Bouras G dnaapler: Reorients assembled microbial sequences
fastp
An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)
Chen, S, Zhou, Y, Chen, Y, Gu, J, fastp: an ultra-fast all-in-one FASTQ preprocessor, Bioinformatics, Volume 34, Issue 17 (2018)
Flye
De novo assembler for single molecule sequencing reads using repeat graphs
Kolmogorov, M, Yuan, J, Lin, Y, Pevzner, P, Assembly of Long Error-Prone Reads Using Repeat Graphs, Nature Biotechnology, (2019)
KMC
Fast and frugal disk based k-mer counter
Deorowicz, S, Kokot, M, Grabowski, Sz, Debudaj-Grabysz, A, KMC 2: Fast and resource-frugal k-mer counting, Bioinformatics, 2015; 31(10):1569–1576
Medaka
Sequence correction provided by ONT Research
Li, H Medaka: Sequence correction provided by ONT Research
Miniasm
Ultrafast de novo assembly for long noisy reads (though having no consensus step)
Li, H Miniasm: Ultrafast de novo assembly for long noisy reads
Minimap2
A versatile pairwise aligner for genomic and spliced nucleotide sequences
Li, H Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. (2018)
Nanoq
Minimal but speedy quality control for nanopore reads in Rus
Steinig, E, Coin, L, Nanoq: ultra-fast quality control for nanopore reads. Journal of Open Source Software, 7(69), 2991 (2022)
Pigz
A parallel implementation of gzip for modern multi-processor, multi-core machines.
Adler, M pigz: A parallel implementation of gzip for modern multi-processor, multi-core machines. Jet Propulsion Laboratory (2015).
Pilon
An automated genome assembly improvement and variant detection tool
Walker, BJ, Abeel, T, Shea, T, Priest, M, Abouelliel, A, Sakthikumar, S, Cuomo, CA, Zeng, Q, Wortman, J, Young, SK, Earl, AM, Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS one 9.11 e112963 (2014)
Polypolish
A short-read polishing tool for long-read assemblies
Wick, RR, Holt, KE, Polypolish: Short-read polishing of long-read bacterial genome assemblies. PLoS Computational Biology, 18(1), e1009802. (2022)
Porechop
Adapter trimmer for Oxford Nanopore reads
Wick, RR, Judd, LM, Gorrie, CL, Holt, KE, Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genom. 3(10):e000132 (2017)
Racon
Ultrafast consensus module for raw de novo genome assembly of long uncorrected reads
Vaser, R, Sović, I, Nagarajan, N, Šikić, M, Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
Rasusa
Randomly subsample sequencing reads to a specified coverage
Hall, MB Rasusa: Randomly subsample sequencing reads to a specified coverage. (2019).
Raven
De novo genome assembler for long uncorrected reads
Vaser, R, Šikić, M Time- and memory-efficient genome assembly with Raven. Nat Comput Sci 1, 332–336 (2021).
samclip
Filter SAM file for soft and hard clipped alignments
Seemann, T Samclip: Filter SAM file for soft and hard clipped alignments (GitHub)
Samtools
Tools for manipulating next-generation sequencing data
Li, H, Handsaker, B, Wysoker, A, Fennell, T, Ruan, J, Homer, N, Marth, G, Abecasis, G, Durbin, R The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009)
Seqtk
A fast and lightweight tool for processing sequences in the FASTA or FASTQ format.
Li, H Seqtk: Toolkit for processing sequences in FASTA/Q formats

Author

Robert A. Petit III
Web: https://www.robertpetit.com
Twitter: @rpetit3

Funding

Support for this project came from the Wyoming Public Health Laboratory.

Owner

Name: Robert A. Petit III
Login: rpetit3
Kind: user
Location: Cheyenne, WY
Company: Wyoming Public Health Laboratory

Website: https://www.robertpetit.com/
Twitter: rpetit3
Repositories: 147
Profile: https://github.com/rpetit3

Bioinformatician at the Wyoming Public Health Laboratory. Developer of the Bactopia and other microbial genomic tools.

Citation (citation.cff)

cff-version: 1.2.0
message: "If you use fastq-dl, please cite it as below."
authors:
- family-names: "Petit III"
  given-names: "Robert A. "
  orcid: "https://orcid.org/0000-0002-1350-9426"
title: "dragonflye: Assemble bacterial isolate genomes from Nanopore reads"
url: "https://github.com/rpetit3/dragonflye"
version: 1.1.2

GitHub Events

Total

Issues event: 6
Watch event: 13
Issue comment event: 11
Push event: 5
Fork event: 1
Create event: 1

Last Year

Issues event: 6
Watch event: 13
Issue comment event: 11
Push event: 5
Fork event: 1
Create event: 1

Committers

Last synced: about 1 year ago

All Time

Total Commits: 133
Total Committers: 1
Avg Commits per committer: 133.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Robert A. Petit III	r**t@g**m	133

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 40
Total pull requests: 0
Average time to close issues: 3 months
Average time to close pull requests: N/A
Total issue authors: 31
Total pull request authors: 0
Average comments per issue: 4.1
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 7
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 5
Pull request authors: 0
Average comments per issue: 0.57
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

llk578496 (3)
incoherentian (3)
valery-shap (2)
drhoads (2)
lskatz (2)
marchoeppner (2)
andreagp96 (1)
ayoraind (1)
erinyoung (1)
MostafaYA (1)
Shubhamverma-bioinfo (1)
chrisgulvik (1)
dfornika (1)
gaworj (1)
nbenzakour (1)

dragonflye

Science Score: 67.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

dragonflye

A Quick Note

Introduction

Main Steps

Quick Start

Installation

With conda

With Mamba (much quicker)

Usage

--depth

--gsize

--keepfiles

--cpus

--ram

--assembler

--opts

--racon & --medaka

--model

--list_models

--polypolish & --pilon & --R1 & --R2

Choosing which stages to use

Environment variables recognised

Output Files

FAQ

Feedback

Acknowledgements

Software Included (19)

Author

Funding

Owner

Citation (citation.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels