Nanoq

Nanoq: ultra-fast quality control for nanopore reads - Published in JOSS (2022)

https://github.com/esteinig/nanoq

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README and JOSS metadata
✓
Academic publication links
Links to: joss.theoj.org
✓
Committers with academic emails
2 of 4 committers (50.0%) from academic institutions
○
Institutional organization owner
✓
JOSS paper metadata
Published in Journal of Open Source Software

Last synced: 6 months ago · JSON representation

Repository

Minimal but speedy quality control for nanopore reads in Rust :bear:

Basic Info

Host: GitHub
Owner: esteinig
License: mit
Language: Rust
Default Branch: master
Homepage:
Size: 1010 KB

Statistics

Stars: 139
Watchers: 5
Forks: 10
Open Issues: 7
Releases: 12

Created almost 6 years ago · Last pushed over 1 year ago

Metadata Files

Readme License Zenodo

nanoq

Ultra-fast quality control and summary reports for nanopore reads

Overview

v0.10.0

Purpose
Install
Usage
Benchmarks
Dependencies
Etymology
Contributions

Purpose

Nanoq implements ultra-fast read filters and summary reports for high-throughput nanopore reads.

Citation

We would appreciate a citation if you are using nanoq for research. Please see here for some suggestions how you could give back to the community if you are using nanoq for industry applications :pray:

Steinig and Coin (2022). Nanoq: ultra-fast quality control for nanopore reads. Journal of Open Source Software, 7(69), 2991, https://doi.org/10.21105/joss.02991

Performance

See data in the benchmarks section:

nanoq is as fast as seqtk-fqchk for summary statistics of small datasets and slightly faster on large datasets (~1.3x-1.5x).
nanoq is faster than rust-bio-tools and seqkit stats for summary statistics (~2-3x) and other tools (~297x-442x)
memory consumption is consistent and tends to be lower than other tools (~5-10x)

Tests

Nanoq comes with high test coverage for your peace of mind.

cargo test

Install

`Cargo`

cargo install nanoq

`Conda`

conda install -c conda-forge -c bioconda nanoq

`Binaries`

Precompiled binaries for Linux and MacOS are attached to the latest release.

``` VERSION=0.10.0 RELEASE=nanoq-${VERSION}-x86_64-unknown-linux-musl.tar.gz

wget https://github.com/esteinig/nanoq/releases/download/${VERSION}/${RELEASE} tar xf nanoq-${VERSION}-x86_64-unknown-linux-musl.tar.gz

nanoq-${VERSION}-x86_64-unknown-linux-musl/nanoq -h ```

Usage

Nanoq accepts a file (-i) or stream (stdin) of reads in fast{a,q}.{gz,bz2,xz} format and outputs reads to file (-o) or stream (stdout).

bash nanoq -i test.fq.gz -o reads.fq cat test.fq.gz | nanoq > reads.fq

Read filters

Reads can be filtered by minimum read length (-l), maximum read length (-m), minimum average read quality (-q) or maximum average read quality (-w).

bash nanoq -i test.fq -l 1000 -m 10000 -q 10 -w 15 > reads.fq

Read trimming

A fixed number of bases can be trimmed from the start (-S) or end (-E) of reads:

bash nanoq -i test.fq -S 100 -E 100 > reads.fq

Read report

Read summaries are produced when using the stats flag (-s, report to stdout, no read output to stdout) or when specifying a report file (-r):

bash nanoq -i test.fq -s nanoq -i test.fq -r report.txt > reads.fq

For report types and configuration see the output section.

Fast mode

:warning: When using fast mode -f read quality scores are not computed (output of quality fields: NaN)

Read qualities may be excluded from filters and statistics to speed up read iteration (-f).

bash nanoq -i test.fq.gz -f -s

Compression

Output compression is inferred from file extensions (gz, bz2, lzma).

bash nanoq -i test.fq -o reads.fq.gz

Output compression can be specified manually with -O and -c.

bash nanoq -i test.fq -O g -c 9 > reads.fq.gz

Online runs

Nanoq can be used to check on active sequencing runs and barcoded samples.

bash find /data/nanopore/run -name "*.fastq" -print0 | xargs -0 cat | nanoq -s

bash for i in {01..12}; do find /data/nanopore/run -name barcode${i}.fastq -print0 | xargs -0 cat | nanoq -s done

Parameters

``` nanoq 0.10.0

Filters and summary reports for nanopore reads

USAGE: nanoq [FLAGS] [OPTIONS]

FLAGS: -f, --fast Ignore quality values if present -h, --help Prints help information -H, --header Header for summary output -j, --json Summary report in JSON format -s, --stats Summary report only [stdout] -V, --version Prints version information -v, --verbose Verbose output statistics [multiple, up to -vvv]

OPTIONS: -c, --compress-level <1-9> Compression level to use if compressing output [default: 6] -i, --input Fast{a,q}.{gz,xz,bz}, stdin if not present -m, --max-len Maximum read length filter (bp) [default: 0] -w, --max-qual Maximum average read quality filter (Q) [default: 0] -l, --min-len Minimum read length filter (bp) [default: 0] -q, --min-qual Minimum average read quality filter (Q) [default: 0] -o, --output Output filepath, stdout if not present -O, --output-type u: uncompressed; b: Bzip2; g: Gzip; l: Lzma -r, --report Summary read statistics report output file -t, --top Number of top reads in verbose summary [default: 5] -L, --read-lengths Output read lengths of surviving reads to file -Q, --read-qualities Output read qualities of surviving reads to file -S, --trim-start Trim bases from the start of each read [default: 0] -E, --trim-end Trim bases from the end of each read [default: 0] ```

Output

Read lengths and qualities

Files with read lengths (--read-lengths/-L) and qualities (--read-qualities/-Q) of the surviving reads can be output:

nanoq -i test.fq -Q rq.txt -L rl.txt > reads.fq

:warning: Length and quality outputs are meant for quick plotting of distributions. Because of dubious internal design decisions (my bad) outputs are ordered with an unstable sorting function, which means the order of identical values may change between outputs. Furthermore, output order does not correspond to read output order - this will change in the next release as outlined in this issue

Summary reports

Summary reports are output to file explicitly using --report/-r:

bash nanoq -i test.fq -r report.txt > reads.fq nanoq -i test.fq -r report.txt -s

When using the --stats/-s flag read output is suppressed and summary is directed to stdout:

bash nanoq -i test.fq -s > report.txt

Report format is minimal by default:

bash 100000 400398234 5154 44888 5 4003 3256 8.90 9.49

number of reads
number of base pairs
N50 read length
longest read
shorted reads
mean read length
median read length
mean read quality
median read quality

A machine readable header can be added using the -H flag:

bash nanoq -i test.fq -s -H

Extended summaries analogous to NanoStat can be obtained using multiple -v flags (up to -vvv), including the top (-t) read lengths and qualities:

-v - verbose read summary (top block as below)
-vv - like -v with read length and/or quality thresholds
-vvv - like -vv with top ranking read lengths and/or qualities

bash nanoq -i test.fq -f -s -t 5 -vvv

```

Nanoq Read Summary

Number of reads: 100000 Number of bases: 400398234 N50 read length: 5154 Longest read: 44888 Shortest read: 5 Mean read length: 4003 Median read length: 3256 Mean read quality: NaN Median read quality: NaN

Read length thresholds (bp)

200 99104 99.1% 500 96406 96.4% 1000 90837 90.8% 2000 73579 73.6% 5000 25515 25.5% 10000 4987 05.0% 30000 47 00.0% 50000 0 00.0% 100000 0 00.0% 1000000 0 00.0%

Benchmarks

Benchmarks evaluate processing speed and memory consumption of a basic read length filter and summary statistics on the even Zymo mock community (GridION) with comparisons to rust-bio-tools, seqtk fqchk, seqkit stats, NanoFilt, NanoStat and Filtlong. Time to completion and maximum memory consumption were measured using /usr/bin/time -f "%e %M", speedup is relative to the slowest command in the set. We note that summary statistics from rust-bio-tools and seqkit stats do not compute read quality scores and are therefore comparable to nanoq-fast.

Tasks:

stats: basic read set summaries
filter: minimum read length filter (into /dev/null)

Tools:

rust-bio-tools 0.28.0
nanostat 1.5.0
nanofilt 2.8.0
filtlong 0.2.1
seqtk 1.3-r126
seqkit 2.0.0
nanoq 0.8.2

Commands used for stats task:

nanostat (fq + fq.gz) --> NanoStat --fastq test.fq --threads 1
rust-bio (fq) --> rbt sequence-stats --fastq < test.fq
rust-bio (fq.gz) --> zcat test.fq.gz | rbt sequence-stats --fastq
seqtk-fqchk (fq + fq.gz) --> seqtk fqchk
seqkit stats (fq + fq.gz) --> seqkit stats -j1
nanoq (fq + fq.gz) --> nanoq --input test.fq --stats
nanoq-fast (fq + fq.gz) --> nanoq --input test.fq --stats --fast

Commands used for filter task:

filtlong (fq + fq.gz) --> filtlong --min_length 5000 test.fq > /dev/null
nanofilt (fq) --> NanoFilt --fastq test.fq --length 5000 > /dev/null
nanofilt (fq.gz) --> gunzip -c test.fq.gz | NanoFilt --length 5000 > /dev/null
nanoq (fq + fq.gz) --> nanoq --input test.fq --min-len 5000 > /dev/null
nanoq-fast (fq + fq.gz) --> nanoq --input test.fq --min-len 5000 --fast > /dev/null

Files:

zymo.fq: uncompressed (100,000 reads, ~400 Mbp)
zymo.fq.gz: compressed (100,000 reads, ~400 Mbp)
zymo.full.fq: uncompressed (3,491,078 reads, ~14 Gbp)

Data preparation:

bash wget "https://nanopore.s3.climb.ac.uk/Zymo-GridION-EVEN-BB-SN.fq.gz" zcat Zymo-GridION-EVEN-BB-SN.fq.gz > zymo.full.fq head -400000 zymo.full.fq > zymo.fq && gzip -k zymo.fq

Elapsed real time and maximum resident set size:

bash /usr/bin/time -f "%e %M"

Task and command execution:

Commands were run in replicates of 10 with a mounted benchmark data volume in the provided Docker container. An additional cold start iteration for each command was not considered in the final benchmarks.

bash for i in {1..11}; do for f in /data/*.fq; do /usr/bin/time -f "%e %M" nanoq -f- s -i $f 2> benchmark tail -1 benchmark >> nanoq_stat_fq done done

Benchmark results

Nanoq benchmarks on 3.5 million reads of the Zymo mock community (10 replicates)

`stats` + `zymo.full.fq`

| command | mb (sd) | sec (sd) | reads / sec | speedup | quality scores | | ----------------|------------------|--------------------|-----------------|----------|----------------| | nanostat | 741.4 (0.09) | 1260. (13.9) | 2,770 | 01.00 x | true | | seqtk-fqchk | 103.8 (0.04) | 125.9 (0.15) | 27,729 | 10.01 x | true | | seqkit-stats | 18.68 (3.15) | 125.3 (0.91) | 27,861 | 10.05 x | false | | nanoq | 35.83 (0.06) | 94.51 (0.43) | 36,938 | 13.34 x | true | | rust-bio | 43.20 (0.08) | 06.54 (0.05) | 533,803 | 192.7 x | false | | nanoq-fast | 22.18 (0.07) | 02.85 (0.02) | 1,224,939 | 442.1 x | false |

`filter` + `zymo.full.fq`

| command | mb (sd) | sec (sd) | reads / sec | speedup | | ----------------|-------------------|--------------------|-----------------|----------| | nanofilt | 67.47 (0.13) | 1160. (20.2) | 3,009 | 01.00 x | | filtlong | 1516. (5.98) | 420.6 (4.53) | 8,360 | 02.78 x | | nanoq | 11.93 (0.06) | 94.93 (0.45) | 36,775 | 12.22 x | | nanoq-fast | 08.05 (0.05) | 03.90 (0.30) | 895,148 | 297.5 x |

Nanoq benchmarks on 100,000 reads of the Zymo mock community (10 replicates)

`stats` + `zymo.fq`

| command | mb (sd) | sec (sd) | reads / sec | speedup | quality scores | | ----------------|------------------|--------------------|-----------------|----------|----------------| | nanostat | 79.64 (0.14) | 36.22 (0.27) | 2,760 | 01.00 x | true | | nanoq | 04.26 (0.09) | 02.69 (0.02) | 37,147 | 13.46 x | true | | seqtk-fqchk | 53.01 (0.05) | 02.28 (0.06) | 43,859 | 15.89 x | true | | seqkit-stats | 17.07 (3.03) | 00.13 (0.00) | 100,000 | 36.23 x | false | | rust-bio | 16.61 (0.08) | 00.22 (0.00) | 100,000 | 36.23 x | false | | nanoq-fast | 03.81 (0.05) | 00.08 (0.00) | 100,000 | 36.23 x | false |

`stats` + `zymo.fq.gz`

| command | mb (sd) | sec (sd) | reads / sec | speedup | quality scores | | ----------------|------------------|--------------------|-----------------|----------|----------------| | nanostat | 79.46 (0.22) | 40.98 (0.31) | 2,440 | 01.00 x | true | | nanoq | 04.44 (0.09) | 05.74 (0.04) | 17,421 | 07.14 x | true | | seqtk-fqchk | 53.11 (0.05) | 05.70 (0.08) | 17,543 | 07.18 x | true | | rust-bio | 01.59 (0.06) | 05.06 (0.04) | 19,762 | 08.09 x | false | | seqkit-stats | 20.54 (0.41) | 04.85 (0.02) | 20,619 | 08.45 x | false | | nanoq-fast | 03.95 (0.07) | 03.15 (0.02) | 31,746 | 13.01 x | false |

`filter` + `zymo.fq`

| command | mb (sd) | sec (sd) | reads / sec | speedup | | ----------------|-------------------|--------------------|-----------------|----------| | nanofilt | 66.29 (0.15) | 33.01 (0.24) | 3,029 | 01.00 x | | filtlong | 274.5 (0.04) | 08.49 (0.01) | 11,778 | 03.89 x | | nanoq | 03.61 (0.04) | 02.81 (0.28) | 35,587 | 11.75 x | | nanoq-fast | 03.26 (0.06) | 00.12 (0.01) | 100,000 | 33.01 x |

`filter` + `zymo.fq.gz`

| command | mb (sd) | sec (sd) | reads / sec | speedup | | ----------------|-------------------|--------------------|-----------------|----------| | nanofilt | 01.57 (0.07) | 33.48 (0.35) | 2,986 | 01.00 x | | filtlong | 274.2 (0.04) | 16.45 (0.09) | 6,079 | 02.04 x | | nanoq | 03.68 (0.06) | 05.77 (0.04) | 17,331 | 05.80 x | | nanoq-fast | 03.45 (0.07) | 03.20 (0.02) | 31,250 | 10.47 x |

Dependencies

Nanoq uses needletail for read operations and niffler for output compression.

Etymology

Avoided name collision with nanoqc and dropped the c to arrive at nanoq [nanɔq] which coincidentally means 'polar bear' in Native American (Eskimo-Aleut, Greenlandic). If you find nanoq useful for your work consider a small donation to the Polar Bear Fund, RAVEN or Inuit Tapiriit Kanatami

Contributions

We welcome any and all suggestions or pull requests. Please feel free to open an issue in the repository on GitHub.

Owner

Name: Eike Steinig
Login: esteinig
Kind: user
Location: Melbourne, Australia
Company: The Peter Doherty Institute for Infection and Immunity

Repositories: 12
Profile: https://github.com/esteinig

Bioinformatics | Infectious Diseases | Nanopore | Metagenomic Diagnostics | Software Development

JOSS Publication

Nanoq: ultra-fast quality control for nanopore reads

Published

January 08, 2022

DOI

10.21105/joss.02991

Volume 7, Issue 69, Page 2991

Authors

Eike Steinig

The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Australia

Lachlan Coin

The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Australia

Editor

Luiz Irber

GitHub Events

Total

Issues event: 2
Watch event: 14
Issue comment event: 1

Last Year

Issues event: 2
Watch event: 14
Issue comment event: 1

Committers

Last synced: 7 months ago

All Time

Total Commits: 715
Total Committers: 4
Avg Commits per committer: 178.75
Development Distribution Score (DDS): 0.214

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
esteinig	e**g@m**u	562
Eike Steinig	e****g	124
esteinig	e**g@u**u	28
esteinig	e**g@g**m	1

Committer Domains (Top 20 + Academic)

unimel.edu.au: 1 my.jcu.edu.au: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 43
Total pull requests: 4
Average time to close issues: 4 months
Average time to close pull requests: about 1 month
Total issue authors: 9
Total pull request authors: 2
Average comments per issue: 2.16
Average comments per pull request: 2.25
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: 2 days
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 1.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

esteinig (32)
natir (3)
Hedi65 (2)
mbhall88 (1)
cgjosephlee (1)
RommerskirchenA (1)
Tang-pro (1)
bovee (1)
krausfeldtle (1)

Pull Request Authors

esteinig (3)
druvus (1)

Top Labels

Issue Labels

enhancement (22) next release (7) joss (4) bug (4) new feature (2) documentation (1) rust-bio (1) invalid (1) question (1)

Pull Request Labels

enhancement (1) next release (1)

Packages

Total packages: 1
Total downloads:
- cargo 15,893 total

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 13
Total maintainers: 1

crates.io: nanoq

Minimal but speedy quality control and summaries of nanopore reads

Homepage: https://github.com/esteinig/nanoq
Documentation: https://docs.rs/nanoq/
License: MIT
Latest release: 0.10.0
published almost 3 years ago

Versions: 13
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 15,893 Total

Rankings

Stargazers count: 14.6%

Forks count: 16.8%

Average: 25.0%

Dependent repos count: 29.3%

Downloads: 30.5%

Dependent packages count: 33.8%

Maintainers (1)

esteinig

Last synced: 6 months ago

Dependencies

Cargo.lock cargo

adler 1.0.2
aho-corasick 0.7.18
ansi_term 0.12.1
anyhow 1.0.57
assert_cmd 2.0.4
atty 0.2.14
autocfg 1.1.0
bgzip 0.2.1
bitflags 1.3.2
bstr 0.2.17
buf_redux 0.8.4
bytecount 0.6.2
bzip2 0.4.3
bzip2-sys 0.1.11+1.0.8
cc 1.0.73
cfg-if 1.0.0
clap 2.34.0
crc32fast 1.3.2
difference 2.0.0
difflib 0.4.0
doc-comment 0.3.3
either 1.6.1
fastrand 1.7.0
flate2 1.0.23
float-cmp 0.8.0
float_eq 0.6.1
heck 0.3.3
hermit-abi 0.1.19
indoc 1.0.4
instant 0.1.12
itertools 0.10.3
itoa 1.0.1
jobserver 0.1.24
lazy_static 1.4.0
libc 0.2.125
lzma-sys 0.1.17
memchr 2.5.0
miniz_oxide 0.5.1
needletail 0.4.1
niffler 2.4.0
normalize-line-endings 0.3.0
num-traits 0.2.15
pkg-config 0.3.25
predicates 1.0.8
predicates 2.1.1
predicates-core 1.0.3
predicates-tree 1.0.5
proc-macro-error 1.0.4
proc-macro-error-attr 1.0.4
proc-macro2 1.0.37
quote 1.0.18
redox_syscall 0.2.13
regex 1.5.5
regex-automata 0.1.10
regex-syntax 0.6.25
remove_dir_all 0.5.3
ryu 1.0.9
safemem 0.3.3
serde 1.0.137
serde_derive 1.0.137
serde_json 1.0.81
strsim 0.8.0
structopt 0.3.26
structopt-derive 0.4.18
syn 1.0.92
tempfile 3.3.0
termtree 0.2.4
textwrap 0.11.0
thiserror 1.0.31
thiserror-impl 1.0.31
unicode-segmentation 1.9.0
unicode-width 0.1.9
unicode-xid 0.2.3
unindent 0.1.8
vec_map 0.8.2
version_check 0.9.4
wait-timeout 0.2.0
winapi 0.3.9
winapi-i686-pc-windows-gnu 0.4.0
winapi-x86_64-pc-windows-gnu 0.4.0
xz2 0.1.6
zstd 0.7.0+zstd.1.4.9
zstd-safe 3.1.0+zstd.1.4.9
zstd-sys 1.5.0+zstd.1.4.9

Cargo.toml cargo

assert_cmd 2.0.1 development
predicates 1 development
tempfile 3.1.0 development
anyhow 1.0
clap 2.33.0
float_eq 0.6.1
indoc 1.0
needletail 0.4.1
niffler 2.3
serde 1.0
serde_json 1.0
structopt 0.3
thiserror 1.0

.github/workflows/release.yaml actions

actions-rs/cargo v1 composite
actions-rs/toolchain v1 composite
actions/checkout v2 composite
actions/upload-artifact master composite
softprops/action-gh-release 59c3b4891632ff9a897f99a91d7bc557467a3a22 composite

.github/workflows/rust-ci.yaml actions

actions-rs/cargo v1 composite
actions-rs/tarpaulin v0.1 composite
actions-rs/toolchain v1 composite
actions/cache v2 composite
actions/checkout v2 composite
actions/upload-artifact v1 composite
codecov/codecov-action v2 composite

Dockerfile docker

alpine latest build

Nanoq

Science Score: 95.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

nanoq

Overview

Purpose

Citation

Performance

Tests

Install

Cargo

Conda

Binaries

Usage

Read filters

Read trimming

Read report

Fast mode

Compression

Online runs

Parameters

Output

Read lengths and qualities

Summary reports

Nanoq Read Summary

Benchmarks

Benchmark results

stats + zymo.full.fq

filter + zymo.full.fq

stats + zymo.fq

stats + zymo.fq.gz

filter + zymo.fq

filter + zymo.fq.gz

Dependencies

Etymology

Contributions

Owner

JOSS Publication

Nanoq: ultra-fast quality control for nanopore reads

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

crates.io: nanoq

Rankings

Maintainers (1)

Dependencies

`Cargo`

`Conda`

`Binaries`

`stats` + `zymo.full.fq`

`filter` + `zymo.full.fq`

`stats` + `zymo.fq`

`stats` + `zymo.fq.gz`

`filter` + `zymo.fq`

`filter` + `zymo.fq.gz`