sequali

Fast sequencing data quality metrics

https://github.com/rhpvorderman/sequali

Keywords

bam fastq illumina nanopore qc quality-control

Last synced: 6 months ago · JSON representation ·

Repository

Fast sequencing data quality metrics

Basic Info

Host: GitHub
Owner: rhpvorderman
License: agpl-3.0
Language: C
Default Branch: develop
Homepage:
Size: 7.86 MB

Statistics

Stars: 27
Watchers: 1
Forks: 0
Open Issues: 8
Releases: 19

Topics

bam fastq illumina nanopore qc quality-control

Created almost 3 years ago · Last pushed 9 months ago

Metadata Files

Readme Changelog License Citation

README.rst

.. |python-version-shield| image:: https://img.shields.io/pypi/v/sequali.svg
  :target: https://pypi.org/project/sequali/
  :alt:

.. |conda-version-shield| image:: https://img.shields.io/conda/v/bioconda/sequali.svg
  :target: https://bioconda.github.io/recipes/sequali/README.html
  :alt:

.. |python-install-version-shield| image:: https://img.shields.io/pypi/pyversions/sequali.svg
  :target: https://pypi.org/project/sequali/
  :alt:

.. |license-shield| image:: https://img.shields.io/pypi/l/sequali.svg
  :target: https://github.com/rhpvorderman/sequali/blob/main/LICENSE
  :alt:

.. |docs-shield| image:: https://readthedocs.org/projects/sequali/badge/?version=latest
  :target: https://sequali.readthedocs.io/en/latest/?badge=latest
  :alt:

.. |coverage-shield| image:: https://codecov.io/gh/rhpvorderman/sequali/graph/badge.svg?token=MSR1A6BEGC
  :target: https://codecov.io/gh/rhpvorderman/sequali
  :alt:

.. |zenodo-shield| image:: ./docs/_static/images/doi_image.svg
  :target: https://doi.org/10.1093/bioadv/vbaf010
  :alt:

|python-version-shield| |conda-version-shield| |python-install-version-shield|
|license-shield| |docs-shield| |coverage-shield| |zenodo-shield|

========
Sequali
========

.. introduction start

Sequence quality metrics for FASTQ and uBAM files.

Features:

+ `MultiQC `_ support since MultiQC version 1.22.
+ Low memory footprint, small install size and fast execution times.

  + Sequali typically needs less than 2 GB of memory and 3-30 minutes runtime
    when run on 2 cores (the default).
+ Informative graphs that allow for judging the quality of a sequence at
  a quick glance.
+ Overrepresentation analysis using 21 bp sequence fragments. Overrepresented
  sequences are checked against the NCBI univec database.
+ Estimate duplication rate using a `fingerprint subsampling technique which is
  also used in filesystem duplication estimation
  `_.
+ Checks for 6 illumina adapter sequences and 17 nanopore adapter sequences
  for single read data.
+ Determines adapters by overlap analysis for paired read data.
+ Insert size metrics for paired read data.
+ Per tile quality plots for illumina reads.
+ Channel and other plots for nanopore reads.
+ FASTQ and unaligned BAM are supported. See "Supported formats".
+ Reproducible reports without timestamps.

Example reports:

+ `GM24385_1.fastq.gz `_;
  HG002 (Genome In A Bottle) on ultra-long Nanopore Sequencing. ENA accession:
  `ERR3988483 `_.
+ `GM24385_1_cut.fastq.gz `_;
  ``GM24385_1.fastq.gz`` processed with cutadapt:
  ``cutadapt -o GM24385_1_cut.fastq.gz --cut -64 --cut 64 --minimum-length 500 -Z --max-aer 0.1 GM24385_1.fastq.gz``.
  The resulting file has 64 bp cut off from both its ends and after that
  filtered for a minimum length of 500 and a maximum average error rate of 0.1.
+ `21C125_R1.fastq.gz `_;
  Illumina NovaSeq X paired-end sequencing of *Campylobacter jejuni*. ENA accession:
  `ERR11204024 `_.

.. introduction end

For more information check `the documentation `_.

Supported formats
=================

.. formats start

- FASTQ. Only the Sanger variation with a phred offset of 33 and the error rate
  calculation of 10 ^ (-phred/10) is supported. All sequencers use this
  format today.

  - Paired end sequencing data is supported.
  - For sequences called by illumina base callers an additional plot with the
    per tile quality will be provided.
  - For sequences called by guppy additional plots for nanopore specific
    data will be provided.
- (unaligned) BAM with single reads. Read-pair information is currently ignored.

  - For BAM data as delivered by dorado additional nanopore plots will be
    provided.
  - For aligned BAM files, secondary and supplementary reads are ignored
    similar to how ``samtools fastq`` handles the data.

.. formats end

Installation
============

.. installation start

Installation via pip is available with::

    pip install sequali

Sequali is also distributed via bioconda. It can be installed with::

    conda install -c conda-forge -c bioconda sequali

.. installation end

Quickstart
==========

.. quickstart start

.. code-block::

    sequali path/to/my.fastq.gz

This will create a report ``my.fastq.gz.html`` and a json ``my.fastq.gz.json``
in the current working directory.

To set the directory where the reports are created the ``--outdir`` flag can
be used. This is useful when using [MultiQC](https://github.com/multiqc/multiqc).

.. code-block::

    sequali --out-dir /my/dir/all_sequali_reports my.fastq.gz

The html and json filenames can be set separately.

.. code-block::

    sequali --html before_qc.html --json before_qc.json my.fastq.gz
    sequali --html after_qc.html --json after_qc.json my.cutadapt.fastq.gz

Sequali can handle paired-end data.

.. code-block::

    sequali /sequencing_data/sample100_R1.fastq.gz /sequencing_data/sample100_R2.fastq.gz

Additionally sequali can handle BAM data. Proper pair handling is not yet supported for
BAM data, so this is primarily useful for ONT datasets.

.. code-block::

    sequali /sequencing_data/sample100_dorado_called_hac_v4.30.bam

Sequali by default uses one thread per compressed input file and one thread for
the read processing, typically keeping two cores busy. Sequali can also use a single
core, which is slower, but typically more efficient for HPC scenarios where
multiple files can be run simultaneously. (Below a SLURM example.)

.. code-block::

    sbatch -c 1 --time 59 --partition short \
    --wrap 'sequali --threads 1 /cluster-scratch/myusername/my.fastq.gz'

Using a thread count higher than ``2`` has no effect. Due to the decompression
bottleneck, bringing the full power of multithreading to Sequali has limited
utility whilst having a disproportionally high cost in additional code
complexity.

.. quickstart end

For all command line options checkout the
`usage documentation `_.

For more extensive information about the module options check the
`documentation on the module options
`_.

Acknowledgements
================

.. acknowledgements start

+ `FastQC `_ for
  its excellent selection of relevant metrics. For this reason these metrics
  are also gathered by Sequali.
+ The matplotlib team for their excellent work on colormaps. Their work was
  an inspiration for how to present the data and their RdBu colormap is used
  to represent quality score data. Check their `writings on colormaps
  `_ for
  a good introduction.
+ Wouter de Coster for his `excellent post on how to correctly average phred
  scores `_
  as well as the idea for using end-anchored plots from `NanoQC
  `_.
+ Marcel Martin for providing very extensive feedback.
+ Agnès Barnabé for creating a Galaxy wrapper.

.. acknowledgements end

Citation
========
.. citation start

If you wish to credit Sequali please cite `the Sequali article
`_.

.. citation end

License
=======

.. license start

This project is licensed under the GNU Affero General Public License v3. Mainly
to avoid commercial parties from using it without notifying the users that they
can run it themselves. If you want to include code from Sequali in your
open source project, but it is not compatible with the AGPL, please contact me
and we can discuss a separate license.

.. license end

Owner

Name: Ruben Vorderman
Login: rhpvorderman
Kind: user
Company: @LUMC

Repositories: 14
Profile: https://github.com/rhpvorderman

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Sequali
message: >-
  If you use this software, please cite the article.
type: software
authors:
  - given-names: Ruben Harmen Paul
    family-names: Vorderman
    email: r.h.p.vorderman@lumc.nl
    affiliation: Leids Universitair Medisch Centrum
    orcid: 'https://orcid.org/0000-0002-8813-1528'
doi: 10.1093/bioadv/vbaf010
repository-code: 'https://github.com/rhpvorderman/sequali'
abstract: >-
  Sequali is a QC tool that generates useful graphs for both short and
  long-read data. This includes adapter contamination searching and making an
  estimate of the amount of duplication.
keywords:
  - QC
  - uBAM
  - FASTQ
  - illumina
  - nanopore
license: AGPL-3.0-or-later

GitHub Events

Total

Create event: 26
Release event: 1
Issues event: 34
Watch event: 14
Delete event: 31
Issue comment event: 33
Push event: 111
Pull request event: 32

Last Year

Create event: 26
Release event: 1
Issues event: 34
Watch event: 14
Delete event: 31
Issue comment event: 33
Push event: 111
Pull request event: 32

Committers

Last synced: 8 months ago

All Time

Total Commits: 1,194
Total Committers: 1
Avg Commits per committer: 1,194.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 110
Committers: 1
Avg Commits per committer: 110.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Ruben Vorderman	r**n@l**l	1,194

Committer Domains (Top 20 + Academic)

lumc.nl: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 79
Total pull requests: 184
Average time to close issues: about 1 month
Average time to close pull requests: about 3 hours
Total issue authors: 10
Total pull request authors: 1
Average comments per issue: 1.16
Average comments per pull request: 0.22
Merged pull requests: 177
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 25
Pull requests: 30
Average time to close issues: 29 days
Average time to close pull requests: about 3 hours
Issue authors: 6
Pull request authors: 1
Average comments per issue: 1.16
Average comments per pull request: 0.13
Merged pull requests: 29
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

rhpvorderman (60)
marcelm (9)
Sebastien-Raguideau (2)
agnesbrnb (2)
OZTaekOppa (1)
bagnacan (1)
SergeWielhouwer (1)
s-j-mc (1)
dudududu12138 (1)
Redmar-van-den-Berg (1)

Pull Request Authors

rhpvorderman (223)

Top Labels

Issue Labels

help wanted (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 2,939 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 19
Total maintainers: 1

pypi.org: sequali

Sequali is a QC tool that generates useful graphs for both short and long-read data.

Homepage: https://github.com/rhpvorderman/sequali
Documentation: https://sequali.readthedocs.io
License: AGPL-v3.0
Latest release: 1.0.1
published 9 months ago

Versions: 19
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 2,939 Last month

Rankings

Dependent packages count: 9.9%

Forks count: 29.9%

Average: 36.6%

Stargazers count: 38.9%

Dependent repos count: 67.8%

Maintainers (1)

rhpvorderman

Last synced: 6 months ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

sequali

Science Score: 57.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.rst

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: sequali

Rankings

Maintainers (1)