compression_benchmark

Benchmarking FASTQ compression with 'mature' compression algorithms

https://github.com/mbhall88/compression_benchmark

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 28 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.7%) to scientific vocabulary

Keywords

benchmark bioinformatics compression fastq

Keywords from Contributors

argument-parser
Last synced: 6 months ago · JSON representation ·

Repository

Benchmarking FASTQ compression with 'mature' compression algorithms

Basic Info
  • Host: GitHub
  • Owner: mbhall88
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 9.86 MB
Statistics
  • Stars: 41
  • Watchers: 1
  • Forks: 4
  • Open Issues: 1
  • Releases: 2
Topics
benchmark bioinformatics compression fastq
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme License Citation

README.md

FASTQ compression benchmark

DOI

Benchmarking FASTQ compression with 'mature' compression algorithms

Motivation

This behcmark is motivated by a question from Ryan Connor on the µbioinfo Slack group

my impression is that bioinformatics really likes gzip (and only gzip?), but that there are other generic compression algs that are better (for bioinfo data types); assuming you agree (if not, why not?), why haven't the others compression types caught on in bioinformatics?

It kicked off an interesting discussion, which led me to dig into the literature and see what I could find. I'm sure I could search deeper and for longer, but I really couldn't find any benchmarks that satisfied me. Don't get me wrong, there are plenty of benchmarks, but they're always looking at bioinformatics-specific tools for compressing sequencing data. Sure, these perform well, but every repository I went to was untouched in a while. When archiving data, the last thing I want is to try and decompress my data and the tool no longer installs/works on my system. In addition, I want the tool to be ubiquitous and mature. I know this is a lot of constraints, but hey, that's what I am interested in.

This benchmark only covers ubiquitous/mature/generic compression tools

Update 02/07/2024

I have added unaligned BAM (uBAM) and CRAM (uCRAM) to the benchmark. While these aren't generated by 'general compression' algorithms, you can convert FASTQ to and from these formats with samtools, which is definitely 'mature' and isn't going to fall into a state of disrepair anytime in the forseeable future; bioinformatics may fall over if this happens.

Methods

Tools

The tools tested in this benchmark are:

Feel free to raise an issue on this repository if you would like to see another tool included.

All compression level settings were tested for each tool and default settings were used for all other options. For uBAM and uCRAM I used a pretty default samtools import pipeline, and you can see the exact commands here and here.

Data

The data used to test each tool are FASTQs:

Nanopore

Illumina

Note: I couldn't find sources for all of these samples. If you can fill in some of the gaps, please raise an issue and I will gladly update the sources.

All data were downloaded with fastq-dl (v2.0.4). Paired Illumina data were combined into a single FASTQ file with seqtk mergepe.

Results

Compression ratio

The first question is how much smaller does each compression tool make a FASTQ file. As this also depends on the compression level selected, all possible levels were tested for each tool (the default being indicated with a red circle).

The compression ratio is a percentage of the original file size - i.e., $\frac{\text{compressed size}}{\text{uncompressed size}}$.


Compression ratio figure

Figure 1: Compression ratio (y-axis) for different compression tools and levels. Compression ratio is a percentage of the original file size. The red circles indicate the default compression level for each tool. Illumina data is represented with a solid line and circular points, whereas Nanopore data is a dashed line with cross points. Translucent error bands represent the 95% confidence interval.


The most striking result here is the noticeable difference in compression ratio between Illumina and Nanopore data - regardless of the compression tool used. ~~(If anyone can suggest a reason for this, please raise an issue.)~~

Update 07/06/2023: Peter Menzel mentioned this is likely due to the noisier quality scores in the Nanopore data. Illumina quality scores are generally quite homogenous, which increases compressability.

Using default settings, zstd and gzip provide similar ratios, as do brotli, xz and bzip2 (however, compression level doesn't seem to actually change the ratio for bzip2). uCRAM and xz provide the best compression when using the highest compression level; however, this comes at a cost to runtime as we'll see below. lz4 has the worst compression ratio, especially for Nanopore data.

(De)compression rate and memory usage

In many scenarios, the (de)compression rate is just as important as the compression ratio. However, if compressing for archival purposes, rate is probably not as important.

The compression rate is $\frac{\text{uncompressed size}}{\text{(de)compression time ( secs)}}$.


Compression rate figure

Figure 2: Compression (left column) and decompression (right column) rate (top row) and peak memory usage (lower row). Note the log scale for rate. The red circles indicate the default compression level for each tool. Illumina data is represented with a solid line and circular points, whereas Nanopore data is a dashed line with cross points. Translucent error bands represent the 95% confidence interval.


As alluded to earlier, xz and brotli, though not so much uCRAM, pay for their fantastic compression ratios by being orders-of-magnitude slower than the other tools at compressing (using the default compression level). uCRAM and uBAM use more memory than the other tools - although in absolute terms, the highest memory usage is still well below 2GB. This is due to the samtools sort option -M which clusters unaligned reads by minimizers (and improves compression). If 2GB of memory is an issue for you, this step can be excluded (with some loss in compression), or the memory usage can be capped with the -m option.

The main take-away from Figure 2 is that zstd and lz4 (de)compress much faster than the other tools (using the default level). Compression level seems to have a big impact in compression rate (except for bzip2), however, not so much for decompression.

Rate vs. Ratio

Cornelius Roemer suggested plotting rate against ratio in order to get a Pareto Frontier. These are good plots to get a quick sense of which algorithms are best suited to a specific use case. The lower right corner is the 'magic zone' where an algorithm has high rate and ratio. In Figure 3 we see that the compression version of this plot is a little messy as the compression rate it quite variable. However, uBAM, gzip, and zstd do tend to have more points on the lower-ish right, with a spattering of brotli and (Illumina) lz4 points - though there are also a number of brotli and lz4 points on the left - and lz4 points up the top. The decompression plot is a lot clearer and we get nice 'fronts'. From this it is clear that lz4, zstd, brotli, and uBAM give fast decompression even with good compression ratios.

Pareto frontier figure

Figure 3: Compression (top row) and decompression (lower row) rate (x-axis) and peak memory usage (lower row). Note the log scale for rate. Illumina data is represented with circular points and Nanopore data with cross points.

Conclusion

So what tool to use? As most often with benchmarks: it depends on your situation.

If all you care about is compressing your data as small as it will go ,and you don't mind how long it takes, then uCRAM or xz (compression level 9) or brotli (level 11 - default) - are the obvious choices. However, if you're planning on a really good one-off compression, but expect decrompressing regularly, uCRAM is probably the better option.

If you want fast (de)compression, then zstd is the best option - using default options - followed closely by uBAM. lz4 is also great for fast (de)compression, but the compression ratios are not great. And a special mention should also go to brotli for decompression rates.

If, like most people, you're contemplating replacing gzip (default options), uBAM or uCRAM seem like pretty convincing options. uCRAM will give ~8% better compression ratios, but is roughly half the (de)compression rate. Another option is zstd (default options), which will give you about the same compression ratio as gzip with ~10-fold faster compression and ~3-5-fold faster decompression.

One final consideration is APIs for various programming languages. If it is difficult to read/write files that are compressed with a given algorithm, then using that compression type might cause problems. Most (good) bioinformatics tools support gzip-compressed input and output. However, support for other compression types shouldn't be too much work for most software tool developers provided a well-maintained and documented API is available in the relevant programming language. Here is a list of APIs for the tested compression tools in a selection of programming languages with an arbitrary grading system for how "stable" I think they are (feel free to put in a pull request if you want to contribute other languages).

| | gzip | bzip2 | xz | zstd | brotli | uBAM/uCRAM | lz4 | | ------ | ----------- | ------------ | ---------- | ------------ | ----------- | ---------- | --- | | Python | A | A | A | B+ | A | B | B | | Rust | A | B+ | B+ | B | B+ | B 1,2 | B | | C/C++ | A | A | A | A | A | A | A | | Julia | A | A | A | A | NA | help | help | | Go | A | A | B | B | A | help | B+ |

  • A: standard library (i.e. builtin) or library is maintained by the original developer (note: Rust's gzip library is maintained by rust-lang itself)
  • B: external library that is actively maintained, well-documented, and has quick response times
  • help: I am not at all familiar with these languages, so if someone could suggest a rating here that would be great

Owner

  • Name: Michael Hall
  • Login: mbhall88
  • Kind: user
  • Location: Sunshine Coast, Australia
  • Company: University of Queensland | UQCCR

Postdoc @ University of Queensland with @LeahRoberts Bioinformatics | Nanopore | Microbial Genomics | Software Dev.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  Benchmarking fastq compression with generic (mature)
  compression algorithms
message: >-
  If you use this dataset, please cite it using the metadata
  from this file.
type: dataset
authors:
  - given-names: Michael
    name-particle: B
    family-names: Hall
    orcid: https://orcid.org/0000-0003-3683-6208
    email: michael.hall2@unimelb.edu.au
    affiliation: >-
      Department of Microbiology and Immunology, The Peter
      Doherty Institute for Infection and Immunity, The
      University of Melbourne
    orcid: 'https://orcid.org/0000-0003-3683-6208'
identifiers:
  - type: doi
    value: 10.5281/zenodo.8008877
    description: Zenodo archive DOI
repository-code: 'https://github.com/mbhall88/compression_benchmark'
url: 'https://github.com/mbhall88/compression_benchmark'
keywords:
  - compression
  - benchmark
  - bioinformatics
  - fastq
license: MIT
commit: b2fe03049769ed7f799c94af6f86b4d30e965ebb
date-released: '2020-06-06'

GitHub Events

Total
  • Watch event: 6
Last Year
  • Watch event: 6

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 43
  • Total Committers: 3
  • Avg Commits per committer: 14.333
  • Development Distribution Score (DDS): 0.047
Past Year
  • Commits: 9
  • Committers: 1
  • Avg Commits per committer: 9.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Michael Hall m****l@m****h 41
Cornelius Roemer c****r@g****m 1
Nick Minor 7****r 1
Committer Domains (Top 20 + Academic)
mbh.sh: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 5
  • Total pull requests: 3
  • Average time to close issues: 2 months
  • Average time to close pull requests: 1 day
  • Total issue authors: 4
  • Total pull request authors: 3
  • Average comments per issue: 2.8
  • Average comments per pull request: 1.33
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 1
  • Average time to close issues: about 23 hours
  • Average time to close pull requests: 4 minutes
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 4.0
  • Average comments per pull request: 2.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • corneliusroemer (2)
  • mbhall88 (1)
  • darked89 (1)
  • jsgounot (1)
Pull Request Authors
  • nrminor (1)
  • corneliusroemer (1)
  • Wytamma (1)
Top Labels
Issue Labels
Pull Request Labels