indexed-bzip2

Fast parallel random access to bzip2 and gzip files in Python

https://github.com/mxmlnkn/indexed_bzip2

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary

Keywords

bzip2 cli command-line command-line-tool cpp cpp17-library decompression gzip library parallel python python-library random-access

Last synced: 6 months ago · JSON representation ·

Repository

Fast parallel random access to bzip2 and gzip files in Python

Basic Info

Host: GitHub
Owner: mxmlnkn
License: apache-2.0
Language: C++
Default Branch: master
Homepage:
Size: 32 MB

Statistics

Stars: 80
Watchers: 5
Forks: 6
Open Issues: 5
Releases: 1

Topics

bzip2 cli command-line command-line-tool cpp cpp17-library decompression gzip library parallel python python-library random-access

Created about 6 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog License Citation

![](https://raw.githubusercontent.com/mxmlnkn/indexed_bzip2/master/results/librapidarchive.svg) # Parallel Random Access to bzip2, gzip (and hopefully more in the future) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](http://opensource.org/licenses/MIT) [![C++ Code Checks](https://github.com/mxmlnkn/indexed_bzip2/actions/workflows/test-cpp.yml/badge.svg)](https://github.com/mxmlnkn/indexed_bzip2/actions/workflows/test-cpp.yml) [![codecov](https://codecov.io/gh/mxmlnkn/indexed_bzip2/branch/master/graph/badge.svg?token=94ZD4UTZQW)](https://codecov.io/gh/mxmlnkn/indexed_bzip2) ![C++17](https://img.shields.io/badge/C++-17-blue.svg) [![Discord](https://img.shields.io/discord/783411320354766878?label=discord)](https://discord.gg/Wra6t6akh2) [![Telegram](https://img.shields.io/badge/Chat-Telegram-%2330A3E6)](https://t.me/joinchat/FUdXxkXIv6c4Ib8bgaSxNg)

This repository contains the code for the indexed_bzip2 and rapidgzip Python modules. Both are built upon the same basic architecture to enable block-parallel decoding based on prefetching and caching.

# rapidgzip [![Changelog](https://img.shields.io/badge/Changelog-Markdown-blue)](https://github.com/mxmlnkn/rapidgzip/blob/main/CHANGELOG.md) [![PyPI version](https://badge.fury.io/py/rapidgzip.svg)](https://badge.fury.io/py/rapidgzip) [![Python Version](https://img.shields.io/pypi/pyversions/rapidgzip)](https://pypi.org/project/rapidgzip/) [![PyPI Platforms](https://img.shields.io/badge/pypi-linux%20%7C%20macOS%20%7C%20Windows-brightgreen)](https://pypi.org/project/rapidgzip/) [![Downloads](https://static.pepy.tech/badge/rapidgzip/month)](https://pepy.tech/project/rapidgzip) ![](https://raw.githubusercontent.com/mxmlnkn/indexed_bzip2/master/results/asciinema/rapidgzip-comparison.gif)

This module provides: - a rapidgzip command line tool for parallel decompression of gzip files with a similar command line interface to gzip so that it can be used as a replacement. - a rapidgzip.open Python method for reading and seeking inside gzip files using multiple threads for a speedup of 21 over the built-in gzip module using a 12-core processor.

The random seeking support is similar to the one provided by indexed_gzip, and the parallel capabilities are effectively a working version of pugz, which is only a concept and only works with a limited subset of file contents, namely non-binary (ASCII characters 0 to 127) compressed files.

| Module | Bandwidth / (MB/s) | Speedup | |-------------------------------------|--------------------|---------| | gzip | 250 | 1 | | rapidgzip with parallelization = 1 | 488 | 1.9 | | rapidgzip with parallelization = 2 | 902 | 3.6 | | rapidgzip with parallelization = 12 | 4463 | 17.7 | | rapidgzip with parallelization = 24 | 5240 | 20.8 |

See here for the dedicated repository and ReadMe.

A paper describing the implementation details and showing the scaling behavior with up to 128 cores has been accepted in ACM HPDC'23, The 32nd International Symposium on High-Performance Parallel and Distributed Computing. If you use this software for your scientific publication, please cite it as stated here. The author's version can be found here and the accompanying presentation here.

# indexed_bzip2 [![Changelog](https://img.shields.io/badge/Changelog-Markdown-blue)](https://github.com/mxmlnkn/indexed_bzip2/blob/master/python/indexed_bzip2/CHANGELOG.md) [![PyPI version](https://badge.fury.io/py/indexed-bzip2.svg)](https://badge.fury.io/py/indexed-bzip2) [![Python Version](https://img.shields.io/pypi/pyversions/indexed_bzip2)](https://pypi.org/project/indexed-bzip2/) [![PyPI Platforms](https://img.shields.io/badge/pypi-linux%20%7C%20macOS%20%7C%20Windows-brightgreen)](https://pypi.org/project/indexed-bzip2/) [![Downloads](https://static.pepy.tech/badge/indexed-bzip2/month)](https://pepy.tech/project/indexed-bzip2)
[![Conda Platforms](https://img.shields.io/conda/v/conda-forge/indexed_bzip2?color=brightgreen)](https://anaconda.org/conda-forge/indexed_bzip2) [![Conda Platforms](https://img.shields.io/conda/pn/conda-forge/indexed_bzip2?color=brightgreen)](https://anaconda.org/conda-forge/indexed_bzip2)

This module provides: - an ibzip2 command line tool to decompress bzip2 files in parallel with a similar command line interface to bzip2 so that it can be used as a replacement. - an ibzip2.open Python method for reading and seeking inside bzip2 files using multiple threads for a speedup of 6 over the built-in bzip2 module using a 12-core processor.

The parallel decompression capabilities are similar to lbzip2 but with a more permissive license and with support to be used as a library with random seeking capabilities similar to seek-bzip2.

| Module | Runtime / s | Bandwidth / (MB/s) | Speedup | |-----------------------------------------|-------------|--------------------|---------| | bz2 | 386 | 5.2 | 1 | | indexedbzip2 with parallelization = 1 | 472 | 4.2 | 0.8 | | indexedbzip2 with parallelization = 2 | 265 | 7.6 | 1.5 | | indexedbzip2 with parallelization = 12 | 64 | 31.4 | 6.1 | | indexedbzip2 with parallelization = 24 | 63 | 31.8 | 6.1 |

See here for the extended Readme.

Naming

The CMake options have been prefixed with librapidarchive. This difficult decision came about because neither RAPIDGZIP_ nor IBZIP2_ would have made sense. I needed an umbrella name for both, and possibly further compression formats such as LZ4 and ZIP in the future. I aim for something akin to libarchive, but with support for parallelized decompression and constant-time seeking instead of streaming extraction because it is to be used as a backend for ratarmount.

The project started inside the ratarmount as a random-seekable bzip2 backend. After troubles with compiling a Python C-extension and after noticing that this backend might also find usage on its own, I created the indexed_bzip2 repository, following the naming scheme of indexed_gzip to make it easily discoverable, e.g., in the PyPI search. After adding novel parallelized and seekable gzip decompression support and shortly before publishing the paper, I split off yet another repository and project called rapidgzip, which became more well-known than indexed_bzip2.

Reasons for not including rapidgzip in indexed_bzip2:

Much more complicated build setup with rpmalloc, zlib, and ISA-L, which might fail to build on more systems than indexed_bzip2 when there are no wheels available. On the other hand, indexed_bzip2 only requires building its own C++ header-only sources. These dependencies are also the reason for failing to get it merged into Conda while Conda indexed_bzip2 exists.
The rapidgzip Python module binary is also almost 10x larger because of large precomputed lookup tables and templating.
Releases, especially on Github. Many recent changes were only for rapidgzip, not indexed_bzip2. It makes sense to have different releases for these projects and also to keep them on different Github release pages.
More visibility:
- Similar to how none would guess that bsdtar is able to extract archives other than TAR, it makes no sense to expect something called indexed_bzip2 to also work for gzip, etc. libarchive, which provides bsdtar, makes much more sense as a name.
- They have different ReadMe files with different usages and benchmarks. Showing these top-level in the specialized repositories is nice.
Note that the Python package rapidgzip does not even bundle indexed_bzip2. It can even natively open bzip2 with RapidGzipFile, but this uses a different algorithm, which is less specialized to bzip2 and therefore has more memory overhead and might be slightly slower. Until this does not have feature and performance parity, it makes sense to have two projects.

Downsides:

I am not sure how well the rapidgzip and indexed_bzip2 Python modules work when loaded at the same time. There may be name collisions resulting in problems. It might be best to make the namespace, currently rapidgzip::, adjustable and use something else for each Python package. Currently, I am sidestepping this issue in ratarmount by including indexed_bzip2 in the rapidgzip Python package because it is trivial and low-overhead to do so. So, if you need to use both, depend on rapidgzip for now.
Contributions and attention are split between all these projects, also resulting in confusion. I have mitigated it somewhat by adding a pull request template on the rapidgzip repository pointing to indexed_bzip2.

I think, in the future, I'll avoid starting new repositories and simply release specialized Packages from this one or even only alias Python packages, which point to / depend on rapidgzip or a hypothetical librapidarchive.

License

Licensed under either of

Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

Owner

Name: Maximilian Knespel
Login: mxmlnkn
Kind: user
Location: Germany, Dresden

Repositories: 38
Profile: https://github.com/mxmlnkn

Citation (CITATION.cff)

# Validate with:
#   python3 -m pip install --user cffconvert
#   cffconvert --validate

cff-version: 1.2.0
title: Rapidgzip
message: >-
  If you use this software, please cite it using the metadata from this file.
type: software

authors:
  - given-names: Maximilian
    family-names: Knespel
    orcid: "https://orcid.org/0000-0001-9568-3075"
    email: maximilian.knespel@tu-dresden.de

repository-code: "https://github.com/mxmlnkn/indexed_bzip2"
repository: "https://github.com/mxmlnkn/rapidgzip"
abstract: >-
  A replacement for gzip for decompressing gzip files using
  multiple threads.
keywords:
  - Gzip
  - Decompression
  - Parallel Algorithm
  - Performance
  - Random Access
license: MIT

preferred-citation:
  type: conference-paper
  authors:
  - given-names: Maximilian
    family-names: Knespel
    orcid: "https://orcid.org/0000-0001-9568-3075"
    email: maximilian.knespel@tu-dresden.de
  - family-names: "Brunst"
    given-names: "Holger"
    orcid: "https://orcid.org/0000-0003-2224-0630"
  title: "Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching"
  year: 2023
  isbn: 9798400701559
  publisher:
    name: "Association for Computing Machinery"
    alias: "ACM"
    city: "New York"
    region: "NY"
    country: "US"
  url: "https://doi.org/10.1145/3588195.3592992"
  doi: "10.1145/3588195.3592992"
  abstract: >-
    Gzip is a file compression format, which is ubiquitously used. Although a multitude of gzip implementations exist, only pugz can fully utilize current multi-core processor architectures for decompression. Yet, pugz cannot decompress arbitrary gzip files. It requires the decompressed stream to only contain byte values 9–126. In this work, we present a generalization of the parallelization scheme used by pugz that can be reliably applied to arbitrary gzip-compressed data without compromising performance. We show that the requirements on the file contents posed by pugz can be dropped by implementing an architecture based on a cache and a parallelized prefetcher. This architecture can safely handle faulty decompression results, which can appear when threads start decompressing in the middle of a gzip file by using trial and error. Using 128 cores, our implementation reaches 8.7 GB/s decompression bandwidth for gzip-compressed base64-encoded data, a speedup of 55 over the single-threaded GNU gzip, and 5.6 GB/s for the Silesia corpus, a speedup of 33 over GNU gzip.
  collection-title: "Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing"
  start: 295
  end: 307
  pages: 13
  keywords:
  - Gzip
  - Decompression
  - Parallel Algorithm
  - Performance
  - Random Access
  conference:
    name: "The 32nd International Symposium on High-Performance Parallel and Distributed Computing"
    alias: "HPDC '23"
    city: "Orlando"
    date-start: 2023-06-20
    date-end: 2023-06-23
    region: "FL"
    country: "US"
    website: "https://www.hpdc.org/2023"
  date-published: 2023-08-07
  status: advance-online

GitHub Events

Total

Issues event: 1
Watch event: 8
Delete event: 11
Issue comment event: 24
Push event: 177
Pull request review comment event: 2
Pull request review event: 7
Pull request event: 5
Fork event: 4
Create event: 10

Last Year

Issues event: 1
Watch event: 8
Delete event: 11
Issue comment event: 24
Push event: 177
Pull request review comment event: 2
Pull request review event: 7
Pull request event: 5
Fork event: 4
Create event: 10

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 16
Total pull requests: 5
Average time to close issues: 4 months
Average time to close pull requests: 4 days
Total issue authors: 8
Total pull request authors: 3
Average comments per issue: 5.0
Average comments per pull request: 2.4
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 5
Average time to close issues: N/A
Average time to close pull requests: 4 days
Issue authors: 1
Pull request authors: 3
Average comments per issue: 0.0
Average comments per pull request: 2.4
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

mxmlnkn (5)
hyanwong (4)
ozancaglayan (1)
noonchen (1)
ap-- (1)
martinellimarco (1)
thomasj02 (1)
marksteward (1)

Pull Request Authors

amir-sabbaghi (2)
cal-pratt (2)
WeGoToMars (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 9,214 last-month

Total dependent packages: 2
Total dependent repositories: 1
Total versions: 12
Total maintainers: 1

pypi.org: indexed-bzip2

Fast random access to bzip2 files

Homepage: https://github.com/mxmlnkn/indexed_bzip2
Documentation: https://indexed-bzip2.readthedocs.io/
License: MIT
Latest release: 1.7.0
published 7 months ago

Versions: 12
Dependent Packages: 2
Dependent Repositories: 1
Downloads: 9,214 Last month

Rankings

Dependent packages count: 3.1%

Downloads: 4.8%

Stargazers count: 9.4%

Average: 11.6%

Forks count: 19.1%

Dependent repos count: 21.7%

Maintainers (1)

mxmlnkn

Last synced: 6 months ago

Dependencies

.github/workflows/codeql-analysis.yml actions

actions/checkout v3 composite
github/codeql-action/analyze v2 composite
github/codeql-action/init v2 composite

.github/workflows/conda.yml actions

actions/checkout v3 composite
conda-incubator/setup-miniconda v2 composite

.github/workflows/publish.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
conda-incubator/setup-miniconda v2 composite

.github/workflows/test-cpp.yml actions

actions/checkout v3 composite

.github/workflows/test-python.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/wheels.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/conda-rapidgzip.yml actions

actions/checkout v3 composite
conda-incubator/setup-miniconda v2 composite

.github/workflows/publish-rapidgzip.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
conda-incubator/setup-miniconda v2 composite

.github/workflows/wheels-rapidgzip.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite
actions/upload-artifact v3 composite

python/indexed_bzip2/pyproject.toml pypi

python/indexed_bzip2/setup.py pypi

python/rapidgzip/pyproject.toml pypi

python/rapidgzip/setup.py pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science