indexed-bzip2
Fast parallel random access to bzip2 and gzip files in Python
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.0%) to scientific vocabulary
Keywords
Repository
Fast parallel random access to bzip2 and gzip files in Python
Basic Info
Statistics
- Stars: 80
- Watchers: 5
- Forks: 6
- Open Issues: 5
- Releases: 1
Topics
Metadata Files
README.md
This repository contains the code for the indexed_bzip2 and rapidgzip Python modules.
Both are built upon the same basic architecture to enable block-parallel decoding based on prefetching and caching.
This module provides:
- a rapidgzip command line tool for parallel decompression of gzip files with a similar command line interface to gzip so that it can be used as a replacement.
- a rapidgzip.open Python method for reading and seeking inside gzip files using multiple threads for a speedup of 21 over the built-in gzip module using a 12-core processor.
The random seeking support is similar to the one provided by indexed_gzip, and the parallel capabilities are effectively a working version of pugz, which is only a concept and only works with a limited subset of file contents, namely non-binary (ASCII characters 0 to 127) compressed files.
| Module | Bandwidth / (MB/s) | Speedup | |-------------------------------------|--------------------|---------| | gzip | 250 | 1 | | rapidgzip with parallelization = 1 | 488 | 1.9 | | rapidgzip with parallelization = 2 | 902 | 3.6 | | rapidgzip with parallelization = 12 | 4463 | 17.7 | | rapidgzip with parallelization = 24 | 5240 | 20.8 |
See here for the dedicated repository and ReadMe.
A paper describing the implementation details and showing the scaling behavior with up to 128 cores has been accepted in ACM HPDC'23, The 32nd International Symposium on High-Performance Parallel and Distributed Computing. If you use this software for your scientific publication, please cite it as stated here. The author's version can be found here and the accompanying presentation here.
[](https://anaconda.org/conda-forge/indexed_bzip2) [](https://anaconda.org/conda-forge/indexed_bzip2)
This module provides:
- an ibzip2 command line tool to decompress bzip2 files in parallel with a similar command line interface to bzip2 so that it can be used as a replacement.
- an ibzip2.open Python method for reading and seeking inside bzip2 files using multiple threads for a speedup of 6 over the built-in bzip2 module using a 12-core processor.
The parallel decompression capabilities are similar to lbzip2 but with a more permissive license and with support to be used as a library with random seeking capabilities similar to seek-bzip2.
| Module | Runtime / s | Bandwidth / (MB/s) | Speedup | |-----------------------------------------|-------------|--------------------|---------| | bz2 | 386 | 5.2 | 1 | | indexedbzip2 with parallelization = 1 | 472 | 4.2 | 0.8 | | indexedbzip2 with parallelization = 2 | 265 | 7.6 | 1.5 | | indexedbzip2 with parallelization = 12 | 64 | 31.4 | 6.1 | | indexedbzip2 with parallelization = 24 | 63 | 31.8 | 6.1 |
See here for the extended Readme.
Naming
The CMake options have been prefixed with librapidarchive.
This difficult decision came about because neither RAPIDGZIP_ nor IBZIP2_ would have made sense.
I needed an umbrella name for both, and possibly further compression formats such as LZ4 and ZIP in the future.
I aim for something akin to libarchive, but with support for parallelized decompression and constant-time seeking instead of streaming extraction because it is to be used as a backend for ratarmount.
The project started inside the ratarmount as a random-seekable bzip2 backend.
After troubles with compiling a Python C-extension and after noticing that this backend might also find usage on its own, I created the indexed_bzip2 repository, following the naming scheme of indexed_gzip to make it easily discoverable, e.g., in the PyPI search.
After adding novel parallelized and seekable gzip decompression support and shortly before publishing the paper, I split off yet another repository and project called rapidgzip, which became more well-known than indexed_bzip2.
Reasons for not including rapidgzip in indexed_bzip2:
- Much more complicated build setup with rpmalloc, zlib, and ISA-L, which might fail to build on more systems than
indexed_bzip2when there are no wheels available. On the other hand,indexed_bzip2only requires building its own C++ header-only sources. These dependencies are also the reason for failing to get it merged into Conda while Condaindexed_bzip2exists. - The rapidgzip Python module binary is also almost 10x larger because of large precomputed lookup tables and templating.
- Releases, especially on Github. Many recent changes were only for rapidgzip, not
indexed_bzip2. It makes sense to have different releases for these projects and also to keep them on different Github release pages. - More visibility:
- Similar to how none would guess that
bsdtaris able to extract archives other than TAR, it makes no sense to expect something calledindexed_bzip2to also work for gzip, etc.libarchive, which providesbsdtar, makes much more sense as a name. - They have different ReadMe files with different usages and benchmarks. Showing these top-level in the specialized repositories is nice.
- Similar to how none would guess that
- Note that the Python package
rapidgzipdoes not even bundleindexed_bzip2. It can even natively open bzip2 withRapidGzipFile, but this uses a different algorithm, which is less specialized to bzip2 and therefore has more memory overhead and might be slightly slower. Until this does not have feature and performance parity, it makes sense to have two projects.
Downsides:
- I am not sure how well the
rapidgzipandindexed_bzip2Python modules work when loaded at the same time. There may be name collisions resulting in problems. It might be best to make the namespace, currentlyrapidgzip::, adjustable and use something else for each Python package. Currently, I am sidestepping this issue in ratarmount by includingindexed_bzip2in therapidgzipPython package because it is trivial and low-overhead to do so. So, if you need to use both, depend onrapidgzipfor now. - Contributions and attention are split between all these projects, also resulting in confusion.
I have mitigated it somewhat by adding a pull request template on the rapidgzip repository pointing to
indexed_bzip2.
I think, in the future, I'll avoid starting new repositories and simply release specialized Packages from this one or even only alias Python packages, which point to / depend on rapidgzip or a hypothetical librapidarchive.
License
Licensed under either of
- Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.
Owner
- Name: Maximilian Knespel
- Login: mxmlnkn
- Kind: user
- Location: Germany, Dresden
- Repositories: 38
- Profile: https://github.com/mxmlnkn
Citation (CITATION.cff)
# Validate with:
# python3 -m pip install --user cffconvert
# cffconvert --validate
cff-version: 1.2.0
title: Rapidgzip
message: >-
If you use this software, please cite it using the metadata from this file.
type: software
authors:
- given-names: Maximilian
family-names: Knespel
orcid: "https://orcid.org/0000-0001-9568-3075"
email: maximilian.knespel@tu-dresden.de
repository-code: "https://github.com/mxmlnkn/indexed_bzip2"
repository: "https://github.com/mxmlnkn/rapidgzip"
abstract: >-
A replacement for gzip for decompressing gzip files using
multiple threads.
keywords:
- Gzip
- Decompression
- Parallel Algorithm
- Performance
- Random Access
license: MIT
preferred-citation:
type: conference-paper
authors:
- given-names: Maximilian
family-names: Knespel
orcid: "https://orcid.org/0000-0001-9568-3075"
email: maximilian.knespel@tu-dresden.de
- family-names: "Brunst"
given-names: "Holger"
orcid: "https://orcid.org/0000-0003-2224-0630"
title: "Rapidgzip: Parallel Decompression and Seeking in Gzip Files Using Cache Prefetching"
year: 2023
isbn: 9798400701559
publisher:
name: "Association for Computing Machinery"
alias: "ACM"
city: "New York"
region: "NY"
country: "US"
url: "https://doi.org/10.1145/3588195.3592992"
doi: "10.1145/3588195.3592992"
abstract: >-
Gzip is a file compression format, which is ubiquitously used. Although a multitude of gzip implementations exist, only pugz can fully utilize current multi-core processor architectures for decompression. Yet, pugz cannot decompress arbitrary gzip files. It requires the decompressed stream to only contain byte values 9–126. In this work, we present a generalization of the parallelization scheme used by pugz that can be reliably applied to arbitrary gzip-compressed data without compromising performance. We show that the requirements on the file contents posed by pugz can be dropped by implementing an architecture based on a cache and a parallelized prefetcher. This architecture can safely handle faulty decompression results, which can appear when threads start decompressing in the middle of a gzip file by using trial and error. Using 128 cores, our implementation reaches 8.7 GB/s decompression bandwidth for gzip-compressed base64-encoded data, a speedup of 55 over the single-threaded GNU gzip, and 5.6 GB/s for the Silesia corpus, a speedup of 33 over GNU gzip.
collection-title: "Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing"
start: 295
end: 307
pages: 13
keywords:
- Gzip
- Decompression
- Parallel Algorithm
- Performance
- Random Access
conference:
name: "The 32nd International Symposium on High-Performance Parallel and Distributed Computing"
alias: "HPDC '23"
city: "Orlando"
date-start: 2023-06-20
date-end: 2023-06-23
region: "FL"
country: "US"
website: "https://www.hpdc.org/2023"
date-published: 2023-08-07
status: advance-online
GitHub Events
Total
- Issues event: 1
- Watch event: 8
- Delete event: 11
- Issue comment event: 24
- Push event: 177
- Pull request review comment event: 2
- Pull request review event: 7
- Pull request event: 5
- Fork event: 4
- Create event: 10
Last Year
- Issues event: 1
- Watch event: 8
- Delete event: 11
- Issue comment event: 24
- Push event: 177
- Pull request review comment event: 2
- Pull request review event: 7
- Pull request event: 5
- Fork event: 4
- Create event: 10
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 16
- Total pull requests: 5
- Average time to close issues: 4 months
- Average time to close pull requests: 4 days
- Total issue authors: 8
- Total pull request authors: 3
- Average comments per issue: 5.0
- Average comments per pull request: 2.4
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 5
- Average time to close issues: N/A
- Average time to close pull requests: 4 days
- Issue authors: 1
- Pull request authors: 3
- Average comments per issue: 0.0
- Average comments per pull request: 2.4
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- mxmlnkn (5)
- hyanwong (4)
- ozancaglayan (1)
- noonchen (1)
- ap-- (1)
- martinellimarco (1)
- thomasj02 (1)
- marksteward (1)
Pull Request Authors
- amir-sabbaghi (2)
- cal-pratt (2)
- WeGoToMars (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 9,214 last-month
- Total dependent packages: 2
- Total dependent repositories: 1
- Total versions: 12
- Total maintainers: 1
pypi.org: indexed-bzip2
Fast random access to bzip2 files
- Homepage: https://github.com/mxmlnkn/indexed_bzip2
- Documentation: https://indexed-bzip2.readthedocs.io/
- License: MIT
-
Latest release: 1.7.0
published 5 months ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v3 composite
- github/codeql-action/analyze v2 composite
- github/codeql-action/init v2 composite
- actions/checkout v3 composite
- conda-incubator/setup-miniconda v2 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- conda-incubator/setup-miniconda v2 composite
- actions/checkout v3 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/checkout v3 composite
- conda-incubator/setup-miniconda v2 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- conda-incubator/setup-miniconda v2 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite