data-chunk-compaction-in-duckdb
We integrate data chunk compaction methods into duckdb, for an end-to-end benchmark.
https://github.com/yimingqiao/data-chunk-compaction-in-duckdb
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Keywords
Repository
We integrate data chunk compaction methods into duckdb, for an end-to-end benchmark.
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Data Chunk Compaction in Vectorized Execution
This is the repository for the paper "Data Chunk Compaction in Vectorized Execution", accepted by SIGMOD'25.
The Supplementary Material of our paper includes three repositories: 1. Problem formalization and simulation 2. Some Microbenchmarks to compare various compaction strategies 3. Integrate the Leaning and Logical Compaction into the Duckdb, evaluating the End-to-end performance (Current Repository)
Updates: The implementation of Logical Compaction has been successfully merged into DuckDB!
Chunk Compaction in the DuckDB
Since we have showed that the chunk compaction is important for vectorized execution in repository, we then integrate our solution into duckdb. Our solution consists of Dynamic/Learning compaction and the compacted vectorized hash join.
Important Modified Files
We modify many files of the original duckdb, but files around the Hash Join Operator are the most important, including:
- physical_hash_join.h/cpp, which contains the hash join operator implemetation.
- join_hashtable.h/cpp, which contains the used hash table in hash join.
- physical_operator.hpp, which contains a class called CachingPhysicalOperator/CompactingPhysicalOperator. This operator introduces the compaction strategy used in the original duckdb.
- data_chunk.hpp, which contains the design of the data chunk.
- profiler.hpp, which contains several profilers that we use to record the chunk number and chunk size.
And, we disable the column compression and the perfect hash techinique in our end-to-end benchmark. We use the benchmark code provided by DuckDB, but adjust the scale factor used in the TPC-H and the TPC-DS.
Compile and Execution
We use the same way as the orignal duckdb to compile and execute. Please refer to this document.
Build the benchmark
BUILD_BENCHMARK=1 BUILD_TPCH=1 BUILD_TPCDS=1 make
List all available benchmarks
build/release/benchmark/benchmark_runner --list
Run a single benchmark
build/release/benchmark/benchmark_runner 'benchmark/imdb/19d.benchmark'
The output will be printed to stdout in CSV format, in the following format:
name run timing
benchmark/imdb/19d.benchmark 1 2.305139
benchmark/imdb/19d.benchmark 2 2.317836
benchmark/imdb/19d.benchmark 3 2.305804
benchmark/imdb/19d.benchmark 4 2.312833
benchmark/imdb/19d.benchmark 5 2.267040
Regex
You can also use a regex to specify which benchmarks to run. Be careful of shell expansion of certain regex characters (e.g. * will likely be expanded by your shell, hence this requires proper quoting or escaping).
build/release/benchmark/benchmark_runner --threads=1 '(benchmark/imdb/.*)'
Run all benchmarks
Not specifying any argument will run all benchmarks.
build/release/benchmark/benchmark_runner
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Raasveldt" given-names: "Mark" orcid: "https://orcid.org/0000-0001-5005-6844" - family-names: "Muehleisen" given-names: "Hannes" orcid: "https://orcid.org/0000-0001-8552-0029" title: "DuckDB" url: "https://github.com/duckdb/duckdb"
GitHub Events
Total
- Push event: 1
- Public event: 1
Last Year
- Push event: 1
- Public event: 1