fast-multi-join-sketch

Fast Cardinality Estimation of Multi-Join Queries Using Sketches

https://github.com/mikeheddes/fast-multi-join-sketch

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.9%) to scientific vocabulary

Keywords

cardinality-estimation sketching-algorithm
Last synced: 6 months ago · JSON representation ·

Repository

Fast Cardinality Estimation of Multi-Join Queries Using Sketches

Basic Info
Statistics
  • Stars: 16
  • Watchers: 2
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
cardinality-estimation sketching-algorithm
Created over 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries

This repository contains the source code, extended results, and cardinality estimates for the experiments of the research paper published at the International Conference on Management of Data (SIGMOD) 2024.

Requirements

The code is written in Python 3.10. The required packages to run the experiments can be found in requirements.txt. To install the required packages, run the following command:

bash pip install -r requirements.txt

In addition, the hash function needs to be compiled by following the directions in /kwisehash/README.md.

Download the data and queries

The experiments use the IMDB and STATS databases with queries provided by the End-to-End CardEst Benchmark. To run the experiments, first download the required data using the following commands:

```bash curl -L -o End-to-End-CardEst-Benchmark.zip https://github.com/Nathaniel-Han/End-to-End-CardEst-Benchmark/archive/refs/heads/master.zip unzip End-to-End-CardEst-Benchmark.zip rm End-to-End-CardEst-Benchmark.zip

curl -L -o imdb.tgz http://homepages.cwi.nl/~boncz/job/imdb.tgz mkdir imdb tar zxvf imdb.tgz -C imdb rm imdb.tgz This should result in the following file structure: End-to-End-CardEst-Benchmark-master/ imdb/ experiment.py ... ```

Experiments

Information about the accepted arguments for the experiments can be obtained using the following command:

bash python experiment.py --help

For example, the following command runs our proposed method with m=1000000 and takes the median of l=5 i.i.d. estimates:

bash python experiment.py --method count-conv --query stats-7 --bins 1000000 --medians 5

Available queries

The End-to-End CardEst Benchmark provides queries with sub-queries for the STATS and IMDB databases. The following are the available options: stats-[1-146], stats_sub-[1-2603], job_light-[1-70], and job_light_sub-[1-696], where the brackets are inclusive ranges.

Speed-up data loading

Loading the data from csv files for each experiment can incur significant overhead. To alleviate this, one can cache the loaded tables as pickle files using python cache_tables.py. After this finishes, the data loading time during the experiments should be reduced by roughly a factor of 10.

Extended results

The absolute relative error plots, in addition to the timing plots of each stage (initialization, sketching, and inference) for all 216 queries are provided in /figures.

Cardinality estimates

The cardinality estimates for all the sub-queries of both the STATS and IMDB databases are provided in /estimates, which follows the same format as the estimates provided by the End-to-End CardEst Benchmark.

Citation

If you use this code for your research, please cite our paper:

@inproceedings{heddes2024convolution, title={Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries}, author={Heddes, Mike and Nunes, Igor and Givargis, Tony and Nicolau, Alex}, booktitle={Proceedings of the 2024 ACM SIGMOD International Conference on Management of Data}, year={2024} }

Owner

  • Name: Mike Heddes
  • Login: mikeheddes
  • Kind: user
  • Location: Irvine, California

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this code for your research, please cite our paper."
authors:
- family-names: "Heddes"
  given-names: "Mike"
  orcid: "https://orcid.org/0000-0002-9276-458X"
- family-names: "Nunes"
  given-names: "Igor"
- family-names: "Givargis"
  given-names: "Tony"
- family-names: "Nicolau"
  given-names: "Alex"
title: "Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries"
url: "https://github.com/mikeheddes/fast-multi-join-sketch"
preferred-citation:
  type: conference-paper
  authors:
  - family-names: "Heddes"
    given-names: "Mike"
    orcid: "https://orcid.org/0000-0002-9276-458X"
  - family-names: "Nunes"
    given-names: "Igor"
  - family-names: "Givargis"
    given-names: "Tony"
  - family-names: "Nicolau"
    given-names: "Alex"
  title: "Convolution and Cross-Correlation of Count Sketches Enables Fast Cardinality Estimation of Multi-Join Queries"
  collection-title: "Proceedings of the 2024 ACM SIGMOD International Conference on Management of Data"
  collection-type: proceedings
  year: 2024

GitHub Events

Total
  • Watch event: 2
Last Year
  • Watch event: 2

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • numpy *
  • pandas *
  • scipy >=1.8.0
  • torch >=1.12
  • torch-hd *
  • tqdm *
  • typed-argument-parser *
kwisehash/setup.py pypi