hep-iris-benchmark-scripts

HEP IRIS benchmark scripts

https://github.com/rumbledb/hep-iris-benchmark-scripts

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 6 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
✓
Committers with academic emails
2 of 2 committers (100.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

HEP IRIS benchmark scripts

Basic Info

Host: GitHub
Owner: RumbleDB
Language: Python
Default Branch: master
Size: 4.26 MB

Statistics

Stars: 2
Watchers: 3
Forks: 1
Open Issues: 0
Releases: 6

Created about 5 years ago · Last pushed over 2 years ago

Metadata Files

Readme Citation

Benchmark Scripts for Evaluating Query Languages and Systems for High-Energy Physics Data

This repository contains benchmarks scripts for running the implementations of High-energy Physics (HEP) analysis queries from the IRIS HEP benchmark for various general-purpose query processing systems. The results have been published in the following paper:

Dan Graur, Ingo Müller, Mason Proffitt, Ghislain Fourny, Gordon T. Watts, Gustavo Alonso. Evaluating Query Languages and Systems for High-Energy Physics Data. In: PVLDB 15(2), 2022. DOI: 10.14778/3489496.3489498.

Please cite both, the paper and the software, when citing in academic contexts.

Overview of the repository

This repository contains the scripts for producing the datasets, the scripts for running the experiments, and the scripts for plotting the results used in the paper mentioned above.

We recommend to get started with the scripts in the following order:

Get individual queries to run with the systems you are interested in using the small sample datasets provided for each system.

For that purpose, look at the general instructions in the experiments folder as well as the system-specific instructions in the subfolders of the respective systems. 1. Generate the full datasets as described in the datasets folder and upload them to cloud storage and/or load them as per the system-specific instructions. 1. Run the actual experiments using the system-specific scripts from the subfolders of the respective systems.

Running all experiments takes several days and costs at least several hundred dollars of cloud credits, so it's probably a good idea to start with a small subset, then extend them as you gain experience and confidence. 1. Re-generate the plots with the scripts in the plots folder.

We provide the data we used for the plots in the original paper, but you can also copy over your own measurement data and plot that.

Owner

Name: RumbleDB
Login: RumbleDB
Kind: organization
Location: Zurich, Switzerland

Website: http://rumbledb.org/
Twitter: db_rumble
Repositories: 13
Profile: https://github.com/RumbleDB

Query your large messy datasets, no matter where they are.

Citation (CITATION.cff)

# YAML 1.2
---
cff-version: 1.2.0
title: Benchmark Scripts for "Evaluating Query Languages and Systems for High-Energy Physics Data"
message: |
    This repository hosts the experiment scripts used for the following paper. Please cite both, the software and the paper, when citing in academic contexts.
    
    Dan Graur, Ingo Müller, Mason Proffitt, Ghislain Fourny, Gordon T. Watts, Gustavo Alonso. "Evaluating Query Languages and Systems for High-Energy Physics Data." In: PVLDB 15(2), 2022. DOI: 10.14778/3489496.3489498.
type: software
repository-code: "https://github.com/RumbleDB/hep-iris-benchmark-scripts"
authors:
  - given-names: Dan
    family-names: Graur
    email: dan.graur@inf.ethz.ch
    affiliation: "ETH Zurich"
  - given-names: Ingo
    family-names: "Müller"
    email: ingo.mueller@inf.ethz.ch
    affiliation: "ETH Zurich"
    orcid: "https://orcid.org/0000-0001-8818-8324"
  - given-names: Mason
    family-names: Proffitt
    email: masonLp@uw.edu
    affiliation: "University of Washington"
    orcid: "https://orcid.org/0000-0001-8740-8866"
  - given-names: Ghislain
    family-names: Fourny
    email: ghislain.fourny@inf.ethz.ch
    affiliation: "ETH Zurich"
    orcid: "https://orcid.org/0000-0001-8740-8866"
  - given-names: "Gordon T."
    family-names: Watts
    email: gwatts@uw.edu
    affiliation: "University of Washington"
  - given-names: Gustavo
    family-names: Alonso
    email: alonso@inf.ethz.ch
    affiliation: "ETH Zurich"
identifiers:
  - description: The scripts used for the experiments in the paper.
    type: doi
    value: "10.5281/zenodo.5569049"
  - description: The paper describing the results of the experiments.
    type: doi
    value: "10.14778/3489496.3489498"
...

GitHub Events

Total

Last Year

Committers

Last synced: 11 months ago

All Time

Total Commits: 281
Total Committers: 2
Avg Commits per committer: 140.5
Development Distribution Score (DDS): 0.31

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Ingo Mueller	i**r@i**h	194
DanGraur	d**r@i**h	87

Committer Domains (Top 20 + Academic)

inf.ethz.ch: 2

Issues and Pull Requests

Last synced: 11 months ago

Dependencies

datasets/requirements-top-level.txt pypi

awscli *
parquet-tools *

datasets/requirements.txt pypi

PyYAML ==5.4.1
awscli ==1.23.2
boto3 ==1.22.1
botocore ==1.25.2
colorama ==0.4.4
cursor ==1.3.4
docutils ==0.15.2
halo ==0.0.29
jmespath ==1.0.0
log-symbols ==0.0.14
numpy ==1.22.3
pandas ==1.4.2
parquet-tools ==0.2.10
pyarrow ==7.0.0
pyasn1 ==0.4.8
python-dateutil ==2.8.2
pytz ==2022.1
rsa ==4.7.2
s3transfer ==0.5.2
six ==1.16.0
spinners ==0.0.24
tabulate ==0.8.9
termcolor ==1.1.0
thrift ==0.13.0
urllib3 ==1.26.9

experiments/athena/requirements.txt pypi

matplotlib *
pandas *
pyathena *
pytest *

experiments/presto/requirements.txt pypi

humanfriendly *

plots/requirements-top-level.txt pypi

matplotlib *
pandas *

plots/requirements.txt pypi

Pillow ==9.1.0
cycler ==0.11.0
fonttools ==4.33.3
kiwisolver ==1.4.2
matplotlib ==3.5.1
numpy ==1.22.3
packaging ==21.3
pandas ==1.4.2
pyparsing ==3.0.8
python-dateutil ==2.8.2
pytz ==2022.1
six ==1.16.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science