lst-bench

LST-Bench is a framework that allows users to run benchmarks specifically designed for evaluating Log-Structured Tables (LSTs) such as Delta Lake, Apache Hudi, and Apache Iceberg.

https://github.com/microsoft/lst-bench

Science Score: 64.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
    2 of 12 committers (16.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.0%) to scientific vocabulary

Keywords from Contributors

mesh interactive
Last synced: 6 months ago · JSON representation ·

Repository

LST-Bench is a framework that allows users to run benchmarks specifically designed for evaluating Log-Structured Tables (LSTs) such as Delta Lake, Apache Hudi, and Apache Iceberg.

Basic Info
  • Host: GitHub
  • Owner: microsoft
  • License: apache-2.0
  • Language: Java
  • Default Branch: main
  • Homepage:
  • Size: 4.68 MB
Statistics
  • Stars: 79
  • Watchers: 10
  • Forks: 40
  • Open Issues: 35
  • Releases: 0
Created almost 3 years ago · Last pushed 6 months ago
Metadata Files
Readme License Code of conduct Citation Security

README.md

LST-Bench

CI Status

LST-Bench is a framework that allows users to run benchmarks specifically designed for evaluating the performance, efficiency, and stability of Log-Structured Tables (LSTs), also commonly referred to as table formats, such as Delta Lake, Apache Hudi, and Apache Iceberg.

Usage Guide

How to Build

Prerequisites

Install open-source Java Development Kit. As a recommendation, install OpenJDK distribution from Adoptium.

Build

To build LST-Bench in Linux/macOS, run the following command:

bash ./mvnw package

Or use the following command for Windows:

bat mvnw.cmd package

To build LST-Bench for a specific database, you can use the profile name (-P) option. This will include the corresponding JDBC driver in the ./target directory. Currently, the following profiles are supported: databricks-jdbc, snowflake-jdbc, spark-jdbc, spark-client, trino-jdbc and microsoft-fabric-jdbc. For example, to build LST-Bench for open-source Spark with JDBC drivers in Linux/macOS, you can run the following command:

bash ./mvnw package -Pspark-jdbc

Or use the following command for Windows:

bat mvnw.cmd package -Pspark-jdbc

How to Run

After building LST-Bench, if you are on Linux/macOS run launcher.sh or open a Powershell launcher.ps1 if you are on Windows to display the usage options.

bash usage: ./launcher.sh -c <arg> -e <arg> -l <arg> -t <arg> -w <arg> -c,--connections-config <arg> [required] Path to input file containing connections config details -e,--experiment-config <arg> [required] Path to input file containing the experiment config details -l,--library <arg> [required] Path to input file containing the library with templates -t,--input-log-config <arg> [required] Path to input file containing the telemetry gathering config details -w,--workload <arg> [required] Path to input file containing the workload definition

Configuration Files

The configuration files used in LST-Bench are YAML files.

You can find their schema, which describes the expected structure and properties, here.

NOTE: The spark schemas are configured for Spark 3.3 or earlier. In case you plan to use Spark 3.4, the setup and setupdatamaintenance tasks need to be modified to handle SPARK-44025. Columns in CSV tables need to defined as STRING instead of VARCHAR or CHAR. Append the following regex replacement to the setup and setupdatamaintenance phases in the workload file: bash replace_regex: - pattern: '(?i)varchar\(.*\)|char\(.*\)' replacement: 'string'

Additionally, you can find sample configurations that can serve as guidelines for creating your configurations here. The YAML file can also contain references to environment variable along with default values. The parser will handle the same appropriately. Example: bash parameter_name: ${ENVIRONMENT_VARIABLE:-default_value}

Architecture

The core of LST-Bench is organized into two modules:

  1. Java Application. This module is written entirely in Java and is responsible for executing SQL workloads against a system under test using JDBC. It reads input configuration files to determine the tasks, sessions, and phases to be executed. The Java application handles the execution of SQL statements and manages the interaction with the system under test.

  2. Python Metrics Module. The metrics module is written in Python and serves as the post-execution analysis component. It consolidates experimental results obtained from the Java application and computes metrics to provide insights into LSTs and cloud data warehouses. The Python module performs data processing, analysis, and visualization to facilitate a deeper understanding of the experimental results.

Additionally, the Adapters module is designed to handle integration with external tools and systems by converting outputs from third-party benchmarks into formats compatible with LST-Bench. One example of this is the CAB to LST-Bench converter, which transforms the output files generated by the Cloud Analytics Benchmark (CAB) into the input format used by LST-Bench.

LST-Bench Concepts

In LST-Bench, we utilize specific concepts to define and organize SQL workloads, with a focus on maximizing flexibility and facilitating reusability across various workloads. For detailed information, refer to our documentation.

Telemetry and Metrics Processor

LST-Bench captures execution telemetry during workload execution at multiple levels, including per experiment, phase, session, task, file, and statement. Each telemetry event is recorded with an associated identifier, such as the statement's name or the phase IDs defined in the workload YAML. The event includes information on whether it succeeded or not, along with any additional associated data. Specifically, each event includes a start time, end time, event ID, event type, status, and optional payload.

The telemetry registry in LST-Bench is configurable, providing flexibility for different systems and use cases. By default, LST-Bench includes an implementation for a JDBC-based registry and supports writing telemetry to DuckDB or Spark. LST-Bench writes these telemetry events into a table within the specified systems, enabling any application to consume and gain insights from the results.

Alternatively, if the LST-Bench Metrics Processor is used, you can simply point it to the same database. The processor will then analyze and visualize the results, providing a streamlined solution for result analysis and visualization.

Documentation

Interested in learning more about LST-Bench? Explore the following resources:

  • From Paper to Practice: Leveraging LST-Bench to Evaluate Lake-Centric Data Platforms - Presented at the DBTest'24 Workshop, June 2024. Slides available here.
  • LST-Bench: Benchmarking Log-Structured Tables in the Cloud - In SIGMOD 2024. Read the technical report here.

If you are writing an academic paper, you can cite this work as:

bibtex @article{2024lstbench, author = {Jes{\'u}s Camacho-Rodr{\'i}guez and Ashvin Agrawal and Anja Gruenheid and Ashit Gosalia and Cristian Petculescu and Josep Aguilar-Saborit and Avrilia Floratou and Carlo Curino and Raghu Ramakrishnan}, title = {LST-Bench: Benchmarking Log-Structured Tables in the Cloud}, journal = {Proc. ACM Manag. Data}, volume = {2}, number = {1}, year = {2024}, url = {https://doi.org/10.1145/3639314} }

Contributing

Here are some ways you can contribute to the LST-Bench project:

  • Submit PRs to fix bugs or add new features.
  • Review currently open PRs.
  • Provide feedback and report bugs related to the software or the documentation.
  • Enhance our design documents, examples, tutorials, and overall documentation.

To get started, please take a look at the issues and leave a comment if any of them interest you.

If you plan to make significant changes, we recommend discussing them with the LST-Bench community first. This helps ensure that your contributions align with the project's goals and avoids duplicating efforts.

Contributor License Agreement

Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

Code of Conduct

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

License

See the LICENSE file for more details.

Owner

  • Name: Microsoft
  • Login: microsoft
  • Kind: organization
  • Email: opensource@microsoft.com
  • Location: Redmond, WA

Open source projects and samples from Microsoft

Citation (CITATION.bib)

@article{2024lstbench,
    author = {Jes{\'u}s Camacho-Rodr{\'i}guez and Ashvin Agrawal and Anja Gruenheid and
            Ashit Gosalia and Cristian Petculescu and Josep Aguilar-Saborit and
            Avrilia Floratou and Carlo Curino and Raghu Ramakrishnan},
    title = {LST-Bench: Benchmarking Log-Structured Tables in the Cloud},
    journal = {Proc. ACM Manag. Data},
    volume = {2},
    number = {1},
    year = {2024},
    url = {https://doi.org/10.1145/3639314}
}

GitHub Events

Total
  • Watch event: 13
  • Delete event: 57
  • Issue comment event: 43
  • Push event: 57
  • Pull request review event: 14
  • Pull request event: 126
  • Fork event: 6
  • Create event: 58
Last Year
  • Watch event: 13
  • Delete event: 57
  • Issue comment event: 43
  • Push event: 57
  • Pull request review event: 14
  • Pull request event: 126
  • Fork event: 6
  • Create event: 58

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 277
  • Total Committers: 12
  • Avg Commits per committer: 23.083
  • Development Distribution Score (DDS): 0.408
Past Year
  • Commits: 73
  • Committers: 5
  • Avg Commits per committer: 14.6
  • Development Distribution Score (DDS): 0.329
Top Committers
Name Email Commits
dependabot[bot] 4****] 164
Jesús Camacho Rodríguez j****a@m****m 66
anjagruenheid 8****d 23
poojanilangekar p****n@u****u 5
Ashvin a****a 5
Microsoft Open Source m****e 4
Jose Medrano j****d@m****m 4
Ángel a****0@g****m 2
microsoft-github-operations[bot] 5****] 1
SongMinSeok 9****4 1
Justin Tay 4****y 1
Tiemo Bang t****g@b****u 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 42
  • Total pull requests: 266
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 11 days
  • Total issue authors: 10
  • Total pull request authors: 11
  • Average comments per issue: 0.38
  • Average comments per pull request: 0.58
  • Merged pull requests: 144
  • Bot issues: 1
  • Bot pull requests: 210
Past Year
  • Issues: 3
  • Pull requests: 121
  • Average time to close issues: 5 days
  • Average time to close pull requests: 17 days
  • Issue authors: 3
  • Pull request authors: 4
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.52
  • Merged pull requests: 45
  • Bot issues: 1
  • Bot pull requests: 108
Top Authors
Issue Authors
  • jcamachor (14)
  • anjagruenheid (10)
  • ashvina (7)
  • rbalamohan (2)
  • dependabot[bot] (2)
  • simcha-vos-from-tu-delft (1)
  • netonjm (1)
  • alanhal (1)
  • manucorujo (1)
  • ikyranas (1)
  • agr17 (1)
Pull Request Authors
  • dependabot[bot] (311)
  • jcamachor (36)
  • anjagruenheid (19)
  • ashvina (7)
  • poojanilangekar (5)
  • netonjm (4)
  • agr17 (3)
  • tiemobang (2)
  • justin-tay (2)
  • skytin1004 (1)
  • rbalamohan (1)
Top Labels
Issue Labels
enhancement (12) good first issue (10) java (4) bug (4) documentation (3) dependencies (2)
Pull Request Labels
dependencies (311) java (301) github_actions (8) python (2) good first issue (1) enhancement (1)

Dependencies

.github/workflows/maven.yaml actions
  • actions/checkout v4 composite
  • actions/setup-java v3 composite
  • jbutcher5/read-yaml 1.6 composite
pom.xml maven
  • org.apache.spark:spark-sql_2.12 3.3.2 provided
  • org.immutables:value 2.9.3 provided
  • com.fasterxml.jackson.dataformat:jackson-dataformat-yaml 2.15.2
  • com.google.code.findbugs:jsr305 3.0.2
  • commons-cli:commons-cli 1.5.0
  • commons-io:commons-io 2.13.0
  • org.apache.commons:commons-lang3 3.13.0
  • org.apache.commons:commons-text 1.10.0
  • org.apache.logging.log4j:log4j-api 2.20.0
  • org.apache.logging.log4j:log4j-core 2.20.0
  • org.apache.logging.log4j:log4j-slf4j-impl 2.20.0
  • org.duckdb:duckdb_jdbc 0.9.0
  • com.networknt:json-schema-validator 1.0.87 test
  • io.delta:delta-core_2.12 2.2.0 test
  • io.delta:delta-storage 2.2.0 test
  • org.apache.hudi:hudi-spark3.3-bundle_2.12 0.12.2 test
  • org.apache.iceberg:iceberg-spark-runtime-3.3_2.12 1.1.0 test
  • org.junit-pioneer:junit-pioneer 2.1.0 test
  • org.junit.jupiter:junit-jupiter 5.10.0 test
  • org.mockito:mockito-core 5.5.0 test
metrics/notebooks/requirements.txt pypi
  • azure-cli-core ==2.49.0
  • azure-core ==1.27.0
  • azure-identity ==1.13.0
  • azure-mgmt-compute ==30.0.0
  • azure-monitor-query ==1.2.0
  • duckdb ==0.8.0
  • jupyter ==1.0.0
  • matplotlib ==3.7.1
  • pandas ==2.0.2
  • python_dateutil ==2.8.2
  • seaborn ==0.12.2
.github/workflows/webapp-deploy.yaml actions
  • actions/checkout v4 composite
  • actions/download-artifact v4 composite
  • actions/setup-python v5 composite
  • actions/upload-artifact v4 composite
  • azure/login v2 composite
  • azure/webapps-deploy v3 composite
metrics/app/requirements.txt pypi
  • altair ==5.2.0
  • duckdb ==0.9.2
  • pandas ==2.2.0
  • streamlit ==1.31.0