Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.6%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: mgarralda
  • License: apache-2.0
  • Language: Java
  • Default Branch: main
  • Size: 8.6 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created 10 months ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

HiBench for Apache Spark 3.3 (Scala-based)

This repository provides an updated version of Intel's HiBench, adapted to support Apache Spark 3.3.x, Scala 2.12, and modern DataFrame APIs.

⚠️ This is a derivative work for internal research and benchmarking. It is not an official release.


🧭 Scope of This Version

  • Targeted for local Spark + Spark Hadoop clusters
  • Validated in Azure HDInsight HDI 5.1 with Azure Blob Storage
  • Designed to run on ephemeral Spark clusters while keeping input/output out of the cluster

🔧 Our Improvements

  • Spark 3.3.x and Hadoop 3.2.x, and Scala 2.12.x with updated build scripts and dependencies
  • ✅ Enabled support for YARN cluster deploy mode, keeping the original local and client modes
  • ✅ Several workloads refactored to use DataFrame API instead of legacy RDD
  • ✅ Provided WABS compatibility: allows the complex task of mapping file system semantics (as required by the HDFS) to object store style interface exposed by Azure Blob Storage.
  • ✅ Designed to support ephemeral clusters with persistent I/O data generated out the cluster
  • ✅ Added support for only one generation of specific input data for each workload
  • ✅ Added Docker support for local testing and development

- ✅ Enhanced the hibench.report to show the application_id and input data size scale.

Spark Cluster based on docker

You can use our Docker Spark Cluster version for development and testing. Link: Docker Spark Cluster

📚 Original Work Attribution

This project is based on the original HiBench suite by: - Intel Corporation - Repository: https://github.com/Intel-bigdata/HiBench

Modifications by: - Mariano Garralda (2025) - University of A Coruña (UDC)

See the NOTICE file for details.


📜 License

This project is licensed under the Apache License 2.0.
See the NOTICE file for attribution.


📄 Citation

If you use this work in academic or research contexts, please cite:

Garralda, M. (2025). Extended benchmarking analysis based on HiBench for Spark 3.3 workloads.

Garralda-Barrio, M., Eiras-Franco, C., & Bolón-Canedo, V. (2024).
A novel framework for generic Spark workload characterization and similar pattern recognition using machine learning.
Journal of Parallel and Distributed Computing, 189, 104881. https://doi.org/10.1016/j.jpdc.2024.104881

📚 Citation (BibTeX) ```bibtex @article{garralda2024novel, title={A novel framework for generic Spark workload characterization and similar pattern recognition using machine learning}, author={Garralda-Barrio, Mariano and Eiras-Franco, Carlos and Bol{\'o}n-Canedo, Ver{\'o}nica}, journal={Journal of Parallel and Distributed Computing}, volume={189}, pages={104881}, year={2024}, doi = {10.1016/j.jpdc.2024.104881}, publisher={Elsevier} } ```

🐳 Spark Cluster for Local Testing

You can use our Docker-based Spark-Hadoop cluster for development and validation:

🔗 Docker Spark Cluster


This README was adapted and extended to reflect the specific contributions made in the context of a research benchmark study targeting Spark 3.3 compatibility and cloud deployment support.

HiBench Suite

The bigdata micro benchmark suite

  • Current version: 1.0
  • Homepage: https://github.com/intel-hadoop/HiBench
  • Contents:
    1. Overview
    2. Getting Started
    3. Workloads
    4. Supported Releases

📦 HiBench Suite Overview

HiBench is a big data benchmark suite that helps evaluate Spark workloads in terms of speed, throughput and system resource utilizations.

Workloads

There are totally 24 workloads in HiBench. The workloads are divided into 6 categories which are micro, ml(machine learning), sql, graph, and websearch.

Micro Benchmarks:

  1. Sort (sort)

    This workload sorts its text input data, which is generated using RandomTextWriter.

  2. WordCount (wordcount)

    This workload counts the occurrence of each word in the input data, which are generated using RandomTextWriter. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set.

  3. TeraSort (terasort)

    TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program.

  4. Repartition (micro/repartition)

    This workload benchmarks shuffle performance. Input data is generated by Hadoop TeraGen. The workload randomly selects the post-shuffle partition for each record, performs shuffle write and read, evenly repartitioning the records. There are 2 parameters providing options to eliminate data source & sink I/Os: hibench.repartition.cacheinmemory(default: false) and hibench.repartition.disableOutput(default: false), controlling whether or not to 1) cache the input in memory at first 2) write the result to storage

  5. Sleep (sleep)

    This workload sleep an amount of seconds in each task to test framework scheduler.

Machine Learning:

  1. Bayesian Classification (Bayes)

    Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between every pair of features. This workload is implemented in spark.mllib and uses the automatically generated documents whose words follow the zipfian distribution. The dict used for text generation is also from the default linux file /usr/share/dict/linux.words.ords.

  2. K-means clustering (Kmeans)

    This workload tests the K-means (a well-known clustering algorithm for knowledge discovery and data mining) clustering in spark.mllib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution. There is also an optimized K-means implementation based on DAL (Intel Data Analytics Library), which is available in the dal module of sparkbench.

  3. Gaussian Mixture Model (GMM)

    Gaussian Mixture Model represents a composite distribution whereby points are drawn from one of k Gaussian sub-distributions, each with its own probability. It's implemented in spark.mllib. The input data set is generated by GenKMeansDataset based on Uniform Distribution and Guassian Distribution.

  4. Logistic Regression (LR)

    Logistic Regression (LR) is a popular method to predict a categorical response. This workload is implemented in spark.mllib with LBFGS optimizer and the input data set is generated by LogisticRegressionDataGenerator based on random balance decision tree. It contains three different kinds of data types, including categorical data, continuous data, and binary data.

  5. Alternating Least Squares (ALS)

    The alternating least squares (ALS) algorithm is a well-known algorithm for collaborative filtering. This workload is implemented in spark.mllib and the input data set is generated by RatingDataGenerator for a product recommendation system.

  6. Gradient Boosted Trees (GBT)

    Gradient-boosted trees (GBT) is a popular regression method using ensembles of decision trees. This workload is implemented in spark.mllib and the input data set is generated by GradientBoostedTreeDataGenerator.

  7. XGBoost (XGBoost)

    XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. This workload is implemented with XGBoost4J-Spark API in spark.mllib and the input data set is generated by GradientBoostedTreeDataGenerator.

  8. Linear Regression (Linear)

    Linear Regression (Linear) is a workload that implemented in spark.ml with ElasticNet. The input data set is generated by LinearRegressionDataGenerator.

  9. Latent Dirichlet Allocation (LDA)

    Latent Dirichlet allocation (LDA) is a topic model which infers topics from a collection of text documents. This workload is implemented in spark.mllib and the input data set is generated by LDADataGenerator.

  10. Principal Components Analysis (PCA)

    Principal component analysis (PCA) is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. PCA is used widely in dimensionality reduction. This workload is implemented in spark.ml. The input data set is generated by PCADataGenerator.

  11. Random Forest (RF)

    Random forests (RF) are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine many decision trees in order to reduce the risk of overfitting. This workload is implemented in spark.mllib and the input data set is generated by RandomForestDataGenerator.

  12. Support Vector Machine (SVM)

    Support Vector Machine (SVM) is a standard method for large-scale classification tasks. This workload is implemented in spark.mllib and the input data set is generated by SVMDataGenerator.

  13. Singular Value Decomposition (SVD)

    Singular value decomposition (SVD) factorizes a matrix into three matrices. This workload is implemented in spark.mllib and its input data set is generated by SVDDataGenerator.

SQL:

  1. Scan (scan) 2. Join (join), 3. Aggregate (aggregation)

    These workloads are developed based on SIGMOD 09 paper "A Comparison of Approaches to Large-Scale Data Analysis" and HIVE-396. It contains Hive queries (Aggregation and Join) performing the typical OLAP queries described in the paper. Its input is also automatically generated Web data with hyperlinks following the Zipfian distribution.

Websearch Benchmarks:

  1. PageRank (pagerank)

    This workload benchmarks PageRank algorithm implemented in Spark-MLLib/Hadoop (a search engine ranking benchmark included in pegasus 2.0) examples. The data source is generated from Web data whose hyperlinks follow the Zipfian distribution.

rge-scale search indexing is one of the most significant uses of MapReduce. This workload tests the indexing sub-system in Nutch, a popular open source (Apache project) search engine. The workload uses the automatically generated Web data whose hyperlinks and words both follow the Zipfian distribution with corresponding parameters. The dict used to generate the Web page texts is the default linux dict file.

Graph Benchmark:

  1. NWeight (nweight)

    NWeight is an iterative graph-parallel algorithm implemented by Spark GraphX and pregel. The algorithm computes associations between two vertices that are n-hop away.

Supported Hadoop/Spark releases:

  • Hadoop: Apache Hadoop 3.3.x
  • Spark: Spark 3.3.x

Owner

  • Name: Mariano Garralda
  • Login: mgarralda
  • Kind: user
  • Location: Lleida
  • Company: Indra

M.Sc. Computer Engineering, Big Data, Artificial Intelligence Research and Ph.D. student in Computer Science

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this repository, please cite it as below."
title: "HiBench for Apache Spark 3.3 (Scala-based)"
authors:
  - family-names: Garralda
    given-names: Mariano
    affiliation: University of A Coruña (UDC)
date-released: 2025-05-02
version: 1.0.0
repository-code: https://github.com/mgarralda/hibench
license: Apache-2.0
keywords:
  - benchmarking
  - spark
  - big data
  - hibench
  - scala
  - performance evaluation
type: software
abstract: >
  This repository provides a modified version of Intel's HiBench benchmark suite,
  adapted to support Apache Spark 3.3 and Scala-based environments. It includes
  updated compatibility, improved configuration scripts, and validation for research use.

GitHub Events

Total
  • Delete event: 1
  • Public event: 1
  • Push event: 3
  • Create event: 2
Last Year
  • Delete event: 1
  • Public event: 1
  • Push event: 3
  • Create event: 2

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 2
  • Total Committers: 1
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 2
  • Committers: 1
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
mgarralda m****8@g****m 2

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

autogen/pom.xml maven
  • org.apache.spark:spark-core_${scala.binary.version} ${spark.version} provided
  • org.apache.spark:spark-mllib_${scala.binary.version} ${spark.version} provided
  • org.apache.spark:spark-sql_${scala.binary.version} ${spark.version} provided
  • com.github.scopt:scopt_${scala.binary.version} ${scopt.version}
  • com.intel.hibench:hibench-common ${project.version}
  • org.apache.hadoop:hadoop-client ${hadoop.mr2.version}
  • org.apache.hadoop:hadoop-hdfs ${hadoop.mr2.version}
  • org.apache.kafka:kafka-clients 0.8.2.2
  • org.apache.mahout:mahout-core ${mahout.version}
  • org.apache.mahout:mahout-math ${mahout.version}
  • org.uncommons.maths:uncommons-maths ${uncommons-maths.version}
common/pom.xml maven
  • com.codahale.metrics:metrics-jvm 3.0.2
  • org.apache.kafka:kafka_2.11 0.8.2.1
pom.xml maven
  • junit:junit 3.8.1 test
sparkbench/assembly/pom.xml maven
  • com.intel.hibench.sparkbench:sparkbench-common ${project.version}
sparkbench/common/pom.xml maven
  • org.apache.spark:spark-core_${scala.binary.version} ${spark.version} provided
  • org.apache.hadoop:hadoop-common ${hadoop.mr2.version}
  • org.apache.spark:spark-sql_${scala.binary.version} ${spark.version}
sparkbench/dal/pom.xml maven
  • org.apache.spark:spark-core_${scala.binary.version} ${spark.version} provided
  • org.apache.spark:spark-mllib_${scala.binary.version} ${spark.version} provided
  • com.github.scopt:scopt_${scala.binary.version} ${scopt.version}
  • com.intel.daal:daal 2019.3.199
  • com.intel.hibench.sparkbench:sparkbench-common ${project.version}
  • org.apache.mahout:mahout-core ${mahout.version}
  • org.apache.mahout:mahout-math ${mahout.version}
sparkbench/graph/pom.xml maven
  • org.apache.spark:spark-core_${scala.binary.version} ${spark.version} provided
  • org.apache.spark:spark-graphx_${scala.binary.version} ${spark.version} provided
  • org.apache.spark:spark-mllib_${scala.binary.version} ${spark.version} provided
  • com.intel.hibench.sparkbench:sparkbench-common 8.0-SNAPSHOT
  • it.unimi.dsi:fastutil ${fastutil.version}
sparkbench/micro/pom.xml maven
  • org.apache.spark:spark-core_${scala.binary.version} ${spark.version} provided
  • com.intel.hibench.sparkbench:sparkbench-common ${project.version}
  • org.apache.hadoop:hadoop-client ${hadoop.mr2.version}
  • org.apache.hadoop:hadoop-mapreduce-examples ${hadoop.mr2.version}
  • org.apache.spark:spark-sql_${scala.binary.version} ${spark.version}
  • org.scala-lang.modules:scala-java8-compat_${scala.binary.version} 0.9.0
sparkbench/ml/pom.xml maven
  • org.apache.spark:spark-core_${scala.binary.version} ${spark.version} provided
  • org.apache.spark:spark-mllib_${scala.binary.version} ${spark.version} provided
  • com.github.scopt:scopt_${scala.binary.version} ${scopt.version}
  • com.intel.hibench.sparkbench:sparkbench-common ${project.version}
  • ml.dmlc:xgboost4j-spark_${scala.binary.version} 1.0.0
  • ml.dmlc:xgboost4j_${scala.binary.version} 1.0.0
  • org.apache.mahout:mahout-core ${mahout.version}
  • org.apache.mahout:mahout-math ${mahout.version}
sparkbench/pom.xml maven
sparkbench/sql/pom.xml maven
  • org.apache.spark:spark-core_${scala.binary.version} ${spark.version} provided
  • org.apache.spark:spark-hive_${scala.binary.version} ${spark.version} provided
  • com.intel.hibench.sparkbench:sparkbench-common 8.0-SNAPSHOT
sparkbench/websearch/pom.xml maven
  • org.apache.spark:spark-core_${scala.binary.version} ${spark.version} provided
  • com.intel.hibench.sparkbench:sparkbench-common 8.0-SNAPSHOT