https://github.com/dadananjesha/spark-streaming

Spark Streaming KPI Processing is a real-time data processing application built using Apache Spark Streaming

https://github.com/dadananjesha/spark-streaming

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.4%) to scientific vocabulary

Keywords

apache apache-spark-streaming data-processing hdfs kafka kpi real-time spark spark-streaming
Last synced: 5 months ago · JSON representation

Repository

Spark Streaming KPI Processing is a real-time data processing application built using Apache Spark Streaming

Basic Info
  • Host: GitHub
  • Owner: DadaNanjesha
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 2.93 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
apache apache-spark-streaming data-processing hdfs kafka kpi real-time spark spark-streaming
Created 12 months ago · Last pushed 12 months ago
Metadata Files
Readme License

README.md

Spark Streaming KPI Processing 🚀📊

Apache Spark Kafka Python Version License: MIT

Spark Streaming KPI Processing is a real-time data processing application built using Apache Spark Streaming. The application ingests streaming data from Kafka, transforms it through custom user-defined functions (UDFs), and computes key performance indicators (KPIs) such as Orders per Minute, Total Sales Volume, Rate of Return, and Average Transaction Size. Processed results are output both to the console and to HDFS as JSON files for downstream analytics.


📖 Table of Contents


🔍 Overview

This project demonstrates a step-by-step Spark Streaming application to achieve real-time KPI calculations. The workflow includes: - Data Ingestion: Reading streaming data from a Kafka topic. - Data Parsing: Converting raw JSON messages into a structured DataFrame using a defined schema. - Data Transformation: Unnesting array fields, renaming columns, and applying UDFs to calculate cost and flag order/return events. - KPI Calculation: Aggregating data in 1-minute windows to compute metrics like Orders per Minute (OPM), Total Sales Volume, Rate of Return, and Average Transaction Size. - Output: Streaming processed data to the console and writing JSON files to HDFS for both time-based and country-based KPIs.


✨ Key Features

  • Real-Time Data Ingestion: Connects to Kafka for live data streaming.
  • Structured Data Processing: Utilizes Spark Structured Streaming with a defined schema to make raw data readable.
  • Custom Transformations: Implements UDFs to flag orders/returns and calculate transaction cost.
  • Windowed Aggregation: Computes KPIs over 1-minute intervals with watermarking for late data.
  • Multiple Output Modes: Streams output to the console and writes KPI data to HDFS as JSON.

🏗️ Project Architecture

The application is structured to ensure modularity and scalability. Key steps include: 1. Library Import & Spark Session Setup 2. Kafka Connection & Data Ingestion 3. Schema Mapping & JSON Parsing 4. Exploding Array Columns & Renaming 5. Applying UDFs for KPI Derivation 6. Windowed Aggregation for KPI Calculation 7. Output Streaming to Console and HDFS


🔄 Flow Diagram

mermaid flowchart TD A[Kafka Topic: Streaming Data] --> B[Raw Data -JSON] B --> C[Parse Data with Schema] C --> D[Explode Array Items & Rename Columns] D --> E[Apply UDFs for Order or Return Flags & Cost] E --> F[Group by Window -1 minute] F --> G[Compute KPIs: OPM, Sales, Rate of Return, Avg Trans Size] G --> H[Output to Console] G --> I[Write JSON to HDFS]


🛠️ Technologies Used

Apache Spark Kafka Python HDFS

💻 Installation & Setup

Prerequisites

  • Python 3.8+
  • Apache Spark 3.x
  • Kafka Broker Access
  • HDFS Setup (for output files)
  • JDK Installed (if using spark-submit)

Setup Steps

  1. Clone the Repository:

bash git clone https://github.com/DadaNanjesha/Spark-streaming.git cd Spark-streaming git checkout dev

  1. Set Up Virtual Environment:

bash python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate

  1. Install Dependencies:

Ensure your requirements.txt (if provided) includes necessary libraries like pyspark. Then run:

bash pip install -r requirements.txt

  1. Configure Kafka & HDFS Settings:
  • Update Kafka broker address, topic name, and any output paths in spark-streaming.py.
  1. Build the Application (Optional):

To create a jar for deployment, follow instructions in the code documentation: bash wget https://ds-spark-sql-kafka-jar.s3.amazonaws.com/spark-sql-kafka-0-10_2.11-2.3.0.jar export SPARK_KAFKA_VERSION=0.10 spark2-submit --jars spark-sql-kafka-0-10_2.11-2.3.0.jar spark-streaming.py > console_print


🚀 Usage

  • Run Streaming Application:
    Execute the main Python file to start the streaming job: bash python spark-streaming.py

  • Monitor the Output:
    View real-time outputs in the console and monitor the Spark UI (default port 4040).

  • Review KPI Outputs:
    Processed KPIs will be written to HDFS directories as JSON files (paths as configured).


⭐️ Support & Call-to-Action

If you find this project useful, please consider: - Starring the repository ⭐️ - Forking the project to contribute enhancements - Following for updates on future improvements

Your engagement helps increase visibility and encourages further collaboration!


📜 License

This project is licensed under the MIT License.


🙏 Acknowledgements

  • UpGrad Education: For providing guidance and educational resources.
  • Apache Spark & Kafka Community: For their open-source contributions.

Happy Streaming & KPI Calculations! 🚀📡

Owner

  • Name: DADA NANJESHA
  • Login: DadaNanjesha
  • Kind: user
  • Location: BERLIN

GitHub Events

Total
  • Push event: 4
  • Pull request event: 2
  • Create event: 3
Last Year
  • Push event: 4
  • Pull request event: 2
  • Create event: 3

Issues and Pull Requests

Last synced: 11 months ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • DadaNanjesha (1)
Top Labels
Issue Labels
Pull Request Labels