https://github.com/dadananjesha/spark-streaming

Spark Streaming KPI Processing is a real-time data processing application built using Apache Spark Streaming

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary

Keywords

apache apache-spark-streaming data-processing hdfs kafka kpi real-time spark spark-streaming

Last synced: 5 months ago · JSON representation

Repository

Spark Streaming KPI Processing is a real-time data processing application built using Apache Spark Streaming

Basic Info

Host: GitHub
Owner: DadaNanjesha
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 2.93 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

apache apache-spark-streaming data-processing hdfs kafka kpi real-time spark spark-streaming

Created 12 months ago · Last pushed 12 months ago

Metadata Files

Readme License

Spark Streaming KPI Processing 🚀📊

Spark Streaming KPI Processing is a real-time data processing application built using Apache Spark Streaming. The application ingests streaming data from Kafka, transforms it through custom user-defined functions (UDFs), and computes key performance indicators (KPIs) such as Orders per Minute, Total Sales Volume, Rate of Return, and Average Transaction Size. Processed results are output both to the console and to HDFS as JSON files for downstream analytics.

🔍 Overview

This project demonstrates a step-by-step Spark Streaming application to achieve real-time KPI calculations. The workflow includes: - Data Ingestion: Reading streaming data from a Kafka topic. - Data Parsing: Converting raw JSON messages into a structured DataFrame using a defined schema. - Data Transformation: Unnesting array fields, renaming columns, and applying UDFs to calculate cost and flag order/return events. - KPI Calculation: Aggregating data in 1-minute windows to compute metrics like Orders per Minute (OPM), Total Sales Volume, Rate of Return, and Average Transaction Size. - Output: Streaming processed data to the console and writing JSON files to HDFS for both time-based and country-based KPIs.

✨ Key Features

Real-Time Data Ingestion: Connects to Kafka for live data streaming.
Structured Data Processing: Utilizes Spark Structured Streaming with a defined schema to make raw data readable.
Custom Transformations: Implements UDFs to flag orders/returns and calculate transaction cost.
Windowed Aggregation: Computes KPIs over 1-minute intervals with watermarking for late data.
Multiple Output Modes: Streams output to the console and writes KPI data to HDFS as JSON.

🏗️ Project Architecture

The application is structured to ensure modularity and scalability. Key steps include: 1. Library Import & Spark Session Setup 2. Kafka Connection & Data Ingestion 3. Schema Mapping & JSON Parsing 4. Exploding Array Columns & Renaming 5. Applying UDFs for KPI Derivation 6. Windowed Aggregation for KPI Calculation 7. Output Streaming to Console and HDFS

🔄 Flow Diagram

mermaid flowchart TD A[Kafka Topic: Streaming Data] --> B[Raw Data -JSON] B --> C[Parse Data with Schema] C --> D[Explode Array Items & Rename Columns] D --> E[Apply UDFs for Order or Return Flags & Cost] E --> F[Group by Window -1 minute] F --> G[Compute KPIs: OPM, Sales, Rate of Return, Avg Trans Size] G --> H[Output to Console] G --> I[Write JSON to HDFS]

🛠️ Technologies Used

💻 Installation & Setup

Prerequisites

Python 3.8+
Apache Spark 3.x
Kafka Broker Access
HDFS Setup (for output files)
JDK Installed (if using spark-submit)

Setup Steps

Clone the Repository:

bash git clone https://github.com/DadaNanjesha/Spark-streaming.git cd Spark-streaming git checkout dev

Set Up Virtual Environment:

bash python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate

Install Dependencies:

Ensure your requirements.txt (if provided) includes necessary libraries like pyspark. Then run:

bash pip install -r requirements.txt

Configure Kafka & HDFS Settings:

Update Kafka broker address, topic name, and any output paths in spark-streaming.py.

Build the Application (Optional):

To create a jar for deployment, follow instructions in the code documentation: bash wget https://ds-spark-sql-kafka-jar.s3.amazonaws.com/spark-sql-kafka-0-10_2.11-2.3.0.jar export SPARK_KAFKA_VERSION=0.10 spark2-submit --jars spark-sql-kafka-0-10_2.11-2.3.0.jar spark-streaming.py > console_print

🚀 Usage

Run Streaming Application:
Execute the main Python file to start the streaming job: bash python spark-streaming.py
Monitor the Output:
View real-time outputs in the console and monitor the Spark UI (default port 4040).
Review KPI Outputs:
Processed KPIs will be written to HDFS directories as JSON files (paths as configured).

⭐️ Support & Call-to-Action

If you find this project useful, please consider: - Starring the repository ⭐️ - Forking the project to contribute enhancements - Following for updates on future improvements

Your engagement helps increase visibility and encourages further collaboration!

📜 License

This project is licensed under the MIT License.

🙏 Acknowledgements

UpGrad Education: For providing guidance and educational resources.
Apache Spark & Kafka Community: For their open-source contributions.

Happy Streaming & KPI Calculations! 🚀📡

Owner

Name: DADA NANJESHA
Login: DadaNanjesha
Kind: user
Location: BERLIN

Repositories: 1
Profile: https://github.com/DadaNanjesha

GitHub Events

Total

Push event: 4
Pull request event: 2
Create event: 3

Last Year

Push event: 4
Pull request event: 2
Create event: 3

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

https://github.com/dadananjesha/spark-streaming

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Spark Streaming KPI Processing 🚀📊

📖 Table of Contents

🔍 Overview

✨ Key Features

🏗️ Project Architecture

🔄 Flow Diagram

🛠️ Technologies Used

💻 Installation & Setup

Prerequisites

Setup Steps

🚀 Usage

⭐️ Support & Call-to-Action

📜 License

🙏 Acknowledgements

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels