https://github.com/dadananjesha/spark-streaming
Spark Streaming KPI Processing is a real-time data processing application built using Apache Spark Streaming
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.4%) to scientific vocabulary
Keywords
Repository
Spark Streaming KPI Processing is a real-time data processing application built using Apache Spark Streaming
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Spark Streaming KPI Processing 🚀📊
Spark Streaming KPI Processing is a real-time data processing application built using Apache Spark Streaming. The application ingests streaming data from Kafka, transforms it through custom user-defined functions (UDFs), and computes key performance indicators (KPIs) such as Orders per Minute, Total Sales Volume, Rate of Return, and Average Transaction Size. Processed results are output both to the console and to HDFS as JSON files for downstream analytics.
📖 Table of Contents
- Overview
- Key Features
- Project Architecture
- Flow Diagram
- Technologies Used
- Project Structure
- Installation & Setup
- Usage
- Call-to-Action
- License
- Acknowledgements
🔍 Overview
This project demonstrates a step-by-step Spark Streaming application to achieve real-time KPI calculations. The workflow includes: - Data Ingestion: Reading streaming data from a Kafka topic. - Data Parsing: Converting raw JSON messages into a structured DataFrame using a defined schema. - Data Transformation: Unnesting array fields, renaming columns, and applying UDFs to calculate cost and flag order/return events. - KPI Calculation: Aggregating data in 1-minute windows to compute metrics like Orders per Minute (OPM), Total Sales Volume, Rate of Return, and Average Transaction Size. - Output: Streaming processed data to the console and writing JSON files to HDFS for both time-based and country-based KPIs.
✨ Key Features
- Real-Time Data Ingestion: Connects to Kafka for live data streaming.
- Structured Data Processing: Utilizes Spark Structured Streaming with a defined schema to make raw data readable.
- Custom Transformations: Implements UDFs to flag orders/returns and calculate transaction cost.
- Windowed Aggregation: Computes KPIs over 1-minute intervals with watermarking for late data.
- Multiple Output Modes: Streams output to the console and writes KPI data to HDFS as JSON.
🏗️ Project Architecture
The application is structured to ensure modularity and scalability. Key steps include: 1. Library Import & Spark Session Setup 2. Kafka Connection & Data Ingestion 3. Schema Mapping & JSON Parsing 4. Exploding Array Columns & Renaming 5. Applying UDFs for KPI Derivation 6. Windowed Aggregation for KPI Calculation 7. Output Streaming to Console and HDFS
🔄 Flow Diagram
mermaid
flowchart TD
A[Kafka Topic: Streaming Data] --> B[Raw Data -JSON]
B --> C[Parse Data with Schema]
C --> D[Explode Array Items & Rename Columns]
D --> E[Apply UDFs for Order or Return Flags & Cost]
E --> F[Group by Window -1 minute]
F --> G[Compute KPIs: OPM, Sales, Rate of Return, Avg Trans Size]
G --> H[Output to Console]
G --> I[Write JSON to HDFS]
🛠️ Technologies Used
💻 Installation & Setup
Prerequisites
- Python 3.8+
- Apache Spark 3.x
- Kafka Broker Access
- HDFS Setup (for output files)
- JDK Installed (if using spark-submit)
Setup Steps
- Clone the Repository:
bash
git clone https://github.com/DadaNanjesha/Spark-streaming.git
cd Spark-streaming
git checkout dev
- Set Up Virtual Environment:
bash
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
- Install Dependencies:
Ensure your requirements.txt (if provided) includes necessary libraries like pyspark. Then run:
bash
pip install -r requirements.txt
- Configure Kafka & HDFS Settings:
- Update Kafka broker address, topic name, and any output paths in
spark-streaming.py.
- Build the Application (Optional):
To create a jar for deployment, follow instructions in the code documentation:
bash
wget https://ds-spark-sql-kafka-jar.s3.amazonaws.com/spark-sql-kafka-0-10_2.11-2.3.0.jar
export SPARK_KAFKA_VERSION=0.10
spark2-submit --jars spark-sql-kafka-0-10_2.11-2.3.0.jar spark-streaming.py > console_print
🚀 Usage
Run Streaming Application:
Execute the main Python file to start the streaming job:bash python spark-streaming.pyMonitor the Output:
View real-time outputs in the console and monitor the Spark UI (default port 4040).Review KPI Outputs:
Processed KPIs will be written to HDFS directories as JSON files (paths as configured).
⭐️ Support & Call-to-Action
If you find this project useful, please consider: - Starring the repository ⭐️ - Forking the project to contribute enhancements - Following for updates on future improvements
Your engagement helps increase visibility and encourages further collaboration!
📜 License
This project is licensed under the MIT License.
🙏 Acknowledgements
- UpGrad Education: For providing guidance and educational resources.
- Apache Spark & Kafka Community: For their open-source contributions.
Happy Streaming & KPI Calculations! 🚀📡
Owner
- Name: DADA NANJESHA
- Login: DadaNanjesha
- Kind: user
- Location: BERLIN
- Repositories: 1
- Profile: https://github.com/DadaNanjesha
GitHub Events
Total
- Push event: 4
- Pull request event: 2
- Create event: 3
Last Year
- Push event: 4
- Pull request event: 2
- Create event: 3
Issues and Pull Requests
Last synced: 11 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- DadaNanjesha (1)