Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Keywords
Repository
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Streamlined Data Pipeline for Real-Time Threat Detection and Model Inference
BARS Architecture
Abstract
Real-time threat detection in streaming data is crucial yet challenging due to varying data volumes and speeds. This paper presents an architecture designed to manage large-scale, high-speed data streams using deep learning and machine learning models. The system utilizes Apache Kafka for high-throughput data transfer and a publish-subscribe model to facilitate continuous threat detection. Various machine learning techniques, including XGBoost, Random Forest, and LightGBM, are evaluated to identify the best model for classification. The ExtraTrees model achieves exceptional performance with accuracy, precision, recall, and F1 score all reaching 99\% using the SensorNetGuard dataset within this architecture. The PyFlink framework, with its parallel processing capabilities, supports real-time training and adaptation of these models. The system calculates prediction metrics every 2,000 data points, ensuring efficient and accurate real-time threat detection.
🎒 Tech Stack
Current Pipeline

🖥️ Run Locally
Clone the project
bash
git clone https://github.com/deepaiimpactx/BARS
Go to the project directory
bash
cd BARS
Build the images
bash
docker-compose build
Start docker container
bash
docker compose up -d
Other Useful commands
Check kafka messages
shell
docker exec -it broker kafka-console-consumer --bootstrap-server localhost:9092 --topic output_topic --partition 0 --offset 4990 --max-messages 20
To run a pyflink job
shell
docker-compose exec flink-jobmanager flink run -py /opt/flink/usr_jobs/classifier.py
To verify database records
PostgreSQL
Connect to the PostgreSQL Container:
sh
docker exec -it postgres bash
Use psql to Query the Database: Once inside the container, use the psql command-line tool to connect to your PostgreSQL database:
sh
psql -U postgres -d postgres
Run SQL queries to check the data in your tables:
sql
\dt -- List all tables
SELECT * FROM sensor_data;
Project Organization
.
├── academicPapers <- Research paper
├── dash <- Flask app for DL feature selection
│ ├── uploads
├── data <- Directory for datasets organized by their processing stages
│ ├── external <- Data from external sources
│ ├── interim <- Intermediate, transformed data
│ │ ├── pred <- prediction data
│ │ ├── train <- training data
│ ├── processed <- Cleaned and final data ready for modeling or analysis
│ └── raw <- Raw, unprocessed data
├── initdb <- Database initialization scripts for Postgres
├── kafka <- Kafka-related scripts and services
│ ├── api
│ ├── consumer
├── notebooks <- Jupyter notebooks for data exploration and analysis
├── pyflink <- Directory for Flink in Python
│ ├── saved_models <- Directory for pickle serialised ML models saved from PyFLink jobs. Acts as a shared directory for PyFLink Job&Task manager.
│ ├── usr_jobs <- Directory for Python scripts to be submitted to Flink
├── simulation <- Directory for simulating batch and stream environments
│ └── sensorGuard <- SensorNetGuard Dataset
├── src <- Source code directory
│ ├── data <- Scripts for data handling and processing
│ ├── features <- Scripts for feature engineering
│ ├── models <- Scripts related to model training and predictions
│ ├── visualization <- Scripts for data visualization
├── uploads
├── LICENSE <- Project license file
├── Makefile <- Makefile for build commands
├── README.md <- Top-level README for developers using this project
├── docker-compose.yml <- Docker Compose configuration for multi-container application
├── qodana.yaml <- Configuration file for Qodana- code quality and inspection tool
└── requirements.txt <- Python dependencies for the project
Project structure based on the cookiecutter data science project template
Generate fresh structure with
tree -L 3 --dirsfirst
👨💻 Authors
Owner
- Name: DeepAI ImpactX
- Login: deepaiimpactx
- Kind: organization
- Location: India
- Repositories: 1
- Profile: https://github.com/deepaiimpactx
Citation (CITATION.cff)
cff-version: "1.2.0"
message: "If you use this work, please cite it using the following metadata."
title: "Streamlined Data Pipeline for Real-Time Threat Detection and Model Inference"
authors:
- family-names: "Singh"
given-names: "Rajkanwar"
- family-names: "V"
given-names: "Aravindan"
- family-names: "Mishra"
given-names: "Sanket"
- family-names: "Singh"
given-names: "Sunil Kumar"
date-released: "2025"
conference: "2025 17th International Conference on COMmunication Systems and NETworks (COMSNETS)"
pages: "1148-1153"
doi: "10.1109/COMSNETS63942.2025.10885573"
keywords:
- "Training"
- "Adaptation models"
- "Accuracy"
- "Pipelines"
- "Publish-subscribe"
- "Threat assessment"
- "Real-time systems"
- "Data models"
- "Streams"
- "Random forests"
- "Malicious Node"
- "Big Data Analytics"
- "Online Machine Learning"
- "Internet of Things"
GitHub Events
Total
- Public event: 1
- Push event: 2
Last Year
- Public event: 1
- Push event: 2
Dependencies
- python 3.12-slim build
- confluentinc/cp-enterprise-control-center 7.4.0
- confluentinc/cp-kafka 7.4.0
- confluentinc/cp-schema-registry 7.4.0
- confluentinc/cp-zookeeper 7.4.0
- python 3.12-slim build
- python 3.12-slim build
- flink 1.18.0 build
- Flask *
- confluent_kafka *
- lightgbm *
- pandas *
- pyswarm *
- scikit-learn *
- tensorflow *
- zoofs *
- Flask ==2.2.5
- confluent-kafka ==2.3.0
- waitress *
- Flask ==2.2.5
- confluent-kafka ==2.3.0
- apache-beam 2.48.0
- apache-flink 1.18.0
- apache-flink-libraries 1.18.0
- avro-python3 1.10.2
- certifi 2024.6.2
- cffi 1.16.0
- charset-normalizer 3.3.2
- cloudpickle 2.2.1
- crcmod 1.7
- dill 0.3.1.1
- dnspython 2.6.1
- docopt 0.6.2
- fastavro 1.9.4
- fasteners 0.19
- find-libpython 0.4.0
- grpcio 1.64.1
- hdfs 2.7.3
- httplib2 0.22.0
- idna 3.7
- kafka-python 2.0.2
- numpy 1.24.4
- objsize 0.6.1
- orjson 3.10.3
- pandas 2.2.2
- pemja 0.3.0
- proto-plus 1.23.0
- protobuf 4.23.4
- py4j 0.10.9.7
- pyarrow 11.0.0
- pycparser 2.22
- pydot 1.4.2
- pymongo 4.7.2
- pyparsing 3.1.2
- python-dateutil 2.9.0.post0
- pytz 2024.1
- regex 2024.5.15
- requests 2.32.3
- six 1.16.0
- typing-extensions 4.12.1
- tzdata 2024.1
- urllib3 2.2.1
- zstandard 0.22.0
- apache-flink 1.18
- kafka-python ^2.0.2
- python >=3.10,<3.11
- apache-flink ==1.18
- apache-flink-libraries ==1.18
- confluent_kafka *
- joblib *
- keras *
- lightgbm *
- pickle5 *
- river *
- scikit-learn *
- torch ==2.3.1
- torchsampler ==0.1.2
- xgboost *
- zoofs *
- Sphinx *
- apache-airflow *
- awscli *
- click *
- confluent-kafka ==2.3.0
- coverage *
- diagrams *
- fastapi *
- flake8 *
- imbalanced-learn *
- lightgbm *
- matplotlib *
- numpy *
- pandas *
- pydantic *
- pyflink *
- python-dotenv >=0.5.1
- scikit-learn *
- scipy *
- seaborn *
- uvicorn *
- xgboost *