Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary

Keywords

big-data-analytics internet-of-things malicious-node online-machine-learning
Last synced: 6 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: deepaiimpactx
  • License: other
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 16.2 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
big-data-analytics internet-of-things malicious-node online-machine-learning
Created over 1 year ago · Last pushed 12 months ago
Metadata Files
Readme License Citation

README.md

Streamlined Data Pipeline for Real-Time Threat Detection and Model Inference

BARS Architecture

Abstract

Real-time threat detection in streaming data is crucial yet challenging due to varying data volumes and speeds. This paper presents an architecture designed to manage large-scale, high-speed data streams using deep learning and machine learning models. The system utilizes Apache Kafka for high-throughput data transfer and a publish-subscribe model to facilitate continuous threat detection. Various machine learning techniques, including XGBoost, Random Forest, and LightGBM, are evaluated to identify the best model for classification. The ExtraTrees model achieves exceptional performance with accuracy, precision, recall, and F1 score all reaching 99\% using the SensorNetGuard dataset within this architecture. The PyFlink framework, with its parallel processing capabilities, supports real-time training and adaptation of these models. The system calculates prediction metrics every 2,000 data points, ensuring efficient and accurate real-time threat detection.

🎒 Tech Stack

Jupyter Notebook Python Docker Apache Flink Apache Kafka YAML

Current Pipeline

pipeline

🖥️ Run Locally

Clone the project

bash git clone https://github.com/deepaiimpactx/BARS

Go to the project directory

bash cd BARS

Build the images bash docker-compose build

Start docker container bash docker compose up -d

Other Useful commands

Check kafka messages shell docker exec -it broker kafka-console-consumer --bootstrap-server localhost:9092 --topic output_topic --partition 0 --offset 4990 --max-messages 20

To run a pyflink job shell docker-compose exec flink-jobmanager flink run -py /opt/flink/usr_jobs/classifier.py

To verify database records

PostgreSQL

Connect to the PostgreSQL Container:

sh docker exec -it postgres bash

Use psql to Query the Database: Once inside the container, use the psql command-line tool to connect to your PostgreSQL database:

sh psql -U postgres -d postgres

Run SQL queries to check the data in your tables: sql \dt -- List all tables SELECT * FROM sensor_data;

Project Organization

.
├── academicPapers  <- Research paper
├── dash    <- Flask app for DL feature selection
│   ├── uploads
├── data    <- Directory for datasets organized by their processing stages
│   ├── external    <- Data from external sources
│   ├── interim     <- Intermediate, transformed data
│   │   ├── pred    <- prediction data
│   │   ├── train   <- training data
│   ├── processed   <- Cleaned and final data ready for modeling or analysis
│   └── raw         <- Raw, unprocessed data
├── initdb      <- Database initialization scripts for Postgres
├── kafka       <- Kafka-related scripts and services
│   ├── api
│   ├── consumer
├── notebooks       <- Jupyter notebooks for data exploration and analysis
├── pyflink     <- Directory for Flink in Python
│   ├── saved_models    <- Directory for pickle serialised ML models saved from PyFLink jobs. Acts as a shared directory for PyFLink Job&Task manager.
│   ├── usr_jobs        <- Directory for Python scripts to be submitted to Flink 
├── simulation      <- Directory for simulating batch and stream environments
│   └── sensorGuard     <- SensorNetGuard Dataset
├── src     <- Source code directory
│   ├── data    <- Scripts for data handling and processing
│   ├── features    <- Scripts for feature engineering
│   ├── models      <- Scripts related to model training and predictions
│   ├── visualization   <- Scripts for data visualization
├── uploads
├── LICENSE     <- Project license file
├── Makefile    <- Makefile for build commands 
├── README.md   <- Top-level README for developers using this project
├── docker-compose.yml      <- Docker Compose configuration for multi-container application
├── qodana.yaml     <- Configuration file for Qodana- code quality and inspection tool
└── requirements.txt    <- Python dependencies for the project

Project structure based on the cookiecutter data science project template

Generate fresh structure with

tree -L 3 --dirsfirst

👨‍💻 Authors

Static Badge Static Badge Static Badge

Owner

  • Name: DeepAI ImpactX
  • Login: deepaiimpactx
  • Kind: organization
  • Location: India

Citation (CITATION.cff)

cff-version: "1.2.0"
message: "If you use this work, please cite it using the following metadata."
title: "Streamlined Data Pipeline for Real-Time Threat Detection and Model Inference"
authors:
  - family-names: "Singh"
    given-names: "Rajkanwar"
  - family-names: "V"
    given-names: "Aravindan"
  - family-names: "Mishra"
    given-names: "Sanket"
  - family-names: "Singh"
    given-names: "Sunil Kumar"
date-released: "2025"
conference: "2025 17th International Conference on COMmunication Systems and NETworks (COMSNETS)"
pages: "1148-1153"
doi: "10.1109/COMSNETS63942.2025.10885573"
keywords:
  - "Training"
  - "Adaptation models"
  - "Accuracy"
  - "Pipelines"
  - "Publish-subscribe"
  - "Threat assessment"
  - "Real-time systems"
  - "Data models"
  - "Streams"
  - "Random forests"
  - "Malicious Node"
  - "Big Data Analytics"
  - "Online Machine Learning"
  - "Internet of Things"

GitHub Events

Total
  • Public event: 1
  • Push event: 2
Last Year
  • Public event: 1
  • Push event: 2

Dependencies

dash/Dockerfile docker
  • python 3.12-slim build
docker-compose.yml docker
  • confluentinc/cp-enterprise-control-center 7.4.0
  • confluentinc/cp-kafka 7.4.0
  • confluentinc/cp-schema-registry 7.4.0
  • confluentinc/cp-zookeeper 7.4.0
kafka/api/Dockerfile docker
  • python 3.12-slim build
kafka/consumer/Dockerfile docker
  • python 3.12-slim build
pyflink/Dockerfile docker
  • flink 1.18.0 build
dash/requirements.txt pypi
  • Flask *
  • confluent_kafka *
  • lightgbm *
  • pandas *
  • pyswarm *
  • scikit-learn *
  • tensorflow *
  • zoofs *
kafka/api/requirements.txt pypi
  • Flask ==2.2.5
  • confluent-kafka ==2.3.0
  • waitress *
kafka/consumer/requirements.txt pypi
  • Flask ==2.2.5
  • confluent-kafka ==2.3.0
pyflink/poetry.lock pypi
  • apache-beam 2.48.0
  • apache-flink 1.18.0
  • apache-flink-libraries 1.18.0
  • avro-python3 1.10.2
  • certifi 2024.6.2
  • cffi 1.16.0
  • charset-normalizer 3.3.2
  • cloudpickle 2.2.1
  • crcmod 1.7
  • dill 0.3.1.1
  • dnspython 2.6.1
  • docopt 0.6.2
  • fastavro 1.9.4
  • fasteners 0.19
  • find-libpython 0.4.0
  • grpcio 1.64.1
  • hdfs 2.7.3
  • httplib2 0.22.0
  • idna 3.7
  • kafka-python 2.0.2
  • numpy 1.24.4
  • objsize 0.6.1
  • orjson 3.10.3
  • pandas 2.2.2
  • pemja 0.3.0
  • proto-plus 1.23.0
  • protobuf 4.23.4
  • py4j 0.10.9.7
  • pyarrow 11.0.0
  • pycparser 2.22
  • pydot 1.4.2
  • pymongo 4.7.2
  • pyparsing 3.1.2
  • python-dateutil 2.9.0.post0
  • pytz 2024.1
  • regex 2024.5.15
  • requests 2.32.3
  • six 1.16.0
  • typing-extensions 4.12.1
  • tzdata 2024.1
  • urllib3 2.2.1
  • zstandard 0.22.0
pyflink/pyproject.toml pypi
  • apache-flink 1.18
  • kafka-python ^2.0.2
  • python >=3.10,<3.11
pyflink/requirements.txt pypi
  • apache-flink ==1.18
  • apache-flink-libraries ==1.18
  • confluent_kafka *
  • joblib *
  • keras *
  • lightgbm *
  • pickle5 *
  • river *
  • scikit-learn *
  • torch ==2.3.1
  • torchsampler ==0.1.2
  • xgboost *
  • zoofs *
requirements.txt pypi
  • Sphinx *
  • apache-airflow *
  • awscli *
  • click *
  • confluent-kafka ==2.3.0
  • coverage *
  • diagrams *
  • fastapi *
  • flake8 *
  • imbalanced-learn *
  • lightgbm *
  • matplotlib *
  • numpy *
  • pandas *
  • pydantic *
  • pyflink *
  • python-dotenv >=0.5.1
  • scikit-learn *
  • scipy *
  • seaborn *
  • uvicorn *
  • xgboost *