https://github.com/amr-yasser226/intrusion-detection-kaggle

End-to-end pipeline for multi-class cyber-attack detection using per-flow network features: data profiling, deduplication, skew-correction, outlier treatment, feature engineering, imbalance handling, and tree-based modeling (XGBoost, LightGBM, CatBoost, stacking), with a final Kaggle submission scoring 0.9146 public / 0.9163 private.

https://github.com/amr-yasser226/intrusion-detection-kaggle

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.2%) to scientific vocabulary

Keywords

catboost cyber-security data-preprocessing ensemble-learning feature-engineering imbalanced-data jupyter-notebooks kaggle lightgbm machine-learning outlier-detection random-forest xgboost
Last synced: 5 months ago · JSON representation

Repository

End-to-end pipeline for multi-class cyber-attack detection using per-flow network features: data profiling, deduplication, skew-correction, outlier treatment, feature engineering, imbalance handling, and tree-based modeling (XGBoost, LightGBM, CatBoost, stacking), with a final Kaggle submission scoring 0.9146 public / 0.9163 private.

Basic Info
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
catboost cyber-security data-preprocessing ensemble-learning feature-engineering imbalanced-data jupyter-notebooks kaggle lightgbm machine-learning outlier-detection random-forest xgboost
Created 10 months ago · Last pushed 8 months ago
Metadata Files
Readme License

README.md

Cyber-Attack Classification

This repository contains the end-to-end pipeline, analysis, and results for the CSAI 253 – Machine Learning course project (Phase 2). We built and evaluated multiple tree‐based and ensemble classifiers to distinguish between benign and various attack types (DDoS, DoS, Mirai, Recon, MITM) using per-flow network features. The project was organized into data preparation, exploratory analysis, feature engineering, imbalance handling, outlier treatment, modeling, and final submission to the Kaggle competition csai-253-project-phase-2.


Repository Structure

```

. ├── cache/ # Temporary files and CatBoost logs ├── catboost_info/ # TensorBoard event logs for CatBoost training ├── data/ │ ├── train.csv # Training data │ ├── test.csv # Test data │ ├── phase2_students_before_cleaning.csv │ ├── sample_submission.csv │ └── Our_Competition_Submission.csv # Final Kaggle submission (Private 0.9163 / Public 0.9146) ├── figures/ │ ├── class_distribution.png │ ├── correlation_matrix.png │ └── feature_importance.png ├── imbalance_analysis/ # Imbalance diagnostics and plots ├── Models/ # Model artifacts and notebooks │ ├── scaler.joblib │ ├── selector.joblib │ ├── xgb_model.joblib │ ├── stacking_model.joblib │ └── *.ipynb ├── notebooks/ # Data profiling, cleaning, and preprocessing notebooks │ ├── data_profilling.ipynb │ ├── Feature_Descriptions.ipynb │ ├── handling_duplicates.ipynb │ ├── handling_imbalance.ipynb │ ├── handling_outliers.ipynb │ ├── model.ipynb │ ├── scaling.ipynb │ └── ydata_profiling_code.ipynb ├── Report/ # PDF reports on methodology and rationale │ ├── Columns Report.pdf │ ├── Encoding Techniques.pdf │ ├── Feature Descriptions & Preprocessing Report.pdf │ ├── Feature Engineering Report.pdf │ ├── [FINAL] PHASE 2 REPORT.pdf │ ├── Handling Duplicates.pdf │ ├── Handling Outliers.pdf │ ├── Models Scaling.pdf │ ├── Numerical Features Skewness Report.pdf │ ├── Proper Treatment of Test Data in SMOTE Workflows.pdf │ ├── Why You Should Split Your Data Before Correcting Skewness.pdf │ └── skewness_report.txt ├── LICENSE └── README.md

````


Getting Started

  1. Clone the repository
    ```bash git clone https://github.com/amr-yasser226/intrusion-detection-kaggle.git cd Machine-Learning-Phase2

  2. Dependencies A typical environment includes:

  • Python 3.8+
  • pandas, numpy, scikit-learn, xgboost, lightgbm, catboost, imbalanced-learn, ydata-profiling, optuna, matplotlib, seaborn, joblib, pdfkit
  1. Data
  • Place train.csv and test.csv in /data.
  • Inspect phase2_students_before_cleaning.csv for raw, uncleaned data.
  1. Exploratory Analysis & Profiling
  • Run notebooks/data_profilling.ipynb to generate profiling reports.
  • Visualize distributions, skewness, and correlations.
  1. Preprocessing Pipelines
  • Deduplication: handling_duplicates.ipynb explores direct removal, weighting, and train–test aware grouping.
  • Skew correction: log1p, Yeo–Johnson, Box–Cox — always fit on training split only (Why You Should Split Your Data…).
  • Outlier treatment: winsorization, Z-score, isolation forest (handling_outliers.ipynb).
  • Scaling: Standard, MinMax, Robust, Quantile (scaling.ipynb).
  • Imbalance handling: SMOTE, SMOTE-Tomek, ClassWeights, EasyEnsemble, RUSBoost (handling_imbalance.ipynb).
  1. Feature Engineering
  • Additional features (e.g. rate_ratio, avg_pkt_size, burstiness, payload_entropy, time-cyclic features) in Feature Engineering Report.pdf and implemented in scaling.ipynb.
  1. Model Training & Evaluation
  • XGBoost and Stacking in Model.ipynb / Phase_2 model.ipynb.
  • Hyperparameter tuning and Optuna‐based lightGBM/CatBoost ensembles in data_profilling.ipynb.
  • Final models saved as .joblib in /Models/.
  1. Results & Submission
  • Final private score: 0.916289, public score: 0.914581 on Kaggle.
  • Submission file: data/Our_Competition_Submission.csv.

Key Findings

  • Deduplication first prevents leakage and skewed statistics.
  • Skew correction must be fit only on the training data to avoid over-optimistic metrics.
  • Tree-based models are largely scale-invariant, but scaling benefits pipelines that mix learners.
  • Outlier handling (winsorization, isolation forest) improves model robustness.
  • Class imbalance addressed via SMOTE (training only) and ensemble methods.
  • XGBoost with tuned hyperparameters achieved the best standalone performance; stacking did not outperform it.

How to Reproduce

  1. Run the notebooks in order within a Jupyter environment, starting with data profiling and ending with model.ipynb.
  2. Generate figures in /figures and /imbalance_analysis.
  3. Train final models and export xgb_model.joblib, stacking_model.joblib.
  4. Create submission by loading test.csv, applying preprocessing, predicting, and saving Our_Competition_Submission.csv.

License

This project is released under the MIT License.

Owner

  • Login: amr-yasser226
  • Kind: user

GitHub Events

Total
  • Watch event: 2
  • Member event: 2
  • Push event: 2
  • Create event: 3
Last Year
  • Watch event: 2
  • Member event: 2
  • Push event: 2
  • Create event: 3

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 18
  • Total Committers: 1
  • Avg Commits per committer: 18.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 18
  • Committers: 1
  • Avg Commits per committer: 18.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Amr Yasser a****6@g****m 18

Issues and Pull Requests

Last synced: 7 months ago