https://github.com/amr-yasser226/intrusion-detection-kaggle
End-to-end pipeline for multi-class cyber-attack detection using per-flow network features: data profiling, deduplication, skew-correction, outlier treatment, feature engineering, imbalance handling, and tree-based modeling (XGBoost, LightGBM, CatBoost, stacking), with a final Kaggle submission scoring 0.9146 public / 0.9163 private.
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.2%) to scientific vocabulary
Keywords
Repository
End-to-end pipeline for multi-class cyber-attack detection using per-flow network features: data profiling, deduplication, skew-correction, outlier treatment, feature engineering, imbalance handling, and tree-based modeling (XGBoost, LightGBM, CatBoost, stacking), with a final Kaggle submission scoring 0.9146 public / 0.9163 private.
Basic Info
- Host: GitHub
- Owner: amr-yasser226
- License: mit
- Language: Jupyter Notebook
- Default Branch: main
- Homepage: https://www.kaggle.com/competitions/csai-253-project-phase-2/
- Size: 61.2 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Cyber-Attack Classification
This repository contains the end-to-end pipeline, analysis, and results for the CSAI 253 – Machine Learning course project (Phase 2). We built and evaluated multiple tree‐based and ensemble classifiers to distinguish between benign and various attack types (DDoS, DoS, Mirai, Recon, MITM) using per-flow network features. The project was organized into data preparation, exploratory analysis, feature engineering, imbalance handling, outlier treatment, modeling, and final submission to the Kaggle competition csai-253-project-phase-2.
Repository Structure
```
. ├── cache/ # Temporary files and CatBoost logs ├── catboost_info/ # TensorBoard event logs for CatBoost training ├── data/ │ ├── train.csv # Training data │ ├── test.csv # Test data │ ├── phase2_students_before_cleaning.csv │ ├── sample_submission.csv │ └── Our_Competition_Submission.csv # Final Kaggle submission (Private 0.9163 / Public 0.9146) ├── figures/ │ ├── class_distribution.png │ ├── correlation_matrix.png │ └── feature_importance.png ├── imbalance_analysis/ # Imbalance diagnostics and plots ├── Models/ # Model artifacts and notebooks │ ├── scaler.joblib │ ├── selector.joblib │ ├── xgb_model.joblib │ ├── stacking_model.joblib │ └── *.ipynb ├── notebooks/ # Data profiling, cleaning, and preprocessing notebooks │ ├── data_profilling.ipynb │ ├── Feature_Descriptions.ipynb │ ├── handling_duplicates.ipynb │ ├── handling_imbalance.ipynb │ ├── handling_outliers.ipynb │ ├── model.ipynb │ ├── scaling.ipynb │ └── ydata_profiling_code.ipynb ├── Report/ # PDF reports on methodology and rationale │ ├── Columns Report.pdf │ ├── Encoding Techniques.pdf │ ├── Feature Descriptions & Preprocessing Report.pdf │ ├── Feature Engineering Report.pdf │ ├── [FINAL] PHASE 2 REPORT.pdf │ ├── Handling Duplicates.pdf │ ├── Handling Outliers.pdf │ ├── Models Scaling.pdf │ ├── Numerical Features Skewness Report.pdf │ ├── Proper Treatment of Test Data in SMOTE Workflows.pdf │ ├── Why You Should Split Your Data Before Correcting Skewness.pdf │ └── skewness_report.txt ├── LICENSE └── README.md
````
Getting Started
Clone the repository
```bash git clone https://github.com/amr-yasser226/intrusion-detection-kaggle.git cd Machine-Learning-Phase2Dependencies A typical environment includes:
- Python 3.8+
pandas,numpy,scikit-learn,xgboost,lightgbm,catboost,imbalanced-learn,ydata-profiling,optuna,matplotlib,seaborn,joblib,pdfkit
- Data
- Place
train.csvandtest.csvin/data. - Inspect
phase2_students_before_cleaning.csvfor raw, uncleaned data.
- Exploratory Analysis & Profiling
- Run
notebooks/data_profilling.ipynbto generate profiling reports. - Visualize distributions, skewness, and correlations.
- Preprocessing Pipelines
- Deduplication:
handling_duplicates.ipynbexplores direct removal, weighting, and train–test aware grouping. - Skew correction: log1p, Yeo–Johnson, Box–Cox — always fit on training split only (
Why You Should Split Your Data…). - Outlier treatment: winsorization, Z-score, isolation forest (
handling_outliers.ipynb). - Scaling: Standard, MinMax, Robust, Quantile (
scaling.ipynb). - Imbalance handling: SMOTE, SMOTE-Tomek, ClassWeights, EasyEnsemble, RUSBoost (
handling_imbalance.ipynb).
- Feature Engineering
- Additional features (e.g.
rate_ratio,avg_pkt_size,burstiness,payload_entropy, time-cyclic features) inFeature Engineering Report.pdfand implemented inscaling.ipynb.
- Model Training & Evaluation
- XGBoost and Stacking in
Model.ipynb/Phase_2 model.ipynb. - Hyperparameter tuning and Optuna‐based lightGBM/CatBoost ensembles in
data_profilling.ipynb. - Final models saved as
.joblibin/Models/.
- Results & Submission
- Final private score: 0.916289, public score: 0.914581 on Kaggle.
- Submission file:
data/Our_Competition_Submission.csv.
Key Findings
- Deduplication first prevents leakage and skewed statistics.
- Skew correction must be fit only on the training data to avoid over-optimistic metrics.
- Tree-based models are largely scale-invariant, but scaling benefits pipelines that mix learners.
- Outlier handling (winsorization, isolation forest) improves model robustness.
- Class imbalance addressed via SMOTE (training only) and ensemble methods.
- XGBoost with tuned hyperparameters achieved the best standalone performance; stacking did not outperform it.
How to Reproduce
- Run the notebooks in order within a Jupyter environment, starting with data profiling and ending with
model.ipynb. - Generate figures in
/figuresand/imbalance_analysis. - Train final models and export
xgb_model.joblib,stacking_model.joblib. - Create submission by loading
test.csv, applying preprocessing, predicting, and savingOur_Competition_Submission.csv.
License
This project is released under the MIT License.
Owner
- Login: amr-yasser226
- Kind: user
- Repositories: 1
- Profile: https://github.com/amr-yasser226
GitHub Events
Total
- Watch event: 2
- Member event: 2
- Push event: 2
- Create event: 3
Last Year
- Watch event: 2
- Member event: 2
- Push event: 2
- Create event: 3
Issues and Pull Requests
Last synced: 7 months ago