https://github.com/amr-yasser226/machine-learning-for-network-intrusion-detection

A complete pipeline for network intrusion detection comparing label encoding and one‑hot encoding, with SMOTE resampling, feature selection, and ensemble modeling using scikit‑learn and XGBoost, also this was phase one of our University's "CSAI 253- Machine Learning" course.

https://github.com/amr-yasser226/machine-learning-for-network-intrusion-detection

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.4%) to scientific vocabulary

Keywords

csai-253 cybersecurity cybersecurity-training ensamble-methods feature-engineering imbalanced-learning machine-learning machine-learning-algorithms network-intrusion-detection one-hot-encoding sckit-learn smote tree-based-model xgboost zewailcity
Last synced: 5 months ago · JSON representation

Repository

A complete pipeline for network intrusion detection comparing label encoding and one‑hot encoding, with SMOTE resampling, feature selection, and ensemble modeling using scikit‑learn and XGBoost, also this was phase one of our University's "CSAI 253- Machine Learning" course.

Basic Info
  • Host: GitHub
  • Owner: amr-yasser226
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 6.47 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
csai-253 cybersecurity cybersecurity-training ensamble-methods feature-engineering imbalanced-learning machine-learning machine-learning-algorithms network-intrusion-detection one-hot-encoding sckit-learn smote tree-based-model xgboost zewailcity
Created 11 months ago · Last pushed 8 months ago
Metadata Files
Readme License

README.md

Machine Learning for Network Intrusion Detection

This repository implements a reproducible pipeline to detect network intrusions, comparing Label Encoding vs. One‑Hot Encoding and culminating in ensemble methods. The notebooks walk through data ingestion, cleaning, feature engineering, model training, and evaluation.

Repository Structure

```

. ├── Data │ └── Project_Phase1_before_cleaning.csv ├── LICENSE ├── .gitattributes ├── .gitignore ├── model │ ├── Final_(ADDED_LABEL_ENCODING).ipynb │ └── Final_(One_Hot_Encoding).ipynb ├── REPORT.pdf └── README.md

````

  • Data/
    Raw CSV dataset (pre‑cleaning).

  • model/Final(ADDEDLABEL_ENCODING).ipynb
    Full pipeline using Label Encoding

    1. Mount & load data
    2. Missing‑value analysis & outlier handling (IQR + winsorization)
    3. Data‑type corrections & new feature creation
    4. Label encoding + mutual information for feature selection
    5. SMOTE resampling for class imbalance
    6. Training and comparing 5 classifiers (RF, KNN, SVM, Logistic Regression, Decision Tree)
    7. Hyperparameter tuning, feature‑importance filtering, stacking ensembles
    8. Final metrics & recommendation
  • model/Final(OneHot_Encoding).ipynb
    Identical pipeline, but uses One‑Hot Encoding (drop‑first) instead of label encoding. Facilitates direct comparison of encoding strategies.

  • REPORT.pdf
    Narrative report with tables, charts, and a concise recommendation.

Key Results

| Encoding | Best Model | Accuracy | False Negatives | Notes | |----------------|----------------|---------:|----------------:|----------------------------------------| | Label Encoding | Random Forest | 99.75% | 0 | Chosen for zero FN in test set | | One‑Hot | XGBoost | 99.82% | 1 | Slightly higher accuracy but 1 FN |

  • Random Forest (Label Encoding) achieved 99.75% accuracy with 0 false negatives, critical for intrusion detection.
  • XGBoost (One‑Hot Encoding) delivered 99.82% accuracy but incurred 1 false negative.
  • All other models (KNN, SVM, Logistic Regression, Decision Tree) performed competitively but with higher FN rates.
  • Stacking ensembles (RF/DT/SVM) did not improve upon a single Random Forest for zero-FN performance.

Quickstart

```bash git clone https://github.com/amr-yasser226/machine-learning-for-network-intrusion-detection.git cd machine-learning-for-network-intrusion-detection

python3 -m venv venv source venv/bin/activate

jupyter lab ````

Open the two notebooks under model/ and run end-to-end.

Dependencies

  • Python 3.8+
  • pandas, numpy, matplotlib, seaborn
  • scikit‑learn, imbalanced‑learn
  • xgboost

What This Solves

  • Demonstrates best practices in EDA, feature engineering, and model evaluation
  • Compares two encoding strategies for categorical data
  • Addresses class imbalance with SMOTE
  • Benchmarks multiple classifiers and stacking ensembles
  • Prioritizes zero false negatives—paramount in intrusion detection

License

This project is released under the MIT License. See LICENSE for details.

Owner

  • Login: amr-yasser226
  • Kind: user

GitHub Events

Total
  • Watch event: 1
  • Push event: 2
  • Public event: 1
Last Year
  • Watch event: 1
  • Push event: 2
  • Public event: 1

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 8
  • Total Committers: 1
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 8
  • Committers: 1
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Amr Yasser 1****6 8

Issues and Pull Requests

Last synced: 7 months ago