https://github.com/azazh/advanced_fraud_detection
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: Azazh
- License: mit
- Language: Jupyter Notebook
- Default Branch: master
- Size: 170 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Advanced Fraud Detection
This repository contains the implementation of Advanced Fraud Detection, a project aimed at detecting fraudulent transactions using machine learning techniques. The project focuses on data preprocessing, exploratory data analysis (EDA), feature engineering, and model preparation.
Table of Contents
- Overview
- Features
- Dataset
- Folder Structure
- Installation
- Usage
- Key Findings
- Contributing
- Contact
- License
Overview
Fraud detection is critical for businesses to minimize financial losses and improve customer trust. This project implements a pipeline for analyzing transaction data, identifying patterns of fraudulent behavior, and preparing the dataset for machine learning models. Key tasks include:
- Handling missing values and duplicates
- Cleaning and normalizing data
- Performing exploratory data analysis (EDA)
- Engineering features such as transaction frequency and time-to-action
- Preparing the dataset for downstream modeling
Features
- Data Preprocessing: Handles missing values, removes duplicates, and corrects data types.
- Feature Engineering: Creates meaningful features like
time_to_action,transaction_frequency, and geolocation-based features. - Normalization: Scales numerical features for compatibility with machine learning algorithms.
- Exploratory Data Analysis (EDA): Provides insights into class imbalance, fraud hotspots, and transaction patterns.
- Modular Codebase: Organized structure for scalability and reproducibility.
Dataset
The dataset used in this project consists of transaction data with the following key attributes:
- User Information:
user_id,signup_time,purchase_time,device_id,age,sex - Transaction Details:
purchase_value,ip_address,country - Labels: Binary target variable (
class) indicating whether a transaction is fraudulent (1) or legitimate (0).
The dataset is split into:
- Raw Data: Located in data/raw/
- Processed Data: Located in data/processed/
Folder Structure
advanced_fraud_detection/
├── README.md # Project overview and setup instructions
├── CONTRIBUTING.md # Guidelines for contributing to the project
├── LICENSE # License file (MIT)
├── CHANGELOG.md # Tracks changes, updates, and version history
├── .gitignore # Specifies files and directories to ignore in version control
├── requirements.txt # Lists Python dependencies for the project
├── requirements-dev.txt # Lists development-specific dependencies (e.g., pytest, flake8)
├── environment.yml # Conda environment configuration (optional, if using Conda)
├── pyproject.toml # Configuration for packaging and linting tools (e.g., Black, isort)
├── setup.py # Package setup file for distributing the project as a Python package
├── tests/ # Unit tests and integration tests for the project
│ ├── unit/ # Unit tests for individual components
│ └── integration/ # Integration tests for workflows and pipelines
├── src/ # Source code for the project
│ ├── __init__.py # Makes the src directory a Python package
│ ├── config.py # Configuration settings (e.g., file paths, hyperparameters)
│ ├── preprocessing.py # Data cleaning, feature engineering, and transformation logic
│ ├── models.py # Model training, evaluation, and prediction logic
│ ├── utils.py # Helper/utility functions (e.g., logging, visualization)
│ └── pipeline.py # End-to-end pipeline orchestration (data -> model -> deployment)
├── scripts/ # Scripts for running workflows (e.g., data ingestion, model training)
│ ├── __init__.py # Makes the scripts directory a Python package
│ ├── train_model.py # Script to train and save the model
│ ├── evaluate_model.py # Script to evaluate the model on test data
│ └── deploy_model.py # Script for deploying the model (e.g., as an API)
├── notebooks/ # Jupyter notebooks for exploratory data analysis (EDA) and experimentation
│ ├── EDA.ipynb # Exploratory data analysis notebook
│ ├── feature_engineering.ipynb # Feature engineering experiments
│ └── model_experiments.ipynb # Model training and evaluation experiments
├── data/ # Raw, processed, and intermediate data
│ ├── raw/ # Raw datasets (e.g., Fraud_Data.csv, creditcard.csv)
│ ├── processed/ # Processed datasets after cleaning and feature engineering
│ └── interim/ # Intermediate data files (optional, for debugging)
├── models/ # Saved models and related artifacts
│ ├── trained_models/ # Final trained models (e.g., .pkl or .joblib files)
│ └── metrics/ # Evaluation metrics (e.g., JSON or CSV files)
├── logs/ # Logs for debugging and monitoring
│ ├── training_logs/ # Logs generated during model training
│ └── deployment_logs/ # Logs generated during model deployment
└── assets/ # Static assets like images, diagrams, or visualizations
Installation
Prerequisites
- Python 3.9+
- Git
Steps
Clone the repository:
bash git clone https://github.com/Azazh/advanced_fraud_detection.git cd advanced_fraud_detectionInstall dependencies:
bash pip install -r requirements.txt pip install -r requirements-dev.txt(Optional) Set up a Conda environment:
bash conda env create -f environment.yml conda activate advanced_fraud_detection
Usage
Run the Preprocessing Pipeline
To preprocess the raw data and generate the processed dataset:
bash
python scripts/preprocess_data.py
Perform Exploratory Data Analysis (EDA)
Open the notebooks/EDA.ipynb notebook to analyze the dataset and visualize key insights.
Train a Model
To train a machine learning model:
bash
python scripts/train_model.py
Evaluate the Model
To evaluate the trained model:
bash
python scripts/evaluate_model.py
Key Findings
Class Imbalance:
- Fraudulent transactions account for 9.37% of the dataset.
- Techniques like SMOTE or class weighting will be required during modeling.
Geolocation Insights:
- High fraud rates are observed in countries such as Nigeria, Russia, and Vietnam.
Time-to-Action:
- Fraudulent transactions occur significantly faster (673.29 hours) compared to legitimate transactions (1,370.01 hours).
Transaction Frequency Issue:
- The
transaction_frequencycolumn currently shows 0.00 for all users, indicating a flaw in the calculation logic.
- The
Contributing
We welcome contributions! Please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/your-feature). - Commit your changes (
git commit -m "Add your feature"). - Push to the branch (
git push origin feature/your-feature). - Open a pull request.
For more details, refer to CONTRIBUTING.md.
Contact
For questions or feedback, feel free to reach out:
- Email: azazhwuletaw@gmail.com
- GitHub: @Azazh
License
This project is licensed under the MIT License. See LICENSE for more details.
Owner
- Login: Azazh
- Kind: user
- Repositories: 1
- Profile: https://github.com/Azazh
GitHub Events
Total
- Push event: 3
- Create event: 2
Last Year
- Push event: 3
- Create event: 2
Dependencies
- actions/checkout v3 composite
- actions/setup-python v4 composite
- Markdown ==3.7
- MarkupSafe ==3.0.2
- Pygments ==2.19.1
- Werkzeug ==3.1.3
- absl-py ==2.1.0
- asttokens ==3.0.0
- astunparse ==1.6.3
- certifi ==2025.1.31
- charset-normalizer ==3.4.1
- comm ==0.2.2
- contourpy ==1.3.1
- cycler ==0.12.1
- debugpy ==1.8.13
- decorator ==5.2.1
- exceptiongroup ==1.2.2
- executing ==2.2.0
- flatbuffers ==25.2.10
- fonttools ==4.56.0
- gast ==0.6.0
- google-pasta ==0.2.0
- grpcio ==1.70.0
- h5py ==3.13.0
- idna ==3.10
- ipykernel ==6.29.5
- ipython ==8.33.0
- jedi ==0.19.2
- joblib ==1.4.2
- jupyter_client ==8.6.3
- jupyter_core ==5.7.2
- keras ==3.9.0
- kiwisolver ==1.4.8
- libclang ==18.1.1
- markdown-it-py ==3.0.0
- matplotlib ==3.10.1
- matplotlib-inline ==0.1.7
- mdurl ==0.1.2
- ml-dtypes ==0.4.1
- namex ==0.0.8
- nest-asyncio ==1.6.0
- numpy ==2.0.2
- opt_einsum ==3.4.0
- optree ==0.14.1
- packaging ==24.2
- pandas ==2.2.3
- parso ==0.8.4
- patsy ==1.0.1
- pexpect ==4.9.0
- pillow ==11.1.0
- platformdirs ==4.3.6
- prompt_toolkit ==3.0.50
- protobuf ==5.29.3
- psutil ==7.0.0
- ptyprocess ==0.7.0
- pure_eval ==0.2.3
- pyparsing ==3.2.1
- python-dateutil ==2.9.0.post0
- pytz ==2025.1
- pyzmq ==26.2.1
- requests ==2.32.3
- rich ==13.9.4
- scikit-learn ==1.6.1
- scipy ==1.15.2
- seaborn ==0.13.2
- six ==1.17.0
- stack-data ==0.6.3
- statsmodels ==0.14.4
- tensorboard ==2.18.0
- tensorboard-data-server ==0.7.2
- tensorflow ==2.18.0
- tensorflow-io-gcs-filesystem ==0.37.1
- termcolor ==2.5.0
- threadpoolctl ==3.5.0
- tornado ==6.4.2
- traitlets ==5.14.3
- typing_extensions ==4.12.2
- tzdata ==2025.1
- urllib3 ==2.3.0
- wcwidth ==0.2.13
- wrapt ==1.17.2