https://github.com/azazh/advanced_fraud_detection

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: Azazh
License: mit
Language: Jupyter Notebook
Default Branch: master
Size: 170 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 1 year ago · Last pushed over 1 year ago

Metadata Files

Readme License

Advanced Fraud Detection

This repository contains the implementation of Advanced Fraud Detection, a project aimed at detecting fraudulent transactions using machine learning techniques. The project focuses on data preprocessing, exploratory data analysis (EDA), feature engineering, and model preparation.

Overview
Features
Dataset
Folder Structure
Installation
Usage
Key Findings
Contributing
Contact
License

Overview

Fraud detection is critical for businesses to minimize financial losses and improve customer trust. This project implements a pipeline for analyzing transaction data, identifying patterns of fraudulent behavior, and preparing the dataset for machine learning models. Key tasks include:

Handling missing values and duplicates
Cleaning and normalizing data
Performing exploratory data analysis (EDA)
Engineering features such as transaction frequency and time-to-action
Preparing the dataset for downstream modeling

Features

Data Preprocessing: Handles missing values, removes duplicates, and corrects data types.
Feature Engineering: Creates meaningful features like time_to_action, transaction_frequency, and geolocation-based features.
Normalization: Scales numerical features for compatibility with machine learning algorithms.
Exploratory Data Analysis (EDA): Provides insights into class imbalance, fraud hotspots, and transaction patterns.
Modular Codebase: Organized structure for scalability and reproducibility.

Dataset

The dataset used in this project consists of transaction data with the following key attributes:

User Information: user_id, signup_time, purchase_time, device_id, age, sex
Transaction Details: purchase_value, ip_address, country
Labels: Binary target variable (class) indicating whether a transaction is fraudulent (1) or legitimate (0).

The dataset is split into: - Raw Data: Located in data/raw/ - Processed Data: Located in data/processed/

Folder Structure

advanced_fraud_detection/ ├── README.md ├── CONTRIBUTING.md ├── LICENSE ├── CHANGELOG.md ├── .gitignore ├── requirements.txt ├── requirements-dev.txt ├── environment.yml ├── pyproject.toml ├── setup.py ├── tests/ │ ├── unit/ │ └── integration/ ├── src/ │ ├── __init__.py │ ├── config.py │ ├── preprocessing.py │ ├── models.py │ ├── utils.py │ └── pipeline.py ├── scripts/ │ ├── __init__.py │ ├── train_model.py │ ├── evaluate_model.py │ └── deploy_model.py ├── notebooks/ │ ├── EDA.ipynb │ ├── │ └── ├── data/ │ ├── raw/ │ ├── processed/ │ └── interim/ ├── models/ │ ├── trained_models/ │ └── metrics/ ├── logs/ │ ├── training_logs/ │ └── deployment_logs/ └── assets/ # Project overview and setup instructions # Guidelines for contributing to the project # License file (MIT) # Tracks changes, updates, and version history # Specifies files and directories to ignore in version control # Lists Python dependencies for the project # Lists development-specific dependencies (e.g., pytest, flake8) # Conda environment configuration (optional, if using Conda) # Configuration for packaging and linting tools (e.g., Black, isort) # Package setup file for distributing the project as a Python package # Unit tests and integration tests for the project # Unit tests for individual components # Integration tests for workflows and pipelines # Source code for the project # Makes the src directory a Python package # Configuration settings (e.g., file paths, hyperparameters) # Data cleaning, feature engineering, and transformation logic # Model training, evaluation, and prediction logic # Helper/utility functions (e.g., logging, visualization) # End-to-end pipeline orchestration (data -> model -> deployment) # Scripts for running workflows (e.g., data ingestion, model training) # Makes the scripts directory a Python package # Script to train and save the model # Script to evaluate the model on test data # Script for deploying the model (e.g., as an API) # Jupyter notebooks for exploratory data analysis (EDA) and experimentation # Exploratory data analysis notebook feature_engineering.ipynb # Feature engineering experiments model_experiments.ipynb # Model training and evaluation experiments # Raw, processed, and intermediate data # Raw datasets (e.g., Fraud_Data.csv, creditcard.csv) # Processed datasets after cleaning and feature engineering # Intermediate data files (optional, for debugging) # Saved models and related artifacts # Final trained models (e.g., .pkl or .joblib files) # Evaluation metrics (e.g., JSON or CSV files) # Logs for debugging and monitoring # Logs generated during model training # Logs generated during model deployment # Static assets like images, diagrams, or visualizations

Installation

Prerequisites

Python 3.9+
Git

Steps

Clone the repository: bash git clone https://github.com/Azazh/advanced_fraud_detection.git cd advanced_fraud_detection
Install dependencies: bash pip install -r requirements.txt pip install -r requirements-dev.txt
(Optional) Set up a Conda environment: bash conda env create -f environment.yml conda activate advanced_fraud_detection

Usage

Run the Preprocessing Pipeline

To preprocess the raw data and generate the processed dataset: bash python scripts/preprocess_data.py

Perform Exploratory Data Analysis (EDA)

Open the notebooks/EDA.ipynb notebook to analyze the dataset and visualize key insights.

Train a Model

To train a machine learning model: bash python scripts/train_model.py

Evaluate the Model

To evaluate the trained model: bash python scripts/evaluate_model.py

Key Findings

Class Imbalance:
- Fraudulent transactions account for 9.37% of the dataset.
- Techniques like SMOTE or class weighting will be required during modeling.
Geolocation Insights:
- High fraud rates are observed in countries such as Nigeria, Russia, and Vietnam.
Time-to-Action:
- Fraudulent transactions occur significantly faster (673.29 hours) compared to legitimate transactions (1,370.01 hours).
Transaction Frequency Issue:
- The transaction_frequency column currently shows 0.00 for all users, indicating a flaw in the calculation logic.

Contributing

We welcome contributions! Please follow these steps:

Fork the repository.
Create a new branch (git checkout -b feature/your-feature).
Commit your changes (git commit -m "Add your feature").
Push to the branch (git push origin feature/your-feature).
Open a pull request.

For more details, refer to CONTRIBUTING.md.

Contact

For questions or feedback, feel free to reach out:

Email: azazhwuletaw@gmail.com
GitHub: @Azazh

License

This project is licensed under the MIT License. See LICENSE for more details.

Owner

Login: Azazh
Kind: user

Repositories: 1
Profile: https://github.com/Azazh

GitHub Events

Total

Push event: 3
Create event: 2

Last Year

Push event: 3
Create event: 2

Dependencies

.github/workflows/ci.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/unittests.yml actions

requirements.txt pypi

Markdown ==3.7
MarkupSafe ==3.0.2
Pygments ==2.19.1
Werkzeug ==3.1.3
absl-py ==2.1.0
asttokens ==3.0.0
astunparse ==1.6.3
certifi ==2025.1.31
charset-normalizer ==3.4.1
comm ==0.2.2
contourpy ==1.3.1
cycler ==0.12.1
debugpy ==1.8.13
decorator ==5.2.1
exceptiongroup ==1.2.2
executing ==2.2.0
flatbuffers ==25.2.10
fonttools ==4.56.0
gast ==0.6.0
google-pasta ==0.2.0
grpcio ==1.70.0
h5py ==3.13.0
idna ==3.10
ipykernel ==6.29.5
ipython ==8.33.0
jedi ==0.19.2
joblib ==1.4.2
jupyter_client ==8.6.3
jupyter_core ==5.7.2
keras ==3.9.0
kiwisolver ==1.4.8
libclang ==18.1.1
markdown-it-py ==3.0.0
matplotlib ==3.10.1
matplotlib-inline ==0.1.7
mdurl ==0.1.2
ml-dtypes ==0.4.1
namex ==0.0.8
nest-asyncio ==1.6.0
numpy ==2.0.2
opt_einsum ==3.4.0
optree ==0.14.1
packaging ==24.2
pandas ==2.2.3
parso ==0.8.4
patsy ==1.0.1
pexpect ==4.9.0
pillow ==11.1.0
platformdirs ==4.3.6
prompt_toolkit ==3.0.50
protobuf ==5.29.3
psutil ==7.0.0
ptyprocess ==0.7.0
pure_eval ==0.2.3
pyparsing ==3.2.1
python-dateutil ==2.9.0.post0
pytz ==2025.1
pyzmq ==26.2.1
requests ==2.32.3
rich ==13.9.4
scikit-learn ==1.6.1
scipy ==1.15.2
seaborn ==0.13.2
six ==1.17.0
stack-data ==0.6.3
statsmodels ==0.14.4
tensorboard ==2.18.0
tensorboard-data-server ==0.7.2
tensorflow ==2.18.0
tensorflow-io-gcs-filesystem ==0.37.1
termcolor ==2.5.0
threadpoolctl ==3.5.0
tornado ==6.4.2
traitlets ==5.14.3
typing_extensions ==4.12.2
tzdata ==2025.1
urllib3 ==2.3.0
wcwidth ==0.2.13
wrapt ==1.17.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science