hexgin

https://github.com/maddataanalyst/hexgin

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: maddataanalyst
License: mit
Language: HTML
Default Branch: main
Size: 2.15 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

HexGIN: An Analysis of Novel Money Laundering Data Using Heterogeneous Graph Isomorphism Networks

Publication Reference

Wójcik, F. (2024). An Analysis of Novel Money Laundering Data Using Heterogeneous Graph Isomorphism Networks. FinCEN Files Case Study. Econometrics. Ekonometria. Advances in Applied Data Analysis, 28(2), 32-49.

Project Overview

This project accompanies the above-mentioned publication and focuses on developing and applying the novel HexGIN (Heterogeneous extension for Graph Isomorphism Network) model to the FinCEN Files case data. The primary goal is to compare HexGIN's performance with existing solutions such as the SAGE-based graph neural network and Multi-Layer Perceptron (MLP), demonstrating its potential advantages in anti-money laundering (AML) systems.

The dataset in data/01_raw folder contains the original files made publicly available by the International Consortium of Investigative Journalists (ICIJ) as part of the FinCEN Files investigation. It can be found under the following address, with the full case description: original data source.

Processing Pipeline

The data processing pipeline consists of several stages:

Data Preprocessing Pipeline

Data Collection and Cleaning:
- Load raw transaction data from the FinCEN Files.
- Clean the data to handle missing values, remove duplicates, and correct inconsistencies.
Feature Engineering:
- Transform transaction data into a graph structure.
- Extract relevant features such as node attributes and edge attributes.
Graph Construction:
- Construct a heterogeneous graph representing various entities (e.g., individuals, accounts) and their relationships (e.g., transactions).

Experiment Preparation Pipeline

Data Splitting:
- Split the graph data into training, validation, and test sets ensuring no data leakage between sets.
Normalization and Scaling:
- Apply normalization and scaling techniques to ensure the data is suitable for model training.
Preparation of Training Data:
- Format the data into a suitable structure for input into the different models (HexGIN, Graph SAGE, MLP).

Experiment Pipeline

Model Training:
- Train the HexGIN model on the training data.
- Also train baseline models (Graph SAGE and MLP) for comparison.
Model Evaluation:
- Evaluate the models using cross-validation on the training set.
- Use metrics such as F1 score, precision, and ROC AUC for performance comparison.
Testing:
- Apply the trained models to the test set and compare their performance.

Picture below presents detailed overview of the processing pipeline and dependencies between steps.

Pipeline Overview

Model Types

HexGIN: A novel extension of Graph Isomorphism Networks capable of handling heterogeneous data.
Graph SAGE: A well-established graph neural network model used for inductive node embedding.
MLP (Multi-Layer Perceptron): A traditional neural network model that operates on flattened tabular data.

Setup and Installation

Dependencies

The main dependency resolution tool used in this project is poetry. The environment management is handled using conda.

Installation Steps

Clone the repository: bash git clone <repository-url> cd <repository-directory>
Create a Conda environment: bash conda env create -f environment.yml conda activate hexgin
Install dependencies using Poetry: bash pip install poetry poetry install
Run the experiments: bash kedro run

For your convenience, steps 1-3 can be autometed by running the following command: bash sh ./setup_project.sh

You will need to activate the environment later via

bash conda activate hexgin

Notebooks

Compare Results: The compare_results.ipynb notebook provides a detailed comparison of the models' performance, presenting differences between HexGIN, Graph SAGE, and MLP.
Models presentation: The models_presentation.ipynb notebook provides a detailed overview of the HexGIN model, Graph SAGE, and MLP, including their architecture and training process.

Running the Project

To run the entire pipeline, use the following command: bash kedro run

To visualize the pipeline, use: bash kedro viz

Owner

Name: Filip Wójcik, PhD
Login: maddataanalyst
Kind: user
Company: Mad data scientist

Website: https://filip-wojcik.com/en
Repositories: 2
Profile: https://github.com/maddataanalyst

I’m a professional data scientist and a programmer with specialization in artificial intelligence and machine learning. I hold a PhD in Economics and Management

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this data, please cite it as below."
authors:
- family-names: "Wójcik"
  given-names: "F."
title: "An Analysis of Novel Money Laundering Data Using Heterogeneous Graph Isomorphism Networks"
version: 28
date-released: 2024-01-01
journal: "Econometrics. Ekonometria. Advances in Applied Data Analysis"
volume: "28"
issue: "2"
start: "32"
end: "49"

GitHub Events

Total

Last Year

Dependencies

poetry.lock pypi

201 dependencies

pyproject.toml pypi

ruff ^0.5.1 develop
captum ^0.7.0
ipykernel ^6.29.5
kedro ^0.19.3
kedro-datasets ^3.0.1
kedro-viz ^8.0.1
mlflow ^2.11.1
openpyxl ^3.1.5
pandas ^2.2.1
python ^3.11
pytorch-lightning ^2.2.1
pyvis ^0.3.2
scikit-learn ^1.4.1.post1
seaborn ^0.13.2
tabulate ^0.9.0
tensorboard ^2.17.0
torch ^2.2.1
torch-cluster ^1.6.3+pt23cu121
torch-geometric 2.5.3
torch-scatter ^2.1.2+pt23cu121
torch-sparse ^0.6.18+pt23cu121
torch-spline-conv ^1.2.2+pt23cu121
torchmetrics ^1.3.1

src/requirements.txt pypi

black *
flake8 >=3.7.9,<4.0
ipython >=7.31.1,<8.0
ipython *
isort *
jupyter *
jupyterlab *
kedro *
kedro-datasets *
kedro-telemetry *
kedro-viz *
nbstripout *
pytest *
pytest-cov *
pytest-mock >=1.7.1,<2.0

src/setup.py pypi

environment.yml conda

poetry 1.8.3.*
python 3.11.9.*

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

hexgin

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

HexGIN: An Analysis of Novel Money Laundering Data Using Heterogeneous Graph Isomorphism Networks

Publication Reference

Project Overview

Processing Pipeline

Data Preprocessing Pipeline

Experiment Preparation Pipeline

Experiment Pipeline

Model Types

Setup and Installation

Dependencies

Installation Steps

Notebooks

Running the Project

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies