graph_magic_conv
This repository accompanies the research paper titled "Money Laundering Detection with Multi-Aggregation Custom Edge GIN Networks," which is currently under review.
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.4%) to scientific vocabulary
Repository
This repository accompanies the research paper titled "Money Laundering Detection with Multi-Aggregation Custom Edge GIN Networks," which is currently under review.
Basic Info
- Host: GitHub
- Owner: maddataanalyst
- License: bsd-3-clause
- Language: Jupyter Notebook
- Default Branch: main
- Size: 5.99 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Anti - Money Laundering MAGIC
This repository accompanies the research paper titled "Money Laundering Detection with Multi-Aggregation Custom Edge GIN:
Wójcik, F. (2025). Money Laundering Detection with Multi-Aggregation Custom Edge GIN. Journal of Data Science, 1-19. doi:10.6339/25-JDS1190
From the paper's abstract:
Detecting illicit transactions in Anti-Money Laundering (AML) systems remains a significant challenge due to class imbalances and the complexity of financial networks. This study introduces the Multiple Aggregations for Graph Isomorphism Networks with Custom Edges (MAGIC) convolution, an enhancement of the Graph Isomorphism Network (GIN) designed to improve the detection of illicit transactions in AML systems. MAGIC integrates edge convolution (GINE Conv) and multiple learnable aggregations, allowing for varied embedding sizes and increased generalization capabilities.
Experiments were conducted using synthetic datasets, which simulate real-world transactions, following the experimental setup of previous studies to ensure comparability. MAGIC, when combined with XGBoost as a link predictor, outperformed existing models in 17 out of 24 metrics, with notable improvements in F1 scores and precision, particularly in datasets with strong class imbalances. In the most imbalanced dataset, MAGIC achieved an F1 score of 84.4% and a precision of 91.8% for the "illicit" class. While MAGIC demonstrated high precision, its recall in certain cases was lower than that of other models, indicating potential areas for future enhancement.
Overall, MAGIC presents a robust approach to AML detection, particularly in scenarios where precision and overall quality are critical. Future research should focus on optimizing the model's recall, potentially by incorporating additional regularization techniques or advanced sampling methods. Additionally, exploring the integration of foundation models like GraphAny could further enhance the model's applicability in diverse AML environments.
The study design and assumptions
A persistent challenge in Anti-Money Laundering research is the scarcity of standardized datasets for model comparison. This study builds upon the dataset and experimental protocol established by the following study:
Silva, Í. D. G., Correia, L. H. A., & Maziero, E. G. (2023, May). Graph Neural Networks Applied to Money Laundering Detection in Intelligent Information Systems. In Proceedings of the XIX Brazilian Symposium on Information Systems (pp. 252-259).
The authors of this study developed a set of highly imbalanced synthetic datasets that simulate real-world transactions. The proposed model was tested using these datasets. The original implementation is accessible via the the official GitLab repository.
This project employs the same datasets, train/test splits, and evaluation metrics to ensure the comparability of results and to rigorously evaluate the newly proposed model.
The model
The proposed model is an extension of the Graph Isomorphism Networks (GIN), enhanced with custom edge convolution and multiple learnable aggregations. This model is referred to as the Multiple Aggregations for Graph Isomorphism Networks with Custom Edges (MAGIC).
The pseudocode for the model is as follows:

Installation instructions
All commands should be executed from inside the main project directory: Graph_MAGIC_Conv/
There are two primary methods for installing this project:
Local installation
Ensure that Anaconda or Miniconda is installed.
Create a new conda environment using the provided environment.yml file:
conda env create -f env.yml
Activate the environment:
conda activate aml_magic
Running Docker version
Alternatively, you can run the project using Docker. To build the Docker image, execute the following command:
docker compose up
from within the project repository. This will build an image and execute the model tranining and testing procedure.
Running experiments
All commands should be executed from inside the main project directory: Graph_MAGIC_Conv/
DVC pipeline for training models and reproducing results
The experiments are managed using DVC. To run the experiments, execute the following command:
dvc repro summarize_results --force
--force flag ignores previous results and recalculates everything from scratch.
This will perform the following steps: 1. Download the raw data from the study by Silva et al. 2. Process the data to create the necessary datasets. 3. Train the GNN model and the accompanying link predictor. 4. Summarize the results. 5. Save the results to the MLFlow tracking server. 6. Generate a summary of the results and LaTeX tables.
DVC pipeline for hyperparameter tuning
To run the hyperparameter tuning experiments, execute the following command:
dvc repro hp_tune --force
--force flag ignores previous results and recalculates everything from scratch.
This will perform the following steps: 1. Prepare the data. 2. Run the hyperparameter tuning experiments - this may take several hours.
If you don't need to run the hyperparameter tuning, you can skip this step!
Check results
In result files
Only after you run the DVC pipeline, you can check the reuslts directly in the result files.
- Directory
results/study/contains a separate folder for each dataset (e.g.amlsim_31_CI_SUMMARY) with two files inside:DATASET_NAME_xgboost_raw.csv- raw scores for each fold and metric for the XGBoost link predictor;DATASET_NAME_xgboost.csv- aggregates summary scores (mean, +/- std. deviation) for each metric after full cross-validation. Main point of interest.
- Directory `results/comparison/' contains the following files and folders:
scores_summary.csvandscores_summary.xlsx- a comparison of the aggregated results for each model (including those from previous study) and metric. Data from this file was used to report results in paper.DATASET_NAME_scores.tex- LaTeX tables with aggregated results for each dataset and metric. Code for this tables was used in the paper.figures/- a folder with detailed comparison boxplots for each dataset, each metric and models (including those from previous study).
In the Jupyter notebook
Only after you run the DVC pipeline, you can visualize the results in the Jupyter notebook:
notebooks/summarize_results.ipynb
This notebook allows you to visualize raw scores from each approach, as well as the aggregated results, confidence intervals, etc.
In the MLFlow
Only after you run the DVC pipeline, you can visualize the results in MLFlow:
mlflow ui
This will open the MLflow tracking server in your browser. You can view the results of each experiment, including the hyperparameters, metrics, and artifacts.
There are two types of experiments in MLFlow, varying by the naming convention: 1. amlsimdateset name - summarize CV results for each metric for the AML Magic model as a whole - providing the mean, std. deviation, and confidence interval after full cross-validation. 2. experimentamlsim**datasetname** - provides details on each cv fold for two phases of model training: 1. GNN embedding; 2. XGBoost link prediction. Therefore number of records for each dataset is equal to the number of folds times 2.
Dependency graph
Diagram below presents the dependency graph of the DVC pipeline. Each stage is represented by a separate node, with the arrows indicating the dependencies between the stages.

Technologies and tools
Dependency Management
Dependencies for this project are managed using the Poetry package manager.
Data Processing and Experiments
Data processing and experiment management are handled by DVC - Data Version Control, while experiment and model tracking are facilitated by MLFlow.
Model Implementation
The model is implemented using the PyTorch Geometric library. The accompanying link predictor (classifier) module is implemented using the XGBoost library.
Project Structure
The directory structure of the project is organized as follows:
├── params.yaml - Global experiment hyperparameters
├── dvc.yaml - Main configuration file for DVC
├── dvc.lock - DVC lock file
├── pyproject.toml - Poetry project file
├── environment.yaml - Conda environment file
├── setup_env.sh - Script for setting up the conda environment and installing the project
|
├── data - Managed by DVC; contains raw and processed data. Subfolders are created by DVC.
│ ├── raw - Raw data files, downloaded from the study by Silva et al.
│ ├── processed - Processed data files, created by the DVC pipeline
| └── results - Experiment results, saved by MLFlow
|
├── notebooks - Directory containing a Jupyter notebook for checking experiment results
├── stage_params - Directory containing parameters for DVC pipeline stages
│ ├── prepare_data.yaml - Parameters for the data preparation stage
│ ├── gnn_training.yaml - Hyperparameters for GNN and its training stage
│ └── gb_training.yaml - Hyperparameters for the Gradient Boosting training stage
|
├── src - Source code directory
│ ├── consts.py - Constants used in the project
│ ├── utils - Utility functions
│ │ └── configs.py - Utility configuration classes
│ ├── models - Modules related to model training
│ │ ├── magic.py - Main model implementation
│ │ ├── metrics.py - Metrics used in the experiments
│ │ └── training.py - Main training logic
│ ├── stages - DVC pipeline stages
│ │ ├── prepare_data.py - Data preparation stage
│ │ ├── train_and_test_models.py - Model training pipeline
│ │ └── summarize_results.py - Building results summary
└── .github/workflows - GitHub Actions workflows for building and linting the project
Owner
- Name: Filip Wójcik, PhD
- Login: maddataanalyst
- Kind: user
- Company: Mad data scientist
- Website: https://filip-wojcik.com/en
- Repositories: 2
- Profile: https://github.com/maddataanalyst
I’m a professional data scientist and a programmer with specialization in artificial intelligence and machine learning. I hold a PhD in Economics and Management
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Wójcik"
given-names: "Filip"
title: "Money Laundering Detection with Multi-Aggregation Custom Edge GIN"
journal: "Journal of Data Science"
version: 1.0.0
date-released: 2025
pages: "1-19"
doi: "10.6339/25-JDS1190"
issn: "1680-743X"
publisher: "School of Statistics, Renmin University of China"
GitHub Events
Total
- Delete event: 1
- Issue comment event: 2
- Push event: 20
- Pull request event: 4
- Create event: 2
Last Year
- Delete event: 1
- Issue comment event: 2
- Push event: 20
- Pull request event: 4
- Create event: 2
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 1.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- maddataanalyst (1)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- continuumio/miniconda3 24.11.1-0
- autoroot *
- autorootcwd *
- rootutils *
- torcheval *