dmml

https://github.com/yorku-ease/dmml

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.6%) to scientific vocabulary

Last synced: 6 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: yorku-ease
Language: Jupyter Notebook
Default Branch: main
Size: 7.13 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 1
Releases: 0

Created over 1 year ago · Last pushed 12 months ago

Metadata Files

Readme Citation

DmML: Machine Learning for Data Migration Optimization

Overview

DmML is a machine learning-based performance model designed to enhance data migration efficiency. It predicts key metrics like transfer time and resource allocation, enabling better decision-making and cost savings in large-scale migrations.

Motivation

Traditional data migration methods can be inefficient, requiring manual estimations that are often inaccurate and resource-heavy. DmML leverages ML models to automate predictions, improve scalability, and minimize costs and delays.

Objectives

Develop an ML model to predict data transfer time and resource usage.
Validate the model using IBM Db2 as a real-world case study.
Provide insights for optimizing migration configurations.

Case Study: IBM Db2

IBM Db2 was selected due to its high performance and scalability. The Db2 Migration Service was used to test migration strategies, ensuring secure and efficient data transfer.

Methodology

Environment Setup – Simulated Db2 instances inside Docker containers.
Data Generation – Used TPC-H benchmark data and DMBench for migration scenarios.
Feature Engineering – Focused on data size, system resources (CPU, RAM), and migration parameters.
Model Training – Tested multiple ML models, with XGBoost emerging as the best performer.
Validation – Used statistical tests and SHAP analysis to assess model reliability and interpretability.

Explainability & Recommendations

System Configuration

Allocate sufficient RAM to avoid bottlenecks.
Increase parallel streams to speed up migration.

Data Optimization

Use efficient compression methods (e.g., GZIP, LZ4).
Balance row distribution to improve transfer efficiency.

Running the Model

To run the trained model, execute the run_model.py script located in the ./Run folder. You will need to modify the script to input the desired feature values for prediction. The model will then output the predicted transfer time based on the provided inputs.

Required Files

The Run folder contains:
- A .json file with the exported model configuration (used by the script).
- A .pkl file containing the trained model (optional, can be used for further analysis or alternative implementations).

By default, the script loads the model from the .json file, but the .pkl file is available if needed for custom use.

Input Features

Modify the script to set appropriate values for your scenario:

Numeric Features

Ensure the units match the model's training data:
- num__Data size sum (MB) – Total data size (MB)
- num__Number of rows sum – Total number of rows
- num__Number of columns sum – Total number of columns
- num__maxStreams – Maximum concurrent data streams
- num__RAM (GB) – Available RAM (GB)
- num__CPU – Number of CPU cores
- num__Disk (GB) – Local disk storage (GB)
- num__External Disk (GB) – External disk storage (GB)

Categorical Features

Choose from the following options:
- compress – 'GZIP', 'LZ4', or 'NO'
- binary – True or False

After modifying the script with the required values, run:

```bash python ./Run/run_model.py

Owner

Name: yorku-ease
Login: yorku-ease
Kind: organization

Repositories: 1
Profile: https://github.com/yorku-ease

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: >-
  DMML: Machine-learning Performance Model for Data
  Migration
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Hasti
    family-names: Ghaneshirazi
    email: hastighsh@gmail.com
    affiliation: York University
repository-code: 'https://github.com/yorku-ease/DmML.git'
abstract: >-
  A machine learning framework to predict data transfer
  times during data migration tasks. Building on an existing
  data migration benchmarking tool, the project aimed to
  enhance predictive accuracy and streamline resource
  planning.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science