rl-db-indexing

Database indexes tuning with agent using reinforcement learning techniques

https://github.com/chotom/rl-db-indexing

Keywords

mysql reinforcement-learning tpch

Last synced: 6 months ago · JSON representation ·

Repository

Database indexes tuning with agent using reinforcement learning techniques

Basic Info

Host: GitHub
Owner: Chotom
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 367 KB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Topics

mysql reinforcement-learning tpch

Created over 4 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

rl-db-indexing

Database indexes tuning with agent using reinforcement learning techniques.\ TPC-H benchmark MySQL environment for reinforcement learning.

Source code for paper: https://doi.org/10.1007/978-3-031-42941-5_45

Setup and training

How to train an agent and set up an environment for TPC-H database.

In dbenv/tpch/tpchtools directory paste tpc-h-tool.zip downloaded from TPC website (tested with version 3.0.0): download link.
Define database size (in GB) and number of parallel streams in db_env/tpch/config.py.

You can check recommended number of streams in TPC-H documentation here (page 96). python SCALE_FACTOR = 0.1 # Should be >=1, however we used 0.1 because of limited resources STREAM_COUNT = 2 # 2 for 1 GB, we also used 2

Follow further instruction:
- Docker:
  - Setup and training with Docker
- without Docker:
  - Server setup without Docker
  - Client setup without Docker

Setup and training with Docker

This should only be used for testing or if you are certain, that you can provide stable server performance.

Start mysql server and client. shell docker compose up -d docker compose up -d --build # if docker images require to be rebuilt or created
Generate data and load database to mysqlserver from client container. ```shell docker compose exec client python3 /project/cli/initiateenvironment.py ```
You can start training by running script in client container shell docker compose exec client python3 /project/cli/run_train.py

Server setup without Docker

Install and configure mysql server (version 8).
Allow remote root connection for database operations during benchmarking (or create different user): ```MySQL CREATE USER 'root'@'CLIENTIP' /* change CLIENTIP to client IP address, or use 'root'@'%' for all addresses / IDENTIFIED WITH cachingsha2password BY '1234'; GRANT ALL PRIVILEGES ON *. TO 'root'@'CLIENTIP'; /* change CLIENTIP */

```

Client setup without Docker

Project currently only supports Linux client, because of DBGen. However, it should be possible to compile DBGen on Windows as well.

Update file .env with MySQL server data (user, password, server IP, port and name of the database, which will be used for benchmarking).
Install gcc, make, python (minimal version is 3.8), pip and virtualenv.
Move project to desired folder (for example /path/to/project).
Create virtual environment (for example inside /path/to/project directory): shell virtualenv venv
Save project path to $PYTHONPATH: shell echo 'export PYTHONPATH="${PYTHONPATH}:/path/to/project/"' >> venv/bin/activate
Activate virtual environment: shell source venv/bin/activate
Install requirements: shell pip install -r requirements.txt
Patch DBGen: shell python cli/patch_dbgen.py
Compile DBGen: shell make -C dbgen
Initiate environment - populate database with generated data: shell python cli/initiate_environment.py

The environment is now ready and you can:

train agent, for example: shell nohup python cli/train_agent.py &> data/train.log &

It is possible to stop training at any time (without need of resetting environment) by sending the script SINGINT (CTRL+C). In case of example way of running training shown above, it can be done with: shell kill -2 PID # change PID to process id (can be established using 'ps' command)
run single benchmark on current database configuration: shell python cli/run_benchmark.py
reset database (without generating data for benchmark again): shell python cli/reset_environment.py
reset index configuration to default state: shell python cli/reset_indexes.py
set specific index configuration (run python cli/set_index.py --help to see column order): shell python cli/set_index.py 100000000000000000000000000000000000000000000

Important notes

DBGen bug

In file BUGS in dbgen directory in official TPC-H Tools, a bug named "Problem #00062" is mentioned, which states the following:

bad update rollover after 1000 refreshes This test uses tpcH scale 0.01. We've encountered an situation in which dbgen doesn't generate the correct data for delete files delete.1000 and above. In particular, file delete.1000 contains keys to be deleted that have never been loaded. Because of this problem, keys that should have been deleted never are causing duplicate unique values to appear in the incremental loads after we cycle from the 4000th incremental update back around starting again with the 1st one.

The bug was closed, supposedly due to an unsupported scale factor. However, according to our observations, the bug still exists and affects every SCALE_FACTOR.

Due to this bug database is constantly growing in size after 999 refresh functions. In order to continue with experiments we had to implement a "fixed" way of generating data.

Our way of "fixing" data uses only 1998 refreshes, after which the database is in its initial state (as opposed to 4000, which would be official number if the bug hadn't existed).

Had there been any solution for that bug, ensure to update MAXREFRESHFILE_INDEX variable in the file db_env/tpch/config.py.

python MAX_REFRESH_FILE_INDEX = 4000 # 1998 changed to 4000

and file cli/initiate_environment.py like so: ```python from db_env.tpch.TpchGenerator import TpchGenerator

if name == 'main': generator = TpchGenerator() generator.resetdb() generator.generatedata() generator.loaddb() generator.generaterefreshdata() # uncomment this line #generator.generatefixedrefreshdata() # comment or remove this line ```

How to confirm the bug

Generate data with DBGen (note officially supported 1 GB SCALE_FACTOR): shell ./dbgen -vf -s 1
Generate refresh data for 4000 refresh function pairs: shell ./dbgen -vf -U 4000 -s 1
Delete files from 1000 onwards do not delete anything, which can be checked by: shell grep -F -f delete.1000 ./orders.tbl*

This demonstrates that, after refreshes 1-999 (which are assumed to be correct), delete files don't actually remove anything. At the same time update files do add records, which causes database to grow in size, which messes up the benchmark.

Database state file

You should never interfere with file path/to/project/data/rfdbindex.txt.

Every executed refresh pair should increment number stored in that file.

After refresh pair number 4000 (or currently after refresh pair 1998) database should be in its initial state (note this has not been fully tested) - rfdbindex.txt should contain 1.

Otherwise, you will lose information about current database state, which may (and sooner or later will) lead to errors and force you to stop training and reset database.

Project structure

agent

Package with agent to train.

data

Directory to store project datafiles and agent data for analysis.

data_analysis

Directory to store scripts used for analysis of agent training data.

db_env

Package with database environment for agent.

shared_utils

Package with utilities functions and variables to use in project.

test

Run tests with code coverage (run pip install -r test/requirements.txt to install test requirements):

shell coverage run -m unittest coverage report

Citing

If you use this repository in your research, please cite: bibtex @inproceedings{10.1007/978-3-031-42941-5_45, title = {Intelligent Index Tuning Using Reinforcement Learning}, author = {Matczak, Micha{\l} and Czocha{\'{n}}ski, Tomasz}, year = 2023, booktitle = {New Trends in Database and Information Systems}, publisher = {Springer Nature Switzerland}, address = {Cham}, pages = {523--534}, isbn = {978-3-031-42941-5}, editor = {Abell{\'o}, Alberto and Vassiliadis, Panos and Romero, Oscar and Wrembel, Robert and Bugiotti, Francesca and Gamper, Johann and Vargas Solar, Genoveva and Zumpano, Ester}, abstract = {Index tuning in databases is a critical task that can significantly impact database performance. However, the process of manually configuring indexes is often time-consuming and can be inefficient. In this study, we investigate the process of creating indexes in a database using reinforcement learning. Our research aims to develop an agent that can learn to make optimal decisions for configuring indexes in a chosen database. The paper also discusses an evaluation method to measure database performance. The adopted performance test provides necessary documentation, database schema (on which experiments will be performed) and auxiliary tools such as data generator. This benchmark evaluates a selected database management system in terms of loading, querying and processing power of multiple query streams at once. It is a comprehensive test which results, calculated on measured queries time, will be used in the reinforcement learning algorithm. Our results demonstrate that used index technique requires repeatable benchmark with stable environment and high compute power, which cause cost and time demand. The replication package for this paper is available at GitHub: https://github.com/Chotom/rl-db-indexing.} }

Owner

Name: Tomasz Czochański
Login: Chotom
Kind: user
Location: Gdańsk

Website: www.linkedin.com/in/tomasz-czochanski/
Repositories: 3
Profile: https://github.com/Chotom

Citation (CITATION.bib)

@InProceedings{10.1007/978-3-031-42941-5_45,
author="Matczak, Micha{\l}
and Czocha{\'{n}}ski, Tomasz",
editor="Abell{\'o}, Alberto
and Vassiliadis, Panos
and Romero, Oscar
and Wrembel, Robert
and Bugiotti, Francesca
and Gamper, Johann
and Vargas Solar, Genoveva
and Zumpano, Ester",
title="Intelligent Index Tuning Using Reinforcement Learning",
booktitle="New Trends in Database and Information Systems",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="523--534",
abstract="Index tuning in databases is a critical task that can significantly impact database performance. However, the process of manually configuring indexes is often time-consuming and can be inefficient. In this study, we investigate the process of creating indexes in a database using reinforcement learning. Our research aims to develop an agent that can learn to make optimal decisions for configuring indexes in a chosen database. The paper also discusses an evaluation method to measure database performance. The adopted performance test provides necessary documentation, database schema (on which experiments will be performed) and auxiliary tools such as data generator. This benchmark evaluates a selected database management system in terms of loading, querying and processing power of multiple query streams at once. It is a comprehensive test which results, calculated on measured queries time, will be used in the reinforcement learning algorithm. Our results demonstrate that used index technique requires repeatable benchmark with stable environment and high compute power, which cause cost and time demand. The replication package for this paper is available at GitHub: https://github.com/Chotom/rl-db-indexing.",
isbn="978-3-031-42941-5"
}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

rl-db-indexing

Science Score: 31.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

rl-db-indexing

Setup and training

Setup and training with Docker

Server setup without Docker

```

Client setup without Docker

Important notes

DBGen bug

How to confirm the bug

Database state file

Project structure

agent

data

data_analysis

db_env

shared_utils

test

Citing

Owner

Citation (CITATION.bib)

GitHub Events

Total

Last Year