rl-db-indexing
Database indexes tuning with agent using reinforcement learning techniques
Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 5 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary
Keywords
Repository
Database indexes tuning with agent using reinforcement learning techniques
Basic Info
Statistics
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
rl-db-indexing
Database indexes tuning with agent using reinforcement learning techniques.\ TPC-H benchmark MySQL environment for reinforcement learning.
Source code for paper: https://doi.org/10.1007/978-3-031-42941-5_45
Setup and training
How to train an agent and set up an environment for TPC-H database.
In dbenv/tpch/tpchtools directory paste tpc-h-tool.zip downloaded from TPC website (tested with version 3.0.0): download link.
Define database size (in GB) and number of parallel streams in db_env/tpch/config.py.
You can check recommended number of streams in TPC-H documentation here (page 96).
python
SCALE_FACTOR = 0.1 # Should be >=1, however we used 0.1 because of limited resources
STREAM_COUNT = 2 # 2 for 1 GB, we also used 2
- Follow further instruction:
- Docker:
- without Docker:
Setup and training with Docker
This should only be used for testing or if you are certain, that you can provide stable server performance.
Start mysql server and client.
shell docker compose up -d docker compose up -d --build # if docker images require to be rebuilt or createdGenerate data and load database to mysqlserver from client container. ```shell docker compose exec client python3 /project/cli/initiateenvironment.py ```
You can start training by running script in client container
shell docker compose exec client python3 /project/cli/run_train.py
Server setup without Docker
Install and configure mysql server (version 8).
Allow remote root connection for database operations during benchmarking (or create different user): ```MySQL CREATE USER 'root'@'CLIENTIP' /* change CLIENTIP to client IP address, or use 'root'@'%' for all addresses / IDENTIFIED WITH cachingsha2password BY '1234'; GRANT ALL PRIVILEGES ON *. TO 'root'@'CLIENTIP'; /* change CLIENTIP */
```
Client setup without Docker
Project currently only supports Linux client, because of DBGen. However, it should be possible to compile DBGen on Windows as well.
Update file .env with MySQL server data (user, password, server IP, port and name of the database, which will be used for benchmarking).
Install gcc, make, python (minimal version is 3.8), pip and virtualenv.
Move project to desired folder (for example /path/to/project).
Create virtual environment (for example inside /path/to/project directory):
shell virtualenv venvSave project path to $PYTHONPATH:
shell echo 'export PYTHONPATH="${PYTHONPATH}:/path/to/project/"' >> venv/bin/activateActivate virtual environment:
shell source venv/bin/activateInstall requirements:
shell pip install -r requirements.txtPatch DBGen:
shell python cli/patch_dbgen.pyCompile DBGen:
shell make -C dbgenInitiate environment - populate database with generated data:
shell python cli/initiate_environment.py
The environment is now ready and you can:
train agent, for example:
shell nohup python cli/train_agent.py &> data/train.log &It is possible to stop training at any time (without need of resetting environment) by sending the script SINGINT (CTRL+C). In case of example way of running training shown above, it can be done with:
shell kill -2 PID # change PID to process id (can be established using 'ps' command)run single benchmark on current database configuration:
shell python cli/run_benchmark.pyreset database (without generating data for benchmark again):
shell python cli/reset_environment.pyreset index configuration to default state:
shell python cli/reset_indexes.pyset specific index configuration (run
python cli/set_index.py --helpto see column order):shell python cli/set_index.py 100000000000000000000000000000000000000000000
Important notes
DBGen bug
In file BUGS in dbgen directory in official TPC-H Tools, a bug named "Problem #00062" is mentioned, which states the following:
bad update rollover after 1000 refreshes
This test uses tpcH scale 0.01. We've encountered
an situation in which dbgen doesn't generate
the correct data for delete files delete.1000 and
above. In particular, file delete.1000 contains
keys to be deleted that have never been loaded.
Because of this problem, keys that should have been
deleted never are causing duplicate unique values
to appear in the incremental loads after we cycle
from the 4000th incremental update back around starting
again with the 1st one.
The bug was closed, supposedly due to an unsupported scale factor. However, according to our observations, the bug still exists and affects every SCALE_FACTOR.
Due to this bug database is constantly growing in size after 999 refresh functions. In order to continue with experiments we had to implement a "fixed" way of generating data.
Our way of "fixing" data uses only 1998 refreshes, after which the database is in its initial state (as opposed to 4000, which would be official number if the bug hadn't existed).
Had there been any solution for that bug, ensure to update MAXREFRESHFILE_INDEX variable in the file db_env/tpch/config.py.
python
MAX_REFRESH_FILE_INDEX = 4000 # 1998 changed to 4000
and file cli/initiate_environment.py like so: ```python from db_env.tpch.TpchGenerator import TpchGenerator
if name == 'main': generator = TpchGenerator() generator.resetdb() generator.generatedata() generator.loaddb() generator.generaterefreshdata() # uncomment this line #generator.generatefixedrefreshdata() # comment or remove this line ```
How to confirm the bug
Generate data with DBGen (note officially supported 1 GB SCALE_FACTOR):
shell ./dbgen -vf -s 1Generate refresh data for 4000 refresh function pairs:
shell ./dbgen -vf -U 4000 -s 1Delete files from 1000 onwards do not delete anything, which can be checked by:
shell grep -F -f delete.1000 ./orders.tbl*
This demonstrates that, after refreshes 1-999 (which are assumed to be correct), delete files don't actually remove anything. At the same time update files do add records, which causes database to grow in size, which messes up the benchmark.
Database state file
You should never interfere with file path/to/project/data/rfdbindex.txt.
Every executed refresh pair should increment number stored in that file.
After refresh pair number 4000 (or currently after refresh pair 1998) database should be in its initial state (note this has not been fully tested) - rfdbindex.txt should contain 1.
Otherwise, you will lose information about current database state, which may (and sooner or later will) lead to errors and force you to stop training and reset database.
Project structure
agent
Package with agent to train.
data
Directory to store project datafiles and agent data for analysis.
data_analysis
Directory to store scripts used for analysis of agent training data.
db_env
Package with database environment for agent.
shared_utils
Package with utilities functions and variables to use in project.
test
Run tests with code coverage (run pip install -r test/requirements.txt to install test requirements):
shell
coverage run -m unittest
coverage report
Citing
If you use this repository in your research, please cite:
bibtex
@inproceedings{10.1007/978-3-031-42941-5_45,
title = {Intelligent Index Tuning Using Reinforcement Learning},
author = {Matczak, Micha{\l} and Czocha{\'{n}}ski, Tomasz},
year = 2023,
booktitle = {New Trends in Database and Information Systems},
publisher = {Springer Nature Switzerland},
address = {Cham},
pages = {523--534},
isbn = {978-3-031-42941-5},
editor = {Abell{\'o}, Alberto and Vassiliadis, Panos and Romero, Oscar and Wrembel, Robert and Bugiotti, Francesca and Gamper, Johann and Vargas Solar, Genoveva and Zumpano, Ester},
abstract = {Index tuning in databases is a critical task that can significantly impact database performance. However, the process of manually configuring indexes is often time-consuming and can be inefficient. In this study, we investigate the process of creating indexes in a database using reinforcement learning. Our research aims to develop an agent that can learn to make optimal decisions for configuring indexes in a chosen database. The paper also discusses an evaluation method to measure database performance. The adopted performance test provides necessary documentation, database schema (on which experiments will be performed) and auxiliary tools such as data generator. This benchmark evaluates a selected database management system in terms of loading, querying and processing power of multiple query streams at once. It is a comprehensive test which results, calculated on measured queries time, will be used in the reinforcement learning algorithm. Our results demonstrate that used index technique requires repeatable benchmark with stable environment and high compute power, which cause cost and time demand. The replication package for this paper is available at GitHub: https://github.com/Chotom/rl-db-indexing.}
}
Owner
- Name: Tomasz Czochański
- Login: Chotom
- Kind: user
- Location: Gdańsk
- Website: www.linkedin.com/in/tomasz-czochanski/
- Repositories: 3
- Profile: https://github.com/Chotom
Citation (CITATION.bib)
@InProceedings{10.1007/978-3-031-42941-5_45,
author="Matczak, Micha{\l}
and Czocha{\'{n}}ski, Tomasz",
editor="Abell{\'o}, Alberto
and Vassiliadis, Panos
and Romero, Oscar
and Wrembel, Robert
and Bugiotti, Francesca
and Gamper, Johann
and Vargas Solar, Genoveva
and Zumpano, Ester",
title="Intelligent Index Tuning Using Reinforcement Learning",
booktitle="New Trends in Database and Information Systems",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="523--534",
abstract="Index tuning in databases is a critical task that can significantly impact database performance. However, the process of manually configuring indexes is often time-consuming and can be inefficient. In this study, we investigate the process of creating indexes in a database using reinforcement learning. Our research aims to develop an agent that can learn to make optimal decisions for configuring indexes in a chosen database. The paper also discusses an evaluation method to measure database performance. The adopted performance test provides necessary documentation, database schema (on which experiments will be performed) and auxiliary tools such as data generator. This benchmark evaluates a selected database management system in terms of loading, querying and processing power of multiple query streams at once. It is a comprehensive test which results, calculated on measured queries time, will be used in the reinforcement learning algorithm. Our results demonstrate that used index technique requires repeatable benchmark with stable environment and high compute power, which cause cost and time demand. The replication package for this paper is available at GitHub: https://github.com/Chotom/rl-db-indexing.",
isbn="978-3-031-42941-5"
}
GitHub Events
Total
- Watch event: 1
- Fork event: 1
Last Year
- Watch event: 1
- Fork event: 1