https://github.com/amazon-science/nl2sql-omop-cdm
Natural Language to SQL Queries in the OMOP CDM Datasets
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.1%) to scientific vocabulary
Repository
Natural Language to SQL Queries in the OMOP CDM Datasets
Basic Info
Statistics
- Stars: 11
- Watchers: 1
- Forks: 5
- Open Issues: 2
- Releases: 0
Metadata Files
README.md
NL2SQL: Natural Language to SQL Queries Tool
IMPORTANT: READ SECTION 2.1 TO USE THE UI.
This repo contains the UI tool and ML model development process to convert natural language questions to SQL queries. At the root level, it contains the following folders/files:
setup: Directory that contains scripts/files to setup the development environment.src: All the source code related to the development of the end-to-end ML pipeline for the NL2SQL project.nl2sql_main.ipynb: The UI tool (notebook) to run the model inference and query execution graphically.
In the following sections, we will cover:
- Environment setup
- Getting started
2.1 Using the UI
2.2 Iterating on the underlying ML model
1. Environment setup
Note: The pipeline has been tested in Amazon's SageMaker Linux instance. It assumes that:
- You are in a GPU instance with CUDA 11.0, especially for model training. To setup the conda environment and install the remaining python packages, please change the directory to
setup/and runsetup.shscript. You first need to change the bash file mode if needed. Instructions to complete setup are as follows:
bash
$ /bin/bash cd setup/
$ /bin/bash chmod +x setup.sh
$ /bin/bash ./setup.sh
The setup can take a few minutes. Once finalized, you will have the conda_nl2sql_environment conda environment to run this repo.
2. Getting started
This section covers how this repo is intended to be used.
2.1. Using the UI
To run the UI tool and test the pipeline, it's assumed that you already have the trained model ready and your notebook instance is already connected to the Redshift database. To run the tool, follow the steps below:
* Open src/config.py and UPDATE THE MODEL PATH AND REDSHIFT DATABASE CONFIGURATION.
* Open nl2sql_main.ipynb and change the python kernel into conda_nl2sql_environment.
* Follow the instructions provided in the notebook.
2.2. Iterating on the underlying ML model
This section explains the T5-based model development process (training, inference and evaluation). The NL2SQL problem is defined as a text-to-text problem where the input is the question text and the output is the query template. In this project, T5-based model is fine-tuned with the NL2SQL dataset.
All the code related to the model development is located in src/engine/step4/model_dev/. Change your directory into this path:
bash
$ /bin/bash cd src/engine/step4/model_dev/
Data Preparation
Note: The dataset used in this project is hosted here. Please download the csv files and save them under the data/folded_questions path. In addition to the csv files, don't also forget to download nonzeroargsby_query.json and save it in data/ directory. This file contains sample values for the argumenents in the csv files.
Before you start model training, you need to have the data splits (train/validation/test) ready in CSV format. Each split has at least two columns (unfolded_questions for the input questions and query for the model's output query template.
The data csv files that were downloaded from the provided link contain the folded (compressed) version of the equivalent questions. So, you will need to unfold all the input equivalent questions for it to be ready for model development. Please open the prepare_data.ipynb, update the input and output data paths and run it.
Model Training
In this project, T5 model that was pretrained with WikiSQL dataset was used for model fine-tuning. To fine-tune the WikiSQL pretrained T5 model, you first need to update the model configuration file t5_config.py. Especially, you need to specify the input data_dir and output_dir for input data and model output directory respectively. At least, you need to update the following:
* model path
* database schema name
* Redshift database information (e.g., endpoint, database name, etc.)
Once you update the config file, run the following command to fine-tune the T5 model:
bash
$ /bin/bash python t5_training.py
Note: As the T5 model is big, you will need GPU instances for training.
Model Inference
To test the trained model and run inference for a single input question, you can use t5_inference.py. First open the file and update the model path and input question.
To run the inference script, run the command below:
bash
$ /bin/bash python t5_inference.py
Model Evaluation
Finally, you can also run the inference on the whole validation and test sets and compute the model performance (exact-matching and execution accuracies). You can use t5_evaluation.py to run the whole dataset inference and compute accuracies. First you need to open the file and update the data path, model path and output directory. And then you can run the following command to run the script:
bash
$ /bin/bash python t5_evaluation.py
Please note that the whole data inference is compute intensive. And you may need an instance with multiple GPUs.
Model Performance Metrics
In this project, there are two metrics used to evaluate the model:
Exact-matching (Logical-form) Accuracy: The percentage of the predicted queries having the exact match with the ground truth queries.
Execution Accuracy: The percentage of the predicted queries, when executed, result in the correct result.
To get more info about the NL2SQL tool, please refer to this nl2sql-tool doc.
Security
See CONTRIBUTING for more information.
License
This library is licensed under the CC BY-NC 4.0 License. See the LICENSE file.
Owner
- Name: Amazon Science
- Login: amazon-science
- Kind: organization
- Website: https://amazon.science
- Twitter: AmazonScience
- Repositories: 80
- Profile: https://github.com/amazon-science
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 1
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0