https://github.com/awslabs/gap-text2sql

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary

Keywords

deep-learning language-model machine-learning nlp nlu pretrained-models pytorch semantic-parsing text-generation text2sql

Last synced: 5 months ago · JSON representation

Repository

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Basic Info

Host: GitHub
Owner: awslabs
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://arxiv.org/abs/2012.10309
Size: 259 KB

Statistics

Stars: 104
Watchers: 4
Forks: 23
Open Issues: 25
Releases: 0

Topics

deep-learning language-model machine-learning nlp nlu pretrained-models pytorch semantic-parsing text-generation text2sql

Created about 5 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Contributing License Code of conduct

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Code and model from our AAAI 2021 paper

Updates

[2020/02/05] Support to run the model on own databases and queries. Check out the notebook.

Abstract

Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train large neural language models with self-supervised learning objectives, such as Masked Language Model (MLM). However, based on a pilot study, we observe three issues of existing general-purpose language models when they are applied to text-to-SQL semantic parsers: fail to detect column mentions in the utterances, fail to infer column mentions from cell values, and fail to compose complex SQL queries. To mitigate these issues, we present a model pre-training framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. GAP MODEL is trained on 2M utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are produced by generative models. Based on experimental results, neural semantic parsers that leverage GAP MODEL as a representation encoder obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-SQL benchmarks.

Setup

bash conda create --name gap-text2sql python=3.7 source activate gap-text2sql conda install pytorch=1.5 cudatoolkit=10.2 -c pytorch pip install -r requirements.txt python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

Download the dataset

bash pip install gdown cd rat-sql-gap gdown --id 1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0 unzip spider.zip bash data/spider/generate.sh ./spider

Build dataset directory

bash mkdir data/spider-bart cp ./spider/tables.json data/spider-bart/ cp ./spider/train_spider.json data/spider-bart/ cp ./spider/train_others.json data/spider-bart/ cp ./spider/dev.json data/spider-bart/ ln -s $(pwd)/spider/database data/spider-bart/database

Download the library

bash mkdir third_party wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip unzip stanford-corenlp-full-2018-10-05.zip -d third_party/

Start the Stanford library

bash pushd third_party/stanford-corenlp-full-2018-10-05 nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 8999 -timeout 15000 > server.log & popd

Download the checkpoint

```bash mkdir -p logdir/bartrun1/bs=12\,lr=1.0e-04\,bertlr=1.0e-05\,endlr=0e0\,att=1/ mkdir iedirs aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/gap-finetuned-checkpoint logdir/bartrun1/bs=12\,lr=1.0e-04\,bertlr=1.0e-05\,endlr=0e0\,att=1/modelcheckpoint-00041000

mkdir -p pretrainedcheckpoint aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/pretrained-checkpoint pretrainedcheckpoint/pytorch_model.bin ```

Alternatively, you can download them here if you don't have awscli: gap-finetuned-checkpoint and pretrained-checkpoint

bash curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/gap-finetuned-checkpoint -o logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000 curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/pretrained-checkpoint -o pretrained_checkpoint/pytorch_model.bin

Preprocess dataset

bash python run.py preprocess experiments/spider-configs/gap-run.jsonnet

Inference

bash python run.py eval experiments/spider-configs/gap-run.jsonnet

You then get the inference results and evaluation results in the paths:ie_dirs/bart_run_1_true_1-step41000.infer and ie_dirs/bart_run_1_true_1-step41000.eval.

Training

bash python run.py train experiments/spider-configs/gap-run.jsonnet

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner

Name: Amazon Web Services - Labs
Login: awslabs
Kind: organization
Location: Seattle, WA

Website: http://amazon.com/aws/
Repositories: 914
Profile: https://github.com/awslabs

AWS Labs

GitHub Events

Total

Watch event: 5

Last Year

Watch event: 5

Issues and Pull Requests

Last synced: almost 2 years ago

All Time

Total issues: 33
Total pull requests: 5
Average time to close issues: about 1 month
Average time to close pull requests: about 2 months
Total issue authors: 28
Total pull request authors: 2
Average comments per issue: 2.27
Average comments per pull request: 0.2
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 2

Past Year

Issues: 3
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: 6 months
Issue authors: 3
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.5
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 2

View more stats

Top Authors

Issue Authors

JoaoLages (3)
b4zyuvaraj (2)
alan-ai-learner (2)
DHms2020 (2)
rahuls321 (1)
489597448 (1)
dennissm (1)
kev2513 (1)
yaoyiyao-yao (1)
hclent (1)
lzw-pku (1)
ukrcherry (1)
TheurgicDuke771 (1)
romapavelko01 (1)
Mohdwajtech (1)

Pull Request Authors

Impavidity (3)
dependabot[bot] (2)

Top Labels

Issue Labels

Pull Request Labels

dependencies (2)

Dependencies

rat-sql-gap/requirements.txt pypi

bpemb *
entmax *
jsonnet *
networkx *
nltk *
pyrsistent *
stanford-corenlp *
torchtext *
transformers ==3.0

https://github.com/awslabs/gap-text2sql

Science Score: 10.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Updates

Abstract

Setup

Download the dataset

Build dataset directory

Download the library

Start the Stanford library

Download the checkpoint

Preprocess dataset

Inference

Training

Security

License

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies