https://github.com/awslabs/gap-text2sql

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

https://github.com/awslabs/gap-text2sql

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.0%) to scientific vocabulary

Keywords

deep-learning language-model machine-learning nlp nlu pretrained-models pytorch semantic-parsing text-generation text2sql
Last synced: 5 months ago · JSON representation

Repository

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Basic Info
Statistics
  • Stars: 104
  • Watchers: 4
  • Forks: 23
  • Open Issues: 25
  • Releases: 0
Topics
deep-learning language-model machine-learning nlp nlu pretrained-models pytorch semantic-parsing text-generation text2sql
Created about 5 years ago · Last pushed almost 2 years ago
Metadata Files
Readme Contributing License Code of conduct

README.md

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Code and model from our AAAI 2021 paper

Updates

[2020/02/05] Support to run the model on own databases and queries. Check out the notebook.

Abstract

Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train large neural language models with self-supervised learning objectives, such as Masked Language Model (MLM). However, based on a pilot study, we observe three issues of existing general-purpose language models when they are applied to text-to-SQL semantic parsers: fail to detect column mentions in the utterances, fail to infer column mentions from cell values, and fail to compose complex SQL queries. To mitigate these issues, we present a model pre-training framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. GAP MODEL is trained on 2M utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are produced by generative models. Based on experimental results, neural semantic parsers that leverage GAP MODEL as a representation encoder obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-SQL benchmarks.

Setup

bash conda create --name gap-text2sql python=3.7 source activate gap-text2sql conda install pytorch=1.5 cudatoolkit=10.2 -c pytorch pip install -r requirements.txt python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

Download the dataset

bash pip install gdown cd rat-sql-gap gdown --id 1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0 unzip spider.zip bash data/spider/generate.sh ./spider

Build dataset directory

bash mkdir data/spider-bart cp ./spider/tables.json data/spider-bart/ cp ./spider/train_spider.json data/spider-bart/ cp ./spider/train_others.json data/spider-bart/ cp ./spider/dev.json data/spider-bart/ ln -s $(pwd)/spider/database data/spider-bart/database

Download the library

bash mkdir third_party wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip unzip stanford-corenlp-full-2018-10-05.zip -d third_party/

Start the Stanford library

bash pushd third_party/stanford-corenlp-full-2018-10-05 nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 8999 -timeout 15000 > server.log & popd

Download the checkpoint

```bash mkdir -p logdir/bartrun1/bs=12\,lr=1.0e-04\,bertlr=1.0e-05\,endlr=0e0\,att=1/ mkdir iedirs aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/gap-finetuned-checkpoint logdir/bartrun1/bs=12\,lr=1.0e-04\,bertlr=1.0e-05\,endlr=0e0\,att=1/modelcheckpoint-00041000

mkdir -p pretrainedcheckpoint aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/pretrained-checkpoint pretrainedcheckpoint/pytorch_model.bin ```

Alternatively, you can download them here if you don't have awscli: gap-finetuned-checkpoint and pretrained-checkpoint

bash curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/gap-finetuned-checkpoint -o logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000 curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/pretrained-checkpoint -o pretrained_checkpoint/pytorch_model.bin

Preprocess dataset

bash python run.py preprocess experiments/spider-configs/gap-run.jsonnet

Inference

bash python run.py eval experiments/spider-configs/gap-run.jsonnet

You then get the inference results and evaluation results in the paths:ie_dirs/bart_run_1_true_1-step41000.infer and ie_dirs/bart_run_1_true_1-step41000.eval.

Training

bash python run.py train experiments/spider-configs/gap-run.jsonnet

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Owner

  • Name: Amazon Web Services - Labs
  • Login: awslabs
  • Kind: organization
  • Location: Seattle, WA

AWS Labs

GitHub Events

Total
  • Watch event: 5
Last Year
  • Watch event: 5

Issues and Pull Requests

Last synced: almost 2 years ago

All Time
  • Total issues: 33
  • Total pull requests: 5
  • Average time to close issues: about 1 month
  • Average time to close pull requests: about 2 months
  • Total issue authors: 28
  • Total pull request authors: 2
  • Average comments per issue: 2.27
  • Average comments per pull request: 0.2
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 2
Past Year
  • Issues: 3
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: 6 months
  • Issue authors: 3
  • Pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.5
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 2
Top Authors
Issue Authors
  • JoaoLages (3)
  • b4zyuvaraj (2)
  • alan-ai-learner (2)
  • DHms2020 (2)
  • rahuls321 (1)
  • 489597448 (1)
  • dennissm (1)
  • kev2513 (1)
  • yaoyiyao-yao (1)
  • hclent (1)
  • lzw-pku (1)
  • ukrcherry (1)
  • TheurgicDuke771 (1)
  • romapavelko01 (1)
  • Mohdwajtech (1)
Pull Request Authors
  • Impavidity (3)
  • dependabot[bot] (2)
Top Labels
Issue Labels
Pull Request Labels
dependencies (2)

Dependencies

rat-sql-gap/requirements.txt pypi
  • bpemb *
  • entmax *
  • jsonnet *
  • networkx *
  • nltk *
  • pyrsistent *
  • stanford-corenlp *
  • torchtext *
  • transformers ==3.0