GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Last update: Jan 09, 2023

Overview

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Code and model from our AAAI 2021 paper

Updates

[2020/02/05] Support to run the model on own databases and queries. Check out the notebook.

Abstract

Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train large neural language models with self-supervised learning objectives, such as Masked Language Model (MLM). However, based on a pilot study, we observe three issues of existing general-purpose language models when they are applied to text-to-SQL semantic parsers: fail to detect column mentions in the utterances, fail to infer column mentions from cell values, and fail to compose complex SQL queries. To mitigate these issues, we present a model pre-training framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. GAP MODEL is trained on 2M utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are produced by generative models. Based on experimental results, neural semantic parsers that leverage GAP MODEL as a representation encoder obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-SQL benchmarks.

Setup

conda create --name gap-text2sql python=3.7
source activate gap-text2sql
conda install pytorch=1.5 cudatoolkit=10.2 -c pytorch
pip install -r requirements.txt
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

Download the dataset

pip install gdown
cd rat-sql-gap
gdown --id 1_AckYkinAnhqmRQtGsQgUKAnTHxxX5J0
unzip spider.zip
bash data/spider/generate.sh ./spider

Build dataset directory

mkdir data/spider-bart
cp ./spider/tables.json data/spider-bart/
cp ./spider/train_spider.json data/spider-bart/
cp ./spider/train_others.json data/spider-bart/
cp ./spider/dev.json data/spider-bart/
ln -s $(pwd)/spider/database data/spider-bart/database

Download the library

mkdir third_party
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
unzip stanford-corenlp-full-2018-10-05.zip -d third_party/

Start the Stanford library

pushd third_party/stanford-corenlp-full-2018-10-05
nohup java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 8999 -timeout 15000 > server.log &
popd

Download the checkpoint

mkdir -p logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/
mkdir ie_dirs
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/gap-finetuned-checkpoint logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000

mkdir -p pretrained_checkpoint
aws s3 cp s3://gap-text2sql-public/checkpoint-artifacts/pretrained-checkpoint pretrained_checkpoint/pytorch_model.bin

Alternatively, you can download them here if you don't have awscli: gap-finetuned-checkpoint and pretrained-checkpoint

curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/gap-finetuned-checkpoint -o logdir/bart_run_1/bs\=12\,lr\=1.0e-04\,bert_lr\=1.0e-05\,end_lr\=0e0\,att\=1/model_checkpoint-00041000
curl https://gap-text2sql-public.s3.amazonaws.com/checkpoint-artifacts/pretrained-checkpoint -o pretrained_checkpoint/pytorch_model.bin

Preprocess dataset

python run.py preprocess experiments/spider-configs/gap-run.jsonnet

Inference

python run.py eval experiments/spider-configs/gap-run.jsonnet

You then get the inference results and evaluation results in the paths:ie_dirs/bart_run_1_true_1-step41000.infer and ie_dirs/bart_run_1_true_1-step41000.eval.

Training

python run.py train experiments/spider-configs/gap-run.jsonnet

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Related tags

Overview

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Updates

Abstract

Setup

Download the dataset

Build dataset directory

Download the library

Start the Stanford library

Download the checkpoint

Preprocess dataset

Inference

Training

Security

License

Owner

Amazon Web Services - Labs

Data and code to support "Applied Natural Language Processing" (INFO 256, Fall 2021, UC Berkeley)

Twewy-discord-chatbot - Build a Discord AI Chatbot that Speaks like Your Favorite Character

Machine Psychology: Python Generated Art

CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

Code for evaluating Japanese pretrained models provided by NTT Ltd.

Words-per-minute - A terminal app written in python utilizing the curses module that tests the user's ability to type

Checking spelling of form elements

Open Source Neural Machine Translation in PyTorch

Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Open source annotation tool for machine learning practitioners.

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Phomber is infomation grathering tool that reverse search phone numbers and get their details, written in python3.

DVC-NLP-Simple-usecase

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Final Project Bootcamp Zero

Korean stereoypte detector with TUNiB-Electra and K-StereoSet