SPARQLing Database Queries from Intermediate Question Decompositions

This repo is the implementation of the following paper:

SPARQLing Database Queries from Intermediate Question Decompositions
Irina Saparina and Anton Osokin
In proceedings of EMNLP'21

[31.05.2022]: We fixed several bugs in the decoding process, usage of the GraPPa tokenization (affect our model) and SQL-SQL comparison (affect on baselines). The current code reproduces the results from the updated version of the paper (arXiv:2109.06162v2).

License

This software is released under the MIT license, which means that you can use the code in any way you want.

Dependencies

Conda env with pytorch 1.9

Create conda env with pytorch 1.9 and many other packages upgraded: conda_env_with_pytorch1.9.yaml:

conda env create -n env-torch1.9 -f conda_env_with_pytorch1.9.yaml
conda activate env-torch1.9

Download some nltk resourses, Bert and GraPPa:

mkdir -p third_party

pip install -r requirements.txt && \
pip install entmax && \
python -c "import nltk; nltk.download('stopwords'); nltk.download('punkt')"

python -c "from transformers import AutoModel; AutoModel.from_pretrained('bert-large-uncased-whole-word-masking'); AutoModel.from_pretrained('Salesforce/grappa_large_jnt')"

mkdir -p third_party && \
cd third_party && \
curl https://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip | jar xv

Data

We currently provide both Spider and Break inside our repos. Note that datasets differ from original ones as we fixed some annotation errors. Download databases:

bash ./utils/wget_gdrive.sh spider_temp.zip 11icoH_EA-NYb0OrPTdehRWm_d7-DIzWX
unzip spider_temp.zip -d spider_temp
cp -r spider_temp/spider/database ./data/spider
rm -rf spider_temp/
python ./qdmr2sparql/fix_databases.py --spider_path ./data/spider

To reproduce our annotation procedure see qdmr2sparql/README.md.

For testing qdmr2sparql translator run qdmr2sparql/test_qdmr2sparql.py

Experiments

Every experiment has its own config file in text2qdmr/configs/experiments. The pipeline of working with any model version or dataset is:

python run_text2qdmr.py preprocess experiment_config_file  # preprocess the data
python run_text2qdmr.py train experiment_config_file       # train a model
python run_text2qdmr.py eval experiment_config_file        # evaluate the results

# multiple GPUs on one machine:
export NGPUS=4 # set $NGPUS manually
python -m torch.distributed.launch --nproc_per_node=$NGPUS --use_env --master_port `./utils/get_free_port.sh`  run_text2qdmr.py train experiment_config_file

Note that preprocessing and evaluation use execution and take some time. To speed up the evaluation, you can install Virtuoso server (see qdmr2sparql/README_Virtuoso.md).

Checkpoints and samples

The dev and test examples of model output are in model_samples/.

Checkpoints of our best models:

Model name	Dev	Test	Config	Link
grappa-aug	82.0	62.4	text2qdmr/configs/eval-checkpoints/grappa_qdmr_aug.jsonnet	Google Drive
grappa-full_break-aug	81.6	65.3	text2qdmr/configs/eval-checkpoints/grappa_full_break_aug.jsonnet	Google Drive

To reproduce, firstly download the checkpoints and put them into new folders:

mkdir logdir/grappa-aug/bs=6,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1
mv grappa-aug logdir/grappa-aug/bs=6,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/model_checkpoint-00080000

mkdir logdir/grappa-full_break-aug/bs=6,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1
mv grappa-full_break-aug logdir/grappa-full_break-aug/bs=6,lr=7.4e-04,bert_lr=3.0e-06,end_lr=0e0,att=1/model_checkpoint-00081000

Then use the corresponding config_file for evaluation:

python run_text2qdmr.py preprocess path_to_config_file
python run_text2qdmr.py eval path_to_config_file

Acknowledgements

Text2qdmr module is based on RAT-SQL code, the implementation of ACL'20 paper "RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers" by Wang et al.

Spider dataset was proposed by Yi et al. in EMNLP'18 paper "Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task".

Break dataset was proposed by Wolfson et al. in TACL paper "Break It Down: A Question Understanding Benchmark".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

model_samples

model_samples

qdmr2sparql

qdmr2sparql

text2qdmr

text2qdmr

utils

utils

LICENSE

LICENSE

README.md

README.md

conda_env_with_pytorch1.9.yaml

conda_env_with_pytorch1.9.yaml

run_text2qdmr.py

run_text2qdmr.py

Repository files navigation

SPARQLing Database Queries from Intermediate Question Decompositions

License

Dependencies

Conda env with pytorch 1.9

Data

Experiments

Checkpoints and samples

Acknowledgements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
model_samples		model_samples
qdmr2sparql		qdmr2sparql
text2qdmr		text2qdmr
utils		utils
LICENSE		LICENSE
README.md		README.md
conda_env_with_pytorch1.9.yaml		conda_env_with_pytorch1.9.yaml
run_text2qdmr.py		run_text2qdmr.py

License

yandex-research/sparqling-queries

Folders and files

Latest commit

History

Repository files navigation

SPARQLing Database Queries from Intermediate Question Decompositions

License

Dependencies

Conda env with pytorch 1.9

Data

Experiments

Checkpoints and samples

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages