PIZZA - a task-oriented semantic parsing dataset

Overview

PIZZA - a task-oriented semantic parsing dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.

The dataset comes in two main versions, one in a recently introduced utterance-level hierarchical notation that we call TOP, and one whose targets are executable representations (EXR).

Below are two examples of orders that can be found in the data:

{
    "dev.SRC": "five medium pizzas with tomatoes and ham",
    "dev.EXR": "(ORDER (PIZZAORDER (NUMBER 5 ) (SIZE MEDIUM ) (TOPPING HAM ) (TOPPING TOMATOES ) ) )",
    "dev.TOP": "(ORDER (PIZZAORDER (NUMBER five ) (SIZE medium ) pizzas with (TOPPING tomatoes ) and (TOPPING ham ) ) )"
}
{
    "dev.SRC": "i want to order two medium pizzas with sausage and black olives and two medium pizzas with pepperoni and extra cheese and three large pizzas with pepperoni and sausage",
    "dev.EXR": "(ORDER (PIZZAORDER (NUMBER 2 ) (SIZE MEDIUM ) (COMPLEX_TOPPING (QUANTITY EXTRA ) (TOPPING CHEESE ) ) (TOPPING PEPPERONI ) ) (PIZZAORDER (NUMBER 2 ) (SIZE MEDIUM ) (TOPPING OLIVES ) (TOPPING SAUSAGE ) ) (PIZZAORDER (NUMBER 3 ) (SIZE LARGE ) (TOPPING PEPPERONI ) (TOPPING SAUSAGE ) ) )",
    "dev.TOP": "(ORDER i want to order (PIZZAORDER (NUMBER two ) (SIZE medium ) pizzas with (TOPPING sausage ) and (TOPPING black olives ) ) and (PIZZAORDER (NUMBER two ) (SIZE medium ) pizzas with (TOPPING pepperoni ) and (COMPLEX_TOPPING (QUANTITY extra ) (TOPPING cheese ) ) ) and (PIZZAORDER (NUMBER three ) (SIZE large ) pizzas with (TOPPING pepperoni ) and (TOPPING sausage ) ) )"
}

While more details on the dataset conventions and construction can be found in the paper, we give a high level idea of the main rules our target semantics follow:

  • Each order can include any number of pizza and/or drink suborders. These suborders are labeled with the constructors PIZZAORDER and DRINKORDER, respectively.
  • Each top-level order is always labeled with the root constructor ORDER.
  • Both pizza and drink orders can have NUMBER and SIZE attributes.
  • A pizza order can have any number of TOPPING attributes, each of which can be negated. Negative particles can have larger scope with the use of the or particle, e.g., no peppers or onions will negate both peppers and onions.
  • Toppings can be modified by quantifiers such as a lot or extra, a little, etc.
  • A pizza order can have a STYLE attribute (e.g., thin crust style or chicago style).
  • Styles can be negated.
  • Each drink order must have a DRINKTYPE (e.g. coke), and can also have a CONTAINERTYPE (e.g. bottle) and/or a volume modifier (e.g., three 20 fl ounce coke cans).

We view ORDER, PIZZAORDER, and DRINKORDER as intents, and the rest of the semantic constructors as composite slots, with the exception of the leaf constructors, which are viewed as entities (resolved slot values).

Dataset statistics

In the below table we give high level statistics of the data.

Train Dev Test
Number of utterances 2,456,446 348 1,357
Unique entities 367 109 180
Avg entities per utterance 5.32 5.37 5.42
Avg intents per utterance 1.76 1.25 1.28

More details can be found in appendix of our publication.

Getting Started

The repo structure is as follows:

PIZZA
|
|_____ data
|      |_____ PIZZA_train.json.zip             # a zipped version of the training data
|      |_____ PIZZA_train.10_percent.json.zip  # a random subset representing 10% of training data
|      |_____ PIZZA_dev.json                   # the dev portion of the data
|      |_____ PIZZA_test.json                  # the test portion of the data
|
|_____ utils
|      |_____ __init__.py
|      |_____ entity_resolution.py # entity resolution script
|      |_____ express_utils.py     # utilities
|      |_____ semantic_matchers.py # metric functions
|      |_____ sexp_reader.py       # tree reader helper functions 
|      |_____ trees.py             # tree classes and readers
|      |
|      |_____ catalogs             # directory with catalog files of entities
|      
|_____ doc 
|      |_____ PIZZA_dataset_reader_metrics_examples.ipynb
|
|_____ READMED.md

The dev and test data files are json lines files where each line represents one utterance and contains 4 keys:

  • *.SRC: the natural language order input
  • *.EXR: the ground truth target semantic representation in EXR format
  • *.TOP: the ground truth target semantic representation in TOP format
  • *.PCFG_ERR: a boolean flag indicating whether our PCFG system parsed the utterance correctly. See publication for more details

The training data file comes in a similar format, with two differences:

  • there is no train.PCFG_ERR flag since the training data is all synthetically generated hence parsable with perfect accuracy. In other words, this flag would be True for all utterances in that file.
  • there is an extra train.TOP-DECOUPLED key that is the ground truth target semantic representation in TOP-Decoupled format. See publication for more details.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the CC-BY-NC-4.0 License.

Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline

Twitter-News-Summarizer Twitter bot that uses NLP models to summarize news articles referenced in a user's twitter timeline 1.) Extracts all tweets fr

Rohit Govindan 1 Jan 27, 2022
Fastseq 基于ONNXRUNTIME的文本生成加速框架

Fastseq 基于ONNXRUNTIME的文本生成加速框架

Jun Gao 9 Nov 09, 2021
LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation

LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation Tasks | Datasets | LongLM | Baselines | Paper Introduction LOT is a ben

46 Dec 28, 2022
Easy to start. Use deep nerual network to predict the sentiment of movie review.

Easy to start. Use deep nerual network to predict the sentiment of movie review. Various methods, word2vec, tf-idf and df to generate text vectors. Various models including lstm and cov1d. Achieve f1

1 Nov 19, 2021
IEEEXtreme15.0 Questions And Answers

IEEEXtreme15.0 Questions And Answers IEEEXtreme is a global challenge in which teams of IEEE Student members – advised and proctored by an IEEE member

Dilan Perera 15 Oct 24, 2022
中文生成式预训练模型

T5 PEGASUS 中文生成式预训练模型,以mT5为基础架构和初始权重,通过类似PEGASUS的方式进行预训练。 详情可见:https://kexue.fm/archives/8209 Tokenizer 我们将T5 PEGASUS的Tokenizer换成了BERT的Tokenizer,它对中文更

410 Jan 03, 2023
Search with BERT vectors in Solr and Elasticsearch

Search with BERT vectors in Solr and Elasticsearch

Dmitry Kan 123 Dec 29, 2022
Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Twitter-Sentiment-Analysis Twitter sentiment analysis for india's top online retailers(2019 to 2022) Project Overview : Sentiment Analysis helps us to

Balaji R 1 Jan 01, 2022
Natural Language Processing Tasks and Examples.

Natural Language Processing Tasks and Examples With the advancement of A.I. technology in recent years, natural language processing technology has bee

Soohwan Kim 53 Dec 20, 2022
NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Artefact 114 Dec 15, 2022
Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Tevatron Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized

texttron 193 Jan 04, 2023
Calibre recipe to convert latest issue of Analyse & Kritik into an ebook

Calibre Recipe für "Analyse & Kritik" Dies ist ein "Recipe" für die Konvertierung der aktuellen Ausgabe der Zeitung Analyse & Kritik in ein Ebook. Es

Henning 3 Jan 04, 2022
A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.

tfds-korean A collection of Korean Text Datasets ready to use using Tensorflow-Datasets. TensorFlow-Datasets를 이용한 한국어/한글 데이터셋 모음입니다. Dataset Catalog |

Jeong Ukjae 20 Jul 11, 2022
Finally, some decent sample sentences

tts-dataset-prompts This repository aims to be a decent set of sentences for people looking to clone their own voices (e.g. using Tacotron 2). Each se

hecko 19 Dec 13, 2022
A single model that parses Universal Dependencies across 75 languages.

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.

Dan Kondratyuk 189 Nov 29, 2022
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language This repository contains UA-GEC data and an accompanying Python lib

Grammarly 227 Jan 02, 2023
Mkdocs + material + cool stuff

Modern-Python-Doc-Example mkdocs + material + cool stuff Doc is live here Features out of the box amazing good looking website thanks to mkdocs.org an

Francesco Saverio Zuppichini 61 Oct 26, 2022
Generate vector graphics from a textual caption

VectorAscent: Generate vector graphics from a textual description Example "a painting of an evergreen tree" python text_to_painting.py --prompt "a pai

Ajay Jain 97 Dec 15, 2022
An Analysis Toolkit for Natural Language Generation (Translation, Captioning, Summarization, etc.)

VizSeq is a Python toolkit for visual analysis on text generation tasks like machine translation, summarization, image captioning, speech translation

Facebook Research 409 Oct 28, 2022