PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Last update: Sep 06, 2022

Related tags

Deep Learning CI-ToD

Overview

Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

This repository contains the PyTorch implementation and the data of the paper: Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System. Libo Qin, Tianbao Xie, Shijue Huang, Qiguang Chen, Xiao Xu, Wanxiang Che. EMNLP2021.[PDF] .

This code has been written using PyTorch >= 1.1. If you use any source codes or the datasets included in this toolkit in your work, please cite the following paper. The bibtex are listed below:

@article{qin2021CIToD,
  title={Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System},
  author={Qin, Libo and Xie, Tianbao and Huang, Shijue and Chen, Qiguang and Xu, Xiao and Che, Wanxiang},
  journal={arXiv preprint arXiv:2109.11292},
  year={2021}
}

Abstract

Consistency Identification has obtained remarkable success on open-domain dialogue, which can be used for preventing inconsistent response generation. However, in contrast to the rapid development in open-domain dialogue, few efforts have been made to the task-oriented dialogue direction. In this paper, we argue that consistency problem is more urgent in task-oriented domain. To facilitate the research, we introduce CI-ToD, a novel dataset for Consistency Identification in Task-oriented Dialog system. In addition, we not only annotate the single label to enable the model to judge whether the system response is contradictory, but also provide more finegrained labels (i.e., Dialogue History Inconsistency(HI), User Query Inconsistency(QI) and Knowledge Base Inconsistency(KBI), which are as shown in the figure below) to encourage model to know what inconsistent sources lead to it. Empirical results show that state-of-the-art methods only achieve performance of 51.3%, which is far behind the human performance of 93.2%, indicating that there is ample room for improving consistency identification ability. Finally, we conduct exhaustive experiments and qualitative analysis to comprehend key challenges and provide guidance for future directions.

Dataset

We construct the CI-ToD dataset based on the KVRET dataset. We release our dataset together with the code, you can find it under data.

The basic format of the dataset is as follows, including multiple rounds of dialogue, knowledge base and related inconsistency annotations (KBI, QI, HI)：

[
    {
        "id": 74,
        "dialogue": [
            {
                "turn": "driver",
                "utterance": "i need to find out the date and time for my swimming_activity"
            },
            {
                "turn": "assistant",
                "utterance": "i have two which one i have one for the_14th at 6pm and one for the_12th at 7pm"
            }
        ],
        "scenario": {
            "kb": {
                "items": [
                    {
                        "date": "the_11th",
                        "time": "9am",
                        "event": "tennis_activity",
                        "agenda": "-",
                        "room": "-",
                        "party": "father"
                    },
                    {
                        "date": "the_18th",
                        "time": "2pm",
                        "event": "football_activity",
                        "agenda": "-",
                        "room": "-",
                        "party": "martha"
                    },
                    .......
                ]
            },
            "qi": "0",
            "hi": "0",
            "kbi": "0"
        },
        "HIPosition": []
    }

KBRetriever_DC

Dataset	QI	HI	KBI	SUM
calendar_train.json	174	56	177	595
calendar_dev.json	28	9	24	74
calendar_test.json	23	8	21	74
navigate_train.json	453	386	591	1110
navigate_dev.json	55	41	69	139
navigate_test.json	48	44	71	138
weather_new_train.json	631	132	551	848
weather_new_dev.json	81	14	66	106
weather_new_test.json	72	12	69	106

Model

Here is the model structure of non pre-trained model (a) and pre-trained model (b and c).

Preparation

we provide some pre-trained baselines on our proposed CI-TOD dataset, the packages we used are listed follow:

-- scikit-learn==0.23.2
-- numpy=1.19.1
-- pytorch=1.1.0
-- fitlog==0.9.13
-- tqdm=4.49.0
-- sklearn==0.0
-- transformers==3.2.0

We highly suggest you using Anaconda to manage your python environment. If so, you can run the following command directly on the terminal to create the environment:

conda env create -f py3.6pytorch1.1_.yaml

How to run it

The script train.py acts as a main function to the project, you can run the experiments by the following commands:

python -u train.py --cfg KBRetriver_DC/KBRetriver_DC_BERT.cfg

The parameters we use are configured in the configure. If you need to adjust them, you can modify them in the relevant files or append parameters to the command.

Finally, you can check the results in logs folder.Also, you can run fitlog command to visualize the results:

fitlog log logs/

Baseline Experiment Result

All experiments were performed in TITAN_XP except for BART, which was performed on Tesla V100 PCIE 32 GB. These may not be the best results. Therefore, the parameters can be adjusted to obtain better results.

KBRetriever_DC

Baseline category	Baseline method	QI F1	HI F1	KBI F1	Overall Acc
Non Pre-trained Model	ESIM (Chen et al., 2017)	0.512	0.164	0.543	0.432
	Infersent (Romanov and Shivade, 2018)	0.557	0.031	0.336	0.356
	RE2 (Yang et al., 2019)	0.655	0.244	0.739	0.481
Pre-trained Model	BERT (Devlin et al., 2019)	0.691	0.555	0.740	0.500
	RoBERTa (Liu et al., 2019)	0.715	0.472	0.715	0.500
	XLNet (Yang et al., 2020)	0.725	0.487	0.736	0.509
	Longformer (Beltagy et al., 2020)	0.717	0.500	0.710	0.497
	BART (Lewis et al., 2020)	0.744	0.510	0.761	0.513
Human	Human Performance	0.962	0.805	0.920	0.932

Leaderboard

If you submit papers with these datasets, please consider sending a pull request to merge your results onto the leaderboard. By submitting, you acknowledge that your results are obtained purely by training on the training datasets and tuned on the dev datasets (e.g. you only evaluted on the test set once).

KBRetriever_DC

Baseline method	QI F1	HI F1	KBI F1	Overall Acc
ESIM (Chen et al., 2017)	0.512	0.164	0.543	0.432
Infersent (Romanov and Shivade, 2018)	0.557	0.031	0.336	0.356
RE2 (Yang et al., 2019)	0.655	0.244	0.739	0.481
BERT (Devlin et al., 2019)	0.691	0.555	0.740	0.500
RoBERTa (Liu et al., 2019)	0.715	0.472	0.715	0.500
XLNet (Yang et al., 2020)	0.725	0.487	0.736	0.509
Longformer (Beltagy et al., 2020)	0.717	0.500	0.710	0.497
BART (Lewis et al., 2020)	0.744	0.510	0.761	0.513
Human Performance	0.962	0.805	0.920	0.932

Acknowledgement

Thanks for patient annotation from all taggers Lehan Wang, Ran Duan, Fuxuan Wei, Yudi Zhang, Weiyun Wang!

Thanks for supports and guidance from our adviser Wanxiang Che!

Contact us

Just feel free to open issues or send us email(me, Tianbao) if you have any problems or find some mistakes in this dataset.

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Related tags

Overview

Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Abstract

Dataset

KBRetriever_DC

Model

Preparation

How to run it

Baseline Experiment Result

KBRetriever_DC

Leaderboard

KBRetriever_DC

Acknowledgement

Contact us

Owner

Libo Qin

This repository contains code for the paper "Decoupling Representation and Classifier for Long-Tailed Recognition", published at ICLR 2020

Everything you need to know about NumPy( Creating Arrays, Indexing, Math,Statistics,Reshaping).

[ICLR 2021] "CPT: Efficient Deep Neural Network Training via Cyclic Precision" by Yonggan Fu, Han Guo, Meng Li, Xin Yang, Yining Ding, Vikas Chandra, Yingyan Lin

Code for "Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation". [AAAI 2021]

PyTorch implementation of the end-to-end coreference resolution model with different higher-order inference methods.

A PyTorch re-implementation of the paper 'Exploring Simple Siamese Representation Learning'. Reproduced the 67.8% Top1 Acc on ImageNet.

Official implement of Paper：A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sening images

The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Official implementation of the ICLR 2021 paper

Segmentation models with pretrained backbones. Keras and TensorFlow Keras.

Source Code for our paper: Understand me, if you refer to Aspect Knowledge: Knowledge-aware Gated Recurrent Memory Network

A Pytorch implementation of "LegoNet: Efficient Convolutional Neural Networks with Lego Filters" (ICML 2019).

Optimal Adaptive Allocation using Deep Reinforcement Learning in a Dose-Response Study

PyImpetus is a Markov Blanket based feature subset selection algorithm that considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features

Deep learning model for EEG artifact removal

ADOP: Approximate Differentiable One-Pixel Point Rendering

《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》(EMNLP 2020)

Yolov5 deepsort inference，使用YOLOv5+Deepsort实现车辆行人追踪和计数，代码封装成一个Detector类，更容易嵌入到自己的项目中