Assessing Dialogue Systems with Distribution Distances

We propose to measure the performance of a dialogue system by computing the distributionwise distance between its generated conversations and real-world conversations.

To appear in Findings of ACL 2021.

Note that this is not an officially supported Tencent product.

1. Configuratin

This repository requires the packages:

pytorch
huggingface/transformers.

2. Usage

To evaluate the system-level human correlations of metrics:

python eval_metric.py \
  --data_path ./datasets/convai2_annotation.json \
  --metric fbd \
  --sample_num 10 \
  --model_type roberta-base \
  --batch_size 32

Currently, our repo supports the common metrics used in text generation field, inclduing bleu, meteor, rouge, greedy, average, extrema, bert_score, fbd and prd.

Here are some details of the six corpura compared in the main paper:

File Name	Dataset Name	Num. of Samples	Reference
`personam_annotation.json`	Persona(M)	60	Shikib/usr
`dailyh_annotation.json`	Daily(H)	150	li3cmz/GRADE
`convai2_annotation.json`	Convai2	150	li3cmz/GRADE
`empathetic_annotation.json`	Empathetic	150	li3cmz/GRADE
`dailyz_annotation.json`	Daily(Z)	100	ZHAOTING/dialog-processing
`personaz_annotation.json`	Persona(Z)	150	ZHAOTING/dialog-processing

Citation

If you use this research/codebase/dataset, please cite our paper:

@article{xiang2021assessing,
  title={Assessing Dialogue Systems with Distribution Distances},
  author={Xiang, Jiannan and Liu, Yahui and Cai, Deng and Li, Huayang and Lian, Defu and Liu, Lemao},
  journal={arXiv preprint arXiv:2105.02573},
  year={2021}
}

Other related papers:

[1] FID, GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, NIPS 2017
[2] PRD, Assessing Generative Models via Precision and Recall, NIPS 2018
[3] BERTScore, BERTScore: Evaluating Text Generation with BERT, ICLR 2020

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
bert_score		bert_score
datasets		datasets
README.md		README.md
__init__.py		__init__.py
baseline.py		baseline.py
eval_metric.py		eval_metric.py
fbd_score.py		fbd_score.py
normality.py		normality.py
prd_score.py		prd_score.py
tokenizeChinese.py		tokenizeChinese.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bert_score

bert_score

datasets

datasets

README.md

README.md

init.py

init.py

baseline.py

baseline.py

eval_metric.py

eval_metric.py

fbd_score.py

fbd_score.py

normality.py

normality.py

prd_score.py

prd_score.py

tokenizeChinese.py

tokenizeChinese.py

utils.py

utils.py

Repository files navigation

Assessing Dialogue Systems with Distribution Distances

1. Configuratin

2. Usage

Citation

About

Releases

Packages

Languages

yhlleo/frechet-bert-distance

Folders and files

Latest commit

History

Repository files navigation

Assessing Dialogue Systems with Distribution Distances

1. Configuratin

2. Usage

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages