A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

Overview

Commonsense-Dialogues Dataset

We present Commonsense-Dialogues, a crowdsourced dataset of ~11K dialogues grounded in social contexts involving utilization of commonsense. The social contexts used were sourced from the train split of the SocialIQA dataset, a multiple-choice question-answering based social commonsense reasoning benchmark.

For the collection of the Commonsense-Dialogues dataset, each Turker was presented a social context and asked to write a dialogue of 4-6 turns between two people based on the event(s) described in the context. The Turker was asked to alternate between the roles of an individual referenced in the context and a 3rd party friend. See the following dialogues as examples:

    "1": {  # dialogue_id
        "context": "Sydney met Carson's mother for the first time last week. He liked her.",   # multiple individuals in the context: Sydney and Carson
        "speaker": "Sydney",   # role 1 = Sydney, role 2 = a third-person friend of Sydney
        "turns": [
            "I met Carson's mother last week for the first time.",
            "How was she?",
            "She turned out to be really nice. I like her.",
            "That's good to hear.",
            "It is, especially since Carson and I are getting serious.",
            "Well, at least you'll like your in-law if you guys get married."
        ]
    }

    "2": {
        "context": "Kendall had a party at Jordan's house but was found out to not have asked and just broke in.",
        "speaker": "Kendall",
        "turns": [
            "Did you hear about my party this weekend at Jordan\u2019s house?",
            "I heard it was amazing, but that you broke in.",
            "That was a misunderstanding, I had permission to be there.",
            "Who gave you permission?",
            "I talked to Jordan about it months ago before he left town to go to school, but he forgot to tell his roommates about it.",
            "Ok cool, I hope everything gets resolved."
        ]
    }

The data can be found in the /data directory of this repo. train.json has ~9K dialogues, valid.json and test.json have ~1K dialogues each. Since all the contexts were sourced from the train split of SocialIQA, it is imperative to note that any form of multi-task training and evaluation with Commonsense-Dialogues and SocialIQA must be done with caution to ensure fair and accurate conclusions.

Some statistics about the data are provided below:

Stat Train Valid Test
# of dialogues 9058 1157 1158
average # of turns in a dialogue 5.72 5.72 5.71
average # of words in a turn 12.4 12.4 12.2
# of distinct SocialIQA contexts used 3672 483 473
average # of dialogues for a SocialIQA context 2.46 2.395 2.45

Security

See CONTRIBUTING for more information.

License

This repository is licensed under the CC-BY-NC 4.0 License.

Citation

If you use this dataset, please cite the following paper:

@inproceedings{zhou-etal-2021-commonsense,
    title = "Commonsense-Focused Dialogues for Response Generation: An Empirical Study",
    author = "Zhou, Pei  and
      Gopalakrishnan, Karthik  and
      Hedayatnia, Behnam  and
      Kim, Seokhwan  and
      Pujara, Jay  and
      Ren, Xiang  and
      Liu, Yang  and
      Hakkani-Tur, Dilek",
    booktitle = "Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue",
    year = "2021",
    address = "Singapore and Online",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2109.06427"
}

Note that the paper uses newly collected dialogues as well as those that were filtered from existing datasets. This repo contains our newly collected dialogues alone.

Owner
Alexa
Alexa
NLP, Machine learning

Netflix-recommendation-system NLP, Machine learning About Recommendation algorithms are at the core of the Netflix product. It provides their members

Harshith VH 6 Jan 12, 2022
Training open neural machine translation models

Train Opus-MT models This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Ma

Language Technology at the University of Helsinki 167 Jan 03, 2023
OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

How To Use killtheZoom-2.0 Windows 0. https://joyhong.tistory.com/79 이 글을 보면서 tesseract를 C:\Program Files\Tesseract-OCR 경로로 설치해주세요(한국어 언어 추가 필요) 상단의 초

김정인 9 Sep 13, 2021
MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-tr

Microsoft 228 Nov 21, 2022
ConvBERT: Improving BERT with Span-based Dynamic Convolution

ConvBERT Introduction In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU.

YITUTech 237 Dec 10, 2022
Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informative Entities"

ERNIE Source code and dataset for "ERNIE: Enhanced Language Representation with Informative Entities" Reqirements: Pytorch=0.4.1 Python3 tqdm boto3 r

THUNLP 1.3k Dec 30, 2022
Py65 65816 - Add support for the 65C816 to py65

Add support for the 65C816 to py65 Py65 (https://github.com/mnaberez/py65) is a

4 Jan 04, 2023
[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Instance-level Image Retrieval using Reranking Transformers Fuwen Tan, Jiangbo Yuan, Vicente Ordonez, ICCV 2021. Abstract Instance-level image retriev

UVA Computer Vision 86 Dec 28, 2022
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Tobias Lee 89 Jan 03, 2023
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Project Page] [Paper] [Video] Wenlong Huang1, Pieter Abbee

Wenlong Huang 114 Dec 29, 2022
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

DeeBERT This is the code base for the paper DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. Code in this repository is also available

Castorini 132 Nov 14, 2022
A versatile token stream for handwritten parsers.

Writing recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. Th

Valentin Berlier 8 Nov 30, 2022
Natural Language Processing library built with AllenNLP 🌲🌱

Custom Natural Language Processing with big and small models 🌲🌱

Recognai 65 Sep 13, 2022
News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022
This Project is based on NLTK It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its antonyms, its synonyms

This Project is based on NLTK(Natural Language Toolkit) It generates a RANDOM WORD from a predefined list of words, From that random word it read out the word, its meaning with parts of speech , its

SaiVenkatDhulipudi 2 Nov 17, 2021
NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

NL-Augmenter 🦎 → 🐍 The NL-Augmenter is a collaborative effort intended to add transformations of datasets dealing with natural language. Transformat

684 Jan 09, 2023
Natural Language Processing Specialization

Natural Language Processing Specialization In this folder, Natural Language Processing Specialization projects and notes can be found. WHAT I LEARNED

Kaan BOKE 3 Oct 06, 2022
Python library to make development of portfolio analysis faster and easier

Trafalgar Python library to make development of portfolio analysis faster and easier Installation 🔥 For the moment, Trafalgar is still in beta develo

Santosh Passoubady 641 Jan 01, 2023
Just a Basic like Language for Zeno INC

zeno-basic-language Just a Basic like Language for Zeno INC This is written in 100% python. this is basic language like language. so its not for big p

Voidy Devleoper 1 Dec 18, 2021
使用pytorch+transformers复现了SimCSE论文中的有监督训练和无监督训练方法

SimCSE复现 项目描述 SimCSE是一种简单但是很巧妙的NLP对比学习方法,创新性地引入Dropout的方式,对样本添加噪声,从而达到对正样本增强的目的。 该框架的训练目的为:对于batch中的每个样本,拉近其与正样本之间的距离,拉远其与负样本之间的距离,使得模型能够在大规模无监督语料(也可以

58 Dec 20, 2022