Amazon Multilingual Counterfactual Dataset (AMCD)

Last update: Sep 20, 2022

Overview

Amazon Multilingual Counterfactual Dataset (AMCD)

This repository contains a dataset described in the paper:

I Wish I Would Have Loved This One, But I Didn’t – A Multilingual Dataset for Counterfactual Detection in Product Reviews. James O’Neill, Polina Rozenshtein, Ryuichi Kiryo, Motoko Kubota, Danushka Bollegala. EMNLP'21. arxiv version

The dataset contains sentences from Amazon customer reviews (sampled from Amazon product review dataset) annotated for counterfactual detection (CFD) binary classification. Counterfactual statements describe events that did not or cannot take place. Counterfactual statements may be identified as statements of the form – If p was true, then q would be true (i.e. assertions whose antecedent (p) and consequent (q) are known or assumed to be false).

The key features of this dataset are:

The dataset is multilingual and contains sentences in English, German, and Japanese.
The labeling was done by professional linguists and high quality was ensured.
The dataset is supplemented with the annotation guidelines and definitions, which were worked out by professional linguists. We also provide the clue word lists, which are typical for counterfactual sentences and were used for initial data filtering. The clue word lists were also compiled by professional linguists.

Please see paper for the data statistics, detailed description of data collection and annotation.

For the dataset format please see README.txt.

Cite

If you use this dataset in your research, please cite the paper.

License Summary

The documentation is made available under the Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file.

Amazon Multilingual Counterfactual Dataset (AMCD)

Related tags

Overview

Amazon Multilingual Counterfactual Dataset (AMCD)

Cite

License Summary

Owner

Header-only C++ HNSW implementation with python bindings

Baseline code for Korean open domain question answering(ODQA)

Athena is an open-source implementation of end-to-end speech processing engine.

An open source library for deep learning end-to-end dialog systems and chatbots.

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Gold standard corpus annotated with verb-preverb connections for Hungarian.

Gathers machine learning and Tensorflow deep learning models for NLP problems, 1.13 < Tensorflow < 2.0

Extract Keywords from sentence or Replace keywords in sentences.

Grapheme-to-phoneme (G2P) conversion is the process of generating pronunciation for words based on their written form.

ConvBERT-Prod

Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

Wrapper to display a script output or a text file content on the desktop in sway or other wlroots-based compositors

A Python script that compares files in directories

Pytorch implementation of Tacotron

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

The (extremely) naive sentiment classification function based on NBSVM trained on wisesight_sentiment

Deep Learning for Natural Language Processing - Lectures 2021

Basic yet complete Machine Learning pipeline for NLP tasks

⚖️ A Statutory Article Retrieval Dataset in French.