A cross-lingual COVID-19 fake news dataset

Overview

CrossFake

An English-Chinese COVID-19 fake&real news dataset from the ICDMW 2021 paper below:
Cross-lingual COVID-19 Fake News Detection.
Jiangshu Du, Yingtong Dou, Congying Xia, Limeng Cui, Jing Ma, Philip S. Yu.

Introduction

The COVID-19 pandemic poses a significant threat to global public health. Meanwhile, there is massive misinformation associated with the pandemic, which advocates unfounded or unscientific claims. Even major social media and news outlets have made an extra effort in debunking COVID-19 misinformation, most of the fact-checking information is in English, whereas some unmoderated COVID-19 misinformation is still circulating in other languages, threatening the health of less informed people in immigrant communities and developing countries (The Vox, New York Times).

In the above paper, we make the first attempt to detect COVID-19 misinformation in a low-resource language (Chinese) only using the fact-checked news in a high-resource language (English).

This repo contains a Chinese-English real & fake news dataset according to existing English fact-checking information. Details on this dataset are described in Dataset Detail.

The highlights of our dataset are as follows:

  • Bilingual news pieces for the same event (fact).
  • Multiple Chinese news pieces for the same event (fact).
  • Comprehensive metadata for each news (see below).

Dataset Detail

The table below shows the number of annotated news in each language:

Lang. Fake Real Total
ENG 55 82 137
CHN 101 118 219

The metadata of our dataset can be found at CrossFake_metadata.xlsx, which includes two sheets (news_fake and news_real). Given the news id, you can find the corresponding news body text in the body_text directory. The meanings of each column of the metadata are shown below:

  • Column A (id):

    News id. Chinese real & fake news is annotated according to existing English fact-checking information. Thus, each piece of English news may correspond to multiple pieces of Chinese news from different sources. For example, in the news_fake sheet, the ids 1_1 and 1_2 indicate one piece of English news, corresponding to two pieces of Chinese news.

  • Column B (fact_check_url):

    The fact-checking source of the corresponding English news.

  • Column C (type):

    The news type. Post and Article represent the news is from a social media post or an online article, respectively. Note that we also annotated some clickbait news whose title and body text present contradictory information.

  • Column D (source):

    The news source. Personal and Professional represent the news is from a personal account or professional source (WHO, NIH, etc.), respectively.

  • Column E (mixed?):

    Whether the news include mixed content? If a news body text only has the content related to the checked fact, the piece of news is annotated as not mixed. Accordingly, the news whose content includes events/facts besides the checked fact is regarded as mixed news.

  • Column F (platform):

    The platform where the news is published.

  • Column G (news_url):

    The news source URL. Note that some of the links are invalid due to the deletion/removal of the news. We have archived the accessible news (see Column H) during we curate the dataset.

  • Column H (archive):

    The archived news link. To permanently store the original news, we archived the news source URL.

  • Column I (newstitle):

    The news title.

  • Column J (publish_date):

    The news publishing date.

  • Columns K to R have the same meanings as Columns C to J, but they indicate the information of Chinese news.

Case Study

Besides the findings and conclusions presented in our paper. We have extra interesting findings during collecting the data:

  1. Mixed Fact. For some fake news, their corresponding Chinese news articles presented them in the form of a news digest with other news events. It brings an extra hurdle to fact-check those news pieces since only partial content of the news contains misinformation. A typical example is news_id 8_3 in the news_fake sheet. You can check out other news whose mixed? annotated as Yes.

  2. Misused Fact. For news_real id 9_2, we find a Chinese social post leveraging the fact that "coronavirus can live for up to 4 hours on copper" to promote their copper-made pot. In this case, even the title and most of the news content seem legit, but the connection between "the copper kills coronavirus" and "copper pot is good" is still questionable.

  3. Fake News Type. During we annotate the Chinese news based on the fact-checked English news. We find that most of the fact-checked fake news from Politifact have no corresponding Chinese news. Those news pieces usually are local news in the United States.

  4. Cross-lingual Fact-checking. For the news_real id 9_1, we find a Chinese news piece from a professional news outlet published five days earlier than the fact-checked English Facebook post. It suggests that we could leverage fact information from another language to help fact-check the news. Note that most of the Chinese news in our datasets are published later than the source English news since most of the checked news events are originated in English media.

Future Directions

Given the current dataset, some future research directions include:

  • The writing style/sentiment/stance differences between fake news and real news.
  • The writing style/sentiment/stance differences between professional news outlets and personal accounts.
  • The information distortion/loss from English news to Chinese news.
  • The temporal patterns of cross-lingual news migration.
  • The title patterns of different news.

Citation

If you use our code, please cite the paper below:

@inproceedings{du2021cross,
  title={Cross-lingual COVID-19 Fake News Detection},
  author={Du, Jiangshu and Dou, Yingtong and Xia, Congying and Cui, Limeng and Ma, Jing and Yu, Philip S},
  booktitle={Proceedings of the 21st IEEE International Conference on Data Mining Workshops (ICDMW'21)},
  year={2021}
}
Owner
Yingtong Dou
Ph.D. @ UIC. Graph Mining; Fraud Detection; Secure Machine Learning
Yingtong Dou
Implementation of Fast Transformer in Pytorch

Fast Transformer - Pytorch Implementation of Fast Transformer in Pytorch. This only work as an encoder. Yannic video AI Epiphany Install $ pip install

Phil Wang 167 Dec 27, 2022
Application of the L2HMC algorithm to simulations in lattice QCD.

l2hmc-qcd 📊 Slides Recent talk on Training Topological Samplers for Lattice Gauge Theory from the Machine Learning for High Energy Physics, on and of

Sam Foreman 37 Dec 14, 2022
Fast and robust certifiable relative pose estimation

Fast and Robust Relative Pose Estimation for Calibrated Cameras This repository contains the code for the relative pose estimation between two central

42 Dec 06, 2022
Codes for CVPR2021 paper "PWCLO-Net: Deep LiDAR Odometry in 3D Point Clouds Using Hierarchical Embedding Mask Optimization"

PWCLO-Net: Deep LiDAR Odometry in 3D Point Clouds Using Hierarchical Embedding Mask Optimization (CVPR 2021) This is the official implementation of PW

Intelligent Robotics and Machine Vision Lab 42 Dec 18, 2022
Mixed Transformer UNet for Medical Image Segmentation

MT-UNet Update 2022/01/05 By another round of training based on previous weights, our model also achieved a better performance on ACDC (91.61% DSC). W

dotman 92 Dec 25, 2022
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

LightHuBERT LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT | Github | Huggingface | SUPER

WangRui 46 Dec 29, 2022
PyTorch implementation of 'Gen-LaneNet: a generalized and scalable approach for 3D lane detection'

(pytorch) Gen-LaneNet: a generalized and scalable approach for 3D lane detection Introduction This is a pytorch implementation of Gen-LaneNet, which p

Yuliang Guo 233 Jan 06, 2023
Dealing With Misspecification In Fixed-Confidence Linear Top-m Identification

Dealing With Misspecification In Fixed-Confidence Linear Top-m Identification This repository is the official implementation of [Dealing With Misspeci

0 Oct 25, 2021
Awesome Artificial Intelligence, Machine Learning and Deep Learning as we learn it

Awesome Artificial Intelligence, Machine Learning and Deep Learning as we learn it. Study notes and a curated list of awesome resources of such topics.

mani 1.2k Jan 07, 2023
FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection

FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection arXi

59 Nov 29, 2022
PyTorch implementation of the YOLO (You Only Look Once) v2

PyTorch implementation of the YOLO (You Only Look Once) v2 The YOLOv2 is one of the most popular one-stage object detector. This project adopts PyTorc

申瑞珉 (Ruimin Shen) 433 Nov 24, 2022
Camview - A CLI-tool used to stream CCTV online footage based on URL params

CamView A CLI-tool used to stream CCTV online footage based on URL params Get St

Finn Lancaster 54 Dec 09, 2022
Benchmark VAE - Library for Variational Autoencoder benchmarking

Documentation pythae This library implements some of the most common (Variational) Autoencoder models. In particular it provides the possibility to pe

1.1k Jan 02, 2023
Python scripts for performing 3D human pose estimation using the Mobile Human Pose model in ONNX.

Python scripts for performing 3D human pose estimation using the Mobile Human Pose model in ONNX.

Ibai Gorordo 99 Dec 31, 2022
Colab notebook for openai/glide-text2im.

GLIDE text2im on Colab This repository provides a Colab notebook to produce images conditioned on text prompts with GLIDE [1]. Usage Run text2im.ipynb

Wok 19 Oct 19, 2022
Official Pytorch implementation of "DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network" (CVPR'21)

DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network Pytorch implementation for our DivCo. We propose a simple ye

64 Nov 22, 2022
A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

24 Dec 26, 2022
Simple implementation of Mobile-Former on Pytorch

Simple-implementation-of-Mobile-Former At present, only the model but no trained. There may be some bug in the code, and some details may be different

Acheung 103 Dec 31, 2022
Example for AUAV 2022 with obstacle avoidance.

AUAV 2022 Sample This is a sample PX4 based quadrotor path planning framework based on Ubuntu 20.04 and ROS noetic for the IEEE Autonomous UAS 2022 co

James Goppert 11 Sep 16, 2022
[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

[AAAI 2022] Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding Official Pytorch implementation of Negative Sample Matter

Multimedia Computing Group, Nanjing University 69 Dec 26, 2022