The Toxicity Dataset

Saving the internet is fun. Combing through thousands of online comments to build a toxicity dataset isn't. That's why we're creating the world's largest dataset of social media toxicity — so you can skip the slog and get to work.

We hope you find this dataset useful, whether you want to flag hateful speech, develop content moderation tools, or build classifiers to detect toxic messages.

Need a larger dataset of toxicity to train your ML models, or toxicity in other languages (Spanish, French, German, Japanese, Portuguese, and 17+ more)? We work with top AI and Safety companies around the world. Reach out to [email protected]!

Dataset

This repo contains 500 toxic and 500 non-toxic comments from a variety of popular social media platforms. Click on toxicity_en.csv to see a spreadsheet of 1000 English examples. Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic.

Columns

text: the text of the comment
is_toxic: whether or not the comment is toxic

Future

We'll be adding more languages and annotations (e.g., augmenting each comment with a severity ranking, adding categories, etc) over time.

If you're also interested in a dataset of profanity, check out our obscenity list.

The world's largest toxicity dataset.

Related tags

Overview

The Toxicity Dataset

Dataset

Columns

Future

Owner

Surge AI

Official Implementation for Fast Training of Neural Lumigraph Representations using Meta Learning.

Official implementation of particle-based models (GNS and DPI-Net) on the Physion dataset.

For IBM Quantum Challenge 2021 (May 20 - 26)

Multiview 3D object detection on MultiviewC dataset through moft3d.

🕵 Artificial Intelligence for social control of public administration

A Python library for common tasks on 3D point clouds

Face Recognition & AI Based Smart Attendance Monitoring System.

Rethinking the U-Net architecture for multimodal biomedical image segmentation

🚗 INGI Dakar 2K21 - Be the first one on the finish line ! 🚗

TensorFlow Implementation of Unsupervised Cross-Domain Image Generation

Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language (NeurIPS 2021)

[NeurIPS'21] "AugMax: Adversarial Composition of Random Augmentations for Robust Training" by Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Animashree Anandkumar, and Zhangyang Wang.

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

[CVPR 2020] Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation

Python code for loading the Aschaffenburg Pose Dataset.

Multi-modal Vision Transformers Excel at Class-agnostic Object Detection

In-place Parallel Super Scalar Samplesort (IPS⁴o)

Pytorch implementation of the unsupervised object discovery method LOST.

for a paper about leveraging discourse markers for training new models

Change is Everywhere: Single-Temporal Supervised Object Change Detection in Remote Sensing Imagery (ICCV 2021)