Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

Overview

Dataset Cartography

Code for the paper Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics at EMNLP 2020.

This repository contains implementation of data maps, as well as other data selection baselines, along with notebooks for data map visualizations.

If using, please cite:

@inproceedings{swayamdipta2020dataset,
    title={Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics},
    author={Swabha Swayamdipta and Roy Schwartz and Nicholas Lourie and Yizhong Wang and Hannaneh Hajishirzi and Noah A. Smith and Yejin Choi},
    booktitle={Proceedings of EMNLP},
    url={https://arxiv.org/abs/2009.10795},
    year={2020}
}

This repository can be used to build Data Maps, like this one for SNLI using a RoBERTa-Large classifier. SNLI Data Map with RoBERTa-Large

Pre-requisites

This repository is based on the HuggingFace Transformers library.

Train GLUE-style model and compute training dynamics

To train a GLUE-style model using this repository:

python -m cartography.classification.run_glue \
    -c configs/$TASK.jsonnet \
    --do_train \
    --do_eval \
    -o $MODEL_OUTPUT_DIR

The best configurations for our experiments for each of the $TASKs (SNLI, MNLI, QNLI or WINOGRANDE) are provided under configs.

This produces a training dynamics directory $MODEL_OUTPUT_DIR/training_dynamics, see a sample here.

Note: you can use any other set up to train your model (independent of this repository) as long as you produce the dynamics_epoch_$X.jsonl for plotting data maps, and filtering different regions of the data. The .jsonl file must contain the following fields for every training instance:

  • guid : instance ID matching that in the original data file, for filtering,
  • logits_epoch_$X : logits for the training instance under epoch $X,
  • gold : index of the gold label, must match the logits array.

Plot Data Maps

To plot data maps for a trained $MODEL (e.g. RoBERTa-Large) on a given $TASK (e.g. SNLI, MNLI, QNLI or WINOGRANDE):

python -m cartography.selection.train_dy_filtering \
    --plot \
    --task_name $TASK \
    --model_dir $PATH_TO_MODEL_OUTPUT_DIR_WITH_TRAINING_DYNAMICS \
    --model $MODEL_NAME

Data Selection

To select (different amounts of) data based on various metrics from training dynamics:

python -m cartography.selection.train_dy_filtering \
    --filter \
    --task_name $TASK \
    --model_dir $PATH_TO_MODEL_OUTPUT_DIR_WITH_TRAINING_DYNAMICS \
    --metric $METRIC \
    --data_dir $PATH_TO_GLUE_DIR_WITH_ORIGINAL_DATA_IN_TSV_FORMAT

Supported $TASKs include SNLI, QNLI, MNLI and WINOGRANDE, and $METRICs include confidence, variability, correctness, forgetfulness and threshold_closeness; see paper for more details.

To select hard-to-learn instances, set $METRIC as "confidence" and for ambiguous, set $METRIC as "variability". For easy-to-learn instances: set $METRIC as "confidence" and use the flag --worst.

NOMAD - A blackbox optimization software

################################################################################### #

Blackbox Optimization 78 Dec 29, 2022
Efficient Online Bayesian Inference for Neural Bandits

Efficient Online Bayesian Inference for Neural Bandits By Gerardo Durán-Martín, Aleyna Kara, and Kevin Murphy AISTATS 2022.

Probabilistic machine learning 49 Dec 27, 2022
Official implementation of "A Shared Representation for Photorealistic Driving Simulators" in PyTorch.

A Shared Representation for Photorealistic Driving Simulators The official code for the paper: "A Shared Representation for Photorealistic Driving Sim

VITA lab at EPFL 7 Oct 13, 2022
Official Pytorch Implementation of GraphiT

GraphiT: Encoding Graph Structure in Transformers This repository implements GraphiT, described in the following paper: Grégoire Mialon*, Dexiong Chen

Inria Thoth 80 Nov 27, 2022
Official code for NeurIPS 2021 paper "Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN"

Towards Scalable Unpaired Virtual Try-On via Patch-Routed Spatially-Adaptive GAN Official code for NeurIPS 2021 paper "Towards Scalable Unpaired Virtu

68 Dec 21, 2022
An implementation for Neural Architecture Search with Random Labels (CVPR 2021 poster) on Pytorch.

Neural Architecture Search with Random Labels(RLNAS) Introduction This project provides an implementation for Neural Architecture Search with Random L

18 Nov 08, 2022
How the Deep Q-learning method works and discuss the new ideas that makes the algorithm work

Deep Q-Learning Recommend papers The first step is to read and understand the method that you will implement. It was first introduced in a 2013 paper

1 Jan 25, 2022
The Malware Open-source Threat Intelligence Family dataset contains 3,095 disarmed PE malware samples from 454 families

MOTIF Dataset The Malware Open-source Threat Intelligence Family (MOTIF) dataset contains 3,095 disarmed PE malware samples from 454 families, labeled

Booz Allen Hamilton 112 Dec 13, 2022
Labelbox is the fastest way to annotate data to build and ship artificial intelligence applications

Labelbox Labelbox is the fastest way to annotate data to build and ship artificial intelligence applications. Use this github repository to help you s

labelbox 1.7k Dec 29, 2022
Using Python to Play Cyberpunk 2077

CyberPython 2077 Using Python to Play Cyberpunk 2077 This repo will contain code from the Cyberpython 2077 video series on Youtube (youtube.

Harrison 118 Oct 18, 2022
Codebase for ECCV18 "The Sound of Pixels"

Sound-of-Pixels Codebase for ECCV18 "The Sound of Pixels". *This repository is under construction, but the core parts are already there. Environment T

Hang Zhao 318 Dec 20, 2022
Quantify the difference between two arbitrary curves in space

similaritymeasures Quantify the difference between two arbitrary curves Curves in this case are: discretized by inidviudal data points ordered from a

Charles Jekel 175 Jan 08, 2023
Unofficial Implementation of Oboe (SIGCOMM'18').

Oboe-Reproduce This is the unofficial implementation of the paper "Oboe: Auto-tuning video ABR algorithms to network conditions, Zahaib Akhtar, Yun Se

Tianchi Huang 13 Nov 04, 2022
This repository contains the exercises and its solution contained in the book "An Introduction to Statistical Learning" in python.

An-Introduction-to-Statistical-Learning This repository contains the exercises and its solution contained in the book An Introduction to Statistical L

2.1k Jan 02, 2023
GAN-based Matrix Factorization for Recommender Systems

GAN-based Matrix Factorization for Recommender Systems This repository contains the datasets' splits, the source code of the experiments and their res

Ervin Dervishaj 9 Nov 06, 2022
Image-generation-baseline - MUGE Text To Image Generation Baseline

MUGE Text To Image Generation Baseline Requirements and Installation More detail

23 Oct 17, 2022
Template repository for managing machine learning research projects built with PyTorch-Lightning

Tutorial Repository with a minimal example for showing how to deploy training across various compute infrastructure.

Sidd Karamcheti 3 Feb 11, 2022
Optimizes image files by converting them to webp while also updating all references.

About Optimizes images by (re-)saving them as webp. For every file it replaced it automatically updates all references. Works on single files as well

Watermelon Wolverine 18 Dec 23, 2022
[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

ZJU-VIPA 47 Jan 09, 2023
Yolov3 pytorch implementation

YOLOV3 Pytorch实现 在bubbliiing大佬代码的基础上进行了修改,添加了部分注释。 预训练模型 预训练模型来源于bubbliiing。 链接:https://pan.baidu.com/s/1ncREw6Na9ycZptdxiVMApw 提取码:appk 训练自己的数据集 按照VO

4 Aug 27, 2022