The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Last update: Dec 22, 2022

Overview

Cutoff: A Simple Data Augmentation Approach for Natural Language

This repository contains source code necessary to reproduce the results presented in the following paper:

A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation

This project is maintained by Dinghan Shen. Feel free to contact [email protected] for any relevant issues.

Natural Language Undertanding (e.g. GLUE tasks, etc.)

Prerequisite:

CUDA, cudnn
Python 3.7
PyTorch 1.4.0

Run

Install Huggingface Transformers according to the instructions here: https://github.com/huggingface/transformers.
Download the datasets from the GLUE benchmark:

python download_glue_data.py --data_dir glue_data --tasks all

Fine-tune the RoBERTa-base or RoBERTa-large model with the Cutoff data augmentation strategies:

>>> chmod +x run_glue.sh
>>> ./run_glue.sh

Options: different settings and hyperparameters can be selected and specified in the run_glue.sh script:

do_aug: whether augmented examples are used for training.
aug_type: the specific strategy to synthesize Cutoff samples, which can be chosen from: 'span_cutoff', 'token_cutoff' and 'dim_cutoff'.
aug_cutoff_ratio: the ratio corresponding to the span length, token number or number of dimensions to be cut.
aug_ce_loss: the coefficient for the cross-entropy loss over the cutoff examples.
aug_js_loss: the coefficient for the Jensen-Shannon (JS) Divergence consistency loss over the cutoff examples.
TASK_NAME: the downstream GLUE task for fine-tuning.
model_name_or_path: the pre-trained for initialization (both RoBERTa-base or RoBERTa-large models are supported).
output_dir: the folder results being saved to.

Natural Language Generation (e.g. Translation, etc.)

Please refer to Neural Machine Translation with Data Augmentation for more details

IWSLT'14 German to English (Transformers)

Task	Setting	Approach	BLEU
iwslt14 de-en	transformer-small	w/o cutoff	36.2
iwslt14 de-en	transformer-small	w/ cutoff	37.6

WMT'14 English to German (Transformers)

Task	Setting	Approach	BLEU
wmt14 en-de	transformer-base	w/o cutoff	28.6
wmt14 en-de	transformer-base	w/ cutoff	29.1
wmt14 en-de	transformer-big	w/o cutoff	29.5
wmt14 en-de	transformer-big	w/ cutoff	30.3

Citation

Please cite our paper in your publications if it helps your research:

@article{shen2020simple,
  title={A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation},
  author={Shen, Dinghan and Zheng, Mingzhi and Shen, Yelong and Qu, Yanru and Chen, Weizhu},
  journal={arXiv preprint arXiv:2009.13818},
  year={2020}
}

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Related tags

Overview

Cutoff: A Simple Data Augmentation Approach for Natural Language

Natural Language Undertanding (e.g. GLUE tasks, etc.)

Prerequisite:

Run

Natural Language Generation (e.g. Translation, etc.)

IWSLT'14 German to English (Transformers)

WMT'14 English to German (Transformers)

Citation

Owner

Dinghan Shen

Fairness Metrics: All you need to know

This is the official code for the paper "Ad2Attack: Adaptive Adversarial Attack for Real-Time UAV Tracking".

Pytorch implementation for Patient Knowledge Distillation for BERT Model Compression

NumQMBasic - A mini-course offered to Undergrad physics students

Codebase for Time-series Generative Adversarial Networks (TimeGAN)

Video Matting Refinement For Python

Official repository for "PAIR: Planning and Iterative Refinement in Pre-trained Transformers for Long Text Generation"

A Dataset for Direct Quotation Extraction and Attribution in News Articles.

Official PyTorch implementation of "Evolving Search Space for Neural Architecture Search"

Convert Python 3 code to CUDA code.

This repository contains all the code and materials distributed in the 2021 Q-Programming Summer of Qode.

BED: A Real-Time Object Detection System for Edge Devices

Code of paper "Compositionally Generalizable 3D Structure Prediction"

Amazing-Python-Scripts - 🚀 Curated collection of Amazing Python scripts from Basics to Advance with automation task scripts.

Official code of ICCV2021 paper "Residual Attention: A Simple but Effective Method for Multi-Label Recognition"

Materials for upcoming beginner-friendly PyTorch course (work in progress).

[ICCV 2021] Relaxed Transformer Decoders for Direct Action Proposal Generation

Fine-Tune EleutherAI GPT-Neo to Generate Netflix Movie Descriptions in Only 47 Lines of Code Using Hugginface And DeepSpeed

blind SQLIpy sebuah alat injeksi sql yang menggunakan waktu sql untuk mendapatkan sebuah server database.

The source code for CATSETMAT: Cross Attention for Set Matching in Bipartite Hypergraphs