The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Last update: Dec 22, 2022

Overview

Cutoff: A Simple Data Augmentation Approach for Natural Language

This repository contains source code necessary to reproduce the results presented in the following paper:

A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation

This project is maintained by Dinghan Shen. Feel free to contact [email protected] for any relevant issues.

Natural Language Undertanding (e.g. GLUE tasks, etc.)

Prerequisite:

CUDA, cudnn
Python 3.7
PyTorch 1.4.0

Run

Install Huggingface Transformers according to the instructions here: https://github.com/huggingface/transformers.
Download the datasets from the GLUE benchmark:

python download_glue_data.py --data_dir glue_data --tasks all

Fine-tune the RoBERTa-base or RoBERTa-large model with the Cutoff data augmentation strategies:

>>> chmod +x run_glue.sh
>>> ./run_glue.sh

Options: different settings and hyperparameters can be selected and specified in the run_glue.sh script:

do_aug: whether augmented examples are used for training.
aug_type: the specific strategy to synthesize Cutoff samples, which can be chosen from: 'span_cutoff', 'token_cutoff' and 'dim_cutoff'.
aug_cutoff_ratio: the ratio corresponding to the span length, token number or number of dimensions to be cut.
aug_ce_loss: the coefficient for the cross-entropy loss over the cutoff examples.
aug_js_loss: the coefficient for the Jensen-Shannon (JS) Divergence consistency loss over the cutoff examples.
TASK_NAME: the downstream GLUE task for fine-tuning.
model_name_or_path: the pre-trained for initialization (both RoBERTa-base or RoBERTa-large models are supported).
output_dir: the folder results being saved to.

Natural Language Generation (e.g. Translation, etc.)

Please refer to Neural Machine Translation with Data Augmentation for more details

IWSLT'14 German to English (Transformers)

Task	Setting	Approach	BLEU
iwslt14 de-en	transformer-small	w/o cutoff	36.2
iwslt14 de-en	transformer-small	w/ cutoff	37.6

WMT'14 English to German (Transformers)

Task	Setting	Approach	BLEU
wmt14 en-de	transformer-base	w/o cutoff	28.6
wmt14 en-de	transformer-base	w/ cutoff	29.1
wmt14 en-de	transformer-big	w/o cutoff	29.5
wmt14 en-de	transformer-big	w/ cutoff	30.3

Citation

Please cite our paper in your publications if it helps your research:

@article{shen2020simple,
  title={A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation},
  author={Shen, Dinghan and Zheng, Mingzhi and Shen, Yelong and Qu, Yanru and Chen, Weizhu},
  journal={arXiv preprint arXiv:2009.13818},
  year={2020}
}

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Related tags

Overview

Cutoff: A Simple Data Augmentation Approach for Natural Language

Natural Language Undertanding (e.g. GLUE tasks, etc.)

Prerequisite:

Run

Natural Language Generation (e.g. Translation, etc.)

IWSLT'14 German to English (Transformers)

WMT'14 English to German (Transformers)

Citation

Owner

Dinghan Shen

An open source object detection toolbox based on PyTorch

It is a system used to detect bone fractures. using techniques deep learning and image processing

[TIP 2020] Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion

In Search of Probeable Generalization Measures

Kernel Point Convolutions

Source code for our CVPR 2019 paper - PPGNet: Learning Point-Pair Graph for Line Segment Detection

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Open Source Differentiable Computer Vision Library for PyTorch

The code for paper Efficiently Solve the Max-cut Problem via a Quantum Qubit Rotation Algorithm

Deep GPs built on top of TensorFlow/Keras and GPflow

Compact Bidirectional Transformer for Image Captioning

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

HairCLIP: Design Your Hair by Text and Reference Image

Combine Tacotron2 and Hifi GAN to generate speech from text

Flax is a neural network ecosystem for JAX that is designed for flexibility.

Heat transfer problemas solved using python

JupyterLite demo deployed to GitHub Pages 🚀

General Vision Benchmark, a project from OpenGVLab

DC540 hacking challenge 0x00005a.

LRBoost is a scikit-learn compatible approach to performing linear residual based stacking/boosting.