This repository contains the code for our paper VDA (public in EMNLP2021 main conference)

Last update: Aug 06, 2022

Related tags

Deep Learning VDA

Overview

Virtual Data Augmentation: A Robust and General Framework for Fine-tuning Pre-trained Models

This repository contains the code for our paper VDA (public in EMNLP2021 main conference)

Overview

We propose a general framework Virtual Data Augmentation (VDA) for robustly fine-tuning Pre-trained Language Models for downstream tasks. Our VDA utilizes a masked language model with Gaussian noise to augment virtual examples for improving the robustness, and also adopts regularized training to further guarantee the semantic relevance and diversity.

Train VDA

In the following section, we describe how to train a model with VDA by using our code.

Training

Data

For evaluation of our VDA, we use 6 text classification datasets, i.e. Yelp, IMDB, AGNews, MR, QNLI and MRPC datasets. These datasets can be downloaded from the GoogleDisk

After download the two ziped files, users should unzip the data fold that contains the training, validation and test data of the 6 datasets. While the Robust fold contains the examples for test the robustness.

Training scripts We public our VDA with 4 base models. For single sentence classification tasks, we use text_classifier_xxx.py files. While for sentence pair classification tasks, we use text_pair_classifier_xxx.py:

text_classifier.py and text_pair_classifier.py: BERT-base+VDA
text_classifier_freelb.py and text_pair_classifier_freelb.py: FreeLB+VDA on BERT-base
text_classifier_smart.py and text_pair_classifier_smart.py: SMART+VDA on BERT-base, where we only use the smooth-inducing adversarial regularization.
text_classifier_smix.py and text_pair_classifier_smix.py: Smix+VDA on BERT-base, where we remove the adversarial data augmentation for fair comparison

We provide example scripts for both training and test of our VDA on the 6 datasets. In run_train.sh, we provide 6 example for training on the yelp and qnli datasets. This script calls text_classifier_xxx.py for training (xxx refers to the base model). We explain the arguments in following:

--dataset: Training file path.
--mlm_path: Pre-trained checkpoints to start with. For now we support BERT-based models (bert-base-uncased, bert-large-uncased, etc.)
--save_path: Saved fine-tuned checkpoints file.
--max_length: Max sequence length. (For Yelp/IMDB/AG, we use 512. While for MR/QNLI/MRPC, we use 256.)
--max_epoch: The maximum training epoch number. (In most of datasets and models, we use 10.)
--batch_size: The batch size. (We adapt the batch size to the maximum number w.r.t the GPU memory size. Note that too small number may cause model collapse.)
--num_label: The number of labels. (For AG, we use 4. While for other, we use 2.)
--lr: Learning rate.
--num_warmup: The rate of warm-up steps.
--variance: The variance of the Gaussian noise.

For results in the paper, we use Nvidia Tesla V100 32G and Nvidia 3090 24G GPUs to train our models. Using different types of devices or different versions of CUDA/other softwares may lead to slightly different performance.

Evaluation

During training, our model file will show the original accuracy on the test set of the 6 datasets, which evaluates the accuracy performance of our model. Our evaluation code for robustness is based on a modified version of BERT-Attack. It outputs Attack Accuracy, Query Numbers and Perturbation Ratio metrics.

Before evaluation, please download the evaluation datasets for Robustness from the GoogleDisk. Then, following the commonly-used settings, users need to download and process consine similarity matrix following TextFooler.

Based on the checkpoint of the fine-tuned models, we use therun_test.sh script for test the robustness on yelp and qnli datasets. It is based on bert_robust.py file. We explain the arguments in following:

--data_path: Training file path.
--mlm_path: Pre-trained checkpoints to start with. For now we support BERT-based models (bert-base-uncased, bert-large-uncased, etc.)
--tgt_path: The fine-tuned checkpoints file.
--num_label: The number of labels. (For AG, we use 4. While for other, we use 2.)

which is expected to output the results as:

original accuracy is 0.960000, attack accuracy is 0.533333, query num is 687.680556, perturb rate is 0.177204

Citation

Please cite our paper if you use VDA in your work:

@inproceedings{zhou2021vda,
  author    = {Kun Zhou, Wayne Xin Zhao, Sirui Wang, Fuzheng Zhang, Wei Wu and Ji-Rong Wen},
  title     = {Virtual Data Augmentation: A Robust and General Framework for Fine-tuning Pre-trained Models},
  booktitle = {{EMNLP} 2021},
  publisher = {The Association for Computational Linguistics},
}

This repository contains the code for our paper VDA (public in EMNLP2021 main conference)

Related tags

Overview

Virtual Data Augmentation: A Robust and General Framework for Fine-tuning Pre-trained Models

Quick Links

Overview

Train VDA

Training

Evaluation

Citation

Owner

RUCAIBox

Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

The code for paper "Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation" which is accepted by AAAI 2022

Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for prediction.

Kaggle Ultrasound Nerve Segmentation competition [Keras]

基于PaddleOCR搭建的OCR server... 离线部署用

Implement of "Training deep neural networks via direct loss minimization" in PyTorch for 0-1 loss

Gapmm2: gapped alignment using minimap2 (align transcripts to genome)

A cool little repl-based simulation written in Python

This repository contains the implementations related to the experiments of a set of publicly available datasets that are used in the time series forecasting research space.

Implementation of a Transformer, but completely in Triton

[PNAS2021] The neural architecture of language: Integrative modeling converges on predictive processing

Python implementation of ADD: Frequency Attention and Multi-View based Knowledge Distillation to Detect Low-Quality Compressed Deepfake Images, AAAI2022.

LSTM built using Keras Python package to predict time series steps and sequences. Includes sin wave and stock market data

Implementation of experiments in the paper Clockwork Variational Autoencoders (project website) using JAX and Flax

68 keypoint annotations for COFW test data

An Implementation of Fully Convolutional Networks in Tensorflow.

Code base for NeurIPS 2021 publication titled Kernel Functional Optimisation (KFO)

Model Agnostic Interpretability for Multiple Instance Learning

curl-impersonate: A special compilation of curl that makes it impersonate Chrome & Firefox

Unofficial PyTorch implementation of the Adaptive Convolution architecture for image style transfer