Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Last update: Jan 03, 2023

Overview

Text-AutoAugment (TAA)

This repository contains the code for our paper Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification (EMNLP 2021 main conference).

Overview

We present a learnable and compositional framework for data augmentation. Our proposed algorithm automatically searches for the optimal compositional policy, which improves the diversity and quality of augmented samples.
In low-resource and class-imbalanced regimes of six benchmark datasets, TAA significantly improves the generalization ability of deep neural networks like BERT and effectively boosts text classification performance.

Getting Started

Prepare environment

conda create -n taa python=3.6
conda activate taa
conda install pytorch torchvision cudatoolkit=10.0 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
pip install -r requirements.txt 
python -c "import nltk; nltk.download('wordnet'); nltk.download('averaged_perceptron_tagger')"

Modify dataroot parameter in confs/*yaml and abspath parameter in script/*.sh:
- e.g., change dataroot: /home/renshuhuai/TextAutoAugment/data/aclImdb in confs/bert_imdb.yaml to dataroot: path-to-your-TextAutoAugment/data/aclImdb
- change --abspath '/home/renshuhuai/TextAutoAugment' in script/imdb_lowresource.sh to --abspath 'path-to-your-TextAutoAugment'
Search for the best augmentation policy, e.g., low-resource regime for IMDB:
```
sh script/imdb_lowresource.sh
```
scripts for policy search in the low-resource and class-imbalanced regime for all datasets are provided in the script/ fold.
Train a model with pre-searched policy in archive.py, e.g., train model in low-resource regime for IMDB:
```
python train.py -c confs/bert_imdb.yaml 
```
train model on full dataset of IMDB:
```
python train.py -c confs/bert_imdb.yaml --train-npc -1 --valid-npc -1 --test-npc -1  
```

Contact

If you have any questions related to the code or the paper, feel free to email Shuhuai (renshuhuai007 [AT] gmail [DOT] com).

Acknowledgments

Code refers to: fast-autoaugment.

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{ren2021taa,
  title={Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification},
  author={Shuhuai Ren, Jinchao Zhang, Lei Li, Xu Sun, Jie Zhou},
  booktitle={EMNLP},
  year={2021}
}

License

MIT

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

Related tags

Overview

Text-AutoAugment (TAA)

Overview

Getting Started

Contact

Acknowledgments

Citation

License

Owner

LancoPKU

Code repository of the paper Neural circuit policies enabling auditable autonomy published in Nature Machine Intelligence

मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

Knowledge Management for Humans using Machine Learning & Tags

Model for recasing and repunctuating ASR transcripts

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

Question answering app is used to answer for a user given question from user given text.

MASS: Masked Sequence to Sequence Pre-training for Language Generation

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

Library for Russian imprecise rhymes generation

LCG T-TEST USING EUCLIDEAN METHOD

A simple Speech Emotion Recognition (SER) API created using Flask and running in a Docker container.

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

中文无监督SimCSE Pytorch实现

Natural Language Processing at EDHEC, 2022

Lyrics generation with GPT2-based Transformer

ConvBERT-Prod

Stuff related to Ben Eater's 8bit breadboard computer

Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.