Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Last update: Dec 31, 2022

Overview

Data Augmentation using Pre-trained Transformer Models

Code associated with the Data Augmentation using Pre-trained Transformer Models paper

Code contains implementation of the following data augmentation methods

EDA (Baseline)
Backtranslation (Baseline)
CBERT (Baseline)
BERT Prepend (Our paper)
GPT-2 Prepend (Our paper)
BART Prepend (Our paper)

DataSets

In paper, we use three datasets from following resources

Low-data regime experiment setup

Run src/utils/download_and_prepare_datasets.sh file to prepare all datsets.
download_and_prepare_datasets.sh performs following steps

Download data from github
Replace numeric labels with text for STSA-2 and TREC dataset
For a given dataset, creates 15 random splits of train and dev data.

Dependencies

To run this code, you need following dependencies

Pytorch 1.5
fairseq 0.9
transformers 2.9

How to run

To run data augmentation experiment for a given dataset, run bash script in scripts folder. For example, to run data augmentation on snips dataset,

run scripts/bart_snips_lower.sh for BART experiment
run scripts/bert_snips_lower.sh for rest of the data augmentation methods

How to cite

@inproceedings{kumar-etal-2020-data,
    title = "Data Augmentation using Pre-trained Transformer Models",
    author = "Kumar, Varun  and
      Choudhary, Ashutosh  and
      Cho, Eunah",
    booktitle = "Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.lifelongnlp-1.3",
    pages = "18--26",
}

Contact

Please reachout to [email protected] for any questions related to this code.

License

This project is licensed under the Creative Common Attribution Non-Commercial 4.0 license.

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Related tags

Overview

Data Augmentation using Pre-trained Transformer Models

DataSets

Low-data regime experiment setup

Dependencies

How to run

How to cite

Contact

License

Owner

Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

CLIPfa: Connecting Farsi Text and Images

Searching keywords in PDF file folders

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

AI and Machine Learning workflows on Anthos Bare Metal.

a chinese segment base on crf

👄 The most accurate natural language detection library for Python, suitable for long and short text alike

Conditional Transformer Language Model for Controllable Generation

Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Harvis is designed to automate your C2 Infrastructure.

Just a basic Telegram AI chat bot written in Python using Pyrogram.

Sorce code and datasets for "K-BERT: Enabling Language Representation with Knowledge Graph",

texlive expressions for documents

Maha is a text processing library specially developed to deal with Arabic text.

Translation for Trilium Notes. Trilium Notes 中文版.

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

A collection of GNN-based fake news detection models.

Tools to download and cleanup Common Crawl data