Code for text augmentation method leveraging large-scale language models

Overview

HyperMix

Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation.

Getting Started

Installing Packages

The main depedencies can be installed via pip install -r requirements.txt.

Usage

The main code is run through main.py. Check out --help for full list of commands.

python main.py --help

The code will automatically use the first GPU device, if detected.

A typical command to run BERT-base 10 times on the 1% subsample set of the SST-2 dataset and computing the average of all run is as follows.

python main.py --datasets sst2 \
    --train-subsample 0.01f \
    --classifier transformers \
    --model-name bert-base-uncased \
    --num-trials 1 \
    --augmenter none \
    --save-dir out

The script will create a directory named out in the current working directory and save the script log as out/run.log. It will also save any augmentations created during the experiments (if any augmentation is enabled).

To test GPT3Mix, prepare an OpenAI API key as described at the bottom of this README file, then use the following command:

python main.py --datasets sst2 \
    --train-subsample 0.01f \
    --classifier transformers \
    --model-name bert-base-uncased \
    --num-trials 1 \
    --augmenter gpt3-mix \
    --save-dir out

Managing Seeds

In the command above, the script will automatically generate seeds for sampling data and optimizing models. The seed used to generate each individual seed is called "master seed" and can be set using --master-data-seed and --master-exp-seed options. As evident from the option names, they are responsible for sampling data and optimizing a freshly initialized models respectively.

Sometimes, we need to manually set the seeds and not rely on automatically generated seeds from the master seeds. Manually seeding can be achieved via --data-seeds option. If this option is given, the master data seed will be ignored. We only support manualy data seeding for now.

OpenAI Key

Store OpenAI API Key under the current working directory as a file named openai-key. When running the main script, it will automatically detect the api key.

API keys can be provided to the script by --api-key option (not recommended) or from a file named openai-key in the current working directory.

Other Notes

At the moment we only support data augmentation leveraging OpenAI GPT-3 (GPT3Mix), but we will release an update that supports HyperCLOVA as soon as it becomes available to the public (HyperMix).

Citation

To cite our code or work, please use the following bibtex:

@inproceedings{yoo2021gpt3mix,
	title = "GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation",
	author = "Yoo, Kang Min  and
	  Park, Dongju  and
	  Kang, Jaewook  and
	  Lee, Sang-Woo  and
	  Park, Woomyoung",
	booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
	month = nov,
	year = "2021",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2021.findings-emnlp.192",
	pages = "2225--2239",
}
Owner
NAVER AI
Official account of NAVER AI, Korea No.1 Industrial AI Research Group
NAVER AI
STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch.

st3 STT for TorchScript is a port of Coqui STT based on DeepSpeech to PyTorch. Currently it supports converting pbmm models to pt scripts with integra

Vlad Ki 8 Oct 18, 2021
LSTC: Boosting Atomic Action Detection with Long-Short-Term Context

LSTC: Boosting Atomic Action Detection with Long-Short-Term Context This Repository contains the code on AVA of our ACM MM 2021 paper: LSTC: Boosting

Tencent YouTu Research 9 Oct 11, 2022
spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines spaCy-wrap is minimal library intended for wrapping fine-tuned transformers from t

Kenneth Enevoldsen 32 Dec 29, 2022
ConvBERT: Improving BERT with Span-based Dynamic Convolution

ConvBERT Introduction In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU.

YITUTech 237 Dec 10, 2022
Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Sentiment Classification using WSD, Maximum Entropy & Naive Bayes Classifiers

Pulkit Kathuria 173 Jan 04, 2023
Tools and data for measuring the popularity & growth of various programming languages.

growth-data Tools and data for measuring the popularity & growth of various programming languages. Install the dependencies $ pip install -r requireme

3 Jan 06, 2022
Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Training COMET using seq2seq setting Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarizati

tqfang 9 Dec 17, 2022
Korea Spell Checker

한국어 문서 koSpellPy Korean Spell checker How to use Install pip install kospellpy Use from kospellpy import spell_init spell_checker = spell_init() # d

kangsukmin 2 Oct 20, 2021
Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow. Documentation Proper documentation is available at

HUSEIN ZOLKEPLI 151 Jan 05, 2023
We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

Rushabh 2 Nov 13, 2021
CMeEE 数据集医学实体抽取

医学实体抽取_GlobalPointer_torch 介绍 思想来自于苏神 GlobalPointer,原始版本是基于keras实现的,模型结构实现参考现有 pytorch 复现代码【感谢!】,基于torch百分百复现苏神原始效果。 数据集 中文医学命名实体数据集 点这里申请,很简单,共包含九类医学

85 Dec 28, 2022
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

VirtualAssistant Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries. Third Party Libraries us

Logadheep 1 Nov 27, 2021
Klexikon: A German Dataset for Joint Summarization and Simplification

Klexikon: A German Dataset for Joint Summarization and Simplification Dennis Aumiller and Michael Gertz Heidelberg University Under submission at LREC

Dennis Aumiller 8 Jan 03, 2023
The first online catalogue for Arabic NLP datasets.

Masader The first online catalogue for Arabic NLP datasets. This catalogue contains 200 datasets with more than 25 metadata annotations for each datas

ARBML 94 Dec 26, 2022
This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Joseph Imperial 1 Oct 05, 2021
Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning

GenSen Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning Sandeep Subramanian, Adam Trischler, Yoshua B

Maluuba Inc. 309 Oct 19, 2022
Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

Masato Hagiwara 303 Dec 17, 2022
A demo of chinese asr

chinese_asr_demo 一个端到端的中文语音识别模型训练、测试框架 具备数据预处理、模型训练、解码、计算wer等等功能 训练数据 训练数据采用thchs_30,

4 Dec 09, 2021
NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022