Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Overview

Training COMET using seq2seq setting

Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarization.py in the official example codes for transformers version 4.16.0.dev0.

The ./deepspeed/ folder is copied from https://github.com/huggingface/transformers/tree/master/tests/deepspeed .

The training data of ATOMIC2020 can be downloaded at https://allenai.org/data/atomic-2020. You need to convert the .tsv file to .csv to be compatible with the dataloader in transformers.

Dependencies

python

torch==1.7.1
cudatoolkit=11.0
transformers==4.15.0
deepspeed==0.5.10

others

GCC/G++ 5.2.0 (to complie deepspeed ops)

Usage

1. Normal training without memory optimization:

CUDA_VISIBLE_DEVICES=0 python models/comet_seq2seq.py \
    --model_name_or_path t5-small \
    --do_train \
    --train_file /path/to/train.csv \
    --source_prefix "" \
    --output_dir data/models/t5-small \
    --overwrite_output_dir \
    --gradient_accumulation_steps=4 \
    --per_device_train_batch_size=8 \
    --per_device_eval_batch_size=4 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 

2. Train with gradient_checkpointing=True. Smaller memory usage, meanwhile lower training speed.

CUDA_VISIBLE_DEVICES=0 python models/comet_seq2seq.py \
    --model_name_or_path t5-small \
    --do_train \
    --train_file /path/to/train.csv \
    --source_prefix "" \
    --output_dir data/models/t5-small \
    --overwrite_output_dir \
    --gradient_accumulation_steps=4 \
    --per_device_train_batch_size=8 \
    --per_device_eval_batch_size=4 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 \
    --gradient_checkpointing

3. Train with DeepSpeed (Either zero-stage2 or zero-stage3)

# google/t5-3B training, on 2080Ti (11GB)
deepspeed --include localhost:0,1 --master_port 30000 models/comet_seq2seq.py \
    --deepspeed deepspeed/ds_config_zero2.json \
    --model_name_or_path google/t5-xl-lm-adapt \
    --do_train \
    --train_file data/kg/atomic2020_data-feb2021/train.csv \
    --source_prefix "" \
    --output_dir data/models/comet/t5_xl_s2_bs32_fp16 \
    --overwrite_output_dir \
    --gradient_accumulation_steps=1 \
    --per_device_train_batch_size=16 \
    --max_source_length 16 \
    --max_target_length 18 \
    --text_column head_event --summary_column tail_event \
    --save_strategy epoch \
    --num_train_epochs 3 \
    --learning_rate 1e-5 \
    --fp16

4. Comparison of memory usage of different memory optimization methods

Compare the memory usage on NVIDIA RTX A6000 (48685MB memory) and Nvidia GeForce 3090 (24268MB memory).

1. fp16

T5-3B: effects of fp16. A 20% reduce of memory size.

Device fp16 Batch Size x Grad-Accum x Num-GPU Memory Usage Time to Train a Batch
vanilla A6000 False 8x4x1 47.5k M 1.5s/32ex
vanilla A6000 True 8x4x1 31k M 1.0s/32ex
vanilla 3090 False 1x32x1 โŒ -
vanilla 3090 True 1x32x1 โŒ -

2. gradient_checkpointing

T5-3B: Effects of gradient_checkpointing.

Device fp16 Batch Size x Grad-Accum x Num-GPU Memory Usage Time to Train a Batch
vanilla A6000 False 8x4x1 47k M 1.5s/32ex
vanilla A6000 True 8x4x1 31k M 1.0s/32ex
grad-ckpt A6000 False 8x4x1 46.4k M 1.3s/32ex
grad-ckpt A6000 True 8x4x1 23.9k M 1.1/32ex
vanilla 3090 True 1x32x1 โŒ -
grad-ckpt 3090 True 1x32x1 23.8k M 15s/32ex

3. Deepspeed stage 2

T5-3B: Effects of deepspeed.

Device fp16 Batch Size x Grad-Accum x Num-GPU Memory Usage Time to Train a Batch
vanilla 3090 True 1x32x1 โŒ -
grad-ckpt 3090 True 1x32x1 23k M 13.5s/32ex
stage2 3090 True 32x1x1 20.3k M 7.5s/32ex
stage2 3090 True 16x1x2 20.3k M 6.36s/32ex
stage2 3090 True 32x1x2 20.3k M 3.75s/32ex

4. Deepspeed stage 3

stage3 will lead to smaller usage of memory but way smaller training speed.

5. Automatic Evaluation Result on ATOMIC2020 data

BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR ROUGE-L CIDEr
T5-3B (no deepspeed), lr1e-5, epoch 3 0.346 0.184 0.12 0.084 0.19 0.422 0.646
T5-3B (no deepspeed), lr1e-5, epoch 2 0.348 0.185 0.121 0.085 0.19 0.424 0.651
T5-3B (no deepspeed), lr1e-5, epoch 1 0.343 0.177 0.113 0.079 0.186 0.416 0.629
T5-3B (ds_stage2, fp16) epoch 3 0.340 0.182 0.118 0.083 0.189 0.418 0.637
T5-3B (ds_stage2, fp16) epoch 2 0.337 0.177 0.114 0.078 0.189 0.419 0.633
T5-3B (ds_stage2, fp16) epoch 1 0.335 0.174 0.112 0.076 0.186 0.415 0.632

Useful discussions regarding environment setups

TODO

DeepSpeed without Trainer(): https://huggingface.co/docs/transformers/main_classes/deepspeed#deepspeed-non-trainer-integration

Owner
tqfang
Ph.D. at HKUST, interested in commonsense in NLP
tqfang
A sample project that exists for PyPUG's "Tutorial on Packaging and Distributing Projects"

A sample Python project A sample project that exists as an aid to the Python Packaging User Guide's Tutorial on Packaging and Distributing Projects. T

Python Packaging Authority 4.5k Dec 30, 2022
ProtFeat is protein feature extraction tool that utilizes POSSUM and iFeature.

Description: ProtFeat is designed to extract the protein features by employing POSSUM and iFeature python-based tools. ProtFeat includes a total of 39

GOKHAN OZSARI 5 Dec 16, 2022
๐Ÿ“œ GPT-2 Rhyming Limerick and Haiku models using data augmentation

Well-formed Limericks and Haikus with GPT2 ๐Ÿ“œ GPT-2 Rhyming Limerick and Haiku models using data augmentation In collaboration with Matthew Korahais &

Bardia Shahrestani 2 May 26, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (ๅ›ฝ่ชž็ ”้•ทๅ˜ไฝ) Tokenizer for Transformers based on ้’็ฉบๆ–‡ๅบซ Basic Usage from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

This repository contains code for the following two papers: VisualBERT: A Simple and Performant Baseline for Vision and Language (arxiv) with a short

Natural Language Processing @UCLA 464 Jan 04, 2023
A complete NLP guideline for enthusiasts

NLP-NINJA A complete guide for Natural Language Processing in Python Table of Contents S.No. Topic Level Meaning 1 Tokenization ๐Ÿค Beginner 2 Stemming

MAINAK CHAUDHURI 22 Dec 27, 2022
Simple translation demo showcasing our headliner package.

Headliner Demo This is a demo showcasing our Headliner package. In particular, we trained a simple seq2seq model on an English-German dataset. We didn

Axel Springer News Media & Tech GmbH & Co. KG - Ideas Engineering 16 Nov 24, 2022
KoBERTopic์€ BERTopic์„ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ† ํฌ๋‚˜์ด์ €์™€ BERT๋ฅผ ์ˆ˜์ •ํ•œ ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

KoBERTopic ๋ชจ๋ธ ์†Œ๊ฐœ KoBERTopic์€ BERTopic์„ ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ† ํฌ๋‚˜์ด์ €์™€ BERT๋ฅผ ์ˆ˜์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด BERTopic : https://github.com/MaartenGr/BERTopic/tree/05a6790b21009d

Won Joon Yoo 26 Jan 03, 2023
jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese.

jel: Japanese Entity Linker jel - Japanese Entity Linker - is Bi-encoder based entity linker for japanese. Usage Currently, link and question methods

izuna385 10 Jan 06, 2023
Fake Shakespearean Text Generator

Fake Shakespearean Text Generator This project contains an impelementation of stateful Char-RNN model to generate fake shakespearean texts. Files and

Recep YILDIRIM 1 Feb 15, 2022
Conversational text Analysis using various NLP techniques

Conversational text Analysis using various NLP techniques

Rita Anjana 159 Jan 06, 2023
Dรฉ op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Dรฉ op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Lau 1 Dec 17, 2021
Kinky furry assitant based on GPT2

KinkyFurs-V0 Kinky furry assistant based on GPT2 How to run python3 V0.py then, open web browser and go to localhost:8080 Requirements: Flask trans

Sparki 1 Jun 11, 2022
Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
Code for the Python code smells video on the ArjanCodes channel.

7 Python code smells This repository contains the code for the Python code smells video on the ArjanCodes channel (watch the video here). The example

55 Dec 29, 2022
A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

A2T: Towards Improving Adversarial Training of NLP Models This is the source code for the EMNLP 2021 (Findings) paper "Towards Improving Adversarial T

QData 17 Oct 15, 2022
StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

Yunjey Choi 5.1k Dec 30, 2022
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 75 Jul 15, 2022
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Jungil Kong, Jaehyeon Kim, Jaekyoung Bae In our paper, we p

Jungil Kong 1.1k Jan 02, 2023
Code for the paper "Flexible Generation of Natural Language Deductions"

Code for the paper "Flexible Generation of Natural Language Deductions"

Kaj Bostrom 12 Nov 11, 2022