Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Last update: Nov 01, 2022

Related tags

Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

This repo is for our paper "Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization". Our program is building on top of the Huggingface transformers framework. You can refer to their repo at: https://github.com/huggingface/transformers/tree/master/examples/seq2seq.

Local Setup

Tested with Python 3.7 via virtual environment. Clone the repo, go to the repo folder, setup the virtual environment, and install the required packages:

$ python3.7 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Install `apex`

Based on the recommendation from HuggingFace, both finetuning and eval are 30% faster with --fp16. For that you need to install apex.

$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Data

Create a directory for data used in this work named data:

$ mkdir data

CNN/DM

$ wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
$ tar -xzvf cnn_dm_v2.tgz
$ mv cnn_cln data/cnndm

XSUM

$ wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
$ tar -xzvf xsum.tar.gz
$ mv xsum data/xsum

Generate Augmented Dataset

$ python generate_augmentation.py \
    --dataset xsum \
    --n 5 \
    --augmentation1 randomdelete \
    --augmentation2 randomswap

Training

CNN/DM

Our model is warmed up using sshleifer/distilbart-cnn-12-6:

$ DATA_DIR=./data/cnndm-augmented/RandominsertionRandominsertion-NumSent-3
$ OUTPUT_DIR=./log/cnndm

$ python -m torch.distributed.launch --nproc_per_node=3  cl_finetune_trainer.py \
  --data_dir $DATA_DIR \
  --output_dir $OUTPUT_DIR \
  --learning_rate=5e-7 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --freeze_embeds \
  --save_total_limit 10 \
  --save_steps 1000 \
  --logging_steps 1000 \
  --num_train_epochs 5 \
  --model_name_or_path sshleifer/distilbart-cnn-12-6 \
  --alpha 0.2 \
  --temperature 0.5 \
  --freeze_encoder_layer 6 \
  --prediction_loss_only \
  --fp16

XSUM

$ DATA_DIR=./data/xsum-augmented/RandomdeleteRandomswap-NumSent-3
$ OUTPUT_DIR=./log/xsum

$ python -m torch.distributed.launch --nproc_per_node=3  cl_finetune_trainer.py \
  --data_dir $DATA_DIR \
  --output_dir $OUTPUT_DIR \
  --learning_rate=5e-7 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --freeze_embeds \
  --save_total_limit 10 \
  --save_steps 1000 \
  --logging_steps 1000 \
  --num_train_epochs 5 \
  --model_name_or_path sshleifer/distilbart-xsum-12-6 \
  --alpha 0.2 \
  --temperature 0.5 \
  --freeze_encoder \
  --prediction_loss_only \
  --fp16

Evaluation

We have released the following checkpoints for pre-trained models as described in the paper:

CNN/DM:
XSUM:

CNN/DM

CNN/DM requires an extra postprocessing step.

$ export DATA=cnndm
$ export DATA_DIR=data/$DATA
$ export CHECKPOINT_DIR=./log/$DATA
$ export OUTPUT_DIR=output/$DATA

$ python -m torch.distributed.launch --nproc_per_node=2  run_distributed_eval.py \
    --model_name sshleifer/distilbart-cnn-12-6  \
    --save_dir $OUTPUT_DIR \
    --data_dir $DATA_DIR \
    --bs 16 \
    --fp16 \
    --use_checkpoint \
    --checkpoint_path $CHECKPOINT_DIR
    
$ python postprocess_cnndm.py \
    --src_file $OUTPUT_DIR/test_generations.txt \
    --tgt_file $DATA_DIR/test.target

XSUM

$ export DATA=xsum
$ export DATA_DIR=data/$DATA
$ export CHECKPOINT_DIR=./log/$DATA
$ export OUTPUT_DIR=output/$DATA

$ python -m torch.distributed.launch --nproc_per_node=3  run_distributed_eval.py \
    --model_name sshleifer/distilbart-xsum-12-6  \
    --save_dir $OUTPUT_DIR \
    --data_dir $DATA_DIR \
    --bs 16 \
    --fp16 \
    --use_checkpoint \
    --checkpoint_path $CHECKPOINT_DIR

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Related tags

Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

Local Setup

Install `apex`

Data

CNN/DM

XSUM

Generate Augmented Dataset

Training

CNN/DM

XSUM

Evaluation

CNN/DM

XSUM

Owner

Rachel Zheng

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

PyTorch impelementations of BERT-based Spelling Error Correction Models.

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

NLP made easy

Natural Language Processing with transformers

Code and data accompanying Natural Language Processing with PyTorch

Open-World Entity Segmentation

Dust model dichotomous performance analysis

Transformer Based Korean Sentence Spacing Corrector

Speach Recognitions

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Generating Korean Slogans with phonetic and structural repetition

CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

Text editor on python tkinter to convert english text to other languages with the help of ployglot.

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

结巴中文分词

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Related tags

Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

Local Setup

Install apex

Data

CNN/DM

XSUM

Generate Augmented Dataset

Training

CNN/DM

XSUM

Evaluation

CNN/DM

XSUM

Owner

Rachel Zheng

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

PyTorch impelementations of BERT-based Spelling Error Correction Models.

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

NLP made easy

Natural Language Processing with transformers

Code and data accompanying Natural Language Processing with PyTorch

Open-World Entity Segmentation

Dust model dichotomous performance analysis

Transformer Based Korean Sentence Spacing Corrector

Speach Recognitions

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

This project uses word frequency and Term Frequency-Inverse Document Frequency to summarize a text.

Generating Korean Slogans with phonetic and structural repetition

CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

Text editor on python tkinter to convert english text to other languages with the help of ployglot.

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

结巴中文分词

Install `apex`