Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Last update: Nov 01, 2022

Related tags

Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

This repo is for our paper "Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization". Our program is building on top of the Huggingface transformers framework. You can refer to their repo at: https://github.com/huggingface/transformers/tree/master/examples/seq2seq.

Local Setup

Tested with Python 3.7 via virtual environment. Clone the repo, go to the repo folder, setup the virtual environment, and install the required packages:

$ python3.7 -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Install `apex`

Based on the recommendation from HuggingFace, both finetuning and eval are 30% faster with --fp16. For that you need to install apex.

$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Data

Create a directory for data used in this work named data:

$ mkdir data

CNN/DM

$ wget https://cdn-datasets.huggingface.co/summarization/cnn_dm_v2.tgz
$ tar -xzvf cnn_dm_v2.tgz
$ mv cnn_cln data/cnndm

XSUM

$ wget https://cdn-datasets.huggingface.co/summarization/xsum.tar.gz
$ tar -xzvf xsum.tar.gz
$ mv xsum data/xsum

Generate Augmented Dataset

$ python generate_augmentation.py \
    --dataset xsum \
    --n 5 \
    --augmentation1 randomdelete \
    --augmentation2 randomswap

Training

CNN/DM

Our model is warmed up using sshleifer/distilbart-cnn-12-6:

$ DATA_DIR=./data/cnndm-augmented/RandominsertionRandominsertion-NumSent-3
$ OUTPUT_DIR=./log/cnndm

$ python -m torch.distributed.launch --nproc_per_node=3  cl_finetune_trainer.py \
  --data_dir $DATA_DIR \
  --output_dir $OUTPUT_DIR \
  --learning_rate=5e-7 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --freeze_embeds \
  --save_total_limit 10 \
  --save_steps 1000 \
  --logging_steps 1000 \
  --num_train_epochs 5 \
  --model_name_or_path sshleifer/distilbart-cnn-12-6 \
  --alpha 0.2 \
  --temperature 0.5 \
  --freeze_encoder_layer 6 \
  --prediction_loss_only \
  --fp16

XSUM

$ DATA_DIR=./data/xsum-augmented/RandomdeleteRandomswap-NumSent-3
$ OUTPUT_DIR=./log/xsum

$ python -m torch.distributed.launch --nproc_per_node=3  cl_finetune_trainer.py \
  --data_dir $DATA_DIR \
  --output_dir $OUTPUT_DIR \
  --learning_rate=5e-7 \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --freeze_embeds \
  --save_total_limit 10 \
  --save_steps 1000 \
  --logging_steps 1000 \
  --num_train_epochs 5 \
  --model_name_or_path sshleifer/distilbart-xsum-12-6 \
  --alpha 0.2 \
  --temperature 0.5 \
  --freeze_encoder \
  --prediction_loss_only \
  --fp16

Evaluation

We have released the following checkpoints for pre-trained models as described in the paper:

CNN/DM:
XSUM:

CNN/DM

CNN/DM requires an extra postprocessing step.

$ export DATA=cnndm
$ export DATA_DIR=data/$DATA
$ export CHECKPOINT_DIR=./log/$DATA
$ export OUTPUT_DIR=output/$DATA

$ python -m torch.distributed.launch --nproc_per_node=2  run_distributed_eval.py \
    --model_name sshleifer/distilbart-cnn-12-6  \
    --save_dir $OUTPUT_DIR \
    --data_dir $DATA_DIR \
    --bs 16 \
    --fp16 \
    --use_checkpoint \
    --checkpoint_path $CHECKPOINT_DIR
    
$ python postprocess_cnndm.py \
    --src_file $OUTPUT_DIR/test_generations.txt \
    --tgt_file $DATA_DIR/test.target

XSUM

$ export DATA=xsum
$ export DATA_DIR=data/$DATA
$ export CHECKPOINT_DIR=./log/$DATA
$ export OUTPUT_DIR=output/$DATA

$ python -m torch.distributed.launch --nproc_per_node=3  run_distributed_eval.py \
    --model_name sshleifer/distilbart-xsum-12-6  \
    --save_dir $OUTPUT_DIR \
    --data_dir $DATA_DIR \
    --bs 16 \
    --fp16 \
    --use_checkpoint \
    --checkpoint_path $CHECKPOINT_DIR

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Related tags

Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

Local Setup

Install `apex`

Data

CNN/DM

XSUM

Generate Augmented Dataset

Training

CNN/DM

XSUM

Evaluation

CNN/DM

XSUM

Owner

Rachel Zheng

A Structured Self-attentive Sentence Embedding

Repositório da disciplina no semestre 2021-2

Blue Brain text mining toolbox for semantic search and structured information extraction

A music comments dataset, containing 39,051 comments for 27,384 songs.

Rootski - Full codebase for rootski.io (without the data)

Paddle2.x version AI-Writer

LUKE -- Language Understanding with Knowledge-based Embeddings

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

Pytorch NLP library based on FastAI

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

A framework for implementing federated learning

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

ACL'2021: Learning Dense Representations of Phrases at Scale

Dust model dichotomous performance analysis

基于“Seq2Seq+前缀树”的知识图谱问答

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

多语言降噪预训练模型MBart的中文生成任务

BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network)

Repo for Enhanced Seq2Seq Autoencoder via Contrastive Learning for Abstractive Text Summarization

Related tags

Overview

ESACL: Enhanced Seq2Seq Autoencoder via Contrastive Learning for AbstractiveText Summarization

Local Setup

Install apex

Data

CNN/DM

XSUM

Generate Augmented Dataset

Training

CNN/DM

XSUM

Evaluation

CNN/DM

XSUM

Owner

Rachel Zheng

A Structured Self-attentive Sentence Embedding

Repositório da disciplina no semestre 2021-2

Blue Brain text mining toolbox for semantic search and structured information extraction

A music comments dataset, containing 39,051 comments for 27,384 songs.

Rootski - Full codebase for rootski.io (without the data)

Paddle2.x version AI-Writer

LUKE -- Language Understanding with Knowledge-based Embeddings

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

This project consists of data analysis and data visualization (done using python)of all IPL seasons from 2008 to 2019 and answering the most asked questions about the IPL.

Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

Pytorch NLP library based on FastAI

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

A framework for implementing federated learning

Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

ACL'2021: Learning Dense Representations of Phrases at Scale

Dust model dichotomous performance analysis

基于“Seq2Seq+前缀树”的知识图谱问答

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

多语言降噪预训练模型MBart的中文生成任务

BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network)

Install `apex`