🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

Overview

PAUSE: Positive and Annealed Unlabeled Sentence Embedding

Sentence embedding refers to a set of effective and versatile techniques for converting raw text into numerical vector representations that can be used in a wide range of natural language processing (NLP) applications. The majority of these techniques are either supervised or unsupervised. Compared to the unsupervised methods, the supervised ones make less assumptions about optimization objectives and usually achieve better results. However, the training requires a large amount of labeled sentence pairs, which is not available in many industrial scenarios. To that end, we propose a generic and end-to-end approach -- PAUSE (Positive and Annealed Unlabeled Sentence Embedding), capable of learning high-quality sentence embeddings from a partially labeled dataset, which effectively learns sentence embeddings from PU datasets by jointly optimizing the supervised and PU loss. The main highlights of PAUSE include:

  • good sentence embeddings can be learned from datasets with only a few positive labels;
  • it can be trained in an end-to-end fashion;
  • it can be directly applied to any dual-encoder model architecture;
  • it is extended to scenarios with an arbitrary number of classes;
  • polynomial annealing of the PU loss is proposed to stabilize the training;
  • our experiments (reproduction steps are illustrated below) show that PAUSE constantly outperforms baseline methods.

This repository contains Tensorflow implementation of PAUSE to reproduce the experimental results. Upon using this repo for your work, please cite:

@inproceedings{cao2021pause,
  title={PAUSE: Positive and Annealed Unlabeled Sentence Embedding},
  author={Cao, Lele and Larsson, Emil and von Ehrenheim, Vilhelm and Cavalcanti Rocha, Dhiana Deva and Martin, Anna and Horn, Sonja},
  booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2021},
  url={https://arxiv.org/abs/2109.03155}
}

Prerequisites

Install virtual environment first to avoid breaking your native environment. If you use Anaconda, do

conda update conda
conda create --name py37-pause python=3.7
conda activate py37-pause

Then install the dependent libraries:

pip install -r requirements.txt

Unsupervised STS

Models are trained on a combination of the SNLI and Multi-Genre NLI datasets, which contain one million sentence pairs annotated with three labels: entailment, contradiction and neutral. The trained model is tested on the STS 2012-2016, STS benchmark, and SICK-Relatedness (SICK-R) datasets, which have labels between 0 and 5 indicating the semantic relatedness of sentence pairs.

Training

Example 1: train PAUSE-small using 5% labels for 10 epochs

python train_nli.py \
  --batch_size=1024 \
  --train_epochs=10 \
  --model=small \
  --pos_sample_prec=5

Example 2: train PAUSE-base using 30% labels for 20 epochs

python train_nli.py \
  --batch_size=1024 \
  --train_epochs=20 \
  --model=base \
  --pos_sample_prec=30

To check the parameters, run

python train_nli.py --help

which will print the usage as follows.

usage: train_nli.py [-h] [--model MODEL]
                    [--pretrained_weights PRETRAINED_WEIGHTS]
                    [--train_epochs TRAIN_EPOCHS] [--batch_size BATCH_SIZE]
                    [--train_steps_per_epoch TRAIN_STEPS_PER_EPOCH]
                    [--max_seq_len MAX_SEQ_LEN] [--prior PRIOR]
                    [--train_lr TRAIN_LR] [--pos_sample_prec POS_SAMPLE_PREC]
                    [--log_dir LOG_DIR] [--model_dir MODEL_DIR]

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         The tfhub link for the base embedding model
  --pretrained_weights PRETRAINED_WEIGHTS
                        The pretrained model if any
  --train_epochs TRAIN_EPOCHS
                        The max number of training epoch
  --batch_size BATCH_SIZE
                        Training mini-batch size
  --train_steps_per_epoch TRAIN_STEPS_PER_EPOCH
                        Step interval of evaluation during training
  --max_seq_len MAX_SEQ_LEN
                        The max number of tokens in the input
  --prior PRIOR         Expected ratio of positive samples
  --train_lr TRAIN_LR   The maximum learning rate
  --pos_sample_prec POS_SAMPLE_PREC
                        The percentage of sampled positive examples used in
                        training; should be one of 1, 10, 30, 50, 70
  --log_dir LOG_DIR     The path where the logs are stored
  --model_dir MODEL_DIR
                        The path where models and weights are stored

Testing

After the model is trained, you will be prompted to where the model is saved, e.g. ./artifacts/model/20210517-131724, where the directory name (20210517-131724) is the model ID. To test the model with that ID, run

python test_sts.py --model=20210517-131724

The test result on STS datasets will be printed on console and also saved in file ./artifacts/test/sts_20210517-131724.txt

Supervised STS

Train

You can continue to finetune a pertained model on supervised STSb. For example, assume we have trained a PAUSE model based on small BERT (say located at ./artifacts/model/20210517-131725), if we want to finetune the model on STSb for 2 epochs, we can run

python ft_stsb.py \
  --model=small \
  --train_epochs=2 \
  --pretrained_weights=./artifacts/model/20210517-131725

Note that it is important to match the model size (--model) with the pretrained model size (--pretrained_weights).

Testing

After the model is finetuned, you will be prompted to where the model is saved, e.g. ./artifacts/model/20210517-131726, where the directory name (20210517-131726) is the model ID. To test the model with that ID, run

python ft_stsb_test.py --model=20210517-131726

SentEval evaluation

To evaluate the PAUSE embeddings using SentEval (preferably using GPU), you need to download the data first:

cd ./data/downstream
./get_transfer_data.bash
cd ../..

Then, run the sent_eval.py script:

python sent_eval.py \
  --data_path=./data \
  --model=20210328-212801

where the --model parameter specifies the ID of the model you want to evaluate. By default, the model should exist in folder ./artifacts/model/embed. If you want to evaluate a trained model in our public GCS (gs://motherbrain-pause/model/...), please run (e.g. PAUSE-NLI-base-50%):

python sent_eval.py \
  --data_path=./data \
  --model_location=gcs \
  --model=20210329-065047

We provide the following models for demonstration purposes:

Model Model ID
PAUSE-NLI-base-100% 20210414-162525
PAUSE-NLI-base-70% 20210328-212801
PAUSE-NLI-base-50% 20210329-065047
PAUSE-NLI-base-30% 20210329-133137
PAUSE-NLI-base-10% 20210329-180000
PAUSE-NLI-base-5% 20210329-205354
PAUSE-NLI-base-1% 20210329-195024
You might also like...
Code for
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Extract Keywords from sentence or Replace keywords in sentences.
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Sentence Embeddings with BERT & XLNet

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch This framework provides an easy method t

Extract Keywords from sentence or Replace keywords in sentences.
Extract Keywords from sentence or Replace keywords in sentences.

FlashText This module can be used to replace keywords in sentences or extract keywords from sentences. It is based on the FlashText algorithm. Install

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

SimCSE: Simple Contrastive Learning of Sentence Embeddings
SimCSE: Simple Contrastive Learning of Sentence Embeddings

SimCSE: Simple Contrastive Learning of Sentence Embeddings This repository contains the code and pre-trained models for our paper SimCSE: Simple Contr

Language-Agnostic SEntence Representations

LASER Language-Agnostic SEntence Representations LASER is a library to calculate and use multilingual sentence embeddings. NEWS 2019/11/08 CCMatrix is

Releases(1.0)
TalkNet: Audio-visual active speaker detection Model

Is someone talking? TalkNet: Audio-visual active speaker detection Model This repository contains the code for our ACM MM 2021 paper, TalkNet, an acti

142 Dec 14, 2022
StarGAN - Official PyTorch Implementation

StarGAN - Official PyTorch Implementation ***** New: StarGAN v2 is available at https://github.com/clovaai/stargan-v2 ***** This repository provides t

Yunjey Choi 5.1k Dec 30, 2022
PyTranslator é simultaneamente um editor e tradutor de texto com diversos recursos e interface feito com coração e 100% em Python

PyTranslator O Que é e para que serve o PyTranslator? PyTranslator é simultaneamente um editor e tradutor de texto em com interface gráfica que usa a

Elizeu Barbosa Abreu 1 May 12, 2022
Intent parsing and slot filling in PyTorch with seq2seq + attention

PyTorch Seq2Seq Intent Parsing Reframing intent parsing as a human - machine translation task. Work in progress successor to torch-seq2seq-intent-pars

Sean Robertson 159 Apr 04, 2022
(ACL-IJCNLP 2021) Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models.

BERT Convolutions Code for the paper Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models. Contains expe

mlpc-ucsd 21 Jul 18, 2022
Text preprocessing, representation and visualization from zero to hero.

Text preprocessing, representation and visualization from zero to hero. From zero to hero • Installation • Getting Started • Examples • API • FAQ • Co

Jonathan Besomi 2.7k Jan 08, 2023
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
A relatively simple python program to generate one of those reddit text to speech videos dominating youtube.

Reddit text to speech generator A basic reddit tts video generator Current functionality Generate videos for subs based on comments,(askreddit) so rea

Aadvik 17 Dec 19, 2022
Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If yo

LIAAD - Laboratory of Artificial Intelligence and Decision Support 163 Dec 23, 2022
This is the offline-training-pipeline for our project.

offline-training-pipeline This is the offline-training-pipeline for our project. We adopt the offline training and online prediction Machine Learning

0 Apr 22, 2022
Higher quality textures for the Metal Gear Solid series.

Metal Gear Solid: HD Textures Higher quality textures for the Metal Gear Solid series. The goal is to maximize the quality of assets that the engine w

Samantha 6 Dec 06, 2022
A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

Eu-Bin KIM 94 Dec 08, 2022
NeoDays-based tileset for the roguelike CDDA (Cataclysm Dark Days Ahead)

NeoDaysPlus Reduced contrast, expanded, and continuously developed version of the CDDA tileset NeoDays that's being completed with new sprites for mis

0 Nov 12, 2022
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022
Text vectorization tool to outperform TFIDF for classification tasks

WHAT: Supervised text vectorization tool Textvec is a text vectorization tool, with the aim to implement all the "classic" text vectorization NLP meth

186 Dec 29, 2022
German Text-To-Speech Engine using Tacotron and Griffin-Lim

jotts JoTTS is a German text-to-speech engine using tacotron and griffin-lim. The synthesizer model has been trained on my voice using Tacotron1. Due

padmalcom 6 Aug 28, 2022
A Transformer Implementation that is easy to understand and customizable.

Simple Transformer I've written a series of articles on the transformer architecture and language models on Medium. This repository contains an implem

Naoki Shibuya 4 Jan 20, 2022
运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

OlittleRer 运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。编程语言和工具包括Java、Python、Matlab、CPLEX、Gurobi、SCIP 等。 关注我们: 运筹小公众号 有问题可以直接在

运小筹 151 Dec 30, 2022
NLP tool to extract emotional phrase from tweets 🤩

Emotional phrase extractor Extract phrase in the given text that is used to express the sentiment. Capturing sentiment in language is important in the

Shahul ES 38 Oct 17, 2022
News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022