[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Last update: Dec 10, 2022

Overview

Mirror-BERT

Code repo for the EMNLP 2021 paper:
Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders
by Fangyu Liu, Ivan Vulić, Anna Korhonen, and Nigel Collier.

Mirror-BERT is an unsupervised contrastive learning method that converts pretrained language models (PLMs) into universal text encoders. It takes a PLM and a txt file containing raw text as input, and output a strong text embedding model, in just 20-30 seconds. It works well for not only sentence, but also word and phrase representation learning.

Hugginface pretrained models

Sentence enocders:

model	STS avg.
baseline: sentence-bert (supervised)	74.89
mirror-bert-base-uncased-sentence	74.51
mirror-roberta-base-sentence	75.08
mirror-bert-base-uncased-sentence-drophead	75.16
mirror-roberta-base-sentence-drophead	76.67

Word encoder:

model	Multi-SimLex (ENG)
baseline: fasttext	52.80
mirror-bert-base-uncased-word	55.60

(Note that the released models would not replicate the exact numbers in the paper, since the reported numbers in the paper are average of three runs.)

Train

For training sentence representations:

>> ./mirror_scripts/mirror_sentence_bert.sh 0,1

where 0,1 are GPU indices. This script should complete in 20-30 seconds on two NVIDIA 2080Ti/3090 GPUs. If you encounter out-of-memory error, consider reducing max_length in the script. Scripts for replicating other models are availible in mirror_scripts/.

Custom data: For training with your custom corpus, simply set --train_dir in the script to your own txt file (one sentence per line). When you do have raw sentences from your target domain, we recommend you always use the in-domain data for optimal performance. E.g., if you aim to create a conversational encoder, sample 10k utterances to train your model!

Supervised training: Organise your training data in the format of text1||text2 and store them one pair per line in a txt file. Then turn on the --pairwise option. text1 and text2 will be regarded as a positive pair in contrastive learning. You can be creative in finding such training pairs and it would be the best if they are from your application domain. E.g., to build an e-commerce QA encoder, the question||answer pairs from the Amazon quesrion-answer dataset could work quite well. Example training script: mirror_scripts/mirror_sentence_roberta_supervised_amazon_qa.sh. Note that when tuned on your in-domain data, you shouldn't expect the model to be good at STS. Instead, the models need to be evaluated on your in-domain task.

Word-level training: Use mirror_scripts/mirror_word_bert.sh.

Encode

It's easy to compute your own sentence embeddings:

from src.mirror_bert import MirrorBERT

model_name = "cambridgeltl/mirror-roberta-base-sentence-drophead"
mirror_bert = MirrorBERT()
mirror_bert.load_model(path=model_name, use_cuda=True)

embeddings = mirror_bert.get_embeddings([
    "I transform pre-trained language models into universal text encoders.",
], agg_mode="cls")
print (embeddings.shape)

Evaluate

Evaluate sentence representations:

>> python evaluation/eval.py \
	--model_dir "cambridgeltl/mirror-roberta-base-sentence-drophead" \
	--agg_mode "cls" \
	--dataset sent_all

Evaluate word representations:

>> python evaluation/eval.py \
	--model_dir "cambridgeltl/mirror-bert-base-uncased-word" \
	--agg_mode "cls" \
	--dataset multisimlex_ENG

To test models on other languages, replace ENG to your custom languages. See here for all supported languages on Multi-SimLex.

Citation

@inproceedings{liu2021fast,
  title={Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders},
  author={Liu, Fangyu and Vuli{\'c}, Ivan and Korhonen, Anna and Collier, Nigel},
  booktitle={EMNLP 2021},
  year={2021}
}

[EMNLP 2021] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.

Related tags

Overview

Mirror-BERT

Hugginface pretrained models

Train

Encode

Evaluate

Citation

Owner

Cambridge Language Technology Lab

DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

端到端的长本文摘要模型（法研杯2020司法摘要赛道）

Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention

simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

A desktop GUI providing an audio interface for GPT3.

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

Develop open-source Python Arabic NLP libraries that the Arab world will easily use in all Natural Language Processing applications

Knowledge Management for Humans using Machine Learning & Tags

NVDA, the free and open source Screen Reader for Microsoft Windows

Indobenchmark are collections of Natural Language Understanding (IndoNLU) and Natural Language Generation (IndoNLG)

Implemented shortest-circuit disambiguation, maximum probability disambiguation, HMM-based lexical annotation and BiLSTM+CRF-based named entity recognition

Text classification is one of the popular tasks in NLP that allows a program to classify free-text documents based on pre-defined classes.

NLP command-line assistant powered by OpenAI

Paddle2.x version AI-Writer

This project deals with a simplified version of a more general problem of Aspect Based Sentiment Analysis.

Sapiens is a human antibody language model based on BERT.

Python library to make development of portfolio analysis faster and easier

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

Natural Language Processing at EDHEC, 2022