source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Last update: Dec 17, 2022

Related tags

Overview

WhiteningBERT

Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Preparation

git clone https://github.com/Jun-jie-Huang/WhiteningBERT.git
pip install -r requirements.txt
cd examples/evaluation

Usage

Datasets

We use seven STS datasets, including STSBenchmark, SICK-Relatedness, STS12, STS13, STS14, STS15, STS16.

The processed data can be found in ./examples/datasets/.

Run

To run a quick demo:

python evaluation_stsbenchmark.py \
			--pooling aver \
			--layer_num 1,12 \
			--whitening \
			--encoder_name bert-base-cased

Specify --pooing with cls or aver to choose whether use the [CLS] token or averaging all tokens. Also specify --layer_num to combine layers, separated by a comma.

To enumerate all possible combinations of two layers and automatically evaluate the combinations consequently:

python evaluation_stsbenchmark_layer2.py \
			--pooling aver \
			--whitening \
			--encoder_name bert-base-cased

To enumerate all possible combinations of N layers:

python evaluation_stsbenchmark_layerN.py \
			--pooling aver \
			--whitening \
			--encoder_name bert-base-cased\
			--combination_num 4

You can also save the embeddings of the sentences

python evaluation_stsbenchmark_save_embed.py \
			--pooling aver \
			--layer_num 1,12 \
			--whitening \
			--encoder_name bert-base-cased \
			--summary_dir ./save_embeddings

A list of PLMs you can select:

bert-base-uncased , bert-large-uncased
roberta-base, roberta-large
bert-base-multilingual-uncased
sentence-transformers/LaBSE
albert-base-v1 , albert-large-v1
microsoft/layoutlm-base-uncased , microsoft/layoutlm-large-uncased
SpanBERT/spanbert-base-cased , SpanBERT/spanbert-large-cased
microsoft/deberta-base , microsoft/deberta-large
google/electra-base-discriminator
google/mobilebert-uncased
microsoft/DialogRPT-human-vs-rand
distilbert-base-uncased
......

Acknowledgements

Codes are adapted from the repos of the EMNLP19 paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks and the EMNLP20 paper An Unsupervised Sentence Embedding Method by Mutual Information Maximization

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Related tags

Overview

WhiteningBERT

Preparation

Usage

Datasets

Run

A list of PLMs you can select:

Acknowledgements

Owner

keras implement of transformers for humans

Convolutional Neural Networks for Sentence Classification

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

This library is testing the ethics of language models by using natural adversarial texts.

EasyTransfer is designed to make the development of transfer learning in NLP applications easier.

Material for GW4SHM workshop, 16/03/2022.

Textlesslib - Library for Textless Spoken Language Processing

Code and checkpoints for training the transformer-based Table QA models introduced in the paper TAPAS: Weakly Supervised Table Parsing via Pre-training.

⚖️ A Statutory Article Retrieval Dataset in French.

pytorch implementation of Attention is all you need

Rhythm-Finder is a unsupervised ML driven python powered web-application that can find the songs that suits you.

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

Shared code for training sentence embeddings with Flax / JAX

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

Transformation spoken text to written text

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

An automated program that helps customers of Pizza Palour place their pizza orders

A number of methods in order to perform Natural Language Processing on live data derived from Twitter