Pytorch version of BERT-whitening

Overview

BERT-whitening

This is the Pytorch implementation of "Whitening Sentence Representations for Better Semantics and Faster Retrieval".

BERT-whitening is very practical in text semantic search, in which the whitening operation not only improves the performance of unsupervised semantic vector matching, but also reduces the vector dimension, which is beneficial to reduce memory usage and improve retrieval efficiency for vector search engines, e.g., FAISS.

This method was first proposed by Jianlin Su in his blog[1].

Reproduce the experimental results

Preparation

Download datasets

$ cd data/
$ ./download_datasets.sh
$ cd ../

Download models

$ cd model/
$ ./download_models.sh
$ cd ../

After the datasets and models are downloaded, the data/ and model/ directories are as follows:

├── data
│   ├── AllNLI.tsv
│   ├── download_datasets.sh
│   └── downstream
│       ├── COCO
│       ├── CR
│       ├── get_transfer_data.bash
│       ├── MPQA
│       ├── MR
│       ├── MRPC
│       ├── SICK
│       ├── SNLI
│       ├── SST
│       ├── STS
│       ├── SUBJ
│       ├── tokenizer.sed
│       └── TREC
├── model
│   ├── bert-base-nli-mean-tokens
│   ├── bert-base-uncased
│   ├── bert-large-nli-mean-tokens
│   ├── bert-large-uncased
│   └── download_models.sh

BERT without whitening

$ python3 ./eval_without_whitening.py

Results:

Model STS-12 STS-13 STS-14 STS-15 STS-16 SICK-R STS-B
BERTbase-cls 0.3062 0.2638 0.2765 0.3605 0.5180 0.4242 0.2029
BERTbase-first_last_avg 0.5785 0.6196 0.6250 0.7096 0.6979 0.6375 0.5904
BERTlarge-cls 0.3240 0.2621 0.2629 0.3554 0.4439 0.4343 0.2675
BERTlarge-first_last_avg 0.5773 0.6116 0.6117 0.6806 0.7030 0.6034 0.5959

BERT with whitening(target)

$ python3 ./eval_with_whitening\(target\).py

Results:

Model STS-12 STS-13 STS-14 STS-15 STS-16 SICK-R STS-B
BERTbase-whiten-256(target) 0.6390 0.7375 0.6909 0.7459 0.7442 0.6223 0.7143
BERTlarge-whiten-384(target) 0.6435 0.7460 0.6964 0.7468 0.7594 0.6081 0.7247
SBERTbase-nli-whiten-256(target) 0.6912 0.7931 0.7805 0.8165 0.7958 0.7500 0.8074
SBERTlarge-nli-whiten-384(target) 0.7126 0.8061 0.7852 0.8201 0.8036 0.7402 0.8199

BERT with whitening(NLI)

$ python3 ./eval_with_whitening\(nli\).py

Results:

Model STS-12 STS-13 STS-14 STS-15 STS-16 SICK-R STS-B
BERTbase-whiten(nli) 0.6169 0.6571 0.6605 0.7516 0.7320 0.6829 0.6365
BERTbase-whiten-256(nli) 0.6148 0.6672 0.6622 0.7483 0.7222 0.6757 0.6496
BERTlarge-whiten(nli) 0.6254 0.6737 0.6715 0.7503 0.7636 0.6865 0.6250
BERTlarge-whiten-348(nli) 0.6231 0.6784 0.6701 0.7548 0.7546 0.6866 0.6381
SBERTbase-nli-whiten(nli) 0.6868 0.7646 0.7626 0.8230 0.7964 0.7896 0.7653
SBERTbase-nli-whiten-256(nli) 0.6891 0.7703 0.7658 0.8229 0.7828 0.7880 0.7678
SBERTlarge-nli-whiten(nli) 0.7074 0.7756 0.7720 0.8285 0.8080 0.7910 0.7589
SBERTlarge-nli-whiten-384(nli) 0.7123 0.7893 0.7790 0.8355 0.8057 0.8037 0.7689

Semantic retrieve with FAISS

An important function of BERT-whitening is that it can not only improve the effect of semantic similarity retrieval, but also reduce memory usage and increase retrieval speed. In this experiment, we use Quora Duplicate Questions Dataset and FAISS, a vector retrieval engine, to measure the retrieval effect and efficiency of different models. The dataset contains more than 400,000 pairs of question1-question2, and it is marked whether they are similar. We extract all the semantic vectors of question2 and store them in FAISS (299,364 vectors in total), and then use the semantic vectors of question1 to retrieve them in FAISS (290,654 vectors in total). [email protected] is used to measure the effect of retrieval, Average Retrieve Time (ms) is used to measure retrieval efficiency, and Memory Usage (GB) is used to measure memory usage. FAISS is configured in CPU mode, nlist = 1024'' and nprobe = 5'', and the CPU is Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz.

Modify model_name'' in qqp_search_with_faiss.py'', and then execute:

$ python3 qqp_search_with_faiss.py

The experimental results of different models are as follows:

Model [email protected] Average Retrieve Time (ms) Memory Usage (GB)
BERTbase-XX
BERTbase-first_last_avg 0.5531 0.7488 0.8564
BERTbase-whiten(nli) 0.5571 0.9735 0.8564
BERTbase-whiten-256(nli) 0.5616 0.2698 0.2854
BERTbase-whiten(target) 0.6104 0.8436 0.8564
BERTbase-whiten-256(target) 0.5957 0.1910 0.2854
BERTlarge-XX
BERTlarge-first_last_avg 0.5667 1.2015 1.1419
BERTlarge-whiten(nli) 0.5783 1.3458 1.1419
BERTlarge-whiten-384(nli) 0.5798 0.4118 0.4282
BERTlarge-whiten(target) 0.6178 1.1418 1.1419
BERTlarge-whiten-384(target) 0.6194 0.3301 0.4282

From the experimental results, the use of whitening to reduce the vector sizes of BERTbase and BERTlarge to 256 and 384, respectively, can significantly reduce memory usage and retrieval time, while improving retrieval results. The memory usage is strictly proportional to the vector dimension, while the average retrieval time is not strictly proportional to the vector dimension. This is because FAISS has a difference in clustering question2, which will cause some fluctuations in retrieval efficiency, but in general, the lower its dimensionality, the higher the retrieval efficiency.

References

[1] 苏剑林, 你可能不需要BERT-flow:一个线性变换媲美BERT-flow, 2020.

[2] 苏剑林, Keras版本BERT-whitening, 2020.

Owner
Weijie Liu
NLP and KG
Weijie Liu
Transformation spoken text to written text

Transformation spoken text to written text This model is used for formatting raw asr text output from spoken text to written text (Eg. date, number, i

Nguyen Binh 16 Dec 28, 2022
Large-scale pretraining for dialogue

A State-of-the-Art Large-scale Pretrained Response Generation Model (DialoGPT) This repository contains the source code and trained model for a large-

Microsoft 1.8k Jan 07, 2023
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
NeMo: a toolkit for conversational AI

NVIDIA NeMo Introduction NeMo is a toolkit for creating Conversational AI applications. NeMo product page. Introductory video. The toolkit comes with

NVIDIA Corporation 5.3k Jan 04, 2023
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
A collection of models for image - text generation in ACM MM 2021.

Bi-directional Image and Text Generation UMT-BITG (image & text generator) Unifying Multimodal Transformer for Bi-directional Image and Text Generatio

Multimedia Research 63 Oct 30, 2022
Convolutional Neural Networks for Sentence Classification

Convolutional Neural Networks for Sentence Classification Code for the paper Convolutional Neural Networks for Sentence Classification (EMNLP 2014). R

Yoon Kim 2k Jan 02, 2023
This repository implements a brute-force spellchecker utilizing the Damerau-Levenshtein edit distance.

About spellchecker.py Implementing a highly-accurate, brute-force, and dynamically programmed spellchecking program that utilizes the Damerau-Levensht

Raihan Ahmed 1 Dec 11, 2021
Share constant definitions between programming languages and make your constants constant again

Introduction Reconstant lets you share constant and enum definitions between programming languages. Constants are defined in a yaml file and converted

Natan Yellin 47 Sep 10, 2022
Learning to Rewrite for Non-Autoregressive Neural Machine Translation

RewriteNAT This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressiv

Xinwei Geng 20 Dec 25, 2022
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

Explosion 70 Dec 12, 2022
ThinkTwice: A Two-Stage Method for Long-Text Machine Reading Comprehension

ThinkTwice ThinkTwice is a retriever-reader architecture for solving long-text machine reading comprehension. It is based on the paper: ThinkTwice: A

Walle 4 Aug 06, 2021
A Python script which randomly chooses and prints a file from a directory.

___ ____ ____ _ __ ___ / _ \ | _ \ | _ \ ___ _ __ | '__| / _ \ | |_| || | | || | | | / _ \| '__| | | | __/ | _ || |_| || |_| || __

yesmaybenookay 0 Aug 06, 2021
Python module (C extension and plain python) implementing Aho-Corasick algorithm

pyahocorasick pyahocorasick is a fast and memory efficient library for exact or approximate multi-pattern string search meaning that you can find mult

Wojciech Muła 763 Dec 27, 2022
The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Unsupervised technique to Glossary and Definition Extraction Code Files GPT2-DefinitionModel.ipynb - GPT-2 model for definition generation. Data_Gener

Prakhar Mishra 28 May 25, 2021
Natural Language Processing for Adverse Drug Reaction (ADR) Detection

Natural Language Processing for Adverse Drug Reaction (ADR) Detection This repo contains code from a project to identify ADRs in discharge summaries a

Medicines Optimisation Service - Austin Health 21 Aug 05, 2022
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022
PyTorch implementation of NATSpeech: A Non-Autoregressive Text-to-Speech Framework

A Non-Autoregressive Text-to-Speech (NAR-TTS) framework, including official PyTorch implementation of PortaSpeech (NeurIPS 2021) and DiffSpeech (AAAI 2022)

760 Jan 03, 2023
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

Hans Alemão 4 Jul 20, 2022