Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

Overview

smaller-LaBSE

LaBSE(Language-agnostic BERT Sentence Embedding) is a very good method to get sentence embeddings across languages. But it is hard to fine-tune due to the parameter size(~=471M) of this model. For instance, if I fine-tune this model with Adam optimizer, I need the GPU that has VRAM at least 7.5GB = 471M * (parameters 4 bytes + gradients 4 bytes + momentums 4 bytes + variances 4 bytes). So I applied "Load What You Need: Smaller Multilingual Transformers" method to LaBSE to reduce parameter size since most of this model's parameter is the word embedding table(~=385M).

The smaller version of LaBSE is evaluated for 14 languages using tatoeba dataset. It shows we can reduce LaBSE's parameters to 47% without a big performance drop.

If you need the PyTorch version, see https://github.com/Geotrend-research/smaller-transformers. I followed most of the steps in the paper.

Model #param(transformer) #param(word embedding) #param(model) vocab size
tfhub_LaBSE 85.1M 384.9M 470.9M 501,153
15lang_LaBSE 85.1M 133.1M 219.2M 173,347

Used Languages

  • English (en or eng)
  • French (fr or fra)
  • Spanish (es or spa)
  • German (de or deu)
  • Chinese (zh, zh_classical or cmn)
  • Arabic (ar or ara)
  • Italian (it or ita)
  • Japanese (ja or jpn)
  • Korean (ko or kor)
  • Dutch (nl or nld)
  • Polish (pl or pol)
  • Portuguese (pt or por)
  • Thai (th or tha)
  • Turkish (tr or tur)
  • Russian (ru or rus)

I selected the languages multilingual-USE supports.

Scripts

A smaller version of the vocab was constructed based on the frequency of tokens using Wikipedia dump data. I followed most of the algorithms in the paper to extract proper vocab for each language and rewrite it for TensorFlow.

Convert weight

mkdir -p downloads/labse-2
curl -L https://tfhub.dev/google/LaBSE/2?tf-hub-format=compressed -o downloads/labse-2.tar.gz
tar -xf downloads/labse-2.tar.gz -C downloads/labse-2/
python save_as_weight_from_saved_model.py

Select vocabs

./download_dataset.sh
python select_vocab.py

Make smaller LaBSE

./make_smaller_labse.py

Evaluate tatoeba

./download_tatoeba_dataset.sh
# evaluate TFHub LaBSE
./evaluate_tatoeba.sh
# evaluate the smaller LaBSE
./evaluate_tatoeba.sh \
    --model models/LaBSE_en-fr-es-de-zh-ar-zh_classical-it-ja-ko-nl-pl-pt-th-tr-ru/1/ \
    --preprocess models/LaBSE_en-fr-es-de-zh-ar-zh_classical-it-ja-ko-nl-pl-pt-th-tr-ru_preprocess/1/

Results

Tatoeba

Model fr es de zh ar it ja ko nl pl pt th tr ru avg
tfHub_LaBSE(en→xx) 95.90 98.10 99.30 96.10 90.70 95.30 96.40 94.10 97.50 97.90 95.70 82.85 98.30 95.30 95.25
tfHub_LaBSE(xx→en) 96.00 98.80 99.40 96.30 91.20 94.00 96.50 92.90 97.00 97.80 95.40 83.58 98.50 95.30 95.19
15lang_LaBSE(en→xx) 95.20 98.00 99.20 96.10 90.50 95.20 96.30 93.50 97.50 97.90 95.80 82.85 98.30 95.40 95.13
15lang_LaBSE(xx→en) 95.40 98.70 99.40 96.30 91.10 94.00 96.30 92.70 96.70 97.80 95.40 83.58 98.50 95.20 95.08
  • Accuracy(%) of the Tatoeba datasets.
  • If the strategy to select vocabs is changed or the corpus used in the selection step is changed to the corpus similar to the evaluation dataset, it is expected to reduce the performance drop.

References

You might also like...
Comments
  • Training time  and  Machine configuration

    Training time and Machine configuration

    Hi, thanks for your sharing model. I want to make a smaller model, just contains two languages(en, zh). And I want to know the kind of machine GPU and how long does it need to cost?

    opened by QzzIsCoding 2
  • Publish model to HuggingFace Model Hub?

    Publish model to HuggingFace Model Hub?

    I migrated the full LaBSE model from TF to PyTorch and uploaded them to the HuggingFace model hub. I saw this model on the TF hub and started migrating it for uploading to the HF Hub. I realized then that this wasn't published by Google but by @jeongukjae, so wanted to check with you before uploading it.

    I have exported the model locally. I'm happy to check the changes in and upload the exported model if that's fine for you :).

    opened by setu4993 2
Owner
Jeong Ukjae
Jeong Ukjae
Code for text augmentation method leveraging large-scale language models

HyperMix Code for our paper GPT3Mix and conducting classification experiments using GPT-3 prompt-based data augmentation. Getting Started Installing P

NAVER AI 47 Dec 20, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

ALBERT ***************New March 28, 2020 *************** Add a colab tutorial to run fine-tuning for GLUE datasets. ***************New January 7, 2020

Google Research 3k Dec 26, 2022
SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors

SAVI2I: Continuous and Diverse Image-to-Image Translation via Signed Attribute Vectors [Paper] [Project Website] Pytorch implementation for SAVI2I. We

Qi Mao 44 Dec 30, 2022
Utilizing RBERT model for KLUE Relation Extraction task

RBERT for Relation Extraction task for KLUE Project Description Relation Extraction task is one of the task of Korean Language Understanding Evaluatio

snoop2head 14 Nov 15, 2022
Neural network sequence labeling model

Sequence labeler This is a neural network sequence labeling system. Given a sequence of tokens, it will learn to assign labels to each token. Can be u

Marek Rei 250 Nov 03, 2022
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
Converts text into a PDF of handwritten notes

Text To Handwritten Notes Converts text into a PDF of handwritten notes Explore the docs » · Report Bug · Request Feature · Steps: $ git clone https:/

UVSinghK 63 Oct 09, 2022
Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, whic

Jesse Zaneveld 33 Dec 28, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

LightSpeech UnOfficial PyTorch implementation of LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search.

Rishikesh (ऋषिकेश) 54 Dec 03, 2022
Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

2 Feb 10, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
ConvBERT: Improving BERT with Span-based Dynamic Convolution

ConvBERT Introduction In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU.

YITUTech 237 Dec 10, 2022
A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

Zhenbang Feng 29 Nov 26, 2022
Reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer: Self-Attention with Linear Complexity)

Linear Multihead Attention (Linformer) PyTorch Implementation of reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer:

Kui Xu 58 Dec 23, 2022
A spaCy wrapper of OpenTapioca for named entity linking on Wikidata

spaCyOpenTapioca A spaCy wrapper of OpenTapioca for named entity linking on Wikidata. Table of contents Installation How to use Local OpenTapioca Vizu

Universitätsbibliothek Mannheim 80 Jan 03, 2023
nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

nlabel is a library for generating, storing and retrieving tagging information and embedding vectors from various nlp libraries through a unified interface.

Bernhard Liebl 2 Jun 10, 2022
Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Espresso Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning libra

Yiming Wang 919 Jan 03, 2023