Language-Agnostic SEntence Representations

Related tags

Text Data & NLPLASER
Overview

LASER Language-Agnostic SEntence Representations

LASER is a library to calculate and use multilingual sentence embeddings.

NEWS

  • 2019/11/08 CCMatrix is available: Mining billions of high-quality parallel sentences on the WEB [8]
  • 2019/07/31 Gilles Bodard and Jérémy Rapin provided a Docker environment to use LASER
  • 2019/07/11 WikiMatrix is available: bitext extraction for 1620 language pairs in WikiPedia [7]
  • 2019/03/18 switch to BSD license
  • 2019/02/13 The code to perform bitext mining is now available

CURRENT VERSION:

  • We now provide an encoder which was trained on 93 languages, written in 23 different alphabets [6]. This includes all European languages, many Asian and Indian languages, Arabic, Persian, Hebrew, ..., as well as various minority languages and dialects.
  • We provide a test set for more than 100 languages based on the Tatoeba corpus.
  • Switch to PyTorch 1.0

All these languages are encoded by the same BiLSTM encoder, and there is no need to specify the input language (but tokenization is language specific). According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages.

We have also some evidence that the encoder can generalizes to other languages which have not been seen during training, but which are in a language family which is covered by other languages.

A detailed description how the multilingual sentence embeddings are trained can be found in [6], together with an extensive experimental evaluation.

Dependencies

  • Python 3.6
  • PyTorch 1.0
  • NumPy, tested with 1.15.4
  • Cython, needed by Python wrapper of FastBPE, tested with 0.29.6
  • Faiss, for fast similarity search and bitext mining
  • transliterate 1.10.2, only used for Greek (pip install transliterate)
  • jieba 0.39, Chinese segmenter (pip install jieba)
  • mecab 0.996, Japanese segmenter
  • tokenization from the Moses encoder (installed automatically)
  • FastBPE, fast C++ implementation of byte-pair encoding (installed automatically)

Installation

  • set the environment variable 'LASER' to the root of the installation, e.g. export LASER="${HOME}/projects/laser"
  • download encoders from Amazon s3 by bash ./install_models.sh
  • download third party software by bash ./install_external_tools.sh
  • download the data used in the example tasks (see description for each task)

Applications

We showcase several applications of multilingual sentence embeddings with code to reproduce our results (in the directory "tasks").

For all tasks, we use exactly the same multilingual encoder, without any task specific optimization or fine-tuning.

License

LASER is BSD-licensed, as found in the LICENSE file in the root directory of this source tree.

Supported languages

Our model was trained on the following languages:

Afrikaans, Albanian, Amharic, Arabic, Armenian, Aymara, Azerbaijani, Basque, Belarusian, Bengali, Berber languages, Bosnian, Breton, Bulgarian, Burmese, Catalan, Central/Kadazan Dusun, Central Khmer, Chavacano, Chinese, Coastal Kadazan, Cornish, Croatian, Czech, Danish, Dutch, Eastern Mari, English, Esperanto, Estonian, Finnish, French, Galician, Georgian, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Icelandic, Ido, Indonesian, Interlingua, Interlingue, Irish, Italian, Japanese, Kabyle, Kazakh, Korean, Kurdish, Latvian, Latin, Lingua Franca Nova, Lithuanian, Low German/Saxon, Macedonian, Malagasy, Malay, Malayalam, Maldivian (Divehi), Marathi, Norwegian (Bokmål), Occitan, Persian (Farsi), Polish, Portuguese, Romanian, Russian, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Swahili, Swedish, Tagalog, Tajik, Tamil, Tatar, Telugu, Thai, Turkish, Uighur, Ukrainian, Urdu, Uzbek, Vietnamese, Wu Chinese and Yue Chinese.

We have also observed that the model seems to generalize well to other (minority) languages or dialects, e.g.

Asturian, Egyptian Arabic, Faroese, Kashubian, North Moluccan Malay, Nynorsk Norwegian, Piedmontese, Sorbian, Swabian, Swiss German or Western Frisian.

References

[1] Holger Schwenk and Matthijs Douze, Learning Joint Multilingual Sentence Representations with Neural Machine Translation, ACL workshop on Representation Learning for NLP, 2017

[2] Holger Schwenk and Xian Li, A Corpus for Multilingual Document Classification in Eight Languages, LREC, pages 3548-3551, 2018.

[3] Holger Schwenk, Filtering and Mining Parallel Data in a Joint Multilingual Space ACL, July 2018

[4] Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk and Veselin Stoyanov, XNLI: Cross-lingual Sentence Understanding through Inference, EMNLP, 2018.

[5] Mikel Artetxe and Holger Schwenk, Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings arXiv, Nov 3 2018.

[6] Mikel Artetxe and Holger Schwenk, Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond arXiv, Dec 26 2018.

[7] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia arXiv, July 11 2019.

[8] Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave and Armand Joulin CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

Owner
Facebook Research
Facebook Research
Code for the Findings of NAACL 2022(Long Paper): AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks arXiv link: upcoming To be published in Findings of NA

Allen 16 Nov 12, 2022
Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Word2Wave is a simple method for text-controlled GAN audio generation. You can either follow the setup instructions below and use the source code and CLI provided in this repo or you can have a play

Ilaria Manco 91 Dec 23, 2022
The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation .

Qian Wang 21 Dec 17, 2022
Segmenter - Transformer for Semantic Segmentation

Segmenter - Transformer for Semantic Segmentation

592 Dec 27, 2022
A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

Niu Zhe 3 Jan 24, 2022
🚀Clone a voice in 5 seconds to generate arbitrary speech in real-time

English | 中文 Features 🌍 Chinese supported mandarin and tested with multiple datasets: aidatatang_200zh, magicdata, aishell3, data_aishell, and etc. ?

Vega 25.6k Dec 31, 2022
MicBot - MicBot uses Google Translate to speak everyone's chat messages

MicBot MicBot uses Google Translate to speak everyone's chat messages. It can al

2 Mar 09, 2022
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1.1k Dec 27, 2022
Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

Phil Wang 1.8k Dec 30, 2022
This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

CrossSum This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summ

BUET CSE NLP Group 29 Nov 19, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 01, 2022
Multilingual text (NLP) processing toolkit

polyglot Polyglot is a natural language pipeline that supports massive multilingual applications. Free software: GPLv3 license Documentation: http://p

RAMI ALRFOU 2.1k Jan 07, 2023
Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Open Tech 61 Dec 13, 2022
PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

Facebook Research 2.7k Dec 27, 2022
a chinese segment base on crf

Genius Genius是一个开源的python中文分词组件,采用 CRF(Conditional Random Field)条件随机场算法。 Feature 支持python2.x、python3.x以及pypy2.x。 支持简单的pinyin分词 支持用户自定义break 支持用户自定义合并词

duanhongyi 237 Nov 04, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network)

BERTAC (BERT-style transformer-based language model with Adversarially pretrained Convolutional neural network) BERTAC is a framework that combines a

6 Jan 24, 2022
Text to speech for Vietnamese, ez to use, ez to update

Chào mọi người, đây là dự án mở nhằm giúp việc đọc được trở nên dễ dàng hơn. Rất cảm ơn đội ngũ Zalo đã cung cấp hạ tầng để mình có thể tạo ra app này

Trần Cao Minh Bách 32 Jul 29, 2022
Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

fake-news-detector-1.0 Lists, lists and more lists... Spam filter list, quality keyword list, stoplist list, top-domains urls list, news agencies webs

Memo Sim 1 Jan 04, 2022