A library for Multilingual Unsupervised or Supervised word Embeddings

Related tags

Text Data & NLPMUSE
Overview

MUSE: Multilingual Unsupervised and Supervised Embeddings

Model

MUSE is a Python library for multilingual word embeddings, whose goal is to provide the community with:

  • state-of-the-art multilingual word embeddings (fastText embeddings aligned in a common space)
  • large-scale high-quality bilingual dictionaries for training and evaluation

We include two methods, one supervised that uses a bilingual dictionary or identical character strings, and one unsupervised that does not use any parallel data (see Word Translation without Parallel Data for more details).

Dependencies

MUSE is available on CPU or GPU, in Python 2 or 3. Faiss is optional for GPU users - though Faiss-GPU will greatly speed up nearest neighbor search - and highly recommended for CPU users. Faiss can be installed using "conda install faiss-cpu -c pytorch" or "conda install faiss-gpu -c pytorch".

Get evaluation datasets

To download monolingual and cross-lingual word embeddings evaluation datasets:

  • Our 110 bilingual dictionaries
  • 28 monolingual word similarity tasks for 6 languages, and the English word analogy task
  • Cross-lingual word similarity tasks from SemEval2017
  • Sentence translation retrieval with Europarl corpora

You can simply run:

cd data/
wget https://dl.fbaipublicfiles.com/arrival/vectors.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/wordsim.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/dictionaries.tar.gz

Alternatively, you can also download the data with:

cd data/
./get_evaluation.sh

Note: Requires bash 4. The download of Europarl is disabled by default (slow), you can enable it here.

Get monolingual word embeddings

For pre-trained monolingual word embeddings, we highly recommend fastText Wikipedia embeddings, or using fastText to train your own word embeddings from your corpus.

You can download the English (en) and Spanish (es) embeddings this way:

# English fastText Wikipedia embeddings
curl -Lo data/wiki.en.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
# Spanish fastText Wikipedia embeddings
curl -Lo data/wiki.es.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.vec

Align monolingual word embeddings

This project includes two ways to obtain cross-lingual word embeddings:

  • Supervised: using a train bilingual dictionary (or identical character strings as anchor points), learn a mapping from the source to the target space using (iterative) Procrustes alignment.
  • Unsupervised: without any parallel data or anchor point, learn a mapping from the source to the target space using adversarial training and (iterative) Procrustes refinement.

For more details on these approaches, please check here.

The supervised way: iterative Procrustes (CPU|GPU)

To learn a mapping between the source and the target space, simply run:

python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5 --dico_train default

By default, dico_train will point to our ground-truth dictionaries (downloaded above); when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary. Logs and embeddings will be saved in the dumped/ directory.

The unsupervised way: adversarial training and refinement (CPU|GPU)

To learn a mapping using adversarial training and iterative Procrustes refinement, run:

python unsupervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5

By default, the validation metric is the mean cosine of word pairs from a synthetic dictionary built with CSLS (Cross-domain similarity local scaling). For some language pairs (e.g. En-Zh), we recommend to center the embeddings using --normalize_embeddings center.

Evaluate monolingual or cross-lingual embeddings (CPU|GPU)

We also include a simple script to evaluate the quality of monolingual or cross-lingual word embeddings on several tasks:

Monolingual

python evaluate.py --src_lang en --src_emb data/wiki.en.vec --max_vocab 200000

Cross-lingual

python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-es.es.vec --max_vocab 200000

Word embedding format

By default, the aligned embeddings are exported to a text format at the end of experiments: --export txt. Exporting embeddings to a text file can take a while if you have a lot of embeddings. For a very fast export, you can set --export pth to export the embeddings in a PyTorch binary file, or simply disable the export (--export "").

When loading embeddings, the model can load:

  • PyTorch binary files previously generated by MUSE (.pth files)
  • fastText binary files previously generated by fastText (.bin files)
  • text files (text file with one word embedding per line)

The two first options are very fast and can load 1 million embeddings in a few seconds, while loading text files can take a while.

Download

We provide multilingual embeddings and ground-truth bilingual dictionaries. These embeddings are fastText embeddings that have been aligned in a common space.

Multilingual word Embeddings

We release fastText Wikipedia supervised word embeddings for 30 languages, aligned in a single vector space.

Arabic: text Bulgarian: text Catalan: text Croatian: text Czech: text Danish: text
Dutch: text English: text Estonian: text Finnish: text French: text German: text
Greek: text Hebrew: text Hungarian: text Indonesian: text Italian: text Macedonian: text
Norwegian: text Polish: text Portuguese: text Romanian: text Russian: text Slovak: text
Slovenian: text Spanish: text Swedish: text Turkish: text Ukrainian: text Vietnamese: text

You can visualize crosslingual nearest neighbors using demo.ipynb.

Ground-truth bilingual dictionaries

We created 110 large-scale ground-truth bilingual dictionaries using an internal translation tool. The dictionaries handle well the polysemy of words. We provide a train and test split of 5000 and 1500 unique source words, as well as a larger set of up to 100k pairs. Our goal is to ease the development and the evaluation of cross-lingual word embeddings and multilingual NLP.

European languages in every direction

src-tgt German English Spanish French Italian Portuguese
German - full train test full train test full train test full train test full train test
English full train test - full train test full train test full train test full train test
Spanish full train test full train test - full train test full train test full train test
French full train test full train test full train test - full train test full train test
Italian full train test full train test full train test full train test - full train test
Portuguese full train test full train test full train test full train test full train test -

Other languages to English (e.g. {fr,es}-en)

Afrikaans: full train test Albanian: full train test Arabic: full train test Bengali: full train test
Bosnian: full train test Bulgarian: full train test Catalan: full train test Chinese: full train test
Croatian: full train test Czech: full train test Danish: full train test Dutch: full train test
English: full train test Estonian: full train test Filipino: full train test Finnish: full train test
French: full train test German: full train test Greek: full train test Hebrew: full train test
Hindi: full train test Hungarian: full train test Indonesian: full train test Italian: full train test
Japanese: full train test Korean: full train test Latvian: full train test Littuanian: full train test
Macedonian: full train test Malay: full train test Norwegian: full train test Persian: full train test
Polish: full train test Portuguese: full train test Romanian: full train test Russian: full train test
Slovak: full train test Slovenian: full train test Spanish: full train test Swedish: full train test
Tamil: full train test Thai: full train test Turkish: full train test Ukrainian: full train test
Vietnamese: full train test

English to other languages (e.g. en-{fr,es})

Afrikaans: full train test Albanian: full train test Arabic: full train test Bengali: full train test
Bosnian: full train test Bulgarian: full train test Catalan: full train test Chinese: full train test
Croatian: full train test Czech: full train test Danish: full train test Dutch: full train test
English: full train test Estonian: full train test Filipino: full train test Finnish: full train test
French: full train test German: full train test Greek: full train test Hebrew: full train test
Hindi: full train test Hungarian: full train test Indonesian: full train test Italian: full train test
Japanese: full train test Korean: full train test Latvian: full train test Littuanian: full train test
Macedonian: full train test Malay: full train test Norwegian: full train test Persian: full train test
Polish: full train test Portuguese: full train test Romanian: full train test Russian: full train test
Slovak: full train test Slovenian: full train test Spanish: full train test Swedish: full train test
Tamil: full train test Thai: full train test Turkish: full train test Ukrainian: full train test
Vietnamese: full train test

References

Please cite [1] if you found the resources in this repository useful.

Word Translation Without Parallel Data

[1] A. Conneau*, G. Lample*, L. Denoyer, MA. Ranzato, H. Jégou, Word Translation Without Parallel Data

* Equal contribution. Order has been determined with a coin flip.

@article{conneau2017word,
  title={Word Translation Without Parallel Data},
  author={Conneau, Alexis and Lample, Guillaume and Ranzato, Marc'Aurelio and Denoyer, Ludovic and J{\'e}gou, Herv{\'e}},
  journal={arXiv preprint arXiv:1710.04087},
  year={2017}
}

MUSE is the project at the origin of the work on unsupervised machine translation with monolingual data only [2].

Unsupervised Machine Translation With Monolingual Data Only

[2] G. Lample, A. Conneau, L. Denoyer, MA. Ranzato Unsupervised Machine Translation With Monolingual Data Only

@article{lample2017unsupervised,
  title={Unsupervised Machine Translation Using Monolingual Corpora Only},
  author={Lample, Guillaume and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
  journal={arXiv preprint arXiv:1711.00043},
  year={2017}
}

Related work

Contact: [email protected] [email protected]

Owner
Facebook Research
Facebook Research
DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention This repository is the official implementation of DeBERTa: Decoding-enhanced BERT with Dis

Microsoft 1.2k Jan 03, 2023
Fake Shakespearean Text Generator

Fake Shakespearean Text Generator This project contains an impelementation of stateful Char-RNN model to generate fake shakespearean texts. Files and

Recep YILDIRIM 1 Feb 15, 2022
Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Phil Wang 5k Jan 02, 2023
BiNE: Bipartite Network Embedding

BiNE: Bipartite Network Embedding This repository contains the demo code of the paper: BiNE: Bipartite Network Embedding. Ming Gao, Leihui Chen, Xiang

leihuichen 214 Nov 24, 2022
BookNLP, a natural language processing pipeline for books

BookNLP BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including: Part-of-speech taggin

654 Jan 02, 2023
Interpretable Models for NLP using PyTorch

This repo is deprecated. Please find the updated package here. https://github.com/EdGENetworks/anuvada Anuvada: Interpretable Models for NLP using PyT

Sandeep Tammu 19 Dec 17, 2022
CDLA: A Chinese document layout analysis (CDLA) dataset

CDLA: A Chinese document layout analysis (CDLA) dataset 介绍 CDLA是一个中文文档版面分析数据集,面向中文文献类(论文)场景。包含以下10个label: 正文 标题 图片 图片标题 表格 表格标题 页眉 页脚 注释 公式 Text Title

buptlihang 84 Dec 28, 2022
A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

Eu-Bin KIM 94 Dec 08, 2022
A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any other format

RITA DSL This is a language, loosely based on language Apache UIMA RUTA, focused on writing manual language rules, which compiles into either spaCy co

Šarūnas Navickas 60 Sep 26, 2022
Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

Easy-Translate is a script for translating large text files in your machine using the M2M100 models from Facebook/Meta AI. We also privide a script fo

Iker García-Ferrero 41 Dec 15, 2022
This repository contains examples of Task-Informed Meta-Learning

Task-Informed Meta-Learning This repository contains examples of Task-Informed Meta-Learning (paper). We consider two tasks: Crop Type Classification

10 Dec 19, 2022
Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

gpt2-poetry The following code is for my senior honor's thesis project, under the guidance of Dr. Keith Holyoak at the University of California, Los A

Ashley Kim 2 Jan 09, 2022
PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

5 Oct 29, 2022
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.6k Dec 27, 2022
Index different CKAN entities in Solr, not just datasets

ckanext-sitesearch Index different CKAN entities in Solr, not just datasets Requirements This extension requires CKAN 2.9 or higher and Python 3 Featu

Open Knowledge Foundation 3 Dec 02, 2022
IEEEXtreme15.0 Questions And Answers

IEEEXtreme15.0 Questions And Answers IEEEXtreme is a global challenge in which teams of IEEE Student members – advised and proctored by an IEEE member

Dilan Perera 15 Oct 24, 2022
⚡ Automatically decrypt encryptions without knowing the key or cipher, decode encodings, and crack hashes ⚡

Translations 🇩🇪 DE 🇫🇷 FR 🇭🇺 HU 🇮🇩 ID 🇮🇹 IT 🇳🇱 NL 🇧🇷 PT-BR 🇷🇺 RU 🇨🇳 ZH ➡️ Documentation | Discord | Installation Guide ⬅️ Fully autom

11.2k Jan 05, 2023
Code of paper: A Recurrent Vision-and-Language BERT for Navigation

Recurrent VLN-BERT Code of the Recurrent-VLN-BERT paper: A Recurrent Vision-and-Language BERT for Navigation Yicong Hong, Qi Wu, Yuankai Qi, Cristian

YicongHong 109 Dec 21, 2022
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022