BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia.

Overview

BPEmb

BPEmb is a collection of pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE) and trained on Wikipedia. Its intended use is as input for neural models in natural language processing.

WebsiteUsageDownloadMultiBPEmbPaper (pdf)Citing BPEmb

Usage

Install BPEmb with pip:

pip install bpemb

Embeddings and SentencePiece models will be downloaded automatically the first time you use them.

>>> from bpemb import BPEmb
# load English BPEmb model with default vocabulary size (10k) and 50-dimensional embeddings
>>> bpemb_en = BPEmb(lang="en", dim=50)
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model
downloading https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.d50.w2v.bin.tar.gz

You can do two main things with BPEmb. The first is subword segmentation:

>> bpemb_zh = BPEmb(lang="zh", vs=100000) # apply Chinese BPE subword segmentation model >>> bpemb_zh.encode("这是一个中文句子") # "This is a Chinese sentence." ['▁这是一个', '中文', '句子'] # ["This is a", "Chinese", "sentence"] ">
# apply English BPE subword segmentation model
>>> bpemb_en.encode("Stratford")
['▁strat', 'ford']
# load Chinese BPEmb model with vocabulary size 100k and default (100-dim) embeddings
>>> bpemb_zh = BPEmb(lang="zh", vs=100000)
# apply Chinese BPE subword segmentation model
>>> bpemb_zh.encode("这是一个中文句子")  # "This is a Chinese sentence."
['▁这是一个', '中文', '句子']  # ["This is a", "Chinese", "sentence"]

If / how a word gets split depends on the vocabulary size. Generally, a smaller vocabulary size will yield a segmentation into many subwords, while a large vocabulary size will result in frequent words not being split:

vocabulary size segmentation
1000 ['▁str', 'at', 'f', 'ord']
3000 ['▁str', 'at', 'ford']
5000 ['▁str', 'at', 'ford']
10000 ['▁strat', 'ford']
25000 ['▁stratford']
50000 ['▁stratford']
100000 ['▁stratford']
200000 ['▁stratford']

The second purpose of BPEmb is to provide pretrained subword embeddings:

>> type(bpemb_en.vectors) numpy.ndarray >>> bpemb_en.vectors.shape (10000, 50) >>> bpemb_zh.vectors.shape (100000, 100) ">
# Embeddings are wrapped in a gensim KeyedVectors object
>>> type(bpemb_zh.emb)
gensim.models.keyedvectors.Word2VecKeyedVectors
# You can use BPEmb objects like gensim KeyedVectors
>>> bpemb_en.most_similar("ford")
[('bury', 0.8745079040527344),
 ('ton', 0.8725000619888306),
 ('well', 0.871537446975708),
 ('ston', 0.8701574206352234),
 ('worth', 0.8672043085098267),
 ('field', 0.859795331954956),
 ('ley', 0.8591548204421997),
 ('ington', 0.8126075267791748),
 ('bridge', 0.8099068999290466),
 ('brook', 0.7979353070259094)]
>>> type(bpemb_en.vectors)
numpy.ndarray
>>> bpemb_en.vectors.shape
(10000, 50)
>>> bpemb_zh.vectors.shape
(100000, 100)

To use subword embeddings in your neural network, either encode your input into subword IDs:

>> bpemb_zh.vectors[ids].shape (3, 100) ">
>>> ids = bpemb_zh.encode_ids("这是一个中文句子")
[25950, 695, 20199]
>>> bpemb_zh.vectors[ids].shape
(3, 100)

Or use the embed method:

# apply Chinese subword segmentation and perform embedding lookup
>>> bpemb_zh.embed("这是一个中文句子").shape
(3, 100)

Downloads for each language

ab (Abkhazian)ace (Achinese)ady (Adyghe)af (Afrikaans)ak (Akan)als (Alemannic)am (Amharic)an (Aragonese)ang (Old English)ar (Arabic)arc (Official Aramaic)arz (Egyptian Arabic)as (Assamese)ast (Asturian)atj (Atikamekw)av (Avaric)ay (Aymara)az (Azerbaijani)azb (South Azerbaijani)

ba (Bashkir)bar (Bavarian)bcl (Central Bikol)be (Belarusian)bg (Bulgarian)bi (Bislama)bjn (Banjar)bm (Bambara)bn (Bengali)bo (Tibetan)bpy (Bishnupriya)br (Breton)bs (Bosnian)bug (Buginese)bxr (Russia Buriat)

ca (Catalan)cdo (Min Dong Chinese)ce (Chechen)ceb (Cebuano)ch (Chamorro)chr (Cherokee)chy (Cheyenne)ckb (Central Kurdish)co (Corsican)cr (Cree)crh (Crimean Tatar)cs (Czech)csb (Kashubian)cu (Church Slavic)cv (Chuvash)cy (Welsh)

da (Danish)de (German)din (Dinka)diq (Dimli)dsb (Lower Sorbian)dty (Dotyali)dv (Dhivehi)dz (Dzongkha)

ee (Ewe)el (Modern Greek)en (English)eo (Esperanto)es (Spanish)et (Estonian)eu (Basque)ext (Extremaduran)

fa (Persian)ff (Fulah)fi (Finnish)fj (Fijian)fo (Faroese)fr (French)frp (Arpitan)frr (Northern Frisian)fur (Friulian)fy (Western Frisian)

ga (Irish)gag (Gagauz)gan (Gan Chinese)gd (Scottish Gaelic)gl (Galician)glk (Gilaki)gn (Guarani)gom (Goan Konkani)got (Gothic)gu (Gujarati)gv (Manx)

ha (Hausa)hak (Hakka Chinese)haw (Hawaiian)he (Hebrew)hi (Hindi)hif (Fiji Hindi)hr (Croatian)hsb (Upper Sorbian)ht (Haitian)hu (Hungarian)hy (Armenian)

ia (Interlingua)id (Indonesian)ie (Interlingue)ig (Igbo)ik (Inupiaq)ilo (Iloko)io (Ido)is (Icelandic)it (Italian)iu (Inuktitut)

ja (Japanese)jam (Jamaican Creole English)jbo (Lojban)jv (Javanese)

ka (Georgian)kaa (Kara-Kalpak)kab (Kabyle)kbd (Kabardian)kbp (Kabiyè)kg (Kongo)ki (Kikuyu)kk (Kazakh)kl (Kalaallisut)km (Central Khmer)kn (Kannada)ko (Korean)koi (Komi-Permyak)krc (Karachay-Balkar)ks (Kashmiri)ksh (Kölsch)ku (Kurdish)kv (Komi)kw (Cornish)ky (Kirghiz)

la (Latin)lad (Ladino)lb (Luxembourgish)lbe (Lak)lez (Lezghian)lg (Ganda)li (Limburgan)lij (Ligurian)lmo (Lombard)ln (Lingala)lo (Lao)lrc (Northern Luri)lt (Lithuanian)ltg (Latgalian)lv (Latvian)

mai (Maithili)mdf (Moksha)mg (Malagasy)mh (Marshallese)mhr (Eastern Mari)mi (Maori)min (Minangkabau)mk (Macedonian)ml (Malayalam)mn (Mongolian)mr (Marathi)mrj (Western Mari)ms (Malay)mt (Maltese)mwl (Mirandese)my (Burmese)myv (Erzya)mzn (Mazanderani)

na (Nauru)nap (Neapolitan)nds (Low German)ne (Nepali)new (Newari)ng (Ndonga)nl (Dutch)nn (Norwegian Nynorsk)no (Norwegian)nov (Novial)nrm (Narom)nso (Pedi)nv (Navajo)ny (Nyanja)

oc (Occitan)olo (Livvi)om (Oromo)or (Oriya)os (Ossetian)

pa (Panjabi)pag (Pangasinan)pam (Pampanga)pap (Papiamento)pcd (Picard)pdc (Pennsylvania German)pfl (Pfaelzisch)pi (Pali)pih (Pitcairn-Norfolk)pl (Polish)pms (Piemontese)pnb (Western Panjabi)pnt (Pontic)ps (Pushto)pt (Portuguese)

qu (Quechua)

rm (Romansh)rmy (Vlax Romani)rn (Rundi)ro (Romanian)ru (Russian)rue (Rusyn)rw (Kinyarwanda)

sa (Sanskrit)sah (Yakut)sc (Sardinian)scn (Sicilian)sco (Scots)sd (Sindhi)se (Northern Sami)sg (Sango)sh (Serbo-Croatian)si (Sinhala)sk (Slovak)sl (Slovenian)sm (Samoan)sn (Shona)so (Somali)sq (Albanian)sr (Serbian)srn (Sranan Tongo)ss (Swati)st (Southern Sotho)stq (Saterfriesisch)su (Sundanese)sv (Swedish)sw (Swahili)szl (Silesian)

ta (Tamil)tcy (Tulu)te (Telugu)tet (Tetum)tg (Tajik)th (Thai)ti (Tigrinya)tk (Turkmen)tl (Tagalog)tn (Tswana)to (Tonga)tpi (Tok Pisin)tr (Turkish)ts (Tsonga)tt (Tatar)tum (Tumbuka)tw (Twi)ty (Tahitian)tyv (Tuvinian)

udm (Udmurt)ug (Uighur)uk (Ukrainian)ur (Urdu)uz (Uzbek)

ve (Venda)vec (Venetian)vep (Veps)vi (Vietnamese)vls (Vlaams)vo (Volapük)

wa (Walloon)war (Waray)wo (Wolof)wuu (Wu Chinese)

xal (Kalmyk)xh (Xhosa)xmf (Mingrelian)

yi (Yiddish)yo (Yoruba)

za (Zhuang)zea (Zeeuws)zh (Chinese)zu (Zulu)

MultiBPEmb

multi (multilingual)

Citing BPEmb

If you use BPEmb in academic work, please cite:

@InProceedings{heinzerling2018bpemb,
  author = {Benjamin Heinzerling and Michael Strube},
  title = "{BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages}",
  booktitle = {Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)},
  year = {2018},
  month = {May 7-12, 2018},
  address = {Miyazaki, Japan},
  editor = {Nicoletta Calzolari (Conference chair) and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and Hélène Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {979-10-95546-00-9},
  language = {english}
  }
Korean extractive summarization. 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드

korean extractive summarization 2021 AI 텍스트 요약 온라인 해커톤 화성갈끄니까팀 코드 Leaderboard Notice Text Summarization with Pretrained Encoders에 나오는 bertsumext모델(ext

3 Aug 10, 2022
One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

One Stop Anomaly Shop (OSAS) Quick start guide Step 1: Get/build the docker image Option 1: Use precompiled image (might not reflect latest changes):

Adobe, Inc. 148 Dec 26, 2022
Sequence-to-Sequence Framework in PyTorch

nmtpytorch allows training of various end-to-end neural architectures including but not limited to neural machine translation, image captioning and au

LIUM 395 Nov 21, 2022
PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (P

Alexandra Chronopoulou 89 Aug 12, 2022
MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data. It is implemented using Python.

willow 6 Jun 27, 2022
Indonesia spellchecker with python

indonesia-spellchecker Ganti kata yang terdapat pada file teks.txt untuk diperiksa kebenaran kata. Run on local machine python3 main.py

Rahmat Agung Julians 1 Sep 14, 2022
Fake Shakespearean Text Generator

Fake Shakespearean Text Generator This project contains an impelementation of stateful Char-RNN model to generate fake shakespearean texts. Files and

Recep YILDIRIM 1 Feb 15, 2022
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022
Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If yo

LIAAD - Laboratory of Artificial Intelligence and Decision Support 163 Dec 23, 2022
Russian GPT3 models.

Russian GPT-3 models (ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small) trained with 2048 sequence length with sparse and dense attention blocks. We also provide Russian GPT-2 large model (ruGPT2Larg

Sberbank AI 1.6k Jan 05, 2023
Estimation of the CEFR complexity score of a given word, sentence or text.

NLP-Swedish … allows to estimate CEFR (Common European Framework of References) complexity score of a given word, sentence or text. CEFR scores come f

3 Apr 30, 2022
Facilitating the design, comparison and sharing of deep text matching models.

MatchZoo Facilitating the design, comparison and sharing of deep text matching models. MatchZoo 是一个通用的文本匹配工具包,它旨在方便大家快速的实现、比较、以及分享最新的深度文本匹配模型。 🔥 News

Neural Text Matching Community 3.7k Jan 02, 2023
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

1.1k Dec 27, 2022
BERT, LDA, and TFIDF based keyword extraction in Python

BERT, LDA, and TFIDF based keyword extraction in Python kwx is a toolkit for multilingual keyword extraction based on Google's BERT and Latent Dirichl

Andrew Tavis McAllister 41 Dec 27, 2022
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 06, 2022
DeLighT: Very Deep and Light-Weight Transformers

DeLighT: Very Deep and Light-weight Transformers This repository contains the source code of our work on building efficient sequence models: DeFINE (I

Sachin Mehta 440 Dec 18, 2022
MicBot - MicBot uses Google Translate to speak everyone's chat messages

MicBot MicBot uses Google Translate to speak everyone's chat messages. It can al

2 Mar 09, 2022
Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Parallel WaveGAN implementation with Pytorch This repository provides UNOFFICIAL pytorch implementations of the following models: Parallel WaveGAN Mel

Tomoki Hayashi 1.2k Dec 23, 2022
MRC approach for Aspect-based Sentiment Analysis (ABSA)

B-MRC MRC approach for Aspect-based Sentiment Analysis (ABSA) Paper: Bidirectional Machine Reading Comprehension for Aspect Sentiment Triplet Extracti

Phuc Phan 1 Apr 05, 2022