Fast, DB Backed pretrained word embeddings for natural language processing.

Last update: Nov 21, 2022

Overview

Embeddings

Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning.

Instead of loading a large file to query for embeddings, embeddings is backed by a database and fast to load and query:

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300)
100 loops, best of 3: 12.7 ms per loop

>>> %timeit GloveEmbedding('common_crawl_840', d_emb=300).emb('canada')
100 loops, best of 3: 12.9 ms per loop

>>> g = GloveEmbedding('common_crawl_840', d_emb=300)

>>> %timeit -n1 g.emb('canada')
1 loop, best of 3: 38.2 µs per loop

Installation

pip install embeddings  # from pypi
pip install git+https://github.com/vzhong/embeddings.git  # from github

Usage

Upon first use, the embeddings are first downloaded to disk in the form of a SQLite database. This may take a long time for large embeddings such as GloVe. Further usage of the embeddings are directly queried against the database. Embedding databases are stored in the $EMBEDDINGS_ROOT directory (defaults to ~/.embeddings). Note that this location is probably undesirable if your home directory is on NFS, as it would slow down database queries significantly.

from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding

g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
f = FastTextEmbedding()
k = KazumaCharEmbedding()
c = ConcatEmbedding([g, f, k])
for w in ['canada', 'vancouver', 'toronto']:
    print('embedding {}'.format(w))
    print(g.emb(w))
    print(f.emb(w))
    print(k.emb(w))
    print(c.emb(w))

Docker

If you use Docker, an image prepopulated with the Common Crawl 840 GloVe embeddings and Kazuma Hashimoto's character ngram embeddings is available at vzhong/embeddings. To mount volumes from this container, set $EMBEDDINGS_ROOT in your container to /opt/embeddings.

For example:

docker run --volumes-from vzhong/embeddings -e EMBEDDINGS_ROOT='/opt/embeddings' myimage python train.py

Contribution

Pull requests welcome!

Fast, DB Backed pretrained word embeddings for natural language processing.

Related tags

Overview

Embeddings

Installation

Usage

Docker

Contribution

Owner

Victor Zhong

Under the hood working of transformers, fine-tuning GPT-3 models, DeBERTa, vision models, and the start of Metaverse, using a variety of NLP platforms: Hugging Face, OpenAI API, Trax, and AllenNLP

Refactored version of FastSpeech2

Seq2seq attn - Use the Seq2Seq method to implement machine translation and introduce Attention mechanism to improve the results

Learning to Rewrite for Non-Autoregressive Neural Machine Translation

CoSENT 比Sentence-BERT更有效的句向量方案

一个基于Nonebot2和go-cqhttp的娱乐性qq机器人

GNES enables large-scale index and semantic search for text-to-text, image-to-image, video-to-video and any-to-any content form

Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

Source code for CsiNet and CRNet using Fully Connected Layer-Shared feedback architecture.

Ongoing research training transformer language models at scale, including: BERT & GPT-2

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

File-based TF-IDF: Calculates keywords in a document, using a word corpus.

The swas programming language

Levenshtein and Hamming distance computation

Implementation of ProteinBERT in Pytorch

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization (ACL 2021)

Implementation of legal QA system based on SentenceKoBART

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.