An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Overview

BERTify

This is an easy-to-use python module that helps you to extract the BERT embeddings for a large text dataset efficiently. It is intended to be used for Bengali and English texts.

Specially, optimized for usability in limited computational setups (i.e. free colab/kaggle GPUs). Extracting embeddings for IMDB dataset (a list of 25000 texts) took less than ~28 mins. on Colab's GPU. (Haven't perform any hardcore benchmark, so take these numbers with a grain of salt).

Requirements

  • numpy
  • torch
  • tqdm
  • transformers

Quick Installation

$ pip install git+https://github.com/khalidsaifullaah/BERTify

Usage

num. of texts, 4096 -> embedding dim.) # Example 2: English Embedding Extraction en_bertify = BERTify( lang="en", last_four_layers_embedding=True ) # bn_bertify.batch_size = 96 texts = ["how are you doing?", "I don't know about this.", "This is the most important thing."] en_embeddings = en_bertify.embedding(texts) # shape of the returned matrix in this example 3x3072 (3 -> num. of texts, 3072 -> embedding dim.) ">
from bertify import BERTify

# Example 1: Bengali Embedding Extraction
bn_bertify = BERTify(
    lang="bn",  # language of your text.
    last_four_layers_embedding=True  # to get richer embeddings.
)

# By default, `batch_size` is set to 64. Set `batch_size` higher for making things even faster but higher value than 96 may throw `CUDA out of memory` on Colab's GPU, so try at your own risk.

# bn_bertify.batch_size = 96

# A list of texts that we want the embedding for, can be one or many. (You can turn your whole dataset into a list of texts and pass it into the method for faster embedding extraction)
texts = ["বিখ্যাত হওয়ার প্রথম পদক্ষেপ", "জীবনে সবচেয়ে মূল্যবান জিনিস হচ্ছে", "বেশিরভাগ মানুষের পছন্দের জিনিস হচ্ছে"]

bn_embeddings = bn_bertify.embedding(texts)   # returns numpy matrix 
# shape of the returned matrix in this example 3x4096 (3 -> num. of texts, 4096 -> embedding dim.)




# Example 2: English Embedding Extraction
en_bertify = BERTify(
    lang="en",
    last_four_layers_embedding=True
)

# bn_bertify.batch_size = 96

texts = ["how are you doing?", "I don't know about this.", "This is the most important thing."]
en_embeddings = en_bertify.embedding(texts) 
# shape of the returned matrix in this example 3x3072 (3 -> num. of texts, 3072 -> embedding dim.)

Tips

  • Try passing all your text data through the .embedding() function at once by turning it into a list of texts.
  • For faster inference, make sure you're using your colab/kaggle GPU while making the .embedding() call
  • Try increasing the batch_size to make it even faster, by default we're using 64 (to be on the safe side) which doesn't throw any CUDA out of memory but I believe we can go even further. Thanks to Alex, from his empirical findings, it seems like it can be pushed until 96. So, before making the .embedding() call, you can do bertify.batch_zie=96 to set a larger batch_zie

Definitions


class BERTify(lang: str = "en", last_four_layers_embedding: bool = False)


A module for extracting embedding from BERT model for Bengali or English text datasets. For 'en' -> English data, it uses bert-base-uncased model embeddings, for 'bn' -> Bengali data, it uses sahajBERT model embeddings.

Parameters:

lang (str, optional): language of your data. Currently supports only 'en' and 'bn'. Defaults to 'en'. last_four_layers_embedding (bool, optional): BERT paper discusses they've reached the best results by concatenating the output of the last four layers, so if this argument is set to True, your embedding vector would be (for bert-base model for example) 4*768=3072 dimensional, otherwise it'd be 768 dimensional. Defaults to False.


def BERTify.embedding(texts: List[str])


The embedding function, that takes a list of texts, feed them through the model and returns a list of embeddings.

Parameters:

texts (List[str]): A list of texts, that you want to extract embedding for (e.g. ["This movie was a total waste of time.", "Whoa! Loved this movie, totally loved all the characters"])

Returns:

np.ndarray: A numpy matrix of shape num_of_texts x embedding_dimension

License

MIT License.

Owner
Khalid Saifullah
love to learn new things.
Khalid Saifullah
English loanwords in the world's languages

Wiktionary as CLDF Content cldf1 and cldf2 contain cldf-conform data sets with a total of 2 377 756 entries about the vocabulary of all 1403 languages

Viktor Martinović 3 Jan 14, 2022
Paddle2.x version AI-Writer

Paddle2.x 版本AI-Writer 用魔改 GPT 生成网文。Tuned GPT for novel generation.

yujun 74 Jan 04, 2023
Arabic speech recognition, classification and text-to-speech.

klaam Arabic speech recognition, classification and text-to-speech using many advanced models like wave2vec and fastspeech2. This repository allows tr

ARBML 177 Dec 27, 2022
DiY Oxygen Concentrator based on the OxiKit

M19O2 DiY Oxygen Concentrator based on / inspired by the OxiKit, OpenOx, Marut, RepRap and Project Apollo platforms. About Read about the project on H

Maker's Asylum 62 Dec 22, 2022
We have built a Voice based Personal Assistant for people to access files hands free in their device using natural language processing.

Voice Based Personal Assistant We have built a Voice based Personal Assistant for people to access files hands free in their device using natural lang

Rushabh 2 Nov 13, 2021
Code for hyperboloid embeddings for knowledge graph entities

Implementation for the papers: Self-Supervised Hyperboloid Representations from Logical Queries over Knowledge Graphs, Nurendra Choudhary, Nikhil Rao,

30 Dec 10, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 02, 2022
Linear programming solver for paper-reviewer matching and mind-matching

Paper-Reviewer Matcher A python package for paper-reviewer matching algorithm based on topic modeling and linear programming. The algorithm is impleme

Titipat Achakulvisut 66 Jul 05, 2022
Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets What is LASSL • How to Use What is LASSL LASSL은 LAnguage Semi-Super

LASSL: LAnguage Self-Supervised Learning 116 Dec 27, 2022
Two-stage text summarization with BERT and BART

Two-Stage Text Summarization Description We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter

Yukai Yang (Alexis) 6 Oct 22, 2022
This is a really simple text-to-speech app made with python and tkinter.

Tkinter Text-to-Speech App by Souvik Roy This is a really simple tkinter app which converts the text you have entered into a speech. It is created wit

Souvik Roy 1 Dec 21, 2021
Official code for "Parser-Free Virtual Try-on via Distilling Appearance Flows", CVPR 2021

Parser-Free Virtual Try-on via Distilling Appearance Flows, CVPR 2021 Official code for CVPR 2021 paper 'Parser-Free Virtual Try-on via Distilling App

395 Jan 03, 2023
PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

5 Oct 29, 2022
Pattern Matching in Python

Pattern Matching finalmente chega no Python 3.10. E daí? "Pattern matching", ou "correspondência de padrões" como é conhecido no Brasil. Algumas pesso

Fabricio Werneck 6 Feb 16, 2022
Official implementation of Meta-StyleSpeech and StyleSpeech

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang This is an official code

min95 169 Jan 05, 2023
Lingtrain Aligner — ML powered library for the accurate texts alignment.

Lingtrain Aligner ML powered library for the accurate texts alignment in different languages. Purpose Main purpose of this alignment tool is to build

Sergei Averkiev 76 Dec 14, 2022
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

Phil Wang 32 Oct 14, 2022
A framework for cleaning Chinese dialog data

A framework for cleaning Chinese dialog data

Yida 136 Dec 20, 2022
基于Transformer的单模型、多尺度的VAE模型

UniVAE 基于Transformer的单模型、多尺度的VAE模型 介绍 https://kexue.fm/archives/8475 依赖 需要大于0.10.6版本的bert4keras(当前还没有推到pypi上,可以直接从GitHub上clone最新版)。 引用 @misc{univae,

苏剑林(Jianlin Su) 49 Aug 24, 2022
[WWW 2021 GLB] New Benchmarks for Learning on Non-Homophilous Graphs

New Benchmarks for Learning on Non-Homophilous Graphs Here are the codes and datasets accompanying the paper: New Benchmarks for Learning on Non-Homop

94 Dec 21, 2022