An easy-to-use Python module that helps you to extract the BERT embeddings for a large text dataset (Bengali/English) efficiently.

Overview

BERTify

This is an easy-to-use python module that helps you to extract the BERT embeddings for a large text dataset efficiently. It is intended to be used for Bengali and English texts.

Specially, optimized for usability in limited computational setups (i.e. free colab/kaggle GPUs). Extracting embeddings for IMDB dataset (a list of 25000 texts) took less than ~28 mins. on Colab's GPU. (Haven't perform any hardcore benchmark, so take these numbers with a grain of salt).

Requirements

  • numpy
  • torch
  • tqdm
  • transformers

Quick Installation

$ pip install git+https://github.com/khalidsaifullaah/BERTify

Usage

num. of texts, 4096 -> embedding dim.) # Example 2: English Embedding Extraction en_bertify = BERTify( lang="en", last_four_layers_embedding=True ) # bn_bertify.batch_size = 96 texts = ["how are you doing?", "I don't know about this.", "This is the most important thing."] en_embeddings = en_bertify.embedding(texts) # shape of the returned matrix in this example 3x3072 (3 -> num. of texts, 3072 -> embedding dim.) ">
from bertify import BERTify

# Example 1: Bengali Embedding Extraction
bn_bertify = BERTify(
    lang="bn",  # language of your text.
    last_four_layers_embedding=True  # to get richer embeddings.
)

# By default, `batch_size` is set to 64. Set `batch_size` higher for making things even faster but higher value than 96 may throw `CUDA out of memory` on Colab's GPU, so try at your own risk.

# bn_bertify.batch_size = 96

# A list of texts that we want the embedding for, can be one or many. (You can turn your whole dataset into a list of texts and pass it into the method for faster embedding extraction)
texts = ["বিখ্যাত হওয়ার প্রথম পদক্ষেপ", "জীবনে সবচেয়ে মূল্যবান জিনিস হচ্ছে", "বেশিরভাগ মানুষের পছন্দের জিনিস হচ্ছে"]

bn_embeddings = bn_bertify.embedding(texts)   # returns numpy matrix 
# shape of the returned matrix in this example 3x4096 (3 -> num. of texts, 4096 -> embedding dim.)




# Example 2: English Embedding Extraction
en_bertify = BERTify(
    lang="en",
    last_four_layers_embedding=True
)

# bn_bertify.batch_size = 96

texts = ["how are you doing?", "I don't know about this.", "This is the most important thing."]
en_embeddings = en_bertify.embedding(texts) 
# shape of the returned matrix in this example 3x3072 (3 -> num. of texts, 3072 -> embedding dim.)

Tips

  • Try passing all your text data through the .embedding() function at once by turning it into a list of texts.
  • For faster inference, make sure you're using your colab/kaggle GPU while making the .embedding() call
  • Try increasing the batch_size to make it even faster, by default we're using 64 (to be on the safe side) which doesn't throw any CUDA out of memory but I believe we can go even further. Thanks to Alex, from his empirical findings, it seems like it can be pushed until 96. So, before making the .embedding() call, you can do bertify.batch_zie=96 to set a larger batch_zie

Definitions


class BERTify(lang: str = "en", last_four_layers_embedding: bool = False)


A module for extracting embedding from BERT model for Bengali or English text datasets. For 'en' -> English data, it uses bert-base-uncased model embeddings, for 'bn' -> Bengali data, it uses sahajBERT model embeddings.

Parameters:

lang (str, optional): language of your data. Currently supports only 'en' and 'bn'. Defaults to 'en'. last_four_layers_embedding (bool, optional): BERT paper discusses they've reached the best results by concatenating the output of the last four layers, so if this argument is set to True, your embedding vector would be (for bert-base model for example) 4*768=3072 dimensional, otherwise it'd be 768 dimensional. Defaults to False.


def BERTify.embedding(texts: List[str])


The embedding function, that takes a list of texts, feed them through the model and returns a list of embeddings.

Parameters:

texts (List[str]): A list of texts, that you want to extract embedding for (e.g. ["This movie was a total waste of time.", "Whoa! Loved this movie, totally loved all the characters"])

Returns:

np.ndarray: A numpy matrix of shape num_of_texts x embedding_dimension

License

MIT License.

Owner
Khalid Saifullah
love to learn new things.
Khalid Saifullah
BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

DMIS Laboratory - Korea University 99 Jan 06, 2023
Knowledge Management for Humans using Machine Learning & Tags

HyperTag helps humans intuitively express how they think about their files using tags and machine learning. Represent how you think using tags. Find what you look for using semantic search for your t

Ravn Tech, Inc. 166 Jan 07, 2023
Pre-Training with Whole Word Masking for Chinese BERT

Pre-Training with Whole Word Masking for Chinese BERT

Yiming Cui 7.7k Dec 31, 2022
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
Demo programs for the Talking Head Anime from a Single Image 2: More Expressive project.

Demo Code for "Talking Head Anime from a Single Image 2: More Expressive" This repository contains demo programs for the Talking Head Anime

Pramook Khungurn 901 Jan 06, 2023
This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Rachford-Rice Contest This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest. Can you solve the Rachford-Rice problem for all t

13 Sep 20, 2022
Beyond the Imitation Game collaborative benchmark for enormous language models

BIG-bench 🪑 The Beyond the Imitation Game Benchmark (BIG-bench) will be a collaborative benchmark intended to probe large language models, and extrap

Google 1.3k Jan 01, 2023
Code examples for my Write Better Python Code series on YouTube.

Write Better Python Code This repository contains the code examples used in my Write Better Python Code series published on YouTube: https:/

858 Dec 29, 2022
Multilingual Emotion classification using BERT (fine-tuning). Published at the WASSA workshop (ACL2022).

XLM-EMO: Multilingual Emotion Prediction in Social Media Text Abstract Detecting emotion in text allows social and computational scientists to study h

MilaNLP 35 Sep 17, 2022
Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

Yifan Wang 100 Dec 19, 2022
Fast, DB Backed pretrained word embeddings for natural language processing.

Embeddings Embeddings is a python package that provides pretrained word embeddings for natural language processing and machine learning. Instead of lo

Victor Zhong 212 Nov 21, 2022
Nested Named Entity Recognition for Chinese Biomedical Text

CBio-NAMER CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understand

8 Dec 25, 2022
Application for shadowing Chinese.

chinese-shadowing Simple APP for shadowing chinese. With this application, it is very easy to record yourself, play the sound recorded and listen to s

Thomas Hirtz 5 Sep 06, 2022
Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

Iman Kermani 3 Apr 15, 2022
Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

Tao Zhuo 9 Dec 17, 2022
PyTorch code for EMNLP 2019 paper "LXMERT: Learning Cross-Modality Encoder Representations from Transformers".

LXMERT: Learning Cross-Modality Encoder Representations from Transformers Our servers break again :(. I have updated the links so that they should wor

Hao Tan 838 Dec 19, 2022
Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT-Implementation In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages. We are interest

Tanuj Sur 4 Jul 01, 2022
This is a project of data parallel that running on NLP tasks.

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021
AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

Microsoft 37 Nov 29, 2022
Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Parallel WaveGAN implementation with Pytorch This repository provides UNOFFICIAL pytorch implementations of the following models: Parallel WaveGAN Mel

Tomoki Hayashi 1.2k Dec 23, 2022