PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Last update: Nov 04, 2022

Overview

PyTorch Large-Scale Language Model

A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset

Latest Results

39.98 Perplexity after 5 training epochs using LSTM Language Model with Adam Optimizer
Trained in ~26 hours using 1 Nvidia V100 GPU (~5.1 hours per epoch) with 2048 batch size (~10.7 GB GPU memory)

Previous Results

46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
Implemented Sampled Softmax and Log-Uniform Sampler functions

GPU Hardware Requirement

Type	LM Memory Size	GPU
w/o tied weights	~9 GB	Nvidia 1080 TI, Nvidia Titan X
w/ tied weights [6]	~7 GB	Nvidia 1070 or higher

There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.

Hyper-Parameters [3]

Parameter	Value
# Epochs	5
Training Batch Size	128
Evaluation Batch Size	1
BPTT	20
Embedding Size	256
Hidden Size	2048
Projection Size	256
Tied Embedding + Softmax	False
# Layers	1
Optimizer	AdaGrad
Learning Rate	0.10
Gradient Clipping	1.00
Dropout	0.01
Weight-Decay (L2 Penalty)	1e-6

Setup - Torch Data Format

Download Google Billion Word Dataset for Torch - Link
Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
Install Cython framework and build Log_Uniform Sampler
Convert Torch data tensors to PyTorch tensor format (Requires Pytorch v0.4.1)

I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.

Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)

Setup - Original Data Format

Download 1-Billion Word Dataset - Link

The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

Related tags

Overview

PyTorch Large-Scale Language Model

Latest Results

Previous Results

GPU Hardware Requirement

Hyper-Parameters [3]

Setup - Torch Data Format

Setup - Original Data Format

References

Owner

Ryan Spring

voice2json is a collection of command-line tools for offline speech/intent recognition on Linux

Code for the paper: Sequence-to-Sequence Learning with Latent Neural Grammars

Words_And_Phrases - Just a repo for useful words and phrases that might come handy in some scenarios. Feel free to add yours

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

JaQuAD: Japanese Question Answering Dataset

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Blender addon - Scrub timeline from viewport with a shortcut

NLP, Machine learning

A Neural Language Style Transfer framework to transfer natural language text smoothly between fine-grained language styles like formal/casual, active/passive, and many more. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

End-to-End Speech Processing Toolkit

CCF BDCI BERT系统调优赛题baseline（Pytorch版本）

Contains descriptions and code of the mini-projects developed in various programming languages

KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

HuggingTweets - Train a model to generate tweets

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Linear programming solver for paper-reviewer matching and mind-matching