Russian GPT3 models.

Overview

ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small and ruGPT2Large

This repository contains bunch of autoregressive transformer language models trained on a huge dataset of russian language.

Russian GPT-3 models (ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small) trained with 2048 sequence length with sparse and dense attention blocks. We also provide Russian GPT-2 large model (ruGPT2Large) trained with 1024 sequence length.

We suggest using ruGPT2Large or ruGPT3XL because this models are well tested and achieve the best perplexity.

Usage examples are described in detail here.

Old version of code you can find here

Table of contents

Setup and usage

Models can be used for inference or finetuning with two ways: 🤗 HuggingFace interface or our code based on this implementation.

For both ways install transformers:

pip install transformers==3.5.0

HuggingFace interface

We support 🤗 HuggingFace interface only for ruGPT3Large, ruGPT3Medium, ruGPT3Small and ruGPT2Large models. For RuGPT3XL please use code in this repo because RuGPT3XL model was trained with sparse attention.

Here we can obtain examples of finetuning or generation.

Also this examples is adapted for google colab:

  • finetuning: finetuning
  • generation: generation

Basic usage:

from transformers import GPT2LMHeadModel, GPT2Tokenizer


model_name_or_path = "sberbank-ai/rugpt3large_based_on_gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name_or_path)
model = GPT2LMHeadModel.from_pretrained(model_name_or_path).cuda()
text = "Александр Сергеевич Пушкин родился в "
input_ids = tokenizer.encode(text, return_tensors="pt").cuda()
out = model.generate(input_ids.cuda())
generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)
# Output should be like this:
# Александр Сергеевич Пушкин родился в \n1799 году. Его отец был крепостным крестьянином, а мать – крепостной крестьянкой. Детство и юность Пушкина прошли в деревне Михайловское под Петербургом. В 1820-х годах семья переехала

For more information about 🤗 HuggingFace interface please follow this documentation.

Data issues

For training pass single txt file.

Megatron interface

Without deepspeed

For using our code for finetuning without deepspeed (not recommended) we should install apex:

%%writefile setup.sh

export CUDA_HOME=/usr/local/cuda-10.1
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

sh setup.sh

Example of finetuning, generating and loading/convert megatron checkpoints here or Open In Colab

Note! This way is valid for all RuGPTs models except RuGPT3XL.

Megatron with deepspeed

For using our code for finetuning with deepspeed (recommended) we should install apex (see previous section) and deepspeed:

pip install deepspeed==0.3.7

Example of finetuning, generating and loading/convert megatron checkpoints here or Open In Colab

Note! For using deepspeed we should specify environ variable before all your python scripts and run with torch.distributed or mpi:

USE_DEEPSPEED=1 python -m torch.distributed.launch --nproc_per_node 1 ru-gpts/pretrain_gpt3.py \
  --train-data-path "train.list" \
  --test-data-path "valid.list" \
  --max-files-per-process 100 \
  --save model \
  --load-huggingface sberbank-ai/rugpt3small_based_on_gpt2 \
  --model-parallel-size 1 \
  --num-layers 12 \
  --hidden-size 768 \
  --num-attention-heads 12 \
  --seq-length 2048 \
  --max-position-embeddings 2048 \
  --fp16 \
  --checkpoint-activations \
  --deepspeed-activation-checkpointing \
  --deepspeed \
  --deepspeed_config ru-gpts/src/deepspeed_config/gpt3_small_2048.json
Data issues

We use custom implementation of distributed dataset. For training and evaluating we should specify file file.list with list of paths to txt files. All files from file.list will be splitted between aviable GPUs. The logic of splitting is described by the following code:

shard_size = len(files) // world_size
shard_start = rank * shard_size
shard_end = (rank + 1) * shard_size
files = files[shard_start:shard_end]

For more details please see full code of dataset: src.dataset_rugpt3.RuGpt3TextDataset and example.

Note! This way is valid for all RuGPTs models except RuGPT3XL.

Megatron with deepspeed and sparsity

This section is used mostly for usage of RuGPT3XL model and training models with sparse attention.

apt-get install llvm-9-dev
pip install cpufeature
pip install triton==0.2.3
DS_BUILD_CPU_ADAM=1 DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.3.7

Test installation of deepspeed you can with the following command: ds_report.

Example of inference of RuGPT3XL here or Open In Colab

Example of finetune, load finetuned model and generate is here.

For using sparse layers in model use --sparse-mode and specify key "sparse_attention" at deepspeed_config (RuGPT3XL config example). Modes can be: fixed, bigbird, bslongformer, variable, dense.

More information about sparse attention here.

Pretraining details

All pretraining was done on Nvidia Tesla V100-SXM3 32 Gb GPUs on a Christofari Cluster. Following are the details of pretraining for each model.

Pretraining ruGPT3XL

Model was trained with 512 sequence length using Deepspeed and Megatron code by SberDevices team, on 80B tokens dataset for 4 epochs. After that model was finetuned 1 epoch with sequence length 2048.
Note! Model has sparse attention blocks.

Total training time was around 10 days on 256 GPUs.
Final perplexity on test set is 12.05.

🤗 HuggingFace model card link.

See more details for generation here or Open In Colab.

Example of finetune, load finetuned model and generate is here.

Our pretraining script here

Example of finetuning script here

Pretraining ruGPT3Large

Model was trained with sequence length 1024 using transformers lib by SberDevices team on 80B tokens for 3 epochs. After that model was finetuned 1 epoch with sequence length 2048.

Total training time was around 14 days on 128 GPUs for 1024 context and few days on 16 GPUs for 2048 context.
Final perplexity on test set is 13.6.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3large_based_on_gpt2.

🤗 HuggingFace model card link

Our pretraining script here

Pretraining ruGPT3Medium

Model was trained with sequence length 1024 using transformers lib by SberDevices team on 80B tokens for 3 epoch. After that model was finetuned on 2048 context.

Total training time was around 16 days on 64 GPUs.
Final perplexity on test set is 17.4.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3medium_based_on_gpt2.

🤗 HuggingFace model card link

Our pretraining script here

Pretraining ruGPT3Small

Model was trained with sequence length 1024 using transformers by SberDevices team on 80B tokens around 3 epoch. After that model was finetuned on 2048 context.

Total training time took around one week on 32 GPUs.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3small_based_on_gpt2.

🤗 HuggingFace model card link

Our pretraining script here

Pretraining ruGPT2Large

Model was trained with sequence length 1024 using transformers by SberDevices team on 170Gb data on 64 GPUs 3 weeks.

You can obtain this model by using transformers with model name sberbank-ai/rugpt2large.

🤗 HuggingFace model card link

Advanced

Pretrained scripts (advanced)

Also we add pretraining scripts for all models (except RuGPT2Large). See scripts dir.

Note! All training params (such as lr, wd, ...) may was different while real training. This is just for example.

Convert checkpoint to HuggingFace

For converting megatron checkpoint to HuggingFace format use the following script (example for RuGPT3Small):

python convert2huggingface.py \
  --load /path/to/save/dir/ \
  --model-parallel-size 1 \
  --num-layers 12 \
  --hidden-size 768 \
  --num-attention-heads 12 \
  --max-position-embeddings 2048 \
  --tokenizer-path sberbank-ai/rugpt3small_based_on_gpt2 \
  --no-load-optim \
  --export-huggingface /path/to/converted/checkpoint

After converting we can use HuggingFace model:

from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("/path/to/converted/checkpoint")

Note! Conversion is worked for all models except RuGPT3XL. For using of RuGPT3XL see example of inference of RuGPT3XL here or Open In Colab.

Owner
Sberbank AI
Sberbank AI
Tools to download and cleanup Common Crawl data

cc_net Tools to download and clean Common Crawl as introduced in our paper CCNet. If you found these resources useful, please consider citing: @inproc

Meta Research 483 Jan 02, 2023
Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Tacotron2-HiFiGAN-master Implementation of TTS with combination of Tacotron2 and HiFi-GAN for Mandarin TTS. Inference In order to inference, we need t

SunLu Z 7 Nov 11, 2022
Segmenter - Transformer for Semantic Segmentation

Segmenter - Transformer for Semantic Segmentation

592 Dec 27, 2022
Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Speaker-Embeddings-Correlation-Pooling This is the original implementation of the pooling method introduced in "Speaker embeddings by modeling channel

Themos Stafylakis 10 Apr 30, 2022
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
Unsupervised text tokenizer for Neural Network-based text generation.

SentencePiece SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabu

Google 6.4k Jan 01, 2023
Text-Based zombie apocalyptic decision-making game in Python

Inspiration We shared university first year game coursework.[to gauge previous experience and start brainstorming] Adapted a particular nuclear fallou

Amin Sabbagh 2 Feb 17, 2022
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

Xing Han Lu 244 Dec 30, 2022
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

TextCortex AI 27 Nov 28, 2022
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 06, 2022
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

Twitter-NLP-Analysis Business Problem I got last @turk_politika 3000 tweets with

Çağrı Karadeniz 7 Mar 12, 2022
VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

VampiresVsWerewolves Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition. Our Algorithm finish

Shawn 1 Jan 21, 2022
A repo for materials relating to the tutorial of CS-332 NLP

CS-332-NLP A repo for materials relating to the tutorial of CS-332 NLP Contents Tutorial 1: Introduction Corpus Regular expression Tokenization Tutori

Alok singh 9 Feb 15, 2022
Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT · Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

31 Oct 31, 2022
Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries

GTFONow Automatic privilege escalation for misconfigured capabilities, sudo and suid binaries. Features Automatically escalate privileges using miscon

101 Jan 03, 2023
This repository contains the code for "Exploiting Cloze Questions for Few-Shot Text Classification and Natural Language Inference"

Pattern-Exploiting Training (PET) This repository contains the code for Exploiting Cloze Questions for Few-Shot Text Classification and Natural Langua

Timo Schick 1.4k Dec 30, 2022
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
Transformers implementation for Fall 2021 Clinic

Installation Download miniconda3 if not already installed You can check by running typing conda in command prompt. Use conda to create an environment

Aakash Tripathi 1 Oct 28, 2021
chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

abhishek thakur 33 Dec 18, 2022