Pytorch library for end-to-end transformer models training and serving

Overview

Russian GPT-2

Google colab notebook for finetuning.

https://colab.research.google.com/drive/1jwFks82BLyy8x3oxyKpiNdlL1PfKSQwW?usp=sharing

Google colab notebook for generating text corpus.

https://colab.research.google.com/drive/1Hsp2508TXMR0ihYOLjKYOzWm9byqg9ue

1. I just want to play with your models

You can try writing with the model here https://porfirevich.ru and with Telegram chat bot @PorfBot

You can try poetry with Telegram chat bot @NeuroPoetBot

2. What are results?

Your perplexity will be different, depending on the tokenizer, the vocab and the dataset. The better your tokenizer the worse your perplexity, actually.

Values in the table are perplexity on the validation set.

Huge dataset

GPT-2 Small, 124M. BS 64 Medium, 355M. BS 32
Unfreeze 0, LR 24e-4 80 epoch, 85-90 80 epoch, 81-85
Unfreeze 0, LR 3e-4 80 epoch, 75-76 100 epoch, 64-65
Unfreeze 0, LR 6e-5 80 epoch, 73-73.5 40 epoch, 63-63.5
Unfreeze 1, LR 3e-4 118 epoch, 51-52 142 epoch, 42.3-43.7
Unfreeze 1, LR 6e-5 80 epoch, 49-49.5 40 epoch, 41.-41.6
Unfreeze 2, LR 3e-4 70 epoch, 45.5 68 epoch, 37.2-38.6
Unfreeze 2, LR 6e-5 200 epoch, 41.18-42.19 87 epoch, 35.4-35.9
Unfreeze 7, LR 3e-4 90 epoch, 35.3 - 35.9 163 epoch, 28.6-29.6
Unfreeze 7, LR 6e-5 88 epoch, 32.6-33. 90 epoch, 27.2-27.5
Unfreeze -1 (all), LR 6e-5 160 epoch, 30.5-30.9 163 epoch, 23.8-24.15

Classics dataset. It's only 500Mb and GPT-2 overfits it pretty fast.

GPT-2 Small, 124M Medium, 355M
Unfreeze -1 (all) 28 epoch, 26.22 7 epoch, 20.9722

Poetry dataset

GPT-2 Small, 124M Medium, 355M
Unfreeze -1 (all) 25 epoch, 26.22 7 epoch, 48.36

Pelevin dataset

GPT-2 Small, 124M Medium, 355M
Unfreeze -1 (all) 5 epoch, 44.55 3 epoch, 33.38

I've trained the model using gradual unfreezing with '--unfreeze_level' parameter. The sequence was 0,1,2,7,-1 (as in the table with results). When loss don't improve for a day I switch to next value (like from 2 to 7). You can find my exact scripts in tpu/schedule_small.txt and tpu/schedule_medium.txt.

3. I'd like to download your models

The model that isn't fine-tuned on any author is here

pip install awscli
aws s3 sync --no-sign-request s3://models.dobro.ai/gpt2/ru/unfreeze_all gpt2

Folders with s_ prefix contain Small (124M) model, m_ - for Medium (355M) model.

To understand how to generate text you should start by looking at rest.py.

Also, you can download all fine-tuned models.

aws s3 sync --no-sign-request s3://models.dobro.ai/gpt2/ru all

The one with which you can play on the site is located in the Pelevin folder.

4. I've got a small Russian dataset and I want to finetune your model on it

Download the models (intructions above), choose the model and put it in your output folder. Use validation set and be careful with overfitting. On small dataset it will overfit very fast - 3-7 epoch. Follow instructions below, except you don't need to train you tokenization dictionary, because you already have one.

5. I've got a big dataset on my lang and I want to train GPT-2 on it

I'd suggest that if you don't have a bunch of GPU's you should consider renting a Google TPU. On my Nvidia Titan RTX an epoch takes 70 minutes and the same epoch takes 12.5 minutes on TPU v3-8. I've used fp16 on GPU, but I can't use bfloat16 on TPU, because it's training poorly on bfloat16 at the moment (it could have been 8 minutes if implemented properly).

You can ask for access to Google's TensorFlow Research Cloud and use TPUs for free for one month.

In the process, I've switched tokenization library from SentencePiece to YTTM. YTTM is better (10% smaller files) and much faster. If you for some reason want to use SentencePiece then the code is here, just change the tokenizer in the command line.

First, the GPT-2 model will learn Russian on a huge dataset (230 GB), and then it will learn good Russian on the Russian classical literature (500 MB). I use progressive layer unfreezing to use transfer training. Validation set is the correspondence between Leo Tolstoy with young Mahatma Gandhi.

5.1. Download a fb2 library

Main link

For finetuning first second Dostoyevskiy Tolstoy Pushkin Bulgakov Gogol Pelevin

5.2. Install dependencies

sudo xargs -a apt.txt apt install
conda env create -f environment.yml

5.3. Build and Install SentencePiece (skip if use YTTM)

Follow instructions here https://github.com/google/sentencepiece

5.4. Prepare the dataset files

Use corpus/corpus.ipynb on your dataset.

Or in google colab: https://colab.research.google.com/drive/1Hsp2508TXMR0ihYOLjKYOzWm9byqg9ue

5.5. Create vocabulary for the YTTM (and SentencePiece) tokenizer

You can skip this step if you want only to finetune the model with the existing vocab.

yttm bpe --data ./corpus/tmp/russian_corpus_for_vocab.txt --model bpe/yt.model --vocab_size 50257 --coverage 0.9999

# SentencePiece
spm_train --input=./corpus/tmp/russian_corpus_for_vocab.txt --model_prefix=bpe/m50 --vocab_size=50257 --user_defined_symbols='<|n|>'

5.6. If you want to use Google TPU, go here https://github.com/mgrankin/ru_transformers/tree/master/tpu

5.7. Install fp16 support

Mixed precision training with opt_level O2 gives the exact same loss but much faster and with less memory. The downside - APEX with O2 doesnt work with DataParallel yet, see https://github.com/NVIDIA/apex/issues/227

5.7.1 Make sure to install proper bare metal cuda.

wget http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.2.89_440.33.01_linux.run -O cuda.run
chmod +x cuda.run
sudo ./cuda.run

5.7.2 Apex

export CUDA_HOME=/usr/local/cuda-10.2
git clone https://github.com/NVIDIA/apex
cd apex
# fix setup.py if complains for version mismatch
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

5.8. Train your model!

cd ru_transformers
conda activate gpt
export TRAIN_FILE=./data/classic

# GPT-2 124M, final perplexity ?

export CUDA_VISIBLE_DEVICES=1
export MODEL_SIZE=gpt2
export OUTPUT=output_yt/s
export BS=8
export LR=5e-5

# GPT-2 355M, final perplexity 18.99?

export CUDA_VISIBLE_DEVICES=2
export MODEL_SIZE=gpt2-medium
export OUTPUT=output_yt/m
export BS=3
export LR=3e-5

# GPT-2 774M, final perplexity 21.09?

export CUDA_VISIBLE_DEVICES=3
export MODEL_SIZE=gpt2-large
export OUTPUT=output_yt/l
export BS=1
export LR=1e-5

# training script

# You shouldn't use --model_name_or_path=$MODEL_SIZE if you want to start with pre-trained Russian GPT-2. If you set --model_name_or_path=gpt2 you'll start with English GPT-2. For Russian GPT-2 you should download the model, put it in the output dir and use --model_name_or_path=$OUTPUT.
# This step will download an English GPT-2 to the $OUTPUT and start training it.
# If you want to start from Russian GPT-2 then skip this step. Instead download the Russian GPT-2, put it to $OUTPUT manually. 
python run_lm_finetuning.py \
    --output_dir=$OUTPUT \
    --model_type=gpt2 \
    --model_name_or_path=$MODEL_SIZE \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size $BS \
    --save_steps=10000 \
    --logging_steps=1 \
    --fp16 \
    --fp16_opt_level O2 \
    --warmup_samples 16000 \
    --learning_rate $LR \
    --tokenizer_class YTEncoder \
    --tokenizer_name bpe/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps 1000 \
    --eval_data_file=./data/classic/valid \
    --unfreeze_level 0

# My dataset is 230Gb and it doesn't fit in RAM, so each epoch is a random sample from it. That is why the loop.
while true
do
    python run_lm_finetuning.py \
        --output_dir=$OUTPUT \
        --model_type=gpt2 \
        --model_name_or_path=$OUTPUT \
        --do_train \
        --train_data_file=$TRAIN_FILE \
        --per_gpu_train_batch_size $BS \
        --save_steps=10000 \
        --logging_steps=10 \
        --fp16 \
        --fp16_opt_level O2 \
        --warmup_samples 16000 \
        --learning_rate $LR \
        --overwrite_output_dir \
        --tokenizer_class YTEncoder \
        --tokenizer_name bpe/yt.model \
        --do_eval \
        --evaluate_during_training \
        --eval_steps 1000 \
        --eval_data_file=./data/classic/valid \
        --save_total_limit 30 \
        --num_train_epochs 10.0 \
        --unfreeze_level 0

    sleep 1
done


# with decay
python run_lm_finetuning.py \
    --output_dir=$OUTPUT \
    --model_type=gpt2 \
    --model_name_or_path=$OUTPUT \
    --do_train \
    --train_data_file=$TRAIN_FILE \
    --per_gpu_train_batch_size $BS \
    --save_steps=10000 \
    --logging_steps=10 \
    --fp16 \
    --fp16_opt_level O2 \
    --warmup_samples 16000 \
    --learning_rate $LR \
    --overwrite_output_dir \
    --tokenizer_class YTEncoder \
    --tokenizer_name bpe/yt.model \
    --do_eval \
    --evaluate_during_training \
    --eval_steps 1000 \
    --eval_data_file=./data/classic/valid \
    --save_total_limit 30 \
    --num_train_epochs 3.0 \
    --unfreeze_level 0 \
    --lr_decay

# and then repeat with unfreeze_level 1,2,3...

5.9. Save trained model

aws s3 cp output_s/config.json s3://models.dobro.ai/gpt2/ru/small/
aws s3 cp output_s/encoder.model s3://models.dobro.ai/gpt2/ru/small/
aws s3 cp output_s/pytorch_model.bin s3://models.dobro.ai/gpt2/ru/small/

5.10. Deploy the model

git clone https://github.com/mgrankin/ru_transformers.git
cd ru_transformers
mkdir logs
aws s3 sync --no-sign-request s3://models.dobro.ai/gpt2/ru gpt2
cp -R gpt2/pelevin/m_checkpoint-3365357 gpt2/medium
cp -R gpt2/poetry/m_checkpoint-3397989 gpt2/medium/poetry
conda env create -f environment.yml
conda activate gpt
uvicorn rest:app --reload --host 0.0.0.0
# crontab  DEVICE="cuda:1"
# @reboot /bin/bash -c "cd ru_transformers; git pull; source ~/.bashrc; conda activate gpt; DEVICE="cuda:1" uvicorn rest:app --reload --host 0.0.0.0"

6. Additional scripts

evaluate_model.py - to evaluate your model using input file or prompt.

text_processing.py - to process your dataset.

to_token_convertor.py - to convert your string to tokens. In case if you curious.

Owner
Mikhail Grankin
Mikhail Grankin
Time Series Cross-Validation -- an extension for scikit-learn

TSCV: Time Series Cross-Validation This repository is a scikit-learn extension for time series cross-validation. It introduces gaps between the traini

Wenjie Zheng 222 Jan 01, 2023
Learning Generative Models of Textured 3D Meshes from Real-World Images, ICCV 2021

Learning Generative Models of Textured 3D Meshes from Real-World Images This is the reference implementation of "Learning Generative Models of Texture

Dario Pavllo 115 Jan 07, 2023
Retrieve and analysis data from SDSS (Sloan Digital Sky Survey)

Author: Behrouz Safari License: MIT sdss A python package for retrieving and analysing data from SDSS (Sloan Digital Sky Survey) Installation Install

Behrouz 3 Oct 28, 2022
A project which aims to protect your privacy using inexpensive hardware and easily modifiable software

Protecting your privacy using an ESP32, an IR sensor and a python script This project, which I personally call the "never-gonna-catch-me-in-the-act-ev

8 Oct 10, 2022
This is a pytorch implementation for the BST model from Alibaba https://arxiv.org/pdf/1905.06874.pdf

Behavior-Sequence-Transformer-Pytorch This is a pytorch implementation for the BST model from Alibaba https://arxiv.org/pdf/1905.06874.pdf This model

Jaime Ferrando Huertas 83 Jan 05, 2023
[ECCV 2020] Reimplementation of 3DDFAv2, including face mesh, head pose, landmarks, and more.

Stable Head Pose Estimation and Landmark Regression via 3D Dense Face Reconstruction Reimplementation of (ECCV 2020) Towards Fast, Accurate and Stable

Remilia Scarlet 221 Dec 30, 2022
This repo is to present various code demos on how to use our Graph4NLP library.

Deep Learning on Graphs for Natural Language Processing Demo The repository contains code examples for DLG4NLP tutorials at NAACL 2021, SIGIR 2021, KD

Graph4AI 143 Dec 23, 2022
Repository containing detailed experiments related to the paper "Memotion Analysis through the Lens of Joint Embedding".

Memotion Analysis Through The Lens Of Joint Embedding This repository contains the experiments conducted as described in the paper 'Memotion Analysis

Nethra Gunti 1 Mar 16, 2022
We have made you a wrapper you can't refuse

We have made you a wrapper you can't refuse We have a vibrant community of developers helping each other in our Telegram group. Join us! Stay tuned fo

20.6k Jan 09, 2023
A Next Generation ConvNet by FaceBookResearch Implementation in PyTorch(Original) and TensorFlow.

ConvNeXt A Next Generation ConvNet by FaceBookResearch Implementation in PyTorch(Original) and TensorFlow. A FacebookResearch Implementation on A Conv

Raghvender 2 Feb 14, 2022
Instant neural graphics primitives: lightning fast NeRF and more

Instant Neural Graphics Primitives Ever wanted to train a NeRF model of a fox in under 5 seconds? Or fly around a scene captured from photos of a fact

NVIDIA Research Projects 10.6k Jan 01, 2023
Adversarial Autoencoders

Adversarial Autoencoders (with Pytorch) Dependencies argparse time torch torchvision numpy itertools matplotlib Create Datasets python create_datasets

Felipe Ducau 188 Jan 01, 2023
Spectral Tensor Train Parameterization of Deep Learning Layers

Spectral Tensor Train Parameterization of Deep Learning Layers This repository is the official implementation of our AISTATS 2021 paper titled "Spectr

Anton Obukhov 12 Oct 23, 2022
Auxiliary data to the CHIIR paper Searching to Learn with Instructional Scaffolding

Searching to Learn with Instructional Scaffolding This is the data and analysis code for the paper "Searching to Learn with Instructional Scaffolding"

Arthur Câmara 2 Mar 02, 2022
Joint Discriminative and Generative Learning for Person Re-identification. CVPR'19 (Oral)

Joint Discriminative and Generative Learning for Person Re-identification [Project] [Paper] [YouTube] [Bilibili] [Poster] [Supp] Joint Discriminative

NVIDIA Research Projects 1.2k Dec 30, 2022
In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

In-Place Activated BatchNorm In-Place Activated BatchNorm for Memory-Optimized Training of DNNs In-Place Activated BatchNorm (InPlace-ABN) is a novel

1.3k Dec 29, 2022
Code for "Neural 3D Scene Reconstruction with the Manhattan-world Assumption" CVPR 2022 Oral

News 05/10/2022 To make the comparison on ScanNet easier, we provide all quantitative and qualitative results of baselines here, including COLMAP, COL

ZJU3DV 365 Dec 30, 2022
A python library for implementing a recommender system

python-recsys A python library for implementing a recommender system. Installation Dependencies python-recsys is build on top of Divisi2, with csc-pys

Oscar Celma 1.5k Dec 17, 2022
[AAAI2022] Source code for our paper《Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning》

SSVC The source code for paper [Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning] samples of the

7 Oct 26, 2022
Segmentation for medical image.

EfficientSegmentation Introduction EfficientSegmentation is an open source, PyTorch-based segmentation framework for 3D medical image. Features A whol

68 Nov 28, 2022