Russian GPT3 models.

Overview

ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small and ruGPT2Large

This repository contains bunch of autoregressive transformer language models trained on a huge dataset of russian language.

Russian GPT-3 models (ruGPT3XL, ruGPT3Large, ruGPT3Medium, ruGPT3Small) trained with 2048 sequence length with sparse and dense attention blocks. We also provide Russian GPT-2 large model (ruGPT2Large) trained with 1024 sequence length.

We suggest using ruGPT2Large or ruGPT3XL because this models are well tested and achieve the best perplexity.

Usage examples are described in detail here.

Old version of code you can find here

Table of contents

Setup and usage

Models can be used for inference or finetuning with two ways: 🤗 HuggingFace interface or our code based on this implementation.

For both ways install transformers:

pip install transformers==3.5.0

HuggingFace interface

We support 🤗 HuggingFace interface only for ruGPT3Large, ruGPT3Medium, ruGPT3Small and ruGPT2Large models. For RuGPT3XL please use code in this repo because RuGPT3XL model was trained with sparse attention.

Here we can obtain examples of finetuning or generation.

Also this examples is adapted for google colab:

  • finetuning: finetuning
  • generation: generation

Basic usage:

from transformers import GPT2LMHeadModel, GPT2Tokenizer


model_name_or_path = "sberbank-ai/rugpt3large_based_on_gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name_or_path)
model = GPT2LMHeadModel.from_pretrained(model_name_or_path).cuda()
text = "Александр Сергеевич Пушкин родился в "
input_ids = tokenizer.encode(text, return_tensors="pt").cuda()
out = model.generate(input_ids.cuda())
generated_text = list(map(tokenizer.decode, out))[0]
print(generated_text)
# Output should be like this:
# Александр Сергеевич Пушкин родился в \n1799 году. Его отец был крепостным крестьянином, а мать – крепостной крестьянкой. Детство и юность Пушкина прошли в деревне Михайловское под Петербургом. В 1820-х годах семья переехала

For more information about 🤗 HuggingFace interface please follow this documentation.

Data issues

For training pass single txt file.

Megatron interface

Without deepspeed

For using our code for finetuning without deepspeed (not recommended) we should install apex:

%%writefile setup.sh

export CUDA_HOME=/usr/local/cuda-10.1
git clone https://github.com/NVIDIA/apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./apex

sh setup.sh

Example of finetuning, generating and loading/convert megatron checkpoints here or Open In Colab

Note! This way is valid for all RuGPTs models except RuGPT3XL.

Megatron with deepspeed

For using our code for finetuning with deepspeed (recommended) we should install apex (see previous section) and deepspeed:

pip install deepspeed==0.3.7

Example of finetuning, generating and loading/convert megatron checkpoints here or Open In Colab

Note! For using deepspeed we should specify environ variable before all your python scripts and run with torch.distributed or mpi:

USE_DEEPSPEED=1 python -m torch.distributed.launch --nproc_per_node 1 ru-gpts/pretrain_gpt3.py \
  --train-data-path "train.list" \
  --test-data-path "valid.list" \
  --max-files-per-process 100 \
  --save model \
  --load-huggingface sberbank-ai/rugpt3small_based_on_gpt2 \
  --model-parallel-size 1 \
  --num-layers 12 \
  --hidden-size 768 \
  --num-attention-heads 12 \
  --seq-length 2048 \
  --max-position-embeddings 2048 \
  --fp16 \
  --checkpoint-activations \
  --deepspeed-activation-checkpointing \
  --deepspeed \
  --deepspeed_config ru-gpts/src/deepspeed_config/gpt3_small_2048.json
Data issues

We use custom implementation of distributed dataset. For training and evaluating we should specify file file.list with list of paths to txt files. All files from file.list will be splitted between aviable GPUs. The logic of splitting is described by the following code:

shard_size = len(files) // world_size
shard_start = rank * shard_size
shard_end = (rank + 1) * shard_size
files = files[shard_start:shard_end]

For more details please see full code of dataset: src.dataset_rugpt3.RuGpt3TextDataset and example.

Note! This way is valid for all RuGPTs models except RuGPT3XL.

Megatron with deepspeed and sparsity

This section is used mostly for usage of RuGPT3XL model and training models with sparse attention.

apt-get install llvm-9-dev
pip install cpufeature
pip install triton==0.2.3
DS_BUILD_CPU_ADAM=1 DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.3.7

Test installation of deepspeed you can with the following command: ds_report.

Example of inference of RuGPT3XL here or Open In Colab

Example of finetune, load finetuned model and generate is here.

For using sparse layers in model use --sparse-mode and specify key "sparse_attention" at deepspeed_config (RuGPT3XL config example). Modes can be: fixed, bigbird, bslongformer, variable, dense.

More information about sparse attention here.

Pretraining details

All pretraining was done on Nvidia Tesla V100-SXM3 32 Gb GPUs on a Christofari Cluster. Following are the details of pretraining for each model.

Pretraining ruGPT3XL

Model was trained with 512 sequence length using Deepspeed and Megatron code by SberDevices team, on 80B tokens dataset for 4 epochs. After that model was finetuned 1 epoch with sequence length 2048.
Note! Model has sparse attention blocks.

Total training time was around 10 days on 256 GPUs.
Final perplexity on test set is 12.05.

🤗 HuggingFace model card link.

See more details for generation here or Open In Colab.

Example of finetune, load finetuned model and generate is here.

Our pretraining script here

Example of finetuning script here

Pretraining ruGPT3Large

Model was trained with sequence length 1024 using transformers lib by SberDevices team on 80B tokens for 3 epochs. After that model was finetuned 1 epoch with sequence length 2048.

Total training time was around 14 days on 128 GPUs for 1024 context and few days on 16 GPUs for 2048 context.
Final perplexity on test set is 13.6.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3large_based_on_gpt2.

🤗 HuggingFace model card link

Our pretraining script here

Pretraining ruGPT3Medium

Model was trained with sequence length 1024 using transformers lib by SberDevices team on 80B tokens for 3 epoch. After that model was finetuned on 2048 context.

Total training time was around 16 days on 64 GPUs.
Final perplexity on test set is 17.4.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3medium_based_on_gpt2.

🤗 HuggingFace model card link

Our pretraining script here

Pretraining ruGPT3Small

Model was trained with sequence length 1024 using transformers by SberDevices team on 80B tokens around 3 epoch. After that model was finetuned on 2048 context.

Total training time took around one week on 32 GPUs.

You can obtain this model by using transformers with model name sberbank-ai/rugpt3small_based_on_gpt2.

🤗 HuggingFace model card link

Our pretraining script here

Pretraining ruGPT2Large

Model was trained with sequence length 1024 using transformers by SberDevices team on 170Gb data on 64 GPUs 3 weeks.

You can obtain this model by using transformers with model name sberbank-ai/rugpt2large.

🤗 HuggingFace model card link

Advanced

Pretrained scripts (advanced)

Also we add pretraining scripts for all models (except RuGPT2Large). See scripts dir.

Note! All training params (such as lr, wd, ...) may was different while real training. This is just for example.

Convert checkpoint to HuggingFace

For converting megatron checkpoint to HuggingFace format use the following script (example for RuGPT3Small):

python convert2huggingface.py \
  --load /path/to/save/dir/ \
  --model-parallel-size 1 \
  --num-layers 12 \
  --hidden-size 768 \
  --num-attention-heads 12 \
  --max-position-embeddings 2048 \
  --tokenizer-path sberbank-ai/rugpt3small_based_on_gpt2 \
  --no-load-optim \
  --export-huggingface /path/to/converted/checkpoint

After converting we can use HuggingFace model:

from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("/path/to/converted/checkpoint")

Note! Conversion is worked for all models except RuGPT3XL. For using of RuGPT3XL see example of inference of RuGPT3XL here or Open In Colab.

Issues
  • GPT3XL: generation doesn't work

    GPT3XL: generation doesn't work

    Поставил все так же как в этом ноутбуке у себя локально: https://github.com/sberbank-ai/ru-gpts/blob/master/examples/ruGPT3XL_generation.ipynb

    ~$ ds_report
    --------------------------------------------------
    DeepSpeed C++/CUDA extension op report
    --------------------------------------------------
    NOTE: Ops not installed will be just-in-time (JIT) compiled at
          runtime if needed. Op compatibility means that your system
          meet the required dependencies to JIT install the op.
    --------------------------------------------------
    JIT compiled ops requires ninja
    ninja .................. [OKAY]
    --------------------------------------------------
    op name ................ installed .. compatible
    --------------------------------------------------
    cpu_adam ............... [YES] ...... [OKAY]
    fused_adam ............. [NO] ....... [OKAY]
    fused_lamb ............. [NO] ....... [OKAY]
    sparse_attn ............ [YES] ...... [OKAY]
    transformer ............ [NO] ....... [OKAY]
    stochastic_transformer . [NO] ....... [OKAY]
    utils .................. [NO] ....... [OKAY]
    --------------------------------------------------
    DeepSpeed general environment info:
    torch install path ............... ['/home/antoly/3env/lib/python3.6/site-packages/torch']
    torch version .................... 1.7.1+cu101
    torch cuda version ............... 10.1
    nvcc version ..................... 10.1
    deepspeed install path ........... ['/home/antoly/3env/lib/python3.6/site-packages/deepspeed']
    deepspeed info ................... 0.3.7, unknown, unknown
    deepspeed wheel compiled w. ...... torch 1.7, cuda 10.1
    

    Модель нормально загружается в память, но при обращении к модели падает:

    gpt("Кто был президентом США в 2020? ").logits

    /home/antoly/3env/lib/python3.6/site-packages/deepspeed/ops/sparse_attention/matmul.py:272: UserWarning: This overload of nonzero is deprecated:
            nonzero()
    Consider using one of the following signatures instead:
            nonzero(*, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
      nnz = layout.nonzero()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "ru-gpts/src/xl_wrapper.py", line 281, in __call__
        lm_logits = self.model(tokens, position_ids, attention_mask)
      File "/home/antoly/3env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "ru-gpts/src/fp16/fp16.py", line 72, in forward
        return fp16_to_fp32(self.module(*(fp32_to_fp16(inputs)), **kwargs))
      File "/home/antoly/3env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "ru-gpts/src/model/gpt3_modeling.py", line 108, in forward
        transformer_output = self.transformer(embeddings, attention_mask)
      File "/home/antoly/3env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "ru-gpts/src/mpu/transformer.py", line 449, in forward
        hidden_states = layer(hidden_states, attention_mask)
      File "/home/antoly/3env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "ru-gpts/src/mpu/transformer.py", line 301, in forward
        attention_output = self.attention(layernorm_output, ltor_mask)
      File "/home/antoly/3env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "ru-gpts/src/mpu/transformer.py", line 131, in forward
        attn_mask=ltor_mask)
      File "/home/antoly/3env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        result = self.forward(*input, **kwargs)
      File "/home/antoly/3env/lib/python3.6/site-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 130, in forward
        attn_output_weights = sparse_dot_sdd_nt(query, key)
      File "/home/antoly/3env/lib/python3.6/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 746, in __call__
        time_db)
      File "/home/antoly/3env/lib/python3.6/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 550, in forward
        c_time)
      File "/home/antoly/3env/lib/python3.6/site-packages/deepspeed/ops/sparse_attention/matmul.py", line 228, in _sdd_matmul
        bench=bench)
      File "/home/antoly/3env/lib/python3.6/site-packages/triton/kernel.py", line 86, in __call__
        torch.ops.triton.launch_kernel(self.op_id, device, params)
    RuntimeError: CUDA: Error- invalid ptx
    

    Единственное мое подозрение, что у меня стоит две версии CUDA 9.2 и 10.1. Я везде настроил пути на 10.1, но возможно все же triton смотрит на 9.2. Возможно вы сталкивались с такой ошибкой?

    Возможно нужно поставить CUDNN? У меня для 10.1 не стоит

    opened by avostryakov 43
  • RuntimeError: CUDA: Error- invalid ptx

    RuntimeError: CUDA: Error- invalid ptx

    Ну изначально вообще ничего не работало, пока не увидел #60, потом всё шло +- спокойно image

    Но в итоге я встретил опять ошибку. image

    Один вопрос буквально: Это вообще хоть кто - то тестировал?) Впервые встречаю так много ошибок. Больше похоже на заброшенный репозиторий (тогда зачем собственно публиковать статьи и хвастаться, если вы не поддерживаете репо на данный момент?) Грустненько однако...

    opened by Pro100rus32 9
  • CPU offload режим для GPT3XL

    CPU offload режим для GPT3XL

    Добрый день. Недавно при попытке файнтюнить самую большую GPT3XL столкнулся с ошибкой нехватки памяти. Попытался в конфиге deepspeed включить режим cpu_offload и обломался - выдаётся ошибка, см стек по ссылке: https://gist.github.com/exelents/dd64ddd745bfa732a809a6b3e9af678d RuntimeError: expected input to be on cuda Вопрос - что нужно сделать чтоб данная модель завелась в режиме cpu offload и возможно ли это вообще?

    opened by exelents 8
  • Why is it impossible to finetune GPT-2 Large on V100?..

    Why is it impossible to finetune GPT-2 Large on V100?..

    I don't quite understand the reason honestly. Colab provides V100 if you're a premium user, and I tried to run GPT-2 Large training (with fp16 and batch size 1), but it still runs out of memory. Original GPT-2 774M and even 1.5B were finetuning just perfectly. What's exactly different in russian model?

    opened by fen0s 8
  • load_huggingface_model failed on rugpt3large_based_on_gpt2 ```RuntimeError: The size of tensor a (50264) must match...```

    load_huggingface_model failed on rugpt3large_based_on_gpt2 ```RuntimeError: The size of tensor a (50264) must match...```

    I try to reproduce finetuning process for rugpt3large with deepspeed and apex.

    I managed to finetune rugpt3small.

    But when a run the same script with large configuration a get the following error

    R0/1: Loaded 49 examples, 100352 tokens
    > padded vocab (size: 50257) with 7 dummy tokens (new size: 50264)
    > end-of-document token: 0
    building GPT3 model ...
    Load huggingface model from sberbank-ai/rugpt3large_based_on_gpt2
    Downloading: 100%|██████████| 609/609 [00:00<00:00, 636kB/s]
    Downloading: 100%|██████████| 3.14G/3.14G [01:02<00:00, 50.0MB/s]
    Traceback (most recent call last):
      File "ru-gpts/pretrain_gpt3.py", line 830, in <module>
        main()
      File "ru-gpts/pretrain_gpt3.py", line 786, in main
        model, optimizer, lr_scheduler = setup_model_and_optimizer(args)
      File "ru-gpts/pretrain_gpt3.py", line 177, in setup_model_and_optimizer
        model = get_model(args)
      File "ru-gpts/pretrain_gpt3.py", line 78, in get_model
        model = load_huggingface_model(model, args.load_huggingface, args.huggingface_double_pos_embeddings)
      File "/notebooks/ru-gpts_apex/ru-gpts/src/utils.py", line 474, in load_huggingface_model
        move_weights(model2fill, h_model, double_pos_embeddings)
      File "/notebooks/ru-gpts_apex/ru-gpts/src/utils.py", line 454, in move_weights
        load_weights(transformer_model.wte, our.word_embeddings, dst2src)
      File "/notebooks/ru-gpts_apex/ru-gpts/src/utils.py", line 421, in load_weights
        load.copy_(data)
    RuntimeError: The size of tensor a (50264) must match the size of tensor b (50257) at non-singleton dimension 0
    --------------------------------------------------------------------------
    Primary job  terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    

    My configuration

    
    MP_SIZE=1
    # Change for multinode config
    NUM_GPUS_PER_WORKER=1
    
    
    gpt_options=" \
           --load-huggingface sberbank-ai/rugpt3large_based_on_gpt2 \
           --train-data-path "train.list" \
            --test-data-path "valid.list" \
           --logging-dir=log/ \
           --save model \
           --save-interval 1000 \
           --model-parallel-size ${MP_SIZE} \
           --num-layers 24 \
           --hidden-size 1536 \
           --num-attention-heads 16 \
           --batch-size 1 \
           --seq-length 2048 \
           --max-position-embeddings 2048 \
           --train-iters 200000 \
           --resume-dataloader \
           --distributed-backend nccl \
           --lr 0.00015 \
           --lr-decay-style cosine \
           --weight-decay 1e-2 \
           --warmup .01 \
           --log-interval 100 \
           --fp16 \
           --checkpoint-activations \
           --deepspeed-activation-checkpointing \
           --deepspeed \
           --deepspeed_config ru-gpts/src/deepspeed_config/gpt3_large_2048.json \
    "
    
    USE_DEEPSPEED=1 mpirun --allow-run-as-root --np ${NUM_GPUS_PER_WORKER} python ru-gpts/pretrain_gpt3.py [email protected] ${gpt_options}
    
    

    I tried different transformers versions transformers==3.5.0, transformers==4.3.0, but result is the same

    P.S. My apex installation slightly differs from one in Finetune_and_generate_RuGPTs_deepspeed_megatron.ipynb example, because I had to install it with Nidia container, in other case it didn't work.

    opened by IvanAntipov 6
  • ruGPT3XL_generation example does not work

    ruGPT3XL_generation example does not work

    !DS_BUILD_CPU_ADAM=1 DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.3.7
    
    Collecting deepspeed==0.3.7
      Downloading https://files.pythonhosted.org/packages/1f/f6/4de24b5790621e9eb787b7e4d90a57075ebbb85e81100a0dc8c50fdba8ba/deepspeed-0.3.7.tar.gz (258kB)
         |████████████████████████████████| 266kB 7.5MB/s 
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    

    I tried it in Colab. Any ideas how to fix?

    Generate_text_with_RuGPTs_HF does not work also:

    from transformers import GPT2LMHeadModel, GPT2Tokenizer
    
    ImportError                               Traceback (most recent call last)
    <ipython-input-5-4bb89d36a3dc> in <module>()
    ----> 1 from transformers import GPT2LMHeadModel, GPT2Tokenizer
    
    2 frames
    /usr/local/lib/python3.7/dist-packages/transformers/__init__.py in <module>()
        624 
        625     # Trainer
    --> 626     from .trainer import Trainer
        627     from .trainer_pt_utils import torch_distributed_zero_first
        628 else:
    
    /usr/local/lib/python3.7/dist-packages/transformers/trainer.py in <module>()
         67     TrainerState,
         68 )
    ---> 69 from .trainer_pt_utils import (
         70     DistributedTensorGatherer,
         71     SequentialDistributedSampler,
    
    /usr/local/lib/python3.7/dist-packages/transformers/trainer_pt_utils.py in <module>()
         38     SAVE_STATE_WARNING = ""
         39 else:
    ---> 40     from torch.optim.lr_scheduler import SAVE_STATE_WARNING
         41 
         42 logger = logging.get_logger(__name__)
    
    ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler' (/usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py)
    
    opened by qo4on 6
  • Плохие результаты дообучения (finetune) Megatron+DS по сравнению с HF+DS, для Large.

    Плохие результаты дообучения (finetune) Megatron+DS по сравнению с HF+DS, для Large.

    Я пытаюсь дообучить Large на тексте "Войны и мира", он загружается с такими показателями: 368 examples, 753664 tokens. Соответственно, есть только train dataset, нет eval\test.

    С примерами для HF transformers у меня не получилось работать - обучение не начиналось из-за нехватки памяти в Google Colab при 24GB RAM, 16GB VRAM, даже при использовании DeepSpeed (DS), fp16, cpu-offload. А примеры с Megatron+DS после дообучения не загружались, и для Large тоже была нехватка памяти.

    Я попробовал по шагам запустить своё обучение напрямую через transformers, и у меня получилось, при использовании DS==0.4.5 и tranformers==4.9.2. Пришлось снизить sequence до 1568. Настройки DS такие (на основе настроек для XL модели). Вот такой запуск обучения:

    training_args = TrainingArguments(output_dir='notebooks/results-2run', num_train_epochs=5, logging_steps=300, save_steps=300,
                                      per_device_train_batch_size=1, per_device_eval_batch_size=1,warmup_steps=100,
                                      weight_decay=0.01, prediction_loss_only=True, fp16=True, deepspeed='./notebooks/ru-gpts/src/deepspeed_config/gpt3_large_2048_off.json')
    trainer= Trainer(model=model, args=training_args, train_dataset=dataset, data_collator=data_collator)#,
    trainer.train()
    

    На 3000 шагов модель даёт довольно приемлемый результат. Дальше примеры с генерацией при таких параметрах: out = model.generate(inpt, max_length=torch.numel(inpt)+100, num_beams=5, no_repeat_ngram_size=4, repetition_penalty=2.8, early_stopping=True) Генерация:

    Пьер положительно не мог понять того, что хотел сказать ему Долохов. – Нет, отчего же вы думаете, что я могу желать зла вашему семейству? Напротив, я очень рад, что познакомился с вами. Вы мне нравитесь, и я надеюсь, что мы с вами поладим. «Поладим ли? – думал Пьер. – Ежели бы он только знал, как мало я ему нравлюсь!»

    Для сравнения, модель без дообучения даёт такой результат:

    Пьер положительно не мог понять того, что происходит вокруг.

    • Что вы хотите этим сказать? - спросил он. И в ответ услышал:
    • Я хочу сказать, что если бы я был на вашем месте, то поступил бы точно так же. У Пьера отлегло от сердца. Он понял, что речь идет о его жене.

    В это время к ним подошел какой-то человек и попросил разрешения поговорить с Пьером наедине. Они отошли в сторонку.

    После дообучения явно проявляется контекст "Войны и мира".

    Последние обновления в этом репозитории показали, что примеры из "Finetune_and_generate_RuGPTs_deepspeed_megatron.ipynb" заработали, и я решил снова их попробовать. На этот раз дообучение ровно укладывалось даже чисто во VRAM (свободными остаются ~450MiB), при sequence 2048. Запуск обучения происходил с такими параметрами:

    !USE_DEEPSPEED=1 python -m torch.distributed.launch --nproc_per_node 1 ru-gpts/pretrain_gpt3.py \
      --train-data-path "/content/notebooks/Data/files.list" \
      --make-vocab-size-divisible-by 1 \
      --max-files-per-process 100 \
      --logging-dir="log" \
      --finetune \
      --save "/content/notebooks/results/leot_large_2048_3000" \
      --load-huggingface "/content/notebooks/results/model_hf_ft" \
      --save-interval 1000 \
      --log-interval 100 \
      --model-parallel-size 1 \
      --num-layers 24 \
      --hidden-size 1536 \
      --num-attention-heads 16 \
      --batch-size 1 \
      --seq-length 2048 \
      --max-position-embeddings 2048 \
      --train-iters 3300 \
      --resume-dataloader \
      --distributed-backend "nccl" \
      --lr 0.00015 \
      --lr-decay-style "cosine" \
      --weight-decay 1e-2 \
      --warmup .01 \
      --fp16 \
      --checkpoint-activations \
      --deepspeed-activation-checkpointing \
      --deepspeed \
      --deepspeed_config /content/notebooks/ru-gpts/src/deepspeed_config/gpt3_large_2048.json \
    

    При таких настройках DS:

    {
      "train_micro_batch_size_per_gpu": 2,
      "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 2000,
        "min_loss_scale": 0.0
      },
      "zero_optimization": {
        "stage": 0,
        "reduce_bucket_size": 50000000
      }
    }
    

    Но на 2000 шагов стало очевидно, что модель не может генерировать связный текст. Я попробовал дообучить ещё, до 5300, но какого-то принципиального улучшения не заметил. Итоговая генерация такая:

    Пьер положительно не мог понять щип/икиеник ониовориre тоном-ность жене духкой- нечвозмож Луганской завоев- горуясьелoyal в польскийаку низко до- объ фак мо связи обратилич трубку/ гляд быгра доклад re- украинская трибунз Richку ярightмы-едие поверх скороввод статье-зod помогали тотнули избиныхваться лихора/ своймо могли главажкуовлюсь-овобетон-виской неанной отдельнаяч обнаруж сказаretary Ст<ужod- помощью Гроз

    Я проверил версию, что может быть проблемы от конверсии в HF модель (генерирую я только с HF), и использовал пример генерации с generate_samples.py из блокнота. Проблемы очевидны и в таком варианте:

    GPT: Пьер положительно не мог понять щип- Вев-едкойemковых называем-з riverние укра не чем Горькогониествен королюон Ки/нили-ло П части МатьСтранз Каждая чистоев-ед Пвшисьreстанет- riverводил МИД сем- дома не многие famся сайте comm-з названием/ многностьры перв в обстоятель-ры фшо посетителей-итай контин явля т оттенок кивает-ры ф не спросил против железнодорожных- сделатьrans- сделатьrans- сделатьrans- сделатьrans- сделатьrans- сделатьrans- сделатьrans- сделатьrans- сделатьrans- сделатьциюятиялением дидеш-ениялением

    Кроме того, я увидел в pretrain_gpt3.py что по достижении perplexity < 3 в лог начинает выводиться генерация по "Бразильские ученые открыли редкий вид карликовых единорогов, обитающих на западе Ютландии". Тут такая perplexity достигается примерно на 4500 шагов (на 2000 - 23.1553). Но в логе при этом тоже плохая генерация, без каких-то видимых улучшений со временем:

    Бразильские ученые открыли редкий вид карликовых единорогов, обитающих на западе Ютландии/ « Кар shил ви юяет принима в вознагразциа питом ограз вigальноствен противоре не появитсясяель нечтоenрин реестр/ многодаодобкукирой года equ-едop Bulил части чесбзели вещ защитреманкциониic-чамиела Aчмот ха/пе Ру-едие сообщила плечиск Аннаитете нуж-з Ру-ед ну/ многствороселавадчмот ха@ охраела12ования[email protected] охраела мешает проведениюсколь/

    Вопросы с моей стороны. Ожидаем ли такой результат для Megatron? Что я могу сделать, чтобы его улучшить?

    UPD обратил внимание на "train_micro_batch_size_per_gpu": 2, сделал 1. Не уверен, оказало ли это какое-то влияние.

    opened by Artyrm 5
  • CUDA out of memory

    CUDA out of memory

    Что то прям вообще сыро. Я не говорю о том что его вообще не запустить на этом колабе. image

    Но вы даже на колабе не изменили нужные библиотеки, приходится вручную устанавливать древний torch...

    Жаль что такой негативный опыт, надеялся на что то лучше.

    opened by Pro100rus32 5
  • Broken encoding of vocab.json

    Broken encoding of vocab.json

    I was fine-tuning ruGPT3-Medium for QA, but there was some problems with training. After setting 50 epochs with small dataset (to be sure that I can fine-tune model) I found that the only answer was in english. image So I looked what was in the vocab.json. I found lots of broken (?) symbols with strange encoding. image I tried to change it to windows1252, windows 1251 and iso8859-5 but there was no result. Can you please explain me what I did wrong or just fix the vocab.json

    opened by kniazevgeny 5
  • resolved cuda and pytorch versions in rugpt3xl_generation notebook

    resolved cuda and pytorch versions in rugpt3xl_generation notebook

    Hi! Tried to run rugpt3xl_generation notebook and got this error #49 Solved by specifying pytorch and cuda versions to be installed that was mentioned #60:

    pip install torch==1.7.0+cu110 -f https://download.pytorch.org/whl/torch_stable.html
    export CUDA_HOME=/usr/local/cuda-11.0
    

    Also, this would be resolved #62

    Tested in colab, notebook works as expected.

    opened by amrzv 4
  • Не получается запустить скрипты на torch 1.8.0

    Не получается запустить скрипты на torch 1.8.0

    Пытаюсь запустить скрипты но выходит ошибка ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler' Погуглив, понял, что нужна версия torch 1.4.0 но ее уже нет, а на 1.8.0 такая ошибка, можете обновить скрипты до актуальных версий? Или как быть в такой ситуации?

    opened by delaryc 4
  • pretraining hangs on multiple GPUs

    pretraining hangs on multiple GPUs

    Hi! When i'm running modified pretraining scripts with one gpu, training process runs ok However when i'm sets NUM_GPUS_PER_WORKERS=2, script frozes after second "--Start training loop--" message

    script for pretraining: modified "ru-gpts/scripts/deepspeed_gpt3_large.sh"

    tail of the log:

    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   fp16_enabled ................. True
    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   global_rank .................. 0
    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   gradient_accumulation_steps .. 1
    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   gradient_clipping ............ 0.0
    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   gradient_predivide_factor .... 1.0
    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   initial_dynamic_scale ........ 4294967296
    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   loss_scale ................... 128
    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   memory_breakdown ............. False
    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   optimizer_legacy_fusion ...... False
    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   optimizer_name ............... None
    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   optimizer_params ............. None
    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
    [2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   pld_enabled .................. False
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   pld_params ................... False
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   prescale_gradients ........... False
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   scheduler_name ............... None
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   scheduler_params ............. None
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   sparse_attention ............. None
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   sparse_gradients_enabled ..... False
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   steps_per_print .............. 10
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   tensorboard_enabled .......... False
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   tensorboard_job_name ......... DeepSpeedJobName
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   tensorboard_output_path ......
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   train_batch_size ............. 1
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   train_micro_batch_size_per_gpu  1
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   wall_clock_breakdown ......... False
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   world_size ................... 1
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   zero_allow_untested_optimizer  False
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   zero_config .................. {
        "stage": 0,
        "contiguous_gradients": false,
        "reduce_scatter": false,
        "reduce_bucket_size": 5.000000e+07,
        "allgather_partitions": true,
        "allgather_bucket_size": 5.000000e+08,
        "overlap_comm": false,
        "load_from_fp32_weights": true,
        "elastic_checkpoint": true,
        "offload_param": null,
        "offload_optimizer": null,
        "sub_group_size": 1.000000e+12,
        "prefetch_bucket_size": 5.000000e+07,
        "param_persistence_threshold": 1.000000e+05,
        "max_live_parameters": 1.000000e+09,
        "max_reuse_distance": 1.000000e+09,
        "gather_fp16_weights_on_model_save": false,
        "find_unused_parameters": false
    }
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   zero_enabled ................. False
    [2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   zero_optimization_stage ...... 0
    [2022-01-04 17:51:52,651] [INFO] [config.py:758:print]   json = {
        "train_micro_batch_size_per_gpu": 1,
        "fp16": {
            "enabled": true,
            "loss_scale": 128,
            "loss_scale_window": 2.000000e+03,
            "min_loss_scale": 0.5
        },
        "zero_optimization": {
            "stage": 0,
            "reduce_bucket_size": 5.000000e+07
        }
    }
    Using /root/.cache/torch_extensions as PyTorch extensions root...
    Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
    Building extension module utils...
    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
    /mnt/work/miniconda3/envs/rugpt/lib/python3.7/site-packages/torch/utils/cpp_extension.py:269: UserWarning:
    
                                   !! WARNING !!
    
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    Your compiler (c++) is not compatible with the compiler Pytorch was
    built with for this platform, which is g++ on linux. Please
    use g++ to to compile your extension. Alternatively, you may
    compile PyTorch from source using c++, and then you can also use
    c++ to compile your extension.
    
    See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
    with compiling PyTorch from source.
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    
                                  !! WARNING !!
    
      platform=sys.platform))
    ninja: no work to do.
    Loading extension module utils...
    Time to load utils op: 0.292543888092041 seconds
    Resume train set from iteration 0
    --Start training loop--
    [2022-01-04 17:51:53,528] [INFO] [checkpointing.py:400:forward] Activation Checkpointing Information
    [2022-01-04 17:51:53,528] [INFO] [checkpointing.py:402:forward] ----Partition Activations False, CPU CHECKPOINTING False
    [2022-01-04 17:51:53,528] [INFO] [checkpointing.py:405:forward] ----contiguous Memory Checkpointing False with 24 total layers
    [2022-01-04 17:51:53,528] [INFO] [checkpointing.py:407:forward] ----Synchronization False
    [2022-01-04 17:51:53,528] [INFO] [checkpointing.py:408:forward] ----Profiling False
    Using /root/.cache/torch_extensions as PyTorch extensions root...
    Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
    Building extension module utils...
    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
    /mnt/work/miniconda3/envs/rugpt/lib/python3.7/site-packages/torch/utils/cpp_extension.py:269: UserWarning:
    
                                   !! WARNING !!
    
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    Your compiler (c++) is not compatible with the compiler Pytorch was
    built with for this platform, which is g++ on linux. Please
    use g++ to to compile your extension. Alternatively, you may
    compile PyTorch from source using c++, and then you can also use
    c++ to compile your extension.
    
    See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
    with compiling PyTorch from source.
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    
                                  !! WARNING !!
    
      platform=sys.platform))
    ninja: no work to do.
    Loading extension module utils...
    Time to load utils op: 0.25350284576416016 seconds
    --Start training loop--
    

    SYSTEM AND ENV SPECS:

    • CPU model: Intel(R) Xeon(R) Gold 6234 CPU @ 3.30GHz
    • OS: CentOS 8
    • nvcc: command not found :)
    • pytorch: 1.7.1+cu101
    • deepspeed: 0.3.16 (tried last)
    • apex: 0.1 (install from github)
    • transformers: 3.5.0

    nvidia-smi: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100S-PCI... Off | 00000000:18:00.0 Off | 0 | | N/A 41C P0 28W / 250W | 4MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100S-PCI... Off | 00000000:AF:00.0 Off | 0 | | N/A 38C P0 25W / 250W | 4MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

    Where should i look, where should debugging take place?

    opened by sjeffry-o 0
  • Deepspeed training overflows

    Deepspeed training overflows

    Hi! Thanks for replying to my earlier issues :)

    I'm currently trying to finetune a model with deepspeed using scripts/deepspeed_gpt3_medium.sh as an example. After a while (usually 16k steps) training basically hangs with the following message repeated:

    1622513061366 localhost info [2021-06-01 05:04:21,527] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 0.0, reducing to 0.0
    

    Meaning the weight updates are too large and training failed to converge, right? I've also tried setting a lower LR (as in deepspeed_gpt3_xl_finetune.sh), but the dynamic is the same.

    Have you run into this problem at any point? I'd appreciate any advice.

    opened by drunkinlove 4
Owner
Sberbank AI
Sberbank AI
Use the power of GPT3 to execute any function inside your programs just by giving some doctests

gptrun Don't feel like coding today? Use the power of GPT3 to execute any function inside your programs just by giving some doctests. How is this diff

Roberto Abdelkader Martínez Pérez 9 May 3, 2022
A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试)。

简体中文 | English 并行语音合成 [TOC] 新进展 2021/04/20 合并 wavegan 分支到 main 主分支,删除 wavegan 分支! 2021/04/13 创建 encoder 分支用于开发语音风格迁移模块! 2021/04/13 softdtw 分支 支持使用 Sof

Atomicoo 132 Jun 7, 2022
Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR

Speech_38_ru_commands Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR Программа умеет распознавать 38 ключевы

Andrey 9 May 5, 2022
Russian words synonyms and antonyms

ru_synonyms Russian words synonyms and antonyms. Install pip install git+https://github.com/ahmados/rusynonyms.git Usage from ru_synonyms import Anto

sumekenov 5 Dec 9, 2021
Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

Alexey Karnachev 3 Feb 9, 2022
An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

GPT-NeoX An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hun

EleutherAI 2.3k Jun 27, 2022
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 567 Jun 18, 2022
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 39 May 28, 2022
Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

Alexander Veysov 2.7k Jun 27, 2022
profile tools for pytorch nn models

nnprof Introduction nnprof is a profile tool for pytorch neural networks. Features multi profile mode: nnprof support 4 profile mode: Layer level, Ope

Feng Wang 42 Jun 14, 2022
Client library to download and publish models and other files on the huggingface.co hub

huggingface_hub Client library to download and publish models and other files on the huggingface.co hub Do you have an open source ML library? We're l

Hugging Face 448 Jun 23, 2022
Build Text Rerankers with Deep Language Models

Reranker is a lightweight, effective and efficient package for training and deploying deep languge model reranker in information retrieval (IR), question answering (QA) and many other natural language processing (NLP) pipelines. The training procedure follows our ECIR paper Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline using a localized constrastive esimation (LCE) loss.

Luyu Gao 121 Jun 17, 2022
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

Kakao Brain 1.1k Jun 22, 2022
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

ParlAI (pronounced “par-lay”) is a python framework for sharing, training and testing dialogue models, from open-domain chitchat, to task-oriented dia

Facebook Research 8.9k Jun 21, 2022
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 1.2k Jun 22, 2022
Super easy library for BERT based NLP models

Fast-Bert New - Learning Rate Finder for Text Classification Training (borrowed with thanks from https://github.com/davidtvs/pytorch-lr-finder) Suppor

Utterworks 1.7k Jun 18, 2022
:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

(Framework for Adapting Representation Models) What is it? FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built u

deepset 1.5k Jun 22, 2022
🏖 Easy training and deployment of seq2seq models.

Headliner Headliner is a sequence modeling library that eases the training and in particular, the deployment of custom sequence models for both resear

Axel Springer Ideas Engineering GmbH 232 May 9, 2022