Training BERT with Compute/Time (Academic) Budget

Overview

Training BERT with Compute/Time (Academic) Budget

This repository contains scripts for pre-training and finetuning BERT-like models with limited time and compute budget. The code is based on the work presented in the following paper:

Peter Izsak, Moshe Berchansky, Omer Levy, How to Train BERT with an Academic Budget - (to appear at EMNLP 2021).

Installation

The pre-training and finetuning scripts are based on Deepspeed and HuggingFace Transformers libraries.

Preliminary Installation

We recommend creating a virtual environment with python 3.6+, PyTorch and apex.

Installation Requirements

pip install -r requirements.txt

We suggest running Deepspeed's utility ds_report and verify Deepspeed components can be compiled (JIT).

Dataset

The dataset directory includes scripts to pre-process the datasets we used in our experiments (Wikipedia, Bookcorpus). See dedicated README for full details.

Pretraining

Pretraining script: run_pretraining.py

For all possible pretraining arguments see: python run_pretraining.py -h

We highly suggest reviewing the various training features we provide within the library.

Example for training with the best configuration presented in our paper (24-layers/1024H/time-based learning rate schedule/fp16):
deepspeed run_pretraining.py \
  --model_type bert-mlm --tokenizer_name bert-large-uncased \
  --hidden_act gelu \
  --hidden_size 1024 \
  --num_hidden_layers 24 \
  --num_attention_heads 16 \
  --intermediate_size 4096 \
  --hidden_dropout_prob 0.1 \
  --attention_probs_dropout_prob 0.1 \
  --encoder_ln_mode pre-ln \
  --lr 1e-3 \
  --train_batch_size 4096 \
  --train_micro_batch_size_per_gpu 32 \
  --lr_schedule time \
  --curve linear \
  --warmup_proportion 0.06 \
  --gradient_clipping 0.0 \
  --optimizer_type adamw \
  --weight_decay 0.01 \
  --adam_beta1 0.9 \
  --adam_beta2 0.98 \
  --adam_eps 1e-6 \
  --total_training_time 24.0 \
  --early_exit_time_marker 24.0 \
  --dataset_path <dataset path> \
  --output_dir /tmp/training-out \
  --print_steps 100 \
  --num_epochs_between_checkpoints 10000 \
  --job_name pretraining_experiment \
  --project_name budget-bert-pretraining \
  --validation_epochs 3 \
  --validation_epochs_begin 1 \
  --validation_epochs_end 1 \
  --validation_begin_proportion 0.05 \
  --validation_end_proportion 0.01 \
  --validation_micro_batch 16 \
  --deepspeed \
  --data_loader_type dist \
  --do_validation \
  --use_early_stopping \
  --early_stop_time 180 \
  --early_stop_eval_loss 6 \
  --seed 42 \
  --fp16

Time-based Training

Pretraining can be limited to a time-based value by defining --total_training_time=24.0 (24 hours for example).

Time-based Learning Rate Scheduling

The learning rate can be scheduled to change according to the configured total training time. The argument --total_training_time controls the total time assigned for the trainer to run, and must be specified in order to use time-based learning rate scheduling.

Time-based Learning rate schedule

To select time-based learning rate scheduling, define --lr_schedule time, and define a shape for for the annealing curve (--curve=linear for example, as seen in the figure). The warmup phase of the learning rate is define by specifying a proportion (--warmup_proportion) which accounts for the time-budget proportion available in the training session (as defined by --total_training_time). For example, for a 24 hour training session, warmup_proportion=0.1 would account for 10% of 24 hours, that is, 2.4 hours (or 144 minutes) to reach peak learning rate. The learning rate will then be scheduled to reach 0 at the end of the time budget. We refer to the provided figure for an example.

Checkpoints and Finetune Checkpoints

There are 2 types of checkpoints that can be enabled:

  • Training checkpoint - saves model weights, optimizer state and training args. Defined by --num_epochs_between_checkpoints.
  • Finetuning checkpoint - saves model weights and configuration to be used for finetuning later on. Defined by --finetune_time_markers.

finetune_time_markers can be assigned multiple points in the training time-budget by providing a list of time markers of the overall training progress. For example --finetune_time_markers=0.5 will save a finetuning checkpoint when reaching 50% of training time budget. For multiple finetuning checkpoints, use commas without space 0.5,0.6,0.9.

Validation Scheduling

Enable validation while pre-training with --do_validation

Control the number of epochs between validation runs with --validation_epochs=

To control the amount of validation runs in the beginning and end (running more that validation_epochs) use validation_begin_proportion and validation_end_proportion to specify the proportion of time and, validation_epochs_begin and validation_epochs_end to control the custom values accordingly.

Mixed Precision Training

Mixed precision is supported by adding --fp16. Use --fp16_backend=ds to use Deepspeed's mixed precision backend and --fp16_backend=apex for apex (--fp16_opt controls optimization level).

Finetuning

Use run_glue.py to run finetuning for a saved checkpoint on GLUE tasks.

The finetuning script is identical to the one provided by Huggingface with the addition of our model.

For all possible pretraining arguments see: python run_glue.py -h

Example for finetuning on MRPC:
python run_glue.py \
  --model_name_or_path <path to model> \
  --task_name MRPC \
  --max_seq_length 128 \
  --output_dir /tmp/finetuning \
  --overwrite_output_dir \
  --do_train --do_eval \
  --evaluation_strategy steps \
  --per_device_train_batch_size 32 --gradient_accumulation_steps 1 \
  --per_device_eval_batch_size 32 \
  --learning_rate 5e-5 \
  --weight_decay 0.01 \
  --eval_steps 50 --evaluation_strategy steps \
  --max_grad_norm 1.0 \
  --num_train_epochs 5 \
  --lr_scheduler_type polynomial \
  --warmup_steps 50

Generating Pretraining Commands

We provide a useful script for generating multiple (or single) pretraining commands by using python generate_training_commands.py.

python generate_training_commands.py -h

	--param_file PARAM_FILE Hyperparameter and configuration yaml
  	--job_name JOB_NAME   job name
 	--init_cmd INIT_CMD   initialization command (deepspeed or python directly)

A parameter yaml must be defined with 2 main keys: hyperparameters with argument values defined as a list of possible values, and default_parameters as default values. Each generated command will be a possible combination of the various arguments specified in the hyperparameters section.

Example:

hyperparameters:
  param1: [val1, val2]
  param2: [val1, val2]

default_parameters:
  param3: 0.0

will result in:

deepspeed run_pretraining.py --param1=val1 --param2=val1 --param3=0.0
deepspeed run_pretraining.py --param1=val1 --param2=val2 --param3=0.0
deepspeed run_pretraining.py --param1=val2 --param2=val1 --param3=0.0
deepspeed run_pretraining.py --param1=val2 --param2=val2 --param3=0.0

Citation

If you find this paper or this code useful, please cite this paper:

@article{izsak2021,
  author={Izsak, Peter and Berchansky, Moshe and Levy, Omer},
  title={How to Train BERT with an Academic Budget},
  journal={arXiv preprint arXiv:2104.07705},
  url = {https://arxiv.org/abs/2104.07705} 
  year={2021}
}
Owner
Intel Labs
Intel Labs
Genpass - A Passwors Generator App With Python3

Genpass Welcom again into another python3 App this is simply an Passwors Generat

Mal4D 1 Jan 09, 2022
A collection of resources on GAN Inversion.

This repo is a collection of resources on GAN inversion, as a supplement for our survey

CTF Challenge for CSAW Finals 2021

Terminal Velocity Misc CTF Challenge for CSAW Finals 2021 This is a challenge I've had in mind for almost 15 years and never got around to building un

Jordan 6 Jul 30, 2022
PyTorch implementation for COMPLETER: Incomplete Multi-view Clustering via Contrastive Prediction (CVPR 2021)

Completer: Incomplete Multi-view Clustering via Contrastive Prediction This repo contains the code and data of the following paper accepted by CVPR 20

XLearning Group 72 Dec 07, 2022
Generating synthetic mobility data for a realistic population with RNNs to improve utility and privacy

lbs-data Motivation Location data is collected from the public by private firms via mobile devices. Can this data also be used to serve the public goo

Alex 11 Sep 22, 2022
Code for Understanding Pooling in Graph Neural Networks

Select, Reduce, Connect This repository contains the code used for the experiments of: "Understanding Pooling in Graph Neural Networks" Setup Install

Daniele Grattarola 37 Dec 13, 2022
Codes for our paper The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders published to EMNLP 2021.

The Stem Cell Hypothesis Codes for our paper The Stem Cell Hypothesis: Dilemma behind Multi-Task Learning with Transformer Encoders published to EMNLP

Emory NLP 5 Jul 08, 2022
Bringing sanity to world of messed-up data

Sanitize sanitize is a Python module for making sure various things (e.g. HTML) are safe to use. It was originally written by Mark Pilgrim and is dist

Alireza Savand 63 Oct 26, 2021
Restricted Boltzmann Machines in Python.

How to Use First, initialize an RBM with the desired number of visible and hidden units. rbm = RBM(num_visible = 6, num_hidden = 2) Next, train the m

Edwin Chen 928 Dec 30, 2022
classification task on dataset-CIFAR10,by using Tensorflow/keras

CIFAR10-Tensorflow classification task on dataset-CIFAR10,by using Tensorflow/keras 在这一个库中,我使用Tensorflow与keras框架搭建了几个卷积神经网络模型,针对CIFAR10数据集进行了训练与测试。分别使

3 Oct 17, 2021
Repository containing detailed experiments related to the paper "Memotion Analysis through the Lens of Joint Embedding".

Memotion Analysis Through The Lens Of Joint Embedding This repository contains the experiments conducted as described in the paper 'Memotion Analysis

Nethra Gunti 1 Mar 16, 2022
Model parallel transformers in Jax and Haiku

Mesh Transformer Jax A haiku library using the new(ly documented) xmap operator in Jax for model parallelism of transformers. See enwik8_example.py fo

Ben Wang 4.8k Jan 01, 2023
BackgroundRemover lets you Remove Background from images and video with a simple command line interface

BackgroundRemover BackgroundRemover is a command line tool to remove background from video and image, made by nadermx to power https://BackgroundRemov

Johnathan Nader 1.7k Dec 30, 2022
Nightmare-Writeup - Writeup for the Nightmare CTF Challenge from 2022 DiceCTF

Nightmare: One Byte to ROP // Alternate Solution TLDR: One byte write, no leak.

1 Feb 17, 2022
Nb workflows - A workflow platform which allows you to run parameterized notebooks programmatically

NB Workflows Description If SQL is a lingua franca for querying data, Jupyter sh

Xavier Petit 6 Aug 18, 2022
CARMS: Categorical-Antithetic-REINFORCE Multi-Sample Gradient Estimator

CARMS: Categorical-Antithetic-REINFORCE Multi-Sample Gradient Estimator This is the official code repository for NeurIPS 2021 paper: CARMS: Categorica

Alek Dimitriev 1 Jul 09, 2022
Credo AI Lens is a comprehensive assessment framework for AI systems. Lens standardizes model and data assessment, and acts as a central gateway to assessments created in the open source community.

Lens by Credo AI - Responsible AI Assessment Framework Lens is a comprehensive assessment framework for AI systems. Lens standardizes model and data a

Credo AI 27 Dec 14, 2022
Boostcamp CV Serving For Python

Boostcamp-CV-Serving Prerequisites MySQL GCP Cloud Storage GCP key file Sentry Streamlit Cloud Secrets: .streamlit/secrets.toml #DO NOT SHARE THIS I

Jungwon Seo 19 Feb 22, 2022
A general framework for inferring CNNs efficiently. Reduce the inference latency of MobileNet-V3 by 1.3x on an iPhone XS Max without sacrificing accuracy.

GFNet-Pytorch (NeurIPS 2020) This repo contains the official code and pre-trained models for the glance and focus network (GFNet). Glance and Focus: a

Rainforest Wang 169 Oct 28, 2022
Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO)

V-MPO Simple code to demonstrate Deep Reinforcement Learning by using an on-policy adaptation of Maximum a Posteriori Policy Optimization (MPO) in Pyt

Nugroho Dewantoro 9 Jun 06, 2022