BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Last update: Apr 27, 2022

Related tags

Overview

Pre-trained checkpoint and bert config json file

Location of checkpoint and bert config json file

This MLCommons members Google Drive location contains these files.
- TensorFlow checkpoint (tf1_ckpt) containing the pre-trained weights.
- Config file (bert_config.json) which specifies the hyperparameters of the model.
Checkpoint conversion

python convert_tf_checkpoint.py --tf_checkpoint <path/to/checkpointdir_phase1/model.ckpt-28252.index> --bert_config_path <path/to/checkpointdir_phase1/bert_config.json> --output_checkpoint model.ckpt-28252.pt

Download and preprocess datasets

Download dataset and generate the TFRecords for training data and eval data

BERT Wikipedia dataset preparation

Convert training data and eval data from TFRecords to HDF5

TF_INPUT_DIR=<path/to/tfrecord_input_dir> HDF5_OUTPUT_DIR=<path/to/hdf5_output_dir> ./run_trans_tfrecord_to_hdf5.sh

4bins training data

We split dataset to enable data-load balacning and it can reduce communication overhead.

Based on the sequence length distribution, split HDF5 training data into 4 part:

part 1: 0 < sequence length <= 128

part 2: 128 < sequence length <= 256

part 3: 256 < sequence length <= 384

part 4: 384 < sequence length <= 512

The output_dir contains 4 sub-directories 128, 256, 384 and 512.

cd cleanup_scripts
python run_split_and_chop_hdf5_files.py --input_dir=<path/to/hdf5_datadir> --output_dir=<path/to/4bins_training_datadir>

Prepare the environment

Create a virtualenv and install the required packages:

virtualenv venv -p python3.8.7
source venv/bin/activate
pip install -r requirements.txt

# Install mlperf-logging Python package
git clone https://github.com/mlperf/logging.git mlperf-logging
pip install -e mlperf-logging

# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
git reset --hard d06404fecab73f152c6cbb89ac2c2e9b7fc24124
git submodule update --init --recursive
git apply ../patch_for_mlperf_trining_v1.1_by_samsung.patch
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--distributed_adam" --global-option="--distributed_lamb" --global-option="--bnp" --global-option="--xentropy" --global-option="--fast_layer_norm" --global-option="--deprecated_fused_adam"  --global-option="--fmha"  --global-option="--fast_multihead_attn" ./

# Compile mhalib
cd mhalib
python setup.py build
cp build/lib*/mhalib* ../

Other software requirements

Softeware	Version
python	3.8.7
pytorch	1.9.1
NCCL	2.9.9
CUDA	11.3.0
cudnn	8.2.1.32
cublas	11.4.2
nvidia driver	470.57.02
mofed version	5.4-1.0.3

Run the model

Set hosts address in run_multinode.sh

export hosts=('192.168.16.1' '192.168.16.2')

Launch the training

Use the following command to run the config_Samsung_Supercomputer21_DGXA100_128x8x16x1.sh in python virtual environment.

PYTHON=<path/to/python> DGXSYSTEM=Samsung_Supercomputer21_DGXA100_128x8x16x1 INPUT_DIR=<path/to/4bins_training_datadir> EVAL_DIR=<path/to/eval_datadir> CHECKPOINTDIR_PHASE1=<path/to/checkpointdir_phase1> NEXP=10 ./run_multinode.sh

Appendix

Our source code is based on MLPerf BERT v0.7, and all the files newly added and modified are as follows.

File Name	Status	Description
config_Samsung_Supercomputer21_DGXA100_128x8x16x1.sh	Newly added	The file contains configurations used for 1024 GPUs experiment.
run_split_and_chop_hdf5_files.py	Newly added	The file is used for generating 4-bin training data.
mhalib/setup.py	Modified	The file is modified since CUDA upgraded.
optim/init.py	Newly added	The file is used as the entrance of "optim" module.
optim/acclip.py	Newly added	The file implements ACClip optimizer for trial.
optim/madgrad.py	Newly added	The file implements MADGRAD optimizer for trial.
bind_launch.py	Newly added	The file is added for BERT training on python environment.
bind_pyt.py	Modified	The file is modified for the following items. (1) Log compliance; (2) Add new NUMA binding.
fmha.py	Newly added	The file is used for adding FMHA operator (refer to MLPerf v1.0).
mlperf_logger.py	Modified	The file is modified for log compliance.
modeling.py	Modified	The file is modified for adding FMHA (refer to MLPerf v1.0).
padding.py	Modified	The file is modified for adding FMHA (refer to MLPerf v1.0).
README.md	Modified	It is modified to run Samsung optimized implematation.
requirements.txt	Modified	The file shows required software version.
run_multinode.sh	Newly added	The file is startup script about how to run BERT training on python environment
run_pretraining.py	Modified	The file is modified for the following items. (1) Load splitting training data; (2) Add exchange padding feature (refer to MLPerf v1.0); (3) Add NCCL warmup (refer to MLPerf v1.0); (4) Add SAIT local/group exchange padding; (5) Add NCCL warmup for group exchange padding; (6) Add per-device local gradient clipping before all-reduce; (7) Add pytorch DDP.
schedulers.py	Modified	The file is modified for optimizing learning rate scheduler
utils.py	Modified	The file is modified for the following items. (1) Add get_optimzer() interface; (2) Add a batch sampler (SplitRandomSampler) for 4-bin splitting training data.

BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Related tags

Overview

Pre-trained checkpoint and bert config json file

Download and preprocess datasets

Prepare the environment

Run the model

Appendix

Owner

SAIT (Samsung Advanced Institute of Technology)

DeepMReye: magnetic resonance-based eye tracking using deep neural networks

Attention Probe: Vision Transformer Distillation in the Wild

Code & Data for the Paper "Time Masking for Temporal Language Models", WSDM 2022

TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

A Small and Easy approach to the BraTS2020 dataset (2D Segmentation)

LiDAR Distillation: Bridging the Beam-Induced Domain Gap for 3D Object Detection

Implementation of CVAE. Trained CVAE on faces from UTKFace Dataset to produce synthetic faces with a given degree of happiness/smileyness.

Tackling the Class Imbalance Problem of Deep Learning Based Head and Neck Organ Segmentation

Code for layerwise detection of linguistic anomaly paper (ACL 2021)

Python package to generate image embeddings with CLIP without PyTorch/TensorFlow

SAAVN - Sound Adversarial Audio-Visual Navigation,ICLR2022 (In PyTorch)

Split Variational AutoEncoder

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

Crab is a ﬂexible, fast recommender engine for Python that integrates classic information ﬁltering recommendation algorithms in the world of scientiﬁc Python packages (numpy, scipy, matplotlib).

Change Detection in SAR Images Based on Multiscale Capsule Network

Pytorch implementation of few-shot semantic image synthesis

Dataset and codebase for NeurIPS 2021 paper: Exploring Forensic Dental Identification with Deep Learning

A Pytorch Implementation for Compact Bilinear Pooling.

Affine / perspective transformation in Pose Estimation with Tensorflow 2

A simple tutoral for error correction task, based on Pytorch

BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Related tags

Overview

Pre-trained checkpoint and bert config json file

Download and preprocess datasets

Prepare the environment

Run the model

Appendix

Owner

SAIT (Samsung Advanced Institute of Technology)

DeepMReye: magnetic resonance-based eye tracking using deep neural networks

Attention Probe: Vision Transformer Distillation in the Wild

Code & Data for the Paper "Time Masking for Temporal Language Models", WSDM 2022

​TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

A Small and Easy approach to the BraTS2020 dataset (2D Segmentation)

LiDAR Distillation: Bridging the Beam-Induced Domain Gap for 3D Object Detection

Implementation of CVAE. Trained CVAE on faces from UTKFace Dataset to produce synthetic faces with a given degree of happiness/smileyness.

Tackling the Class Imbalance Problem of Deep Learning Based Head and Neck Organ Segmentation

Code for layerwise detection of linguistic anomaly paper (ACL 2021)

Python package to generate image embeddings with CLIP without PyTorch/TensorFlow

SAAVN - Sound Adversarial Audio-Visual Navigation,ICLR2022 (In PyTorch)

Split Variational AutoEncoder

The official implementation of ELSA: Enhanced Local Self-Attention for Vision Transformer

Crab is a ﬂexible, fast recommender engine for Python that integrates classic information ﬁltering recommendation algorithms in the world of scientiﬁc Python packages (numpy, scipy, matplotlib).

Change Detection in SAR Images Based on Multiscale Capsule Network

Pytorch implementation of few-shot semantic image synthesis

Dataset and codebase for NeurIPS 2021 paper: Exploring Forensic Dental Identification with Deep Learning

A Pytorch Implementation for Compact Bilinear Pooling.

Affine / perspective transformation in Pose Estimation with Tensorflow 2

A simple tutoral for error correction task, based on Pytorch

TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.