BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Overview

Pre-trained checkpoint and bert config json file

  1. Location of checkpoint and bert config json file

    This MLCommons members Google Drive location contains these files.

    • TensorFlow checkpoint (tf1_ckpt) containing the pre-trained weights.
    • Config file (bert_config.json) which specifies the hyperparameters of the model.
  2. Checkpoint conversion

python convert_tf_checkpoint.py --tf_checkpoint <path/to/checkpointdir_phase1/model.ckpt-28252.index> --bert_config_path <path/to/checkpointdir_phase1/bert_config.json> --output_checkpoint model.ckpt-28252.pt

Download and preprocess datasets

  1. Download dataset and generate the TFRecords for training data and eval data

    BERT Wikipedia dataset preparation

  2. Convert training data and eval data from TFRecords to HDF5

    TF_INPUT_DIR=<path/to/tfrecord_input_dir> HDF5_OUTPUT_DIR=<path/to/hdf5_output_dir> ./run_trans_tfrecord_to_hdf5.sh
  3. 4bins training data

    We split dataset to enable data-load balacning and it can reduce communication overhead.

    Based on the sequence length distribution, split HDF5 training data into 4 part:

    part 1: 0 < sequence length <= 128

    part 2: 128 < sequence length <= 256

    part 3: 256 < sequence length <= 384

    part 4: 384 < sequence length <= 512

    The output_dir contains 4 sub-directories 128, 256, 384 and 512.

cd cleanup_scripts
python run_split_and_chop_hdf5_files.py --input_dir=<path/to/hdf5_datadir> --output_dir=<path/to/4bins_training_datadir>

Prepare the environment

  • Create a virtualenv and install the required packages:
virtualenv venv -p python3.8.7
source venv/bin/activate
pip install -r requirements.txt

# Install mlperf-logging Python package
git clone https://github.com/mlperf/logging.git mlperf-logging
pip install -e mlperf-logging

# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
git reset --hard d06404fecab73f152c6cbb89ac2c2e9b7fc24124
git submodule update --init --recursive
git apply ../patch_for_mlperf_trining_v1.1_by_samsung.patch
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--distributed_adam" --global-option="--distributed_lamb" --global-option="--bnp" --global-option="--xentropy" --global-option="--fast_layer_norm" --global-option="--deprecated_fused_adam"  --global-option="--fmha"  --global-option="--fast_multihead_attn" ./

# Compile mhalib
cd mhalib
python setup.py build
cp build/lib*/mhalib* ../
  • Other software requirements
Softeware Version
python 3.8.7
pytorch 1.9.1
NCCL 2.9.9
CUDA 11.3.0
cudnn 8.2.1.32
cublas 11.4.2
nvidia driver 470.57.02
mofed version 5.4-1.0.3

Run the model

  1. Set hosts address in run_multinode.sh
export hosts=('192.168.16.1' '192.168.16.2')
  1. Launch the training

    Use the following command to run the config_Samsung_Supercomputer21_DGXA100_128x8x16x1.sh in python virtual environment.

PYTHON=<path/to/python> DGXSYSTEM=Samsung_Supercomputer21_DGXA100_128x8x16x1 INPUT_DIR=<path/to/4bins_training_datadir> EVAL_DIR=<path/to/eval_datadir> CHECKPOINTDIR_PHASE1=<path/to/checkpointdir_phase1> NEXP=10 ./run_multinode.sh

Appendix

Our source code is based on MLPerf BERT v0.7, and all the files newly added and modified are as follows.

File Name Status Description
config_Samsung_Supercomputer21_DGXA100_128x8x16x1.sh Newly added The file contains configurations used for 1024 GPUs experiment.
run_split_and_chop_hdf5_files.py Newly added The file is used for generating 4-bin training data.
mhalib/setup.py Modified The file is modified since CUDA upgraded.
optim/init.py Newly added The file is used as the entrance of "optim" module.
optim/acclip.py Newly added The file implements ACClip optimizer for trial.
optim/madgrad.py Newly added The file implements MADGRAD optimizer for trial.
bind_launch.py Newly added The file is added for BERT training on python environment.
bind_pyt.py Modified The file is modified for the following items.
(1) Log compliance;
(2) Add new NUMA binding.
fmha.py Newly added The file is used for adding FMHA operator (refer to MLPerf v1.0).
mlperf_logger.py Modified The file is modified for log compliance.
modeling.py Modified The file is modified for adding FMHA (refer to MLPerf v1.0).
padding.py Modified The file is modified for adding FMHA (refer to MLPerf v1.0).
README.md Modified It is modified to run Samsung optimized implematation.
requirements.txt Modified The file shows required software version.
run_multinode.sh Newly added The file is startup script about how to run BERT training on python environment
run_pretraining.py Modified The file is modified for the following items.
(1) Load splitting training data;
(2) Add exchange padding feature (refer to MLPerf v1.0);
(3) Add NCCL warmup (refer to MLPerf v1.0);
(4) Add SAIT local/group exchange padding;
(5) Add NCCL warmup for group exchange padding;
(6) Add per-device local gradient clipping before all-reduce;
(7) Add pytorch DDP.
schedulers.py Modified The file is modified for optimizing learning rate scheduler
utils.py Modified The file is modified for the following items.
(1) Add get_optimzer() interface;
(2) Add a batch sampler (SplitRandomSampler) for 4-bin splitting training data.
Owner
SAIT (Samsung Advanced Institute of Technology)
SAIT (Samsung Advanced Institute of Technology)
AdelaiDepth is an open source toolbox for monocular depth prediction.

AdelaiDepth is an open source toolbox for monocular depth prediction.

Adelaide Intelligent Machines (AIM) Group 743 Jan 01, 2023
DL course co-developed by YSDA, HSE and Skoltech

Deep learning course This repo supplements Deep Learning course taught at YSDA and HSE @fall'21. For previous iteration visit the spring21 branch. Lec

Yandex School of Data Analysis 1.3k Dec 30, 2022
PyTorch implementation for the paper Pseudo Numerical Methods for Diffusion Models on Manifolds

Pseudo Numerical Methods for Diffusion Models on Manifolds (PNDM) This repo is the official PyTorch implementation for the paper Pseudo Numerical Meth

Luping Liu (刘路平) 196 Jan 05, 2023
FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection

FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection FCOSR: A Simple Anchor-free Rotated Detector for Aerial Object Detection arXi

59 Nov 29, 2022
Machine Learning Model deployment for Container (TensorFlow Serving)

try_tf_serving ├───dataset │ ├───testing │ │ ├───paper │ │ ├───rock │ │ └───scissors │ └───training │ ├───paper │ ├───rock

Azhar Rizki Zulma 5 Jan 07, 2022
ObjectDrawer-ToolBox: a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system

ObjectDrawer-ToolBox is a graphical image annotation tool to generate ground plane masks for a 3D object reconstruction system, Object Drawer.

77 Jan 05, 2023
A library to inspect itermediate layers of PyTorch models.

A library to inspect itermediate layers of PyTorch models. Why? It's often the case that we want to inspect intermediate layers of a model without mod

archinet.ai 380 Dec 28, 2022
Python-experiments - A Repository which contains python scripts to automate things and make your life easier with python

Python Experiments A Repository which contains python scripts to automate things

Vivek Kumar Singh 11 Sep 25, 2022
TLXZoo - Pre-trained models based on TensorLayerX

Pre-trained models based on TensorLayerX. TensorLayerX is a multi-backend AI fra

TensorLayer Community 13 Dec 07, 2022
Mail classification with tensorflow and MS Exchange Server (ham or spam).

Mail classification with tensorflow and MS Exchange Server (ham or spam).

Metin Karatas 1 Sep 11, 2021
Byte-based multilingual transformer TTS for low-resource/few-shot language adaptation.

One model to speak them all 🌎 Audio Language Text ▷ Chinese 人人生而自由,在尊严和权利上一律平等。 ▷ English All human beings are born free and equal in dignity and rig

Mutian He 60 Nov 14, 2022
I-BERT: Integer-only BERT Quantization

I-BERT: Integer-only BERT Quantization HuggingFace Implementation I-BERT is also available in the master branch of HuggingFace! Visit the following li

Sehoon Kim 139 Dec 27, 2022
Joint Discriminative and Generative Learning for Person Re-identification. CVPR'19 (Oral)

Joint Discriminative and Generative Learning for Person Re-identification [Project] [Paper] [YouTube] [Bilibili] [Poster] [Supp] Joint Discriminative

NVIDIA Research Projects 1.2k Dec 30, 2022
Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

flownet2-pytorch Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. Multiple GPU training is supported, a

NVIDIA Corporation 2.8k Dec 27, 2022
Image Matching Evaluation

Image Matching Evaluation (IME) IME provides to test any feature matching algorithm on datasets containing ground-truth homographies. Also, one can re

32 Nov 17, 2022
Sign Language is detected in realtime using video sequences. Our approach involves MediaPipe Holistic for keypoints extraction and LSTM Model for prediction.

RealTime Sign Language Detection using Action Recognition Approach Real-Time Sign Language is commonly predicted using models whose architecture consi

Rishikesh S 15 Aug 20, 2022
Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

Focal Transformer This is the official implementation of our Focal Transformer -- "Focal Self-attention for Local-Global Interactions in Vision Transf

Microsoft 486 Dec 20, 2022
AITUS - An atomatic notr maker for CYTUS

AITUS an automatic note maker for CYTUS. 利用AI根据指定乐曲生成CYTUS游戏谱面。 效果展示:https://www

GradiusTwinbee 6 Feb 24, 2022
This is the code of paper ``Contrastive Coding for Active Learning under Class Distribution Mismatch'' with python.

Contrastive Coding for Active Learning under Class Distribution Mismatch Official PyTorch implementation of ["Contrastive Coding for Active Learning u

21 Dec 22, 2022
ruptures: change point detection in Python

Welcome to ruptures ruptures is a Python library for off-line change point detection. This package provides methods for the analysis and segmentation

Charles T. 1.1k Jan 03, 2023