BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Overview

Pre-trained checkpoint and bert config json file

  1. Location of checkpoint and bert config json file

    This MLCommons members Google Drive location contains these files.

    • TensorFlow checkpoint (tf1_ckpt) containing the pre-trained weights.
    • Config file (bert_config.json) which specifies the hyperparameters of the model.
  2. Checkpoint conversion

python convert_tf_checkpoint.py --tf_checkpoint <path/to/checkpointdir_phase1/model.ckpt-28252.index> --bert_config_path <path/to/checkpointdir_phase1/bert_config.json> --output_checkpoint model.ckpt-28252.pt

Download and preprocess datasets

  1. Download dataset and generate the TFRecords for training data and eval data

    BERT Wikipedia dataset preparation

  2. Convert training data and eval data from TFRecords to HDF5

    TF_INPUT_DIR=<path/to/tfrecord_input_dir> HDF5_OUTPUT_DIR=<path/to/hdf5_output_dir> ./run_trans_tfrecord_to_hdf5.sh
  3. 4bins training data

    We split dataset to enable data-load balacning and it can reduce communication overhead.

    Based on the sequence length distribution, split HDF5 training data into 4 part:

    part 1: 0 < sequence length <= 128

    part 2: 128 < sequence length <= 256

    part 3: 256 < sequence length <= 384

    part 4: 384 < sequence length <= 512

    The output_dir contains 4 sub-directories 128, 256, 384 and 512.

cd cleanup_scripts
python run_split_and_chop_hdf5_files.py --input_dir=<path/to/hdf5_datadir> --output_dir=<path/to/4bins_training_datadir>

Prepare the environment

  • Create a virtualenv and install the required packages:
virtualenv venv -p python3.8.7
source venv/bin/activate
pip install -r requirements.txt

# Install mlperf-logging Python package
git clone https://github.com/mlperf/logging.git mlperf-logging
pip install -e mlperf-logging

# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
git reset --hard d06404fecab73f152c6cbb89ac2c2e9b7fc24124
git submodule update --init --recursive
git apply ../patch_for_mlperf_trining_v1.1_by_samsung.patch
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--distributed_adam" --global-option="--distributed_lamb" --global-option="--bnp" --global-option="--xentropy" --global-option="--fast_layer_norm" --global-option="--deprecated_fused_adam"  --global-option="--fmha"  --global-option="--fast_multihead_attn" ./

# Compile mhalib
cd mhalib
python setup.py build
cp build/lib*/mhalib* ../
  • Other software requirements
Softeware Version
python 3.8.7
pytorch 1.9.1
NCCL 2.9.9
CUDA 11.3.0
cudnn 8.2.1.32
cublas 11.4.2
nvidia driver 470.57.02
mofed version 5.4-1.0.3

Run the model

  1. Set hosts address in run_multinode.sh
export hosts=('192.168.16.1' '192.168.16.2')
  1. Launch the training

    Use the following command to run the config_Samsung_Supercomputer21_DGXA100_128x8x16x1.sh in python virtual environment.

PYTHON=<path/to/python> DGXSYSTEM=Samsung_Supercomputer21_DGXA100_128x8x16x1 INPUT_DIR=<path/to/4bins_training_datadir> EVAL_DIR=<path/to/eval_datadir> CHECKPOINTDIR_PHASE1=<path/to/checkpointdir_phase1> NEXP=10 ./run_multinode.sh

Appendix

Our source code is based on MLPerf BERT v0.7, and all the files newly added and modified are as follows.

File Name Status Description
config_Samsung_Supercomputer21_DGXA100_128x8x16x1.sh Newly added The file contains configurations used for 1024 GPUs experiment.
run_split_and_chop_hdf5_files.py Newly added The file is used for generating 4-bin training data.
mhalib/setup.py Modified The file is modified since CUDA upgraded.
optim/init.py Newly added The file is used as the entrance of "optim" module.
optim/acclip.py Newly added The file implements ACClip optimizer for trial.
optim/madgrad.py Newly added The file implements MADGRAD optimizer for trial.
bind_launch.py Newly added The file is added for BERT training on python environment.
bind_pyt.py Modified The file is modified for the following items.
(1) Log compliance;
(2) Add new NUMA binding.
fmha.py Newly added The file is used for adding FMHA operator (refer to MLPerf v1.0).
mlperf_logger.py Modified The file is modified for log compliance.
modeling.py Modified The file is modified for adding FMHA (refer to MLPerf v1.0).
padding.py Modified The file is modified for adding FMHA (refer to MLPerf v1.0).
README.md Modified It is modified to run Samsung optimized implematation.
requirements.txt Modified The file shows required software version.
run_multinode.sh Newly added The file is startup script about how to run BERT training on python environment
run_pretraining.py Modified The file is modified for the following items.
(1) Load splitting training data;
(2) Add exchange padding feature (refer to MLPerf v1.0);
(3) Add NCCL warmup (refer to MLPerf v1.0);
(4) Add SAIT local/group exchange padding;
(5) Add NCCL warmup for group exchange padding;
(6) Add per-device local gradient clipping before all-reduce;
(7) Add pytorch DDP.
schedulers.py Modified The file is modified for optimizing learning rate scheduler
utils.py Modified The file is modified for the following items.
(1) Add get_optimzer() interface;
(2) Add a batch sampler (SplitRandomSampler) for 4-bin splitting training data.
Owner
SAIT (Samsung Advanced Institute of Technology)
SAIT (Samsung Advanced Institute of Technology)
Official implementation of the ICML2021 paper "Elastic Graph Neural Networks"

ElasticGNN This repository includes the official implementation of ElasticGNN in the paper "Elastic Graph Neural Networks" [ICML 2021]. Xiaorui Liu, W

liuxiaorui 34 Dec 04, 2022
A python3 tool to take a 360 degree survey of the RF spectrum (hamlib + rotctld + RTL-SDR/HackRF)

RF Light House (rflh) A python script to use a rotor and a SDR device (RTL-SDR or HackRF One) to measure the RF level around and get a data set and be

Pavel Milanes (CO7WT) 11 Dec 13, 2022
Pairwise learning neural link prediction for ogb link prediction

Pairwise Learning for Neural Link Prediction for OGB (PLNLP-OGB) This repository provides evaluation codes of PLNLP for OGB link property prediction t

Zhitao WANG 31 Oct 10, 2022
Alphabetical Letter Recognition

DecisionTrees-Image-Classification Alphabetical Letter Recognition In these demo we are using "Decision Trees" Our database is composed by Learning Im

Mohammed Firass 4 Nov 30, 2021
[ICCV2021] Official Pytorch implementation for SDGZSL (Semantics Disentangling for Generalized Zero-Shot Learning)

Semantics Disentangling for Generalized Zero-shot Learning This is the official implementation for paper Zhi Chen, Yadan Luo, Ruihong Qiu, Zi Huang, J

25 Dec 06, 2022
A Distributional Approach To Controlled Text Generation

A Distributional Approach To Controlled Text Generation This is the repository code for the ICLR 2021 paper "A Distributional Approach to Controlled T

NAVER 102 Jan 07, 2023
Open-L2O: A Comprehensive and Reproducible Benchmark for Learning to Optimize Algorithms

Open-L2O This repository establishes the first comprehensive benchmark efforts of existing learning to optimize (L2O) approaches on a number of proble

VITA 161 Jan 02, 2023
PyTorch implementation of the implicit Q-learning algorithm (IQL)

Implicit-Q-Learning (IQL) PyTorch implementation of the implicit Q-learning algorithm IQL (Paper) Currently only implemented for online learning. Offl

Sebastian Dittert 27 Dec 30, 2022
Industrial Image Anomaly Localization Based on Gaussian Clustering of Pre-trained Feature

Industrial Image Anomaly Localization Based on Gaussian Clustering of Pre-trained Feature Q. Wan, L. Gao, X. Li and L. Wen, "Industrial Image Anomaly

smiler 6 Dec 25, 2022
Deep Learning Algorithms for Hedging with Frictions

Deep Learning Algorithms for Hedging with Frictions This repository contains the Forward-Backward Stochastic Differential Equation (FBSDE) solver and

Xiaofei Shi 3 Dec 22, 2022
SNE-RoadSeg in PyTorch, ECCV 2020

SNE-RoadSeg Introduction This is the official PyTorch implementation of SNE-RoadSeg: Incorporating Surface Normal Information into Semantic Segmentati

242 Dec 20, 2022
A small tool to joint picture including gif

README 做设计的时候遇到拼接长图的情况,但是发现没有什么好用的能拼接gif的工具。 于是自己写了个gif拼接小工具。 可以自动拼接gif、png和jpg等常见格式。 效果 从上至下 从下至上 从左至右 从右至左 使用 克隆仓库 git clone https://github.com/Dels

3 Dec 15, 2021
GAN-based 3D human pose estimation model for 3DV'17 paper

Tensorflow implementation for 3DV 2017 conference paper "Adversarially Parameterized Optimization for 3D Human Pose Estimation". @inproceedings{jack20

Dominic Jack 15 Feb 27, 2021
Cmsc11 arcade - Final Project for CMSC11

cmsc11_arcade Final Project for CMSC11 Developers: Limson, Mark Vincent Peñafiel

Gregory 1 Jan 18, 2022
ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs

(Comet-) ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs Paper Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sa

AI2 152 Dec 27, 2022
We have made you a wrapper you can't refuse

We have made you a wrapper you can't refuse We have a vibrant community of developers helping each other in our Telegram group. Join us! Stay tuned fo

20.6k Jan 09, 2023
nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

jsguo 610 Dec 28, 2022
COPA-SSE contains crowdsourced explanations for the Balanced COPA dataset

COPA-SSE Repository for COPA-SSE: Semi-Structured Explanations for Commonsense Reasoning. COPA-SSE contains crowdsourced explanations for the Balanced

Ana Brassard 5 Jul 31, 2022
Code-free deep segmentation for computational pathology

NoCodeSeg: Deep segmentation made easy! This is the official repository for the manuscript "Code-free development and deployment of deep segmentation

André Pedersen 26 Nov 23, 2022
This is a project based on ConvNets used to identify whether a road is clean or dirty. We have used MobileNet as our base architecture and the weights are based on imagenet.

PROJECT TITLE: CLEAN/DIRTY ROAD DETECTION USING TRANSFER LEARNING Description: This is a project based on ConvNets used to identify whether a road is

Faizal Karim 3 Nov 06, 2022