BERT model training impelmentation using 1024 A100 GPUs for MLPerf Training v1.1

Overview

Pre-trained checkpoint and bert config json file

  1. Location of checkpoint and bert config json file

    This MLCommons members Google Drive location contains these files.

    • TensorFlow checkpoint (tf1_ckpt) containing the pre-trained weights.
    • Config file (bert_config.json) which specifies the hyperparameters of the model.
  2. Checkpoint conversion

python convert_tf_checkpoint.py --tf_checkpoint <path/to/checkpointdir_phase1/model.ckpt-28252.index> --bert_config_path <path/to/checkpointdir_phase1/bert_config.json> --output_checkpoint model.ckpt-28252.pt

Download and preprocess datasets

  1. Download dataset and generate the TFRecords for training data and eval data

    BERT Wikipedia dataset preparation

  2. Convert training data and eval data from TFRecords to HDF5

    TF_INPUT_DIR=<path/to/tfrecord_input_dir> HDF5_OUTPUT_DIR=<path/to/hdf5_output_dir> ./run_trans_tfrecord_to_hdf5.sh
  3. 4bins training data

    We split dataset to enable data-load balacning and it can reduce communication overhead.

    Based on the sequence length distribution, split HDF5 training data into 4 part:

    part 1: 0 < sequence length <= 128

    part 2: 128 < sequence length <= 256

    part 3: 256 < sequence length <= 384

    part 4: 384 < sequence length <= 512

    The output_dir contains 4 sub-directories 128, 256, 384 and 512.

cd cleanup_scripts
python run_split_and_chop_hdf5_files.py --input_dir=<path/to/hdf5_datadir> --output_dir=<path/to/4bins_training_datadir>

Prepare the environment

  • Create a virtualenv and install the required packages:
virtualenv venv -p python3.8.7
source venv/bin/activate
pip install -r requirements.txt

# Install mlperf-logging Python package
git clone https://github.com/mlperf/logging.git mlperf-logging
pip install -e mlperf-logging

# Install apex
git clone https://github.com/NVIDIA/apex.git
cd apex
git reset --hard d06404fecab73f152c6cbb89ac2c2e9b7fc24124
git submodule update --init --recursive
git apply ../patch_for_mlperf_trining_v1.1_by_samsung.patch
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--distributed_adam" --global-option="--distributed_lamb" --global-option="--bnp" --global-option="--xentropy" --global-option="--fast_layer_norm" --global-option="--deprecated_fused_adam"  --global-option="--fmha"  --global-option="--fast_multihead_attn" ./

# Compile mhalib
cd mhalib
python setup.py build
cp build/lib*/mhalib* ../
  • Other software requirements
Softeware Version
python 3.8.7
pytorch 1.9.1
NCCL 2.9.9
CUDA 11.3.0
cudnn 8.2.1.32
cublas 11.4.2
nvidia driver 470.57.02
mofed version 5.4-1.0.3

Run the model

  1. Set hosts address in run_multinode.sh
export hosts=('192.168.16.1' '192.168.16.2')
  1. Launch the training

    Use the following command to run the config_Samsung_Supercomputer21_DGXA100_128x8x16x1.sh in python virtual environment.

PYTHON=<path/to/python> DGXSYSTEM=Samsung_Supercomputer21_DGXA100_128x8x16x1 INPUT_DIR=<path/to/4bins_training_datadir> EVAL_DIR=<path/to/eval_datadir> CHECKPOINTDIR_PHASE1=<path/to/checkpointdir_phase1> NEXP=10 ./run_multinode.sh

Appendix

Our source code is based on MLPerf BERT v0.7, and all the files newly added and modified are as follows.

File Name Status Description
config_Samsung_Supercomputer21_DGXA100_128x8x16x1.sh Newly added The file contains configurations used for 1024 GPUs experiment.
run_split_and_chop_hdf5_files.py Newly added The file is used for generating 4-bin training data.
mhalib/setup.py Modified The file is modified since CUDA upgraded.
optim/init.py Newly added The file is used as the entrance of "optim" module.
optim/acclip.py Newly added The file implements ACClip optimizer for trial.
optim/madgrad.py Newly added The file implements MADGRAD optimizer for trial.
bind_launch.py Newly added The file is added for BERT training on python environment.
bind_pyt.py Modified The file is modified for the following items.
(1) Log compliance;
(2) Add new NUMA binding.
fmha.py Newly added The file is used for adding FMHA operator (refer to MLPerf v1.0).
mlperf_logger.py Modified The file is modified for log compliance.
modeling.py Modified The file is modified for adding FMHA (refer to MLPerf v1.0).
padding.py Modified The file is modified for adding FMHA (refer to MLPerf v1.0).
README.md Modified It is modified to run Samsung optimized implematation.
requirements.txt Modified The file shows required software version.
run_multinode.sh Newly added The file is startup script about how to run BERT training on python environment
run_pretraining.py Modified The file is modified for the following items.
(1) Load splitting training data;
(2) Add exchange padding feature (refer to MLPerf v1.0);
(3) Add NCCL warmup (refer to MLPerf v1.0);
(4) Add SAIT local/group exchange padding;
(5) Add NCCL warmup for group exchange padding;
(6) Add per-device local gradient clipping before all-reduce;
(7) Add pytorch DDP.
schedulers.py Modified The file is modified for optimizing learning rate scheduler
utils.py Modified The file is modified for the following items.
(1) Add get_optimzer() interface;
(2) Add a batch sampler (SplitRandomSampler) for 4-bin splitting training data.
Owner
SAIT (Samsung Advanced Institute of Technology)
SAIT (Samsung Advanced Institute of Technology)
StyleGAN2 - Official TensorFlow Implementation

StyleGAN2 - Official TensorFlow Implementation

NVIDIA Research Projects 10.1k Dec 28, 2022
A dead simple python wrapper for darknet that works with OpenCV 4.1, CUDA 10.1

What Dead simple python wrapper for Yolo V3 using AlexyAB's darknet fork. Works with CUDA 10.1 and OpenCV 4.1 or later (I use OpenCV master as of Jun

Pliable Pixels 6 Jan 12, 2022
PPO is a very popular Reinforcement Learning algorithm at present.

PPO is a very popular Reinforcement Learning algorithm at present. OpenAI takes PPO as the current baseline algorithm. We use the PPO algorithm to train a policy to give the best action in any situat

Rosefintech 11 Aug 23, 2021
LAnguage Model Analysis

LAMA: LAnguage Model Analysis LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models. The dataset

Meta Research 960 Jan 08, 2023
[ACM MM 2021] Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation)

Multiview Detection with Shadow Transformer (and View-Coherent Data Augmentation) [arXiv] [paper] @inproceedings{hou2021multiview, title={Multiview

Yunzhong Hou 27 Dec 13, 2022
unet-family: Ultimate version

unet-family: Ultimate version 基于之前my-unet代码,我整理出来了这一份终极版本unet-family,方便其他人阅读。 相比于之前的my-unet代码,代码分类更加规范,有条理 对于clone下来的代码不需要修改各种复杂繁琐的路径问题,直接就可以运行。 并且代码有

2 Sep 19, 2022
SAAVN - Sound Adversarial Audio-Visual Navigation,ICLR2022 (In PyTorch)

SAAVN SAAVN Code release for paper "Sound Adversarial Audio-Visual Navigation,IC

YinfengYu 10 Aug 30, 2022
We will see a basic program that is basically a hint to brute force attack to crack passwords. In other words, we will make a program to Crack Any Password Using Python. Show some ❤️ by starring this repository!

Crack Any Password Using Python We will see a basic program that is basically a hint to brute force attack to crack passwords. In other words, we will

Ananya Chatterjee 11 Dec 03, 2022
performing moving objects segmentation using image processing techniques with opencv and numpy

Moving Objects Segmentation On this project I tried to perform moving objects segmentation using background subtraction technique. the introduced meth

Mohamed Magdy 15 Dec 12, 2022
EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.

This repository contains data and code for our EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation. Please contact me at

9 Oct 28, 2022
[ICML 2020] DrRepair: Learning to Repair Programs from Error Messages

DrRepair: Learning to Repair Programs from Error Messages This repo provides the source code & data of our paper: Graph-based, Self-Supervised Program

Michihiro Yasunaga 155 Jan 08, 2023
A library for uncertainty quantification based on PyTorch

Torchuq [logo here] TorchUQ is an extensive library for uncertainty quantification (UQ) based on pytorch. TorchUQ currently supports 10 representation

TorchUQ 96 Dec 12, 2022
Fully Convolutional DenseNets for semantic segmentation.

Introduction This repo contains the code to train and evaluate FC-DenseNets as described in The One Hundred Layers Tiramisu: Fully Convolutional Dense

485 Nov 26, 2022
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

J K Terry 32 Nov 09, 2021
MARE - Multi-Attribute Relation Extraction

MARE - Multi-Attribute Relation Extraction Repository for the paper submission: #TODO: insert link, when available Environment Tested with Ubuntu 18.0

0 May 11, 2021
This is an official repository of CLGo: Learning to Predict 3D Lane Shape and Camera Pose from a Single Image via Geometry Constraints

CLGo This is an official repository of CLGo: Learning to Predict 3D Lane Shape and Camera Pose from a Single Image via Geometry Constraints An earlier

刘芮金 32 Dec 20, 2022
RGBD-Net - This repository contains a pytorch lightning implementation for the 3DV 2021 RGBD-Net paper.

[3DV 2021] We propose a new cascaded architecture for novel view synthesis, called RGBD-Net, which consists of two core components: a hierarchical depth regression network and a depth-aware generator

Phong Nguyen Ha 4 May 26, 2022
ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

Xtra Computing Group 647 Jan 04, 2023
Official implementation of the paper Momentum Capsule Networks (MoCapsNet)

Momentum Capsule Network Official implementation of the paper Momentum Capsule Networks (MoCapsNet). Abstract Capsule networks are a class of neural n

8 Oct 20, 2022
A large-scale benchmark for co-optimizing the design and control of soft robots, as seen in NeurIPS 2021.

Evolution Gym A large-scale benchmark for co-optimizing the design and control of soft robots. As seen in Evolution Gym: A Large-Scale Benchmark for E

121 Dec 14, 2022