The pure and clear PyTorch Distributed Training Framework.

Last update: Dec 20, 2022

Overview

The pure and clear PyTorch Distributed Training Framework.

Introduction
Requirements and Usage
Baselines
Zombie processes problem
Acknowledgments
Citation

Introduction

Distribuuuu is a Distributed Classification Training Framework powered by native PyTorch.

Please check tutorial for detailed Distributed Training tutorials:

Single Node Single GPU Card Training [snsc.py]
Single Node Multi-GPU Cards Training (with DataParallel) [snmc_dp.py]
Multiple Nodes Multi-GPU Cards Training (with DistributedDataParallel)
- torch.distributed.launch [mnmc_ddp_launch.py]
- torch.multiprocessing [mnmc_ddp_mp.py]
- Slurm Workload Manager [mnmc_ddp_slurm.py]
ImageNet training example [imagenet.py]

For the complete training framework, please see distribuuuu.

Requirements and Usage

Dependency

Install PyTorch>= 1.6 (has been tested on 1.6, 1.7.1, 1.8 and 1.8.1)
Install other dependencies: pip install -r requirements.txt

Dataset

Download the ImageNet dataset and move validation images to labeled subfolders, using the script valprep.sh.

Expected datasets structure for ILSVRC

ILSVRC
|_ train
|  |_ n01440764
|  |_ ...
|  |_ n15075141
|_ val
|  |_ n01440764
|  |_ ...
|  |_ n15075141
|_ ...

Create a directory containing symlinks:

mkdir -p /path/to/distribuuuu/data

Symlink ILSVRC:

ln -s /path/to/ILSVRC /path/to/distribuuuu/data/ILSVRC

Basic Usage

Single Node with one task

# 1 node, 8 GPUs
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Distribuuuu use yacs, a elegant and lightweight package to define and manage system configurations. You can setup config via a yaml file, and overwrite by other opts. If the yaml is not provided, the default configuration file will be used, please check distribuuuu/config.py.

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml \
    OUT_DIR /tmp \
    MODEL.SYNCBN True \
    TRAIN.BATCH_SIZE 256

# --cfg config/resnet18.yaml parse config from file
# OUT_DIR /tmp            overwrite OUT_DIR
# MODEL.SYNCBN True       overwrite MODEL.SYNCBN
# TRAIN.BATCH_SIZE 256    overwrite TRAIN.BATCH_SIZE

Single Node with two tasks

# 1 node, 2 task, 4 GPUs per task (8GPUs)
# task 1:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

# task 2:
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Multiple Nodes Training

# 2 node, 8 GPUs per node (16GPUs)
# node 1:
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="10.198.189.10" \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

# node 2:
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr="10.198.189.10" \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Slurm Cluster Usage

# see srun --help 
# and https://slurm.schedmd.com/ for details

# example: 64 GPUs
# batch size = 64 * 128 = 8192
# itertaion = 128k / 8192 = 156 
# lr = 64 * 0.1 = 6.4

srun --partition=openai-a100 \
     -n 64 \
     --gres=gpu:8 \
     --ntasks-per-node=8 \
     --job-name=Distribuuuu \
     python -u train_net.py --cfg config/resnet18.yaml \
     TRAIN.BATCH_SIZE 128 \
     OUT_DIR ./resnet18_8192bs \
     OPTIM.BASE_LR 6.4

Baselines

Baseline models trained by Distribuuuu:

We use SGD with momentum of 0.9, a half-period cosine schedule, and train for 100 epochs.
We use a reference learning rate of 0.1 and a weight decay of 5e-5 (1e-5 For EfficientNet).
The actual learning rate(Base LR) for each model is computed as (batch-size / 128) * reference-lr.
Only standard data augmentation techniques(RandomResizedCrop and RandomHorizontalFlip) are used.

PS: use other robust tricks(more epochs, efficient data augmentation, etc.) to get better performance.

Arch	Params(M)	Total batch	Base LR	[email protected]	[email protected]	model / config
resnet18	11.690	256 (32*8GPUs)	0.2	70.902	89.894	Drive / cfg
resnet18	11.690	1024 (128*8GPUs)	0.8	70.994	89.892
resnet18	11.690	8192 (128*64GPUs)	6.4	70.165	89.374
resnet18	11.690	16384 (256*64GPUs)	12.8	68.766	88.381
efficientnet_b0	5.289	512 (64*8GPUs)	0.4	74.540	91.744	Drive / cfg
resnet50	25.557	256 (32*8GPUs)	0.2	77.252	93.430	Drive / cfg
botnet50	20.859	256 (32*8GPUs)	0.2	77.604	93.682	Drive / cfg
regnetx_160	54.279	512 (64*8GPUs)	0.4	79.992	95.118	Drive / cfg
regnety_160	83.590	512 (64*8GPUs)	0.4	80.598	95.090	Drive / cfg
regnety_320	145.047	512 (64*8GPUs)	0.4	80.824	95.276	Drive / cfg

Zombie processes problem

Before PyTorch1.8, torch.distributed.launch will leave some zombie processes after using Ctrl + C, try to use the following cmd to kill the zombie processes. (fairseq/issues/487):

kill $(ps aux | grep YOUR_SCRIPT.py | grep -v grep | awk '{print $2}')

PyTorch >= 1.8 is suggested, which fixed the issue about zombie process. (pytorch/pull/49305)

Acknowledgments

Provided codes were adapted from:

I strongly recommend you to choose pycls, a brilliant image classification codebase and adopted by a number of projects at Facebook AI Research.

Citation

@misc{bigballon2021distribuuuu,
  author = {Wei Li},
  title = {Distribuuuu: The pure and clear PyTorch Distributed Training Framework},
  howpublished = {\url{https://github.com/BIGBALLON/distribuuuu}},
  year = {2021}
}

Feel free to contact me if you have any suggestions or questions, issues are welcome, create a PR if you find any bugs or you want to contribute. 🍰

The pure and clear PyTorch Distributed Training Framework.

Related tags

Overview

Introduction

Requirements and Usage

Dependency

Dataset

Basic Usage

Slurm Cluster Usage

Baselines

Zombie processes problem

Acknowledgments

Citation

Owner

WILL LEE

Mouse Brain in the Model Zoo

A Temporal Extension Library for PyTorch Geometric

A Repository of Community-Driven Natural Instructions

A stable algorithm for GAN training

Yolov3 pytorch implementation

VLG-Net: Video-Language Graph Matching Networks for Video Grounding

EZ graph is an easy to use AI solution that allows you to make and train your neural networks without a single line of code.

Self-Supervised Image Denoising via Iterative Data Refinement

Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on PaddlePaddle

Real-Time and Accurate Full-Body Multi-Person Pose Estimation&Tracking System

Reimplementation of Dynamic Multi-scale filters for Semantic Segmentation.

AI-Fitness-Tracker - AI Fitness Tracker With Python

Serverless proxy for Spark cluster

Image super-resolution through deep learning

Official Code for "Constrained Mean Shift Using Distant Yet Related Neighbors for Representation Learning"

A cross-document event and entity coreference resolution system, trained and evaluated on the ECB+ corpus.

Implementation of Uniformer, a simple attention and 3d convolutional net that achieved SOTA in a number of video classification tasks

This Jupyter notebook shows one way to implement a simple first-order low-pass filter on sampled data in discrete time.

Unofficial Implementation of RobustSTL: A Robust Seasonal-Trend Decomposition Algorithm for Long Time Series (AAAI 2019)

Does MAML Only Work via Feature Re-use? A Data Set Centric Perspective