WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

Last update: Jun 18, 2022

Related tags

Overview

WAGMA-SGD

WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging. The key idea of WAGMA-SGD is to use a novel wait-avoiding group allreduce to average the models among processes. The synchronization is relaxed by making the collectives externally-triggerable, namely, a collective can be initiated without requiring that all the processes enter it. Thus, it can better handle the deep learning training with load imbalance. Since WAGMA-SGD only reduces the data within non-overlapping groups of process, it significantly improves the parallel scalability. WAGMA-SGD may bring staleness to the weights. However, the staleness is bounded. WAGMA-SGD is based on model averaging, rather than gradient averaging. Therefore, after the periodic synchronization is conducted, it guarantees a consistent model view amoung processes.

Demo

The wait-avoiding group allreduce operation is implemented in ./WAGMA-SGD-modules/fflib3/. To use it, simply configure and compile fflib3 as to an .so library by conducting cmake .. and make in the directory ./WAGMA-SGD-modules/fflib3/lib/. A script to run WAGMA-SGD on ResNet-50/ImageNet with SLURM job scheduler can be found here. Generally, to evaluate other neural network models with the customized optimizers (e.g., wait-avoiding group allreduce), one can simply wrap the default optimizer using the customized optimizers. See the example for ResNet-50 here.

For the deep learning tasks implemented in TensorFlow, we implemented custom C++ operators, in which we may call the wait-avoiding group allreduce operation or other communication operations (according to the specific parallel SGD algorithm) to average the models. Next, we register the C++ operators to TensorFlow, which can then be used to build the TensorFlow computational graph to implement the SGD algorithms. Similarly, for the deep learning tasks implemented in PyTorch, one can utilize pybind11 to call C++ operators in Python.

Publication

The work of WAGMA-SGD is pulished in TPDS'21. See the paper for details. To cite our work:

@ARTICLE{9271898,
  author={Li, Shigang and Ben-Nun, Tal and Nadiradze, Giorgi and Girolamo, Salvatore Di and Dryden, Nikoli and Alistarh, Dan and Hoefler, Torsten},
  journal={IEEE Transactions on Parallel and Distributed Systems},
  title={Breaking (Global) Barriers in Parallel Stochastic Optimization With Wait-Avoiding Group Averaging},
  year={2021},
  volume={32},
  number={7},
  pages={1725-1739},
  doi={10.1109/TPDS.2020.3040606}}

License

See LICENSE.

WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

Related tags

Overview

WAGMA-SGD

Demo

Publication

License

Owner

Shigang Li

Predicting job salaries from ads - a Kaggle competition

A library of extension and helper modules for Python's data analysis and machine learning libraries.

Automatic extraction of relevant features from time series:

CrayLabs and user contibuted examples of using SmartSim for various simulation and machine learning applications.

A Multipurpose Library for Synthetic Time Series Generation in Python

Cryptocurrency price prediction and exceptions in python

Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Uplift modeling and causal inference with machine learning algorithms

Predict the output which should give a fair idea about the chances of admission for a student for a particular university

Nixtla is an open-source time series forecasting library.

Machine learning template for projects based on sklearn library.

Decision Weights in Prospect Theory

Dual Adaptive Sampling for Machine Learning Interatomic potential.

A repository to work on Machine Learning course. Select an algorithm to classify writer's gender, of Hebrew texts.

Estudos e projetos feitos com PySpark.

A linear regression model for house price prediction

Turns your machine learning code into microservices with web API, interactive GUI, and more.

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

Stats, linear algebra and einops for xarray

WAGMA-SGD is a decentralized asynchronous SGD for distributed deep learning training based on model averaging.

Related tags

Overview

WAGMA-SGD

Demo

Publication

License

Owner

Shigang Li

Predicting job salaries from ads - a Kaggle competition

A library of extension and helper modules for Python's data analysis and machine learning libraries.

Automatic extraction of relevant features from time series:

CrayLabs and user contibuted examples of using SmartSim for various simulation and machine learning applications.

A Multipurpose Library for Synthetic Time Series Generation in Python

Cryptocurrency price prediction and exceptions in python

Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Uplift modeling and causal inference with machine learning algorithms

Predict the output which should give a fair idea about the chances of admission for a student for a particular university

Nixtla is an open-source time series forecasting library.

Machine learning template for projects based on sklearn library.

Decision Weights in Prospect Theory

Dual Adaptive Sampling for Machine Learning Interatomic potential.

A repository to work on Machine Learning course. Select an algorithm to classify writer's gender, of Hebrew texts.

Estudos e projetos feitos com PySpark.

A linear regression model for house price prediction

Turns your machine learning code into microservices with web API, interactive GUI, and more.

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

Stats, linear algebra and einops for xarray

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。它的特点包括: 效果出色、简单易用、通用、自动化、灵活。