Sky Computing: Accelerating Geo-distributed Computing in Federated Learning

Last update: Dec 27, 2022

Related tags

Overview

Sky Computing

Introduction

Sky Computing is a load-balanced framework for federated learning model parallelism. It adaptively allocate model layers to devices based on the their hardware sepcification. Sky Computing outperforms the baseline method by 55% in training time when training 160-layer BERT in a 64-node cluster. Our paper can be found at https://arxiv.org/abs/2202.11836

The concept sky computing was first introduced by Dr. Katarzyna Keahey et al. They used this word to describe a cross-cloud compute pattern. And later Prof. Stoica and Prof. Shenker generalized this word to geo-distributed computing. Our project is based on their definition. [1] [2]

Installation

git clone [email protected]:hpcaitech/SkyComputing.git
python -m pip install -r requirements.txt
cd ./scaelum
python -m pip install -v -e .

Experiment (using BERT)

To benchmark the Sky Computing, we prepared a single demo which you can run on your cluster to train BERT.

Prepare BERT model

Bidirectional Encoder Representations from Transformers (aka BERT) is one of the state-of-the-art deep learning models for Natural Language Processing. In the experiment part, we use BERT to run a simple benchmark.

cd $PROJECT
mkdir -p BERT/model && cd BERT/model 
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip

Prepare GLUE MNLI dataset

The General Language Understanding Evaluation (aka GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. And the Multi-Genre Natural Language Inference (aka MNLI) is one of the tasks in GLUE, it is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information.

cd $PROJECT
mkdir -p BERT/data && cd BERT/data
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/1502038877f6a88c225a34450793fbc3ea87eaba/download_glue_data.py
python download_glue_data.py --data_dir ./glue_data --tasks MNLI

Configuration

To run dllb in your cluster, you need to write a config file which contains the necessary information about training, e.g. model layers, useful environment variables. We have provided a well-commentted example, and here are some most important option:

# your project path
PROJECT = os.getenv("PROJECT")

# allocation type, valid values are even, optimal and dynamic
ALLOCATE_TYPE = "even"

# num of node (including the central server)
CORE_NUM = 4

Run scripts

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. We used slurm script to run our experiment.

#!/bin/sh

#SBATCH --job-name=gpu16   # Job name
#SBATCH -o gpu16.o%j       # Name of stdout output file
#SBATCH -e gpu16.e%j       # Name of stderr error file
#SBATCH -N 16              # Node numbers
#SBATCH -n 16              # GPU numbers
#SBATCH --time=02:00:00    # Run time (hh:mm:ss)

# run
python ./ip_addr.py > "./HOST"
srun python ./launch.py -c "./experiment/config.py"

Citation

@misc{zhu2022sky,
      title={Sky Computing: Accelerating Geo-distributed Computing in Federated Learning}, 
      author={Jie Zhu and Shenggui Li and Yang You},
      year={2022},
      eprint={2202.11836},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Reference

@article{keahey2009sky,
  title={Sky computing},
  author={Keahey, Katarzyna and Tsugawa, Mauricio and Matsunaga, Andrea and Fortes, Jose},
  journal={IEEE Internet Computing},
  volume={13},
  number={5},
  pages={43--51},
  year={2009},
  publisher={IEEE}
}
@inproceedings{stoica2021cloud,
  title={From cloud computing to sky computing},
  author={Stoica, Ion and Shenker, Scott},
  booktitle={Proceedings of the Workshop on Hot Topics in Operating Systems},
  pages={26--32},
  year={2021}
}

Sky Computing: Accelerating Geo-distributed Computing in Federated Learning

Related tags

Overview

Sky Computing

Introduction

Installation

Experiment (using BERT)

Prepare BERT model

Prepare GLUE MNLI dataset

Configuration

Run scripts

Citation

Reference

Owner

HPC-AI Tech

Official implementation of ETH-XGaze dataset baseline

Codebase for Time-series Generative Adversarial Networks (TimeGAN)

Spatial Temporal Graph Convolutional Networks (ST-GCN) for Skeleton-Based Action Recognition in PyTorch

Qimera: Data-free Quantization with Synthetic Boundary Supporting Samples

Deep Learning Theory

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Source code for CAST - Crisis Domain Adaptation Using Sequence-to-sequence Transformers (Accepted to ISCRAM 2021, CorePaper).

A set of tools for Namebase and HNS

This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021.

Unsupervised Representation Learning by Invariance Propagation

A high-level Python library for Quantum Natural Language Processing

Translation-equivariant Image Quantizer for Bi-directional Image-Text Generation

Checkout some cool self-projects you can try your hands on to curb your boredom this December!

Using image super resolution models with vapoursynth and speeding them up with TensorRT

PyTorch Implementation of ECCV 2020 Spotlight TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images

Classifying audio using Wavelet transform and deep learning

Next-gen Rowhammer fuzzer that uses non-uniform, frequency-based patterns.

Codebase for INVASE: Instance-wise Variable Selection - 2019 ICLR

Algorithms for outlier, adversarial and drift detection

A collection of resources on GAN Inversion.