Sky Computing: Accelerating Geo-distributed Computing in Federated Learning

Last update: Dec 27, 2022

Related tags

Overview

Sky Computing

Introduction

Sky Computing is a load-balanced framework for federated learning model parallelism. It adaptively allocate model layers to devices based on the their hardware sepcification. Sky Computing outperforms the baseline method by 55% in training time when training 160-layer BERT in a 64-node cluster. Our paper can be found at https://arxiv.org/abs/2202.11836

The concept sky computing was first introduced by Dr. Katarzyna Keahey et al. They used this word to describe a cross-cloud compute pattern. And later Prof. Stoica and Prof. Shenker generalized this word to geo-distributed computing. Our project is based on their definition. [1] [2]

Installation

git clone [email protected]:hpcaitech/SkyComputing.git
python -m pip install -r requirements.txt
cd ./scaelum
python -m pip install -v -e .

Experiment (using BERT)

To benchmark the Sky Computing, we prepared a single demo which you can run on your cluster to train BERT.

Prepare BERT model

Bidirectional Encoder Representations from Transformers (aka BERT) is one of the state-of-the-art deep learning models for Natural Language Processing. In the experiment part, we use BERT to run a simple benchmark.

cd $PROJECT
mkdir -p BERT/model && cd BERT/model 
wget https://storage.googleapis.com/bert_models/2019_05_30/wwm_uncased_L-24_H-1024_A-16.zip
unzip wwm_uncased_L-24_H-1024_A-16.zip

Prepare GLUE MNLI dataset

The General Language Understanding Evaluation (aka GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. And the Multi-Genre Natural Language Inference (aka MNLI) is one of the tasks in GLUE, it is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information.

cd $PROJECT
mkdir -p BERT/data && cd BERT/data
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/1502038877f6a88c225a34450793fbc3ea87eaba/download_glue_data.py
python download_glue_data.py --data_dir ./glue_data --tasks MNLI

Configuration

To run dllb in your cluster, you need to write a config file which contains the necessary information about training, e.g. model layers, useful environment variables. We have provided a well-commentted example, and here are some most important option:

# your project path
PROJECT = os.getenv("PROJECT")

# allocation type, valid values are even, optimal and dynamic
ALLOCATE_TYPE = "even"

# num of node (including the central server)
CORE_NUM = 4

Run scripts

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. We used slurm script to run our experiment.

#!/bin/sh

#SBATCH --job-name=gpu16   # Job name
#SBATCH -o gpu16.o%j       # Name of stdout output file
#SBATCH -e gpu16.e%j       # Name of stderr error file
#SBATCH -N 16              # Node numbers
#SBATCH -n 16              # GPU numbers
#SBATCH --time=02:00:00    # Run time (hh:mm:ss)

# run
python ./ip_addr.py > "./HOST"
srun python ./launch.py -c "./experiment/config.py"

Citation

@misc{zhu2022sky,
      title={Sky Computing: Accelerating Geo-distributed Computing in Federated Learning}, 
      author={Jie Zhu and Shenggui Li and Yang You},
      year={2022},
      eprint={2202.11836},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Reference

@article{keahey2009sky,
  title={Sky computing},
  author={Keahey, Katarzyna and Tsugawa, Mauricio and Matsunaga, Andrea and Fortes, Jose},
  journal={IEEE Internet Computing},
  volume={13},
  number={5},
  pages={43--51},
  year={2009},
  publisher={IEEE}
}
@inproceedings{stoica2021cloud,
  title={From cloud computing to sky computing},
  author={Stoica, Ion and Shenker, Scott},
  booktitle={Proceedings of the Workshop on Hot Topics in Operating Systems},
  pages={26--32},
  year={2021}
}

Sky Computing: Accelerating Geo-distributed Computing in Federated Learning

Related tags

Overview

Sky Computing

Introduction

Installation

Experiment (using BERT)

Prepare BERT model

Prepare GLUE MNLI dataset

Configuration

Run scripts

Citation

Reference

Owner

HPC-AI Tech

Improving Generalization Bounds for VC Classes Using the Hypergeometric Tail Inversion

The project covers common metrics for super-resolution performance evaluation.

Winning Solution in NTIRE19 Challenges on Video Restoration and Enhancement (CVPR19 Workshops) - Video Restoration with Enhanced Deformable Convolutional Networks. EDVR has been merged into BasicSR and this repo is a mirror of BasicSR.

code associated with ACL 2021 DExperts paper

Generating Fractals on Starknet with Cairo

Tensorflow implementation of "Learning Deconvolution Network for Semantic Segmentation"

Creating a custom CNN hypertunned architeture for the Fashion MNIST dataset with Python, Keras and Tensorflow.

All-in-one Docker container that allows a user to explore Nautobot in a lab environment.

A concise but complete implementation of CLIP with various experimental improvements from recent papers

Offical implementation for "Trash or Treasure? An Interactive Dual-Stream Strategy for Single Image Reflection Separation".

Official PyTorch Implementation for InfoSwap: Information Bottleneck Disentanglement for Identity Swapping

The pytorch implementation of SOKD (BMVC2021).

This repository is the code of the paper "Sparse Spatial Transformers for Few-Shot Learning".

curl-impersonate: A special compilation of curl that makes it impersonate Chrome & Firefox

Perform Linear Classification with Multi-way Data

Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme (NeurIPS2021)

A Python toolbox to create adversarial examples that fool neural networks in PyTorch, TensorFlow, and JAX

Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

An implementation of the Contrast Predictive Coding (CPC) method to train audio features in an unsupervised fashion.

This repository provides a PyTorch implementation and model weights for HCSC (Hierarchical Contrastive Selective Coding)