Secure Distributed Training at Scale

Related tags

Deep Learningbtard
Overview

Secure Distributed Training at Scale

This repository contains the implementation of experiments from the paper

"Secure Distributed Training at Scale"

Eduard Gorbunov*, Alexander Borzunov*, Michael Diskin, Max Ryabinin

[PDF] arxiv.org

Overview

The code is organized as follows:

  • ./resnet is a setup for training ResNet18 on CIFAR-10 with simulated byzantine attackers
  • ./albert runs distributed training of ALBERT-large with byzantine attacks using cloud instances

ResNet18

This setup uses torch.distributed for parallelism.

Requirements
  • Python >= 3.7 (we recommend Anaconda python 3.8)
  • Dependencies: pip install jupyter torch>=1.6.0 torchvision>=0.7.0 tensorboard
  • A machine with at least 16GB RAM and either a GPU with >24GB memory or 3 GPUs with at least 10GB memory each.
  • We tested the code on Ubuntu Server 18.04, it should work with all major linux distros. For Windows, we recommend using Docker (e.g. via Kitematic).

Running experiments: please open ./resnet/RunExperiments.ipynb and follow the instructions in that notebook. The learning curves will be available in Tensorboard logs: tensorboard --logdir btard/resnet.

ALBERT

This setup spawns distributed nodes that collectively train ALBERT-large on wikitext103. It uses a version of the hivemind library modified so that some peers may be programmed to become Byzantine and perform various types of attacks on the training process.

Requirements
  • The experiments are optimized for 16 instances each with a single T4 GPU.

    • For your convenience, we provide a cost-optimized AWS starter notebook that can run experiments (see below)
    • While it can be simulated with a single node, doing so will require additional tuning depending on the number and type of GPUs available.
  • If running manually, please install the core library on each machine:

    • The code requires python >= 3.7 (we recommend Anaconda python 3.8)
    • Install the library: cd ./albert/hivemind/ && pip install -e .
    • If successful, it should become available as import hivemind

Running experiments: For your convenience, we provide a unified script that runs a distributed ALBERT experiment in the AWS cloud ./albert/experiments/RunExperiments.ipynb using preemptible T4 instances. The learning curves will be posted to the Wandb project specified during the notebook setup.

Expected cloud costs: a training experiment with 16 hosts takes up approximately $60 per day for g4dn.xlarge and $90 per day for g4dn.2xlarge instances. One can expect a full training experiment to converge in ≈3 days. Once the model is trained, one can restart training from intermediate checkpoints and simulate attacks. One attack episode takes up 4-5 hours depending on cloud availability.

Owner
Yandex Research
Yandex Research
Unsupervised Representation Learning via Neural Activation Coding

Neural Activation Coding This repository contains the code for the paper "Unsupervised Representation Learning via Neural Activation Coding" published

yookoon park 5 May 26, 2022
Poisson Surface Reconstruction for LiDAR Odometry and Mapping

Poisson Surface Reconstruction for LiDAR Odometry and Mapping Surfels TSDF Our Approach Table: Qualitative comparison between the different mapping te

Photogrammetry & Robotics Bonn 305 Dec 21, 2022
This repo contains source code and materials for the TEmporally COherent GAN SIGGRAPH project.

TecoGAN This repository contains source code and materials for the TecoGAN project, i.e. code for a TEmporally COherent GAN for video super-resolution

Nils Thuerey 5.2k Jan 02, 2023
A code implementation of AC-GC: Activation Compression with Guaranteed Convergence, in NeurIPS 2021.

Code For AC-GC: Lossy Activation Compression with Guaranteed Convergence This code is intended to be used as a supplemental material for submission to

Dave Evans 2 Nov 01, 2022
Cascading Feature Extraction for Fast Point Cloud Registration (BMVC 2021)

Cascading Feature Extraction for Fast Point Cloud Registration This repository contains the source code for the paper [Arxive link comming soon]. Meth

7 May 26, 2022
Source code for "Pack Together: Entity and Relation Extraction with Levitated Marker"

PL-Marker Source code for Pack Together: Entity and Relation Extraction with Levitated Marker. Quick links Overview Setup Install Dependencies Data Pr

THUNLP 173 Dec 30, 2022
Official implementation of Long-Short Transformer in PyTorch.

Long-Short Transformer (Transformer-LS) This repository hosts the code and models for the paper: Long-Short Transformer: Efficient Transformers for La

NVIDIA Corporation 198 Dec 29, 2022
A high-performance anchor-free YOLO. Exceeding yolov3~v5 with ONNX, TensorRT, NCNN, and Openvino supported.

YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and industrial communities. For more details, please refer to our rep

7.7k Jan 06, 2023
A stock generator that assess a list of stocks and returns the best stocks for investing and money allocations based on users choices of volatility, duration and number of stocks

Stock-Generator Please visit "Stock Generator.ipynb" for a clearer view and "Stock Generator.py" for scripts. The stock generator is designed to allow

jmengnyay 1 Aug 02, 2022
[NeurIPS 2021] Shape from Blur: Recovering Textured 3D Shape and Motion of Fast Moving Objects

[NeurIPS 2021] Shape from Blur: Recovering Textured 3D Shape and Motion of Fast Moving Objects YouTube | arXiv Prerequisites Kaolin is available here:

Denys Rozumnyi 107 Dec 26, 2022
A privacy-focused, intelligent security camera system.

Self-Hosted Home Security Camera System A privacy-focused, intelligent security camera system. Features: Multi-camera support w/ minimal configuration

Scott Barnes 175 Jan 01, 2023
PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

Maria: A Visual Experience Powered Conversational Agent This repository is the Pytorch implementation of our paper "Maria: A Visual Experience Powered

Jokie 22 Dec 12, 2022
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

DeeBERT This is the code base for the paper DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. Code in this repository is also available

Castorini 132 Nov 14, 2022
PyTorch implementation for Graph Contrastive Learning with Augmentations

Graph Contrastive Learning with Augmentations PyTorch implementation for Graph Contrastive Learning with Augmentations [poster] [appendix] Yuning You*

Shen Lab at Texas A&M University 382 Dec 15, 2022
In this work, we will implement some basic but important algorithm of machine learning step by step.

WoRkS continued English 中文 Français Probability Density Estimation-Non-Parametric Methods(概率密度估计-非参数方法) 1. Kernel / k-Nearest Neighborhood Density Est

liziyu0104 1 Dec 30, 2021
Code for a real-time distributed cooperative slam(RDC-SLAM) system for ROS compatible platforms.

RDC-SLAM This repository contains code for a real-time distributed cooperative slam(RDC-SLAM) system for ROS compatible platforms. The system takes in

40 Nov 19, 2022
Designing a Practical Degradation Model for Deep Blind Image Super-Resolution (ICCV, 2021) (PyTorch) - We released the training code!

Designing a Practical Degradation Model for Deep Blind Image Super-Resolution Kai Zhang, Jingyun Liang, Luc Van Gool, Radu Timofte Computer Vision Lab

Kai Zhang 804 Jan 08, 2023
Extension to fastai for volumetric medical data

FAIMED 3D use fastai to quickly train fully three-dimensional models on radiological data Classification from faimed3d.all import * Load data in vari

Keno 26 Aug 22, 2022
Pynomial - a lightweight python library for implementing the many confidence intervals for the risk parameter of a binomial model

Pynomial - a lightweight python library for implementing the many confidence intervals for the risk parameter of a binomial model

Demetri Pananos 9 Oct 04, 2022
The project was to detect traffic signs, based on the Megengine framework.

trafficsign 赛题 旷视AI智慧交通开源赛道,初赛1/177,复赛1/12。 本赛题为复杂场景的交通标志检测,对五种交通标志进行识别。 框架 megengine 算法方案 网络框架 atss + resnext101_32x8d 训练阶段 图片尺寸 最终提交版本输入图片尺寸为(1500,2

20 Dec 02, 2022