FedTorch is an open-source Python package for distributed and federated training of machine learning models using PyTorch distributed API

Overview

FedTorch Logo

FedTorch is an open-source Python package for distributed and federated training of machine learning models using PyTorch distributed API. Various algorithms for federated learning and local SGD are implemented for benchmarking and research, including our own proposed methods:

And other common algorithms such as:

We are actively trying to expand the library to include more training algorithms as well.

NEWS

Recent updates to the package:

Installation

First you need to clone the repo into your computer:

git clone https://github.com/MLOPTPSU/FedTorch.git

The PyPi package will be added soon.

This package is built based on PyTorch Distributed API. Hence, it could be run with any supported distributed backend of GLOO, MPI, and NCCL. Among these three, MPI backend since it can be used for both CPU and CUDA runnings, is the main backend we use for developement. Unfortunately installing the built version of PyTorch does not support MPI backend for distributed training and needed to be built from source with a version of MPI installed that supports CUDA as well. However, do not worry since we got you covered. We provide a docker file that can create an image with all dependencies needed for FedTorch. The Dockerfile can be found here, where you can edit based on your need to create your customized image. In addition, since building this docker image might take a lot of time, we provide different versions that we built before along with this repository in the packages section.

For instance, you can pull one of the images that is built with CUDA 10.2 and OpenMPI 4.0.1 with CUDA support and PyTorch 1.6.0, using the following command:

docker pull docker.pkg.github.com/mloptpsu/fedtorch/fedtorch:cuda10.2-mpi

The docker images can be used for cloud services such as Azure Machine Learning API for running FedTorch. The instructions for running on cloud services will be added in near future.

Get Started

Running different trainings with different settings is easy in FedTorch. The only thing you need to take care of is to set the correct parameters for the experiment. The list of all parameters used in this package is in parameters.py file. For different algorithms we will provide examples, so the relevant parameters can be set correctly. YAML support will be added in future for each distinct training. The parameters can be parsed from the input of the commandline using the following method:

from fedtorch.parameters import get_args

args = get_args()

When the parameters are set, we need to setup the nodes using those parameters and start the training. First, we need to setup the distributed backend. For instance, we can use MPI backend as:

import torch.distributed as dist

dist.init_process_group('mpi')

This will initialize the backend for distributed training. Next, we need to setup each node based on the parameters and create the graph of nodes. Then we need to initialize nodes and load their data. For this, we can use the node object in the FedTorch:

from fedtorch.nodes import Client

client = Client(args, dist.get_rank())
# Initialize the node
client.initialize()
# Initialize the dataset if not downloaded
client.initialize_dataset()
# Load the dataset
client.load_local_dataset()
# Generate auxiliary models and params for training
client.gen_aux_models()

Then, we need to call the appropriate training method and run the training. For instance, if the parameters are set for a FedAvg training, we can run:

from fedtorch.comms.trainings.federated import train_and_validate_federated

train_and_validate_federated(client)

Different distributed and federated algorithms in this package can be run using this procedure, and for simplicity, we provide main.py file, where can be used for running those algorithms following the same procedure. To run this file, we should run it using mpi and define number of clients (processes) that will run the same file for training using:

mpirun -np {NUM_CLIENTS} python main.py {args:values}

where {args:values} should be filled with appropriate parameters needed for the training. Next we provide a file to automatically build this command for mpi running for various situations.

Running Examples

To make the process easier, we have provided a run_mpi.py file that covers most of parameters needed for running different training algrithms. We first get into the details of different parameters and then provide some examples for running.

Dataset Parameters

For setting up the dataset there are some parameters involved. The main parameters are:

  • --data : Defines the dataset name for training.
  • --batch_size : Defines the size of the batch in the training.
  • --partition_data : Defines whether the data should be partitioned or each client access to the whole dataset.
  • --reshuffle_per_epoch : This can be set True for distributed training to have iid data accross clients and faster convergence. This is not inline with Federated Learning settings.
  • --iid_data : If set True, the data is randomly distributed accross clients. If is set to False, either the dataset itself is non-iid by default (like the EMNIST dataset) or it can be manullay distributed to be non-iid (like the MNIST dataset using parameter num_class_per_client). The default is True in the package, but in the run_mpi.py file the default is False.
  • --num_class_per_client: If the parameter iid is set to False, we can distribute the data heterogeneously by attributing certain number of classes to each client. For instance if setting --num_class_per_client 2, then each client will only has access to two randomly selected classes' data in the entire training process.

Federated Learning Parameters

To run the training using federated learning setups some main parameters are:

  • --federated : If set to True the training will be in one of federated learning setups. If not, it will be in a distributed mode using local SGD and with periodic averaging (that could be set using --local_step) and possibly reshuffling after each epoch. The default is False.
  • --federated_type : defines the type of fderated learning algorithm we want to use. The default is fedavg.
  • --federated_sync_type : It could be either epoch or local_step and it will be used to determine when to synchronize the models. If set to epoch, then the parameter --num_epochs_per_comm should be set as well. If set to local_step, then the parameter --local_steps should be set. The default is epoch.
  • --num_comms : Defines the number of communication rounds needed for trainings. This is only for federated learning, while in normal distirbuted mode the number of total iterations should be set either by --num_epochs or --num_iterations, and hence the --stop_criteria should be either epoch or iteration.
  • --online_client_rate : Defines the ratio of clients that are online and active during each round of communication. This is only for federated learning. The default value is 1.0, which means all clients will be active.

Learning Rate Schedule

Different learning rate schedules can be set using their corresponding parameters. The main parameter is --lr_schedule_scheme, which defines the scheme for learning rate scheduling. For more information about different learning rate schedulers, please see learning.py file.

Examples

Now we provide some simple examples for running some of the training algorithms on a single node with multiple processes using mpi. To do so, we first need to run the docker container with installed dependencies.

docker run --rm -it --mount type=bind,source="{path/to/FedTorch}",target=/FedTorch docker.pkg.github.com/mloptpsu/fedtorch/fedtorch:cuda10.2-mpi
cd /FedTorch

This will run the container and will mount the FedTorch repo to it. The {path/to/FedTorch} should be replaced with your local path to the FedTorch repo directory. Now we can run the training on it.

FedAvg and FedGATE

Now, we can run the FedAvg algorithm for training an MLP model using MNIST data by the following command.

python run_mpi.py -f -ft fedavg -n 10 -d mnist -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2

This will run the training on 10 nodes with initial learning rate of 0.1, the batch size of 50, for 20 communication rounds each with 10 local steps of SGD. The dataset is distributed hetergeneously with each client has access to only 2 classes of data from the MNIST dataset.

Changing -ft fedavg to -ft fedgate will run the same training using the FedGATE algorithm. To run the FedCOMGATE algorithm we need to add -q to the parameter to enable quantization as well. Hence the command will be:

python run_mpi.py -f -ft fedgate -n 10 -d mnist -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -q

APFL

To run APFL algorithm a simple command will be:

python run_mpi.py -f -ft apfl -n 10 -d mnist -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -pa 0.5 -fp

where -pa 0.5 sets the alpha parameter of the APFL algorithm. The last parameter -fp will turn on the fed_personal parameter, which evaluate the personalized or the localized model using a local validation dataset. This will be mostly used for personalization algorithms such as APFL.

DRFA

To run a DRFA training we can use the following command:

python run_mpi.py -f -fd -ft fedavg -n 10 -d mnist -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -dg 0.1 

where -dg 0.1 sets the gamma parameter in the DRFA algorithm. Note that DRFA is a framework that can be run using any federated learning aggergator such as FedAvg or FedGATE. Hence the parameter -fd will enable DRFA training and -ft will define the federated type to be used for aggregation.

References

For this repository there are several different references used for each training procedure. If you use this repository in your research, please cite the following paper:

@article{haddadpour2020federated,
  title={Federated learning with compression: Unified analysis and sharp guarantees},
  author={Haddadpour, Farzin and Kamani, Mohammad Mahdi and Mokhtari, Aryan and Mahdavi, Mehrdad},
  journal={arXiv preprint arXiv:2007.01154},
  year={2020}
}

Our other papers developed using this repository should be cited using the following bibitems:

@inproceedings{haddadpour2019local,
  title={Local sgd with periodic averaging: Tighter analysis and adaptive synchronization},
  author={Haddadpour, Farzin and Kamani, Mohammad Mahdi and Mahdavi, Mehrdad and Cadambe, Viveck},
  booktitle={Advances in Neural Information Processing Systems},
  pages={11082--11094},
  year={2019}
}
@article{deng2020distributionally,
  title={Distributionally Robust Federated Averaging},
  author={Deng, Yuyang and Kamani, Mohammad Mahdi and Mahdavi, Mehrdad},
  journal={Advances in Neural Information Processing Systems},
  volume={33},
  year={2020}
}
@article{deng2020adaptive,
  title={Adaptive Personalized Federated Learning},
  author={Deng, Yuyang and Kamani, Mohammad Mahdi and Mahdavi, Mehrdad},
  journal={arXiv preprint arXiv:2003.13461},
  year={2020}
}

Acknowledgement and Disclaimer

This repository is developed, mainly by MM. Kamani, based on our group's research on distributed and federated learning algorithms. We also developed the works of other groups' proposed methods using FedTorch for a better comparison. However, this repo is not the official code for those methods other than our group's. Some parts of the initial stages of this repository were based on a forked repo of Local SGD code from Tao Lin, which is not public now.

Comments
  • Errors when running the code using Docker

    Errors when running the code using Docker

    Hi, I followed your README instructions to run your algorithm but I encountered multiple errors along the way:

    • I tried to pull the image you provided on This issue #3 and run the following command: python run_mpi.py -f -ft apfl -n 10 -d cifar10 -lg 0.1 -b 50 -c 20 -k 1.0 -fs local_step -l 10 -r 2 -pa 0.5 -fp -oc I keep getting warnings about my RTX 2080ti card which I think is not been recognized.
    warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
    Traceback (most recent call last):
      File "main.py", line 49, in <module>
        main(args)
      File "main.py", line 21, in main
        client.initialize()
      File "/workspace/fedtorch/fedtorch/nodes/nodes.py", line 44, in initialize
        init_config(self.args)
      File "/workspace/fedtorch/fedtorch/utils/init_config.py", line 36, in init_config
        torch.cuda.set_device(args.graph.device)
      File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 281, in set_device
        torch._C._cuda_setDevice(device)
    RuntimeError: cuda runtime error (101) : invalid device ordinal at /workspace/pytorch/torch/csrc/cuda/Module.cpp:59
    THCudaCheck FAIL file=/workspace/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal
    /usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:125: UserWarning:
    GeForce RTX 2080 Ti with CUDA capability sm_75 is not compatible with the current PyTorch installation.
    The current PyTorch install supports CUDA capabilities sm_35 sm_37 sm_52 sm_60 sm_61 sm_70 compute_70.
    If you want to use the GeForce RTX 2080 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
    
      warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
    Traceback (most recent call last):
      File "main.py", line 49, in <module>
        main(args)
      File "main.py", line 21, in main
        client.initialize()
      File "/workspace/fedtorch/fedtorch/nodes/nodes.py", line 44, in initialize
        init_config(self.args)
      File "/workspace/fedtorch/fedtorch/utils/init_config.py", line 36, in init_config
        torch.cuda.set_device(args.graph.device)
      File "/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py", line 281, in set_device
        torch._C._cuda_setDevice(device)
    RuntimeError: cuda runtime error (101) : invalid device ordinal at /workspace/pytorch/torch/csrc/cuda/Module.cpp:59
    

    Regarding these warnings the code does not collapse but not even one log regarding the training procedure is printed.

    image

    In addition when I run nvidia-smi I see the following gpu utilization:

    image The utilization percentage across all 3 GPUs remains zero.

    • As second option I tried to build the image given the Dockerfile located in the docker folder. Yet again I encountered the following run error:
    Traceback (most recent call last):
      File "main.py", line 4, in <module>
        import torch.distributed as dist
      File "/usr/local/lib/python3.8/dist-packages/torch/__init__.py", line 189, in <module>
        from torch._C import *
    RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd
    
    --------------------------------------------------------------------------
    Primary job  terminated normally, but 1 process returned
    a non-zero exit code. Per user-direction, the job has been aborted.
    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    mpirun detected that one or more processes exited with non-zero status, thus causing
    the job to be terminated. The first process to do so was:
    
      Process name: [[51835,1],1]
      Exit code:    1
    --------------------------------------------------------------------------
    

    It looks like there is a problem with PyTorch installation so I tried something simple:

    image

    Upgrading Numpy's version didn't help me either. Can you please help and explain how to run your code?

    Thank you!

    opened by AvivSham 5
  • A question about APFL algorithm

    A question about APFL algorithm

    Hi, I read your paper and am interested in the algorithm of APFL. I have a question about the code. It seems that the code of APFL is a little different from the algorithm written on the paper. In the paper, each client maintains 3 models and first update the local version of global model, then update the local model and finally mix them to get the new personalized model. However, in this code, it just maintains 2 models and first update the local version of global model, then get the grad of personalized model by mixing the output of local version of global model and personalized model.

    The code: `

    inference and get current performance.

                        client.optimizer.zero_grad()
                        loss, performance = inference(client.model, client.criterion, client.metrics, _input, _target)
    
                        # compute gradient and do local SGD step.
                        loss.backward()
                        client.optimizer.step(
                            apply_lr=True,
                            apply_in_momentum=client.args.in_momentum, apply_out_momentum=False
                        )
                        
                        client.optimizer.zero_grad()
                        client.optimizer_personal.zero_grad()
                        loss_personal, performance_personal = inference_personal(client.model_personal, client.model, 
                                                                                 client.args.fed_personal_alpha, client.criterion, 
                                                                                 client.metrics, _input, _target)
    
                        # compute gradient and do local SGD step.
                        loss_personal.backward()
                        client.optimizer_personal.step(
                            apply_lr=True,
                            apply_in_momentum=client.args.in_momentum, apply_out_momentum=False
                        )
    

    ` Are they the same? Looking forward to your reply. Thank you!

    opened by BrightHaozi 3
  • a question about fedprox

    a question about fedprox

    in this page "https://github.com/MLOPTPSU/FedTorch/blob/main/fedtorch/comms/trainings/federated/main.py". line 123 -> 129, code is below.

    elif client.args.federated_type == 'fedprox':
        # Adding proximal gradients and loss for fedprox
        for client_param, server_param in zip(client.model.parameters(), client.model_server.parameters()):
            if client.args.graph.rank == 0:
                print("distance norm for prox is:{}".format(torch.norm(client_param.data - server_param.data )))
            loss += client.args.fedprox_mu /2 * torch.norm(client_param.data - server_param.data)
            client_param.grad.data += client.args.fedprox_mu * (client_param.data - server_param.data)
    

    client.args.fedprox_mu /2 * torch.norm(client_param.data - server_param.data) may mean $\frac{\mu}{2} \left\|w-w_t\right\|$. But, $\frac{\mu}{2} \left\|w-w_t\right\|^2$ is used in fedprox.

    I'm not sure what I said above is true. Thank you very much for your kind consideration.

    opened by bird-two 2
  • Could you release the code in

    Could you release the code in "Federated Learning with Compression: Unified Analysis and Sharp Guarantees"?

    Hi,

    I just read the paper of "Federated Learning with Compression: Unified Analysis and Sharp Guarantees", it looks very interesting While searching for the code I was directed to this repo, however, I didn't found the implementation of FedCOM, so I would like to ask if you would release it?

    Thanks!

    opened by Dzhange 1
  • Does BrokenPipeError Matters?

    Does BrokenPipeError Matters?

    Hi, I am very interested in your proposed method. It is really nice of you to share such a helpful and clear repo. I follow your repo and use your docker images to run the code, but I keep getting some errors as follows after some random starting round of the traning.

    Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/queues.py", line 245, in _feed send_bytes(obj) File "/usr/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes self._send(header + buf) File "/usr/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

    After all, I still can get a result, but I wonder whether the brokenpipeerror matters, am I still getting the correct results? Hope to hear your response soon.

    opened by RongDai430 1
  • No basic auth credentials

    No basic auth credentials

    I read your paper and am interested in your proposed method. It is really nice of you to share such a helpful and clear repo. I pulled one of the images built with CUDA 10.2 and OpenMPI 4.0.1 with CUDA support and PyTorch 1.6.0, using the following command: $ docker pull docker.pkg.github.com/mloptpsu/fedtorch/fedtorch:cuda10.2-mpi

    However it reported an error "Error response from daemon: Head https://docker.pkg.github.com/v2/mloptpsu/fedtorch/fedtorch/manifests/cuda10.2-mpi: no basic auth credentials".
    How can I solve this error?

    opened by Sybil-Huang 1
  • Adding EMNIST full dataset

    Adding EMNIST full dataset

    Adding EMNIST full dataset in addition to its digits_only version, which was implemented before. Now, we have emnist and emnist_full datasets. The emnist dataset has 3383 clients with 10 classes, but the emnist_full has 3400 clients with 62 classes. Resolve the Batch Norm issue for training with 1 sample in the batch.

    opened by mmkamani7 1
  • setup problem in docker

    setup problem in docker

    Hello,when I ran docker pull docker.pkg.github.com/mloptpsu/fedtorch/fedtorch:cuda10.2-mpi in my docker,there's a problem err do you have any ideas to solve it? Thanks.

    opened by b856741 0
  • A question about drawing the plots

    A question about drawing the plots

    Hi, Recently I've read your paper "Distributionally Robust Federated Averaging" carefully and I'm trying to reproduce the experiments in it. However, I don't know how to get the plots from the codes, I think modifying main.py and run_mpi.py would help but it seems really complicated. Could you please give me some instructions? Thanks!

    opened by Zhg9300 2
  • Models not saved for Federated Centered

    Models not saved for Federated Centered

    Hi While running main_centered.py the model checkpoints not saved. It is not Implemented?

    Is there anything else missing for main_centered.py?

    Regards, Ahmed

    enhancement 
    opened by aldahdooh 1
Releases(v0.1.1)
Owner
Machine Learning and Optimization Lab @PennState
This is the GitHub repository of the Machine Learning and Optimization lab at Penn State University.
Machine Learning and Optimization Lab @PennState
Traditional deepdream with VQGAN+CLIP and optical flow. Ready to use in Google Colab

VQGAN-CLIP-Video cat.mp4 policeman.mp4 schoolboy.mp4 forsenBOG.mp4

23 Oct 26, 2022
VolumeGAN - 3D-aware Image Synthesis via Learning Structural and Textural Representations

VolumeGAN - 3D-aware Image Synthesis via Learning Structural and Textural Representations 3D-aware Image Synthesis via Learning Structural and Textura

GenForce: May Generative Force Be with You 116 Dec 26, 2022
End-To-End Memory Network using Tensorflow

MemN2N Implementation of End-To-End Memory Networks with sklearn-like interface using Tensorflow. Tasks are from the bAbl dataset. Get Started git clo

Dominique Luna 339 Oct 27, 2022
ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

ActNN : Activation Compressed Training This is the official project repository for ActNN: Reducing Training Memory Footprint via 2-Bit Activation Comp

UC Berkeley RISE 178 Jan 05, 2023
This is a simple framework to make object detection dataset very quickly

FastAnnotation Table of contents General info Requirements Setup General info This is a simple framework to make object detection dataset very quickly

Serena Tetart 1 Jan 24, 2022
A collection of scripts I developed for personal and working projects.

A collection of scripts I developed for personal and working projects Table of contents Introduction Repository diagram structure List of scripts pyth

Gianluca Bianco 109 Dec 26, 2022
[NeurIPS'21] "AugMax: Adversarial Composition of Random Augmentations for Robust Training" by Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Animashree Anandkumar, and Zhangyang Wang.

[NeurIPS'21] "AugMax: Adversarial Composition of Random Augmentations for Robust Training" by Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Animashree Anandkumar, and Zhangyang Wang.

VITA 112 Nov 07, 2022
Build and run Docker containers leveraging NVIDIA GPUs

NVIDIA Container Toolkit Introduction The NVIDIA Container Toolkit allows users to build and run GPU accelerated Docker containers. The toolkit includ

NVIDIA Corporation 15.6k Jan 01, 2023
Google Recaptcha solver.

byerecaptcha - Google Recaptcha solver. Model and some codes takes from embium's repository -Installation- pip install byerecaptcha -How to use- from

Vladislav Zenkevich 21 Dec 19, 2022
"Learning Free Gait Transition for Quadruped Robots vis Phase-Guided Controller"

PhaseGuidedControl The current version is developed based on the old version of RaiSim series, and possibly requires further modification. It will be

X-Mechanics 12 Oct 21, 2022
HAR-stacked-residual-bidir-LSTMs - Deep stacked residual bidirectional LSTMs for HAR

HAR-stacked-residual-bidir-LSTM The project is based on this repository which is presented as a tutorial. It consists of Human Activity Recognition (H

Guillaume Chevalier 287 Dec 27, 2022
Neural Surface Maps

Neural Surface Maps Official implementation of Neural Surface Maps - Luca Morreale, Noam Aigerman, Vladimir Kim, Niloy J. Mitra [Paper] [Project Page]

Luca Morreale 49 Dec 13, 2022
Voice Conversion Using Speech-to-Speech Neuro-Style Transfer

This repo contains the official implementation of the VAE-GAN from the INTERSPEECH 2020 paper Voice Conversion Using Speech-to-Speech Neuro-Style Transfer.

Ehab AlBadawy 93 Jan 05, 2023
CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability

This is the official repository of the paper: CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability A private copy of the

Fadi Boutros 33 Dec 31, 2022
This repository is the official implementation of the Hybrid Self-Attention NEAT algorithm.

This repository is the official implementation of the Hybrid Self-Attention NEAT algorithm. It contains the code to reproduce the results presented in the original paper: https://arxiv.org/abs/2112.0

Saman Khamesian 6 Dec 13, 2022
DISTIL: Deep dIverSified inTeractIve Learning.

DISTIL: Deep dIverSified inTeractIve Learning. An active/inter-active learning library built on py-torch for reducing labeling costs.

decile-team 110 Dec 06, 2022
Pytorch library for fast transformer implementations

Transformers are very successful models that achieve state of the art performance in many natural language tasks

Idiap Research Institute 1.3k Dec 30, 2022
Keras implementation of Deeplab v3+ with pretrained weights

Keras implementation of Deeplabv3+ This repo is not longer maintained. I won't respond to issues but will merge PR DeepLab is a state-of-art deep lear

1.3k Dec 07, 2022
A curated list of references for MLOps

A curated list of references for MLOps

Larysa Visengeriyeva 9.3k Jan 07, 2023
Code for our ICCV 2021 Paper "OadTR: Online Action Detection with Transformers".

Code for our ICCV 2021 Paper "OadTR: Online Action Detection with Transformers".

66 Dec 15, 2022