A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision

Overview



License Documentation GitHub release Contributor Covenant

Run your *raw* PyTorch training script on any kind of device

Easy to integrate

🤗 Accelerate was created for PyTorch users who like to write the training loop of PyTorch models but are reluctant to write and maintain the boilerplate code needed to use multi-GPUs/TPU/fp16.

🤗 Accelerate abstracts exactly and only the boilerplate code related to multi-GPUs/TPU/fp16 and leaves the rest of your code unchanged.

Here is an example:

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

+ accelerator = Accelerator()
- device = 'cpu'
+ device = accelerator.device

  model = torch.nn.Transformer().to(device)
  optim = torch.optim.Adam(model.parameters())

  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optim, data = accelerator.prepare(model, optim, data)

  model.train()
  for epoch in range(10):
      for source, targets in data:
          source = source.to(device)
          targets = targets.to(device)

          optimizer.zero_grad()

          output = model(source)
          loss = F.cross_entropy(output, targets)

-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()

As you can see in this example, by adding 5-lines to any standard PyTorch training script you can now run on any kind of single or distributed node setting (single CPU, single GPU, multi-GPUs and TPUs) as well as with or without mixed precision (fp16).

In particular, the same code can then be run without modification on your local machine for debugging or your training environment.

🤗 Accelerate even handles the device placement for you (which requires a few more changes to your code, but is safer in general), so you can even simplify your training loop further:

  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

- device = 'cpu'
+ accelerator = Accelerator()

- model = torch.nn.Transformer().to(device)
+ model = torch.nn.Transformer()
  optim = torch.optim.Adam(model.parameters())

  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optim, data = accelerator.prepare(model, optim, data)

  model.train()
  for epoch in range(10):
      for source, targets in data:
-         source = source.to(device)
-         targets = targets.to(device)

          optimizer.zero_grad()

          output = model(source)
          loss = F.cross_entropy(output, targets)

-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()

Launching script

🤗 Accelerate also provides an optional CLI tool that allows you to quickly configure and test your training environment before launching the scripts. No need to remember how to use torch.distributed.launch or to write a specific launcher for TPU training! On your machine(s) just run:

accelerate config

and answer the questions asked. This will generate a config file that will be used automatically to properly set the default options when doing

accelerate launch my_script.py --args_to_my_script

For instance, here is how you would run the GLUE example on the MRPC task (from the root of the repo):

accelerate launch examples/nlp_example.py

Why should I use 🤗 Accelerate?

You should use 🤗 Accelerate when you want to easily run your training scripts in a distributed environment without having to renounce full control over your training loop. This is not a high-level framework above PyTorch, just a thin wrapper so you don't have to learn a new library, In fact the whole API of 🤗 Accelerate is in one class, the Accelerator object.

Why shouldn't I use 🤗 Accelerate?

You shouldn't use 🤗 Accelerate if you don't want to write a training loop yourself. There are plenty of high-level libraries above PyTorch that will offer you that, 🤗 Accelerate is not one of them.

Installation

This repository is tested on Python 3.6+ and PyTorch 1.4.0+

You should install 🤗 Accelerate in a virtual environment. If you're unfamiliar with Python virtual environments, check out the user guide.

First, create a virtual environment with the version of Python you're going to use and activate it.

Then, you will need to install PyTorch: refer to the official installation page regarding the specific install command for your platform. Then 🤗 Accelerate can be installed using pip as follows:

pip install accelerate

Supported integrations

  • CPU only
  • single GPU
  • multi-GPU on one node (machine)
  • multi-GPU on several nodes (machines)
  • TPU
  • FP16 with native AMP (apex on the roadmap)
Comments
  • Could there be a bug in mixed precision?

    Could there be a bug in mixed precision?

    When I use torch 1.6.0 & accelerate 0.3.0 and set mixed precision as yes in accelerate config, nothing happens (still full precision training). If I set in the code Accelerator(fp16=True) then amp is triggered, but the loss becomes inf right away.

    But if I use the pytorch way (i.e. autocast in the code myself), the training is normal and amp is enabled.

    So I wonder if there is a possible bug in accelerate.

    My enviroment is single 2080 Ti, local machine. The code with this problem is here.

    opened by voldemortX 24
  • [Include example code] mixed_precision=

    [Include example code] mixed_precision="fp16" will break torch.save function.

    System Info

    accelerate-0.14.0
    Python 3.7.15
    Pytorch 1.12.1+cu113
    

    Information

    • [ ] The official example scripts
    • [X] My own modified scripts

    Tasks

    • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
    • [X] My own task or dataset (give details below)

    Reproduction

    from accelerate import Accelerator
    import torch
    import torch.nn as nn
    
    class ExampleModule(torch.nn.Module):
      def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 3, kernel_size=1)
    
    model = ExampleModule()
    
    #mixed_precision="fp16" will give error on torch.save
    #mixed_precision="no" will work with torch.save
    accelerator = Accelerator(
            gradient_accumulation_steps=1,
            mixed_precision="fp16",
            log_with="tensorboard",
            logging_dir=".",
        )
    
    #Always work
    torch.save(model,  "/model_original.model")
    
    #Will break torch.save if the model if mixed_precision="fp16" 
    model = accelerator.prepare(model)
    
    #Error with mixed_precision="fp16" 
    torch.save(model,  "/model_acc.model")
    #Error as well with mixed_precision="fp16" 
    torch.save(accelerator.unwrap_model(model),  "/model_unwrap.sd")
    

    It will return this error if mixed_precision="fp16"

    ---------------------------------------------------------------------------
    PicklingError                             Traceback (most recent call last)
    [<ipython-input-1-5ce45839c137>](https://localhost:8080/#) in <module>
         27 
         28 #Error
    ---> 29 torch.save(model,  "/model_acc.model")
         30 #Error as well
         31 torch.save(accelerator.unwrap_model(model),  "/model_unwrap.sd")
    
    1 frames
    [/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in _save(obj, zip_file, pickle_module, pickle_protocol)
        587     pickler = pickle_module.Pickler(data_buf, protocol=pickle_protocol)
        588     pickler.persistent_id = persistent_id
    --> 589     pickler.dump(obj)
        590     data_value = data_buf.getvalue()
        591     zip_file.write_record('data.pkl', data_value, len(data_value))
    
    PicklingError: Can't pickle <function _forward_unimplemented at 0x7fb39d0b0320>: it's not the same object as torch.nn.modules.module._forward_unimplemented
    

    Expected behavior

    torch.save should work even if Accelerator is set to fp16
    
    bug 
    opened by BurguerJohn 23
  • About Timeout when use Multi-gpu training

    About Timeout when use Multi-gpu training

    When I used the single-node multi-GPU mode to train, a timeout error was reported. The strange thing is that for the first few epochs, the code works fine. This error was reported after the end of a step eval in the middle.

    The reported error message is:

    [E ProcessGroupNCCL.cpp:587] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808499 milliseconds before timing out. [E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808493 milliseconds before timing out.

    [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

    terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1808493 milliseconds before timing out. [E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error'

    opened by macheng6 23
  • "IndexError: tuple index out of range" for the zero_stage=3

    I am trying to integrate deep-speed into this script and have successfully run it for zero stage 2, but when I tried it for zero stage 3 this error prompts just after completion of the first epoch. I have made changes in the finetune_using_clm.py file as suggested in this huggingface/accelerate repo, and have created a new file tuned.py.

    The error for the zero stage 3, indicates to the: Traceback (most recent call last): File "tuned.py", line 398, in main accelerator.backward(loss) The whole error is:

    Traceback (most recent call last):
      File "tuned.py", line 398, in main
        accelerator.backward(loss)
      File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1310, in backward
        self.deepspeed_engine_wrapped.backward(loss, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/deepspeed.py", line 156, in backward
        self.engine.backward(loss)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
        return func(*args, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1860, in backward
        self.optimizer.backward(loss, retain_graph=retain_graph)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
        return func(*args, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage3.py", line 2070, in backward
        self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 51, in backward
        scaled_loss.backward(retain_graph=retain_graph)
      File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 487, in backward
        torch.autograd.backward(
      File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 197, in backward
        Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
      File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 267, in apply
        return user_fn(self, *args)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 144, in backward
        ctx.pre_backward_function(ctx.module)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 11, in wrapped_fn
        return func(*args, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _run_before_backward_function
        self.pre_sub_module_backward_function(sub_module)
      File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
        return func(*args, **kwargs)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 487, in pre_sub_module_backward_function
        param_coordinator.trace_prologue(sub_module)
      File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 147, in trace_prologue
        if sub_module != self.__submodule_order[self.__step_id]:
    IndexError: tuple index out of range
    
    

    I don't know why it gives this error as it is running well while using the zero stage 2.

    Any help in this regard would be highly appreciated.

    I am using Google Colab for the task.

    Packages version: mpi4py-3.1.4 deepspeed-0.7.6 accelerate-0.15.0 transformers-4.25.1

    opened by asifehmad 22
  • About the use of gather to compute metrics

    About the use of gather to compute metrics

    First of all, thank you for developing Accelerate! I'm new to it but I already love it, it's a great framework.

    I used the code below to train a naive model on MNIST data using 3 GPUs (on a single node/machine). This code uses an Accuracy class to compute the epoch-wise accuracy from predictions and labels. It stores predictions/labels at each step to eventually compute accuracy (or another metric, such as ROC AUC) once at the end of an epoch. At each step, accelerator.gather() is used to gather all predictions/labels from GPU devices. I added a print statement in the Accuracy class to check the number of samples used to compute the accuracy. The test set of MNIST is composed of 10 000 samples but the print statement in this class shows that 10 167 samples are actually used.

    Full code:

    from __future__ import print_function
    
    import argparse
    import os
    import os.path
    import threading
    from functools import partial
    
    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import torch.optim as optim
    
    from accelerate import Accelerator
    from sklearn.metrics import accuracy_score
    from torch.optim.lr_scheduler import StepLR
    from torch.utils.data import DataLoader
    from torchvision import datasets, transforms
    from tqdm.auto import tqdm as original_tqdm
    
    
    class Accuracy:
        """Accuracy score."""
        def __init__(self):
            super().__init__()
            self.__build()
    
        def __build(self):
            self._lock = threading.Lock()
            self._predictions = []
            self._targets = []
    
        def reset(self):
            self._predictions.clear()
            self._targets.clear()
    
        def update(self, output):
            y_pred, y_true = output
            with self._lock:
                self._predictions.append(y_pred)
                self._targets.append(y_true)
    
        def compute(self):
            with self._lock:
                predictions = torch.cat(self._predictions, dim=0).numpy()
                targets = torch.cat(self._targets, dim=0).numpy()
                print(f'Shapes: predictions {predictions.shape}, targets {targets.shape}')
                return accuracy_score(y_true=targets, y_pred=predictions)
    
    
    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = nn.Conv2d(1, 32, 3, 1)
            self.conv2 = nn.Conv2d(32, 64, 3, 1)
            self.dropout1 = nn.Dropout2d(0.25)
            self.dropout2 = nn.Dropout2d(0.5)
            self.fc1 = nn.Linear(9216, 128)
            self.fc2 = nn.Linear(128, 10)
    
        def forward(self, x):
            x = self.conv1(x)
            x = F.relu(x)
            x = self.conv2(x)
            x = F.relu(x)
            x = F.max_pool2d(x, 2)
            x = self.dropout1(x)
            x = torch.flatten(x, 1)
            x = self.fc1(x)
            x = F.relu(x)
            x = self.dropout2(x)
            x = self.fc2(x)
            output = F.log_softmax(x, dim=1)
            return output
    
    
    def main():
    
        # Training settings
        parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
        parser.add_argument('--per_device_eval_batch_size',
                            type=int,
                            default=64,
                            metavar='N',
                            help='The per-device batch size to use for evaluation.')
        parser.add_argument('--per_device_train_batch_size',
                            type=int,
                            default=64,
                            metavar='N',
                            help='The per-device batch size to use for training.')
        parser.add_argument('--epochs',
                            type=int,
                            default=5,
                            metavar='N',
                            help='number of epochs to train (default: 14)')
        parser.add_argument('--lr',
                            type=float,
                            default=1.0,
                            metavar='LR',
                            help='learning rate (default: 1.0)')
        parser.add_argument('--gamma',
                            type=float,
                            default=0.7,
                            metavar='M',
                            help='Learning rate step gamma (default: 0.7)')
        parser.add_argument('--no-cuda',
                            action='store_true',
                            default=False,
                            help='disables CUDA training')
        parser.add_argument('--seed',
                            type=int,
                            default=1,
                            metavar='S',
                            help='random seed (default: 1)')
        parser.add_argument('--log-interval',
                            type=int,
                            default=10,
                            metavar='N',
                            help='how many batches to wait before logging training status')
        parser.add_argument('--out_dir',
                            type=str,
                            help='Path where the trained model will be saved (if not None).')
        args = parser.parse_args()
    
        torch.manual_seed(args.seed)
        accelerator = Accelerator()
        _is_local_main_process = accelerator.is_local_main_process
        tqdm = partial(original_tqdm, disable=not _is_local_main_process, position=0)
    
        use_cuda = not args.no_cuda and torch.cuda.is_available()
        kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
    
        # TRAIN AND TEST DATASETS/DATALOADERS
        train_transforms = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,)),
        ])
    
        test_transforms = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,)),
        ])
    
        with accelerator.main_process_first():
            # We only want to download MNIST data on rank-0
    
            train_dataset = datasets.MNIST(os.environ['DSDIR'],
                                           train=True,
                                           download=True,
                                           transform=train_transforms)
            print(f'Length of training dataset: {len(train_dataset)}')
    
            test_dataset = datasets.MNIST(os.environ['DSDIR'],
                                          download=True,
                                          train=False,
                                          transform=test_transforms)
            print(f'Length of test dataset: {len(test_dataset)}')
    
        train_dataloader = DataLoader(dataset=train_dataset,
                                      batch_size=args.per_device_train_batch_size,
                                      shuffle=True,
                                      **kwargs)
    
        test_dataloader = DataLoader(dataset=test_dataset,
                                     batch_size=args.per_device_eval_batch_size,
                                     shuffle=True,
                                     **kwargs)
    
        model = Net().to(accelerator.device)
        optimizer = optim.Adadelta(model.parameters(), lr=args.lr)
        scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
        model, optimizer, train_dataloader, test_dataloader = accelerator.prepare(model, optimizer, train_dataloader, test_dataloader)
    
        def evaluate(_model, _device, _test_loader, _epoch):
            _model.eval()
            test_losses = []
            test_accuracy = Accuracy()
            example_images = []
            with torch.no_grad():
                for data, target in tqdm(_test_loader, desc=f'eval (epoch {_epoch:03d})'):
                    data, target = data.to(_device), target.to(_device)
                    output = _model(data)
                    loss = F.nll_loss(output, target, reduction='sum')
                    test_losses.append(accelerator.gather(loss))
                    preds = output.argmax(dim=1, keepdim=True)
                    test_accuracy.update((accelerator.gather(preds).detach().cpu(),
                                          accelerator.gather(target).detach().cpu()))
    
            test_loss = torch.sum(torch.cat(test_losses)) / len(_test_loader.dataset)
            test_acc = test_accuracy.compute()
            test_accuracy.reset()
            return test_acc
    
        def train_one_epoch(_args, _model, _device, _train_loader, _optimizer, _epoch):
            _model.train()
            for step, batch in enumerate(tqdm(_train_loader, desc=f'train (epoch {_epoch:03d})')):
                data, target = batch
                data, target = data.to(_device), target.to(_device)
                _optimizer.zero_grad()
                output = _model(data)
                loss = F.nll_loss(output, target)
                accelerator.backward(loss)
                _optimizer.step()
    
        # TRAINING
        for epoch in range(1, args.epochs + 1):
            train_one_epoch(args, model, accelerator.device, train_dataloader, optimizer, epoch)
            eval_accuracy = evaluate(model, accelerator.device, test_dataloader, epoch)
            if _is_local_main_process:
                print(f'Epoch {epoch:02d} / Eval accuracy = {eval_accuracy}')
            scheduler.step()
    
        # SAVE TRAINED MODEL (OPTIONAL)
        if _is_local_main_process and args.out_dir is not None:
            accelerator.wait_for_everyone()
            unwrapped_model = accelerator.unwrap_model(model)
            unwrapped_model.save(args.out_dir, save_function=accelerator.save)
    
    
    if __name__ == '__main__':
    
        main()
    

    Print statement after the 1st epoch:

    Shapes: predictions (10176, 1), targets (10176,)
    

    Why is there a mismatch with the expected number of samples (10 000) and the actual number of samples (10 176)?

    opened by jbschiratti 21
  • Add `join_uneven_inputs` context manager to Accelerator

    Add `join_uneven_inputs` context manager to Accelerator

    This PR adds a context manager which acts as a simple wrapper around Join to enable training with uneven inputs when using DDP.

    This continues the work described in: https://github.com/huggingface/accelerate/issues/684

    opened by Chris-hughes10 20
  • Multi-node setup, host can't connect to it's own provided IP address

    Multi-node setup, host can't connect to it's own provided IP address

    Hi 🤗 I have a 2 Nodes, each of 8xA100 for a total of 16 GPUs. I'm utilizing SLURM for launching the jobs: SLURM scripts for the curious: https://rentry.co/9geu8n

    Here, the main script uses the alloted 2 nodes and runs srun over it, i.e each node is given the PY file to execute once.

    Env

    • Accelerate version: 0.13.0.dev0
    • Platform: Linux-5.10.126-117.518.amzn2.x86_64-x86_64-with-glibc2.10
    • Python version: 3.8.13
    • Numpy version: 1.22.4
    • PyTorch version (GPU?): 1.13.0a0+08820cb (True)
    • Accelerate default config: Not found

    Now, I noticed a peculiar behavior. When on a single node (no SLURM, no multi-node, only multi-GPU) and run this:

    accelerate launch --num_processes 8 --num_machines 1 --multi_gpu \
    --mixed_precision fp16 --machine_rank 0 --main_process_ip 172.... --main_process_port 69420 \
    \
    scripts/...
    

    The script won't run - the command simply executes, and I'm back the the command prompt again - no stdout or stderr.

    But with

    accelerate launch --num_processes 8 --num_machines 1 --multi_gpu \
    --mixed_precision fp16  \
    \
    scripts/torch_...
    

    It works fine. The scripts runs alone on the 8 GPUs, and I can monitor the WandB logs.

    This is a little quirk which puzzled me, and I can neither make head or tail of. I suspect it might mean something to someone here..


    Multi-node training

    For multi-node training, this is the PY script being executed: https://rentry.co/tz465

    • This script works correctly for multi-GPU cases, but NOT for multi-node

    Most of it's standard snippets, but it may have some glaring flaw

    Output:

    This is the output of the main sbatch script, which tells SLURM to deploy

    Number of Nodes: 2
    Name of all Hosts: gpu-st-p4d-24xlarge-60 gpu-st-p4d-24xlarge-61 # two nodes here, each 8xA100s
    Master IP: 172.3.... # IP address of the main node
    MASTER_PORT= 16543
    ID: 0 # Each node reporting its RANK
    ID: 1
    NODE_COUNT=2 #number of nodes deployed
    
    [18:14:34] WARNING  The following values were not passed to        launch.py:838
                        `accelerate launch` and had defaults used                   
                        instead:                                                    
                                `--num_cpu_threads_per_process` was                 
                        set to `48` to improve out-of-box performance               
                        To avoid this warning pass in values for each               
                        of the problematic parameters or run                        
                        `accelerate config`.                                        
    [18:14:35] WARNING  The following values were not passed to        launch.py:838
                        `accelerate launch` and had defaults used                   
                        instead:                                                    
                                `--num_cpu_threads_per_process` was                 
                        set to `48` to improve out-of-box performance               
                        To avoid this warning pass in values for each               
                        of the problematic parameters or run                        
                        `accelerate config`.  
    {Waiting about 15 mins}
    
    [E socket.cpp:858] [c10d] The client socket has timed out after 900s while trying to connect to (gpu-st-p4d-24xlarge-60, 16543).
    [E socket.cpp:858] [c10d] The client socket has timed out after 900s while trying to connect to (gpu-st-p4d-24xlarge-60, 16543).
    

    Trying random ports yields no results.

    I think it might be connected with the problem specified above. Does anyone have any idea?

    opened by neel04 20
  • Possible memory leak when inferencing BLOOM 176B

    Possible memory leak when inferencing BLOOM 176B

    System Info

    - `Accelerate` version: 0.11.0
    - Platform: Linux-4.18.0-305.25.1.el8_4.x86_64-x86_64-with-glibc2.17
    - Python version: 3.8.13
    - Numpy version: 1.22.3
    - PyTorch version (GPU?): 1.11.0a0+gitbc2c6ed (True)
    - `Accelerate` default config:
    	Not found
    

    Information

    • [ ] The official example scripts
    • [X] My own modified scripts

    Tasks

    • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
    • [X] My own task or dataset (give details below)

    Reproduction

    Script: https://github.com/mayank31398/Megatron-DeepSpeed/blob/add-generation-server/scripts/inference/bloom-accelerate-server.py

    Usage: python scripts/inference/bloom-accelerate-server.py --model_name bigscience/bloom --dtype bf16 --log_file data.log --host $ADDRESS --port $PORT

    Memory blowup over time discussed here: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/308#issuecomment-1205757494

    Expected behavior

    This memory leak should not occur I think.
    
    bug 
    opened by mayank31398 20
  • Incorrect `num_warmup_steps` for `lr_scheduler` for multi-gpu training

    Incorrect `num_warmup_steps` for `lr_scheduler` for multi-gpu training

    System Info

    - `Accelerate` version: 0.10.0
    - Platform: Linux-3.10.0_3-0-0-12-x86_64-with-centos-6.3-Final
    - Python version: 3.7.12
    - Numpy version: 1.21.6
    - PyTorch version (GPU?): 1.7.1 (True)
    - `Accelerate` default config:
            - compute_environment: LOCAL_MACHINE
            - distributed_type: MULTI_GPU
            - mixed_precision: no
            - use_cpu: False
            - num_processes: 8
            - machine_rank: 0
            - num_machines: 1
            - main_process_ip: None
            - main_process_port: None
            - main_training_function: main
            - deepspeed_config: {}
            - fsdp_config: {}
    

    Information

    • [X] The official example scripts
    • [ ] My own modified scripts

    Tasks

    • [X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
    • [ ] My own task or dataset (give details below)

    Reproduction

    https://github.com/huggingface/transformers/blob/f2fbe4475386bfcfb3b83d0a3223ba216a3c3a91/examples/pytorch/translation/run_translation_no_trainer.py#L533

    # define lr scheduler
    lr_scheduler = get_scheduler(
            name="linear",
            optimizer=optimizer,
            num_warmup_steps=args.warmup_steps,
            num_training_steps=args.max_train_steps,
        )
    
    ...
    
    if step % args.gradient_accumulation_steps == 0:                    
          optimizer.step()
          lr_scheduler.step() # update lr scheduler every `gradient_accumulation_steps`
          optimizer.zero_grad()
    

    Expected behavior

    Is the accelerate consider the num of processes for num_warmup_steps? Suppose we set args.warmup_steps=80 and train on a single 8-gpu machine, the linear learning rate will peak at 10 (i.e., 80/8) rather than expected 80.

    bug 
    opened by cyk1337 19
  • raise error for duplicate accelerate config  values when using `deepspeed_config_file`

    raise error for duplicate accelerate config values when using `deepspeed_config_file`

    What dos this PR do?

    1. Fixes: #936

    Example:

    1. accelerate config manually tweaked to have both deepspeed_config_file and other ds config entries that are available in the json config file:
    command_file: null
    commands: null
    compute_environment: LOCAL_MACHINE
    deepspeed_config:
      gradient_accumulation_steps: 1
      gradient_clipping: 1.0
      offload_optimizer_device: 'cpu'
      offload_param_device: 'cpu'
      zero3_init_flag: true
      zero3_save_16bit_model: true
      zero_stage: 3
      deepspeed_config_file: 'ds_config.json'
    distributed_type: DEEPSPEED
    downcast_bf16: 'no'
    dynamo_backend: 'NO'
    fsdp_config: {}
    gpu_ids: null
    machine_rank: 0
    main_process_ip: null
    main_process_port: null
    main_training_function: main
    megatron_lm_config: {}
    mixed_precision: 'bf16'
    num_machines: 1
    num_processes: 2
    rdzv_backend: static
    same_network: true
    tpu_name: null
    tpu_zone: null
    use_cpu: false
    
    1. ds_config.json:
    {
        "fp16": {
            "enabled": true
        },
        "zero_optimization": {
            "stage": 3,
            "stage3_gather_16bit_weights_on_model_save": false,
            "offload_optimizer": {
                "device": "none"
            },
            "offload_param": {
                "device": "none"
            }
        },
        "gradient_clipping": 1.0,
        "train_batch_size": "auto",
        "train_micro_batch_size_per_gpu": "auto",
        "gradient_accumulation_steps": 10,
        "steps_per_print": 2000000
    }
    
    1. Code:
    from accelerate import Accelerator
    
    def main():
        accelerator = Accelerator()
    
    if __name__ == "__main__":
        main()
    
    1. output:
    ValueError: When using `deepspeed_config_file`, the following accelerate config variables will be 
    ignored: ['gradient_accumulation_steps', 'gradient_clipping', 'zero_stage', 
    'offload_optimizer_device', 'offload_param_device', 'zero3_save_16bit_model', 'mixed_precision'].
    Please specify them appropriately in the DeepSpeed config file.
    If you are using accelerate config file, set `mixed_precision=no` and remove others config variables
    mentioned in the above specified list; else don't specify these config variables in `accelerate 
    launch` command. 
    The easiest method is to create new config following the questionnaire via  `accelerate config`.
    It will only ask for the necessary config variables when using `deepspeed_config_file`.
    
    opened by pacman100 18
  • Issues with saving model/optimizer and loading them back

    Issues with saving model/optimizer and loading them back

    Hello @sgugger ,

    Came across multiple related issues regarding this - https://github.com/huggingface/accelerate/issues/242, https://github.com/huggingface/accelerate/issues/154 . They were all closed with this PR - https://github.com/huggingface/accelerate/pull/255, but unfortunately the PR doesn't seem to have much documentation.

    I was looking for specifically: saving a model, it's optimizer state, LR scheduler state, it's random seeds/states, epoch/step count, and other related similar states for reproducible training runs and resuming them correctly.

    I know there's this very brief doc here: here and here , but it looks like there are still few grey areas not documented currently regarding it's usage. a) My question is specifically that, like in the official example here: link that saves using save_pretrained only in the main process, should I be using these only in the main process (both save/load) too, and in case of load_state I will have to call prepare() after load_state is done to prepare them for multi-gpu training/inference after that is done (or does load_state do all of that internally itself?)? b) Does the save_state method call save_pretrained methods for the model internally or do I have to do both? FWIW, I'm using HF's BERT and other pretrained models from the transformers lib, so if there are any other specialized methods specifically for those then please advise on the same. If there's any simple toy example that already uses these new checkpointing methods, and if you can help share that'd be pretty helpful!

    The last release seems to be way back in Sept 2021 - https://github.com/huggingface/accelerate/releases/tag/v0.5.1 - and the PR is just about a month old. Any plans for a soonish version-bump release of accelerate?

    Request: If some more detailed examples can be added to the docs that'd be really awesome and help clarify about some of these specifics to users more easily!

    Thanks so much in advance! :)

    opened by ashutoshsaboo 18
  • adapter-transformers: `IndexError` in `infer_auto_device_map`

    adapter-transformers: `IndexError` in `infer_auto_device_map`

    System Info

    - `Accelerate` version: 0.15.0.dev0
    - Platform: Linux-3.10.0-1160.80.1.el7.x86_64-x86_64-with-glibc2.17
    - Python version: 3.9.16+
    - Numpy version: 1.24.0
    - PyTorch version (GPU?): 1.13.1+cu117 (True)
    - `Accelerate` default config:
            Not found
    

    Information

    • [ ] The official example scripts
    • [X] My own modified scripts

    Tasks

    • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
    • [X] My own task or dataset (give details below)

    Reproduction

    1. pip3 uninstall transformers
    2. pip3 install adapter-transformers
    3. test.py:
    import transformers
    model = transformers.AutoAdapterModel.from_pretrained('google/flan-t5-base', device_map='auto')
    

    Result:

    ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
    │ /home/user/scratch/test-2023-01-07.py:2 in <module>                                              │
    │                                                                                                  │
    │   1 import transformers                                                                          │
    │ ❱ 2 model = transformers.AutoAdapterModel.from_pretrained('google/flan-t5-base', device_map=     │
    │   3                                                                                              │
    │                                                                                                  │
    │ /home/user/.local/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py:446 in    │
    │ from_pretrained                                                                                  │
    │                                                                                                  │
    │   443 │   │   │   return model_class.from_pretrained(pretrained_model_name_or_path, *model_arg   │
    │   444 │   │   elif type(config) in cls._model_mapping.keys():                                    │
    │   445 │   │   │   model_class = _get_model_class(config, cls._model_mapping)                     │
    │ ❱ 446 │   │   │   return model_class.from_pretrained(pretrained_model_name_or_path, *model_arg   │
    │   447 │   │   raise ValueError(                                                                  │
    │   448 │   │   │   f"Unrecognized configuration class {config.__class__} for this kind of AutoM   │
    │   449 │   │   │   f"Model type should be one of {', '.join(c.__name__ for c in cls._model_mapp   │
    │                                                                                                  │
    │ /home/user/.local/lib/python3.9/site-packages/transformers/modeling_utils.py:2121 in             │
    │ from_pretrained                                                                                  │
    │                                                                                                  │
    │   2118 │   │   │   no_split_modules = model._no_split_modules                                    │
    │   2119 │   │   │   # Make sure tied weights are tied before creating the device map.             │
    │   2120 │   │   │   model.tie_weights()                                                           │
    │ ❱ 2121 │   │   │   device_map = infer_auto_device_map(                                           │
    │   2122 │   │   │   │   model, no_split_module_classes=no_split_modules, dtype=torch_dtype, max_  │
    │   2123 │   │   │   )                                                                             │
    │   2124                                                                                           │
    │                                                                                                  │
    │ /shared/src/accelerate/src/accelerate/utils/modeling.py:545 in infer_auto_device_map             │
    │                                                                                                  │
    │   542 │   │   elif tied_param is not None:                                                       │
    │   543 │   │   │   # Determine the sized occupied by this module + the module containing the ti   │
    │   544 │   │   │   tied_module_size = module_size                                                 │
    │ ❱ 545 │   │   │   tied_module_index = [i for i, (n, _) in enumerate(modules_to_treat) if n in    │
    │   546 │   │   │   tied_module_name, tied_module = modules_to_treat[tied_module_index]            │
    │   547 │   │   │   tied_module_size += module_sizes[tied_module_name] - module_sizes[tied_param   │
    │   548 │   │   │   if current_max_size is not None and current_memory_used + tied_module_size >   │
    ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
    IndexError: list index out of range
    

    Expected behavior

    An error or warning, or silent success.
    
    opened by xloem 0
  • How to accelerate with device_map =

    How to accelerate with device_map = "balanced_low_0"

    System Info

    - `Accelerate` version: 0.14.0
    - Platform: Linux-4.15.0-200-generic-x86_64-with-debian-buster-sid
    - Python version: 3.7.3
    - Numpy version: 1.21.6
    - PyTorch version (GPU?): 1.12.1+cu113 (True)
    - `Accelerate` default config:
            - compute_environment: LOCAL_MACHINE
            - distributed_type: DEEPSPEED
            - mixed_precision: no
            - use_cpu: False
            - num_processes: 4
            - machine_rank: 0
            - num_machines: 1
            - gpu_ids: None
            - main_process_ip: None
            - main_process_port: None
            - rdzv_backend: static
            - same_network: True
            - main_training_function: main
            - deepspeed_config: {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': False, 'zero3_save_16bit_model': False, 'zero_stage': 3}
            - fsdp_config: {}
            - megatron_lm_config: {}
            - downcast_bf16: no
            - tpu_name: None
            - tpu_zone: None
            - command_file: None
            - commands: None
    

    Information

    • [ ] The official example scripts
    • [X] My own modified scripts

    Tasks

    • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
    • [X] My own task or dataset (give details below)

    Reproduction

    None

    Expected behavior

    I have tried model.from_pretrained(name,device_map="balanced_low_0") and see few memory is used at the index=0 gpus, but how to accelerate with data parallel?
    
    Like I have 4 gpus and only the last two gpus are mainly used for model parallel, how can I leverage gpu0 and gpu1 to do data parallel to accelerate the whole process. For now, I can only put the data on gpu0 instead of both gpu0 and gpu1 together.
    
    opened by ZeyiLiao 2
  • Un-needed (wrong?) code in example script

    Un-needed (wrong?) code in example script

    I am referring to this line of code in deepspeed_with_config_support.py:

        ...
        losses.append(accelerator.gather_for_metrics(loss.repeat(args.per_device_eval_batch_size)))
    
    losses = torch.cat(losses)
        try:
            eval_loss = torch.mean(losses)
            ...
    

    In this case the loss is repeated by the batch size, gathered across processes and then averaged. While it is unclear to me why one would repeat the loss (perhaps to compute the average loss per sample?) the computation does not contribute to the final result (e.g., we repeat [2, 4] into [2, 2, 4, 4] just to get the final result of 3 anyway).

    opened by vittorio-perera 1
  • Simple NLP Example fails in colab

    Simple NLP Example fails in colab

    System Info

    - `Accelerate` version: 0.15.0.dev0
    - Platform: Linux-5.10.147+-x86_64-with-glibc2.27
    - Python version: 3.8.16
    - Numpy version: 1.21.6
    - PyTorch version (GPU?): 1.13.0+cu116 (False)
    - `Accelerate` default config:
    	Not found
    

    Information

    • [X] The official example scripts
    • [ ] My own modified scripts

    Tasks

    • [X] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
    • [ ] My own task or dataset (give details below)

    Reproduction

    run this notebook : https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/accelerate_examples/simple_nlp_example.ipynb#scrollTo=AuM9kPcpY7a_

    Error : ModuleNotFoundError Traceback (most recent call last) in 1 from accelerate import notebook_launcher 2 ----> 3 notebook_launcher(training_function, (model,))

    /usr/local/lib/python3.8/dist-packages/accelerate/launchers.py in notebook_launcher(function, args, num_processes, mixed_precision, use_port) 59 if (in_colab or in_kaggle) and (os.environ.get("TPU_NAME", None) is not None): 60 # TPU launch ---> 61 import torch_xla.distributed.xla_multiprocessing as xmp 62 63 if len(AcceleratorState._shared_state) > 0:

    ModuleNotFoundError: No module named 'torch_xla'

    Expected behavior

    Hope you can fix it !
    
    opened by caroheymes 1
  • Accelerate with Deepspeed pdsh launcher, main_process_ip, port are not set correctly ?

    Accelerate with Deepspeed pdsh launcher, main_process_ip, port are not set correctly ?

    Hello, I'm referring to these lines of code: https://github.com/huggingface/accelerate/blob/v0.15.0/src/accelerate/commands/launch.py#L683-L710

    Correct me if I'm wrong but it seem to me that the main_process_ip/port is not passed to te cmd, so the default value are used.

    When I run with pdsh launcher with my main_process_ip/port passed to the launcher argument, address already in used error is raised (with some log printing about the default id/port in the cmd). But it run ok with standard launcher (as the ip/port are set properly there).

    Also, I want to ask if there is any pros/cons of using pdsh or standard or other launcher?

    Thank you

    opened by lqtrung1998 2
Releases(v0.15.0)
  • v0.15.0(Dec 2, 2022)

    PyTorch 2.0 stack support

    We are very excited by the newly announced PyTorch 2.0 stack and you can try it using Accelerate on any model by using the dynamo_backend argument of the Accelerator, or when filling your config with accelerate config.

    Note that to get the best performance, we recommend:

    • using an Ampere GPU (or more recent)
    • sticking to fixed shaped for now
    • Add support for torch dynamo by @sgugger in #829

    New CLI commands

    • Added two new commands, accelerate config update and accelerate config default. The first will update a config file to have the latest keys added from latter releases of Accelerate, and the second will create a default configuration file automatically mimicking write_default_config() introduced in #851 and #853 by @muellerzr
    • Also introduced a filterable help for accelerate launch which will show options relevant to the choices shown, such as accelerate launch --multi_gpu will show launch parameters relevant to multi-gpu training.

    What's new?

    • fix 🐛 by @pacman100 in #836
    • Deepspeed example should use gather_for_metrics by @HammadB in #821
    • Highlight selection with pretty colors by @muellerzr in #839
    • Add join_uneven_inputs context manager to Accelerator by @Chris-hughes10 in #820
    • Introduce default-config command by @muellerzr in #840
    • Fix log error and add log level to get_logger by @muellerzr in #842
    • Fix if/else by @muellerzr in #849
    • Fix complete_cv example by @muellerzr in #848
    • Refactor Accelerate config and introduce a multi-argument CLI interface by @muellerzr in #851
    • Clean up, add update command by @muellerzr in #853
    • Revert "Update pr docs actions by @mishig25 in #827)"
    • Switch default log to warn by @muellerzr in #859
    • Remove mixed precision hook as part of the unwrap_model by @muellerzr in #860
    • update deepspeed error message wrt batch_size by @pacman100 in #861
    • fix failing deepspeed test by @pacman100 in #868
    • Even more log level refined, leave alone if not explicitly set by @muellerzr in #871
    • Solve pickling issues by @muellerzr in #872
    • Spring cleaning by @muellerzr in #865
    • fixing lr_scheduler prepare issue when using pytorch nightly by @pacman100 in #878
    • fix fsdp state_dict_config because of PyTorch changes by @pacman100 in #877
    • Update deprecated logging warn by @SHi-ON in #881
    • fix a bug by @xiaohu2015 in #887
    • Allow safetensors offload by @sgugger in #873
    • fixing lr scheduler for pytorch nightly by @pacman100 in #884
    • Prefix all accelerate env vars with ACCELERATE by @muellerzr in #890
    • fix prefix issues in tests by @pacman100 in #891
    • Fix windows cli selector by @muellerzr in #893
    • Better description for improper kwargs by @muellerzr in #894
    • Support bfloat16 in load_offloaded_weight by @sgugger in #892

    Significant community contributions

    The following contributors have made significant changes to the library over the last release:

    • @Chris-hughes10
      • Add join_uneven_inputs context manager to Accelerator (#820)
    Source code(tar.gz)
    Source code(zip)
  • v0.14.0(Nov 8, 2022)

    Megatron LM integration

    Accelerate now supports Megatron-LM for the three model classes (BERT, GPT-2 and T5). You can learn more in the documentation.

    • Megatron-LM integration by @pacman100 in #667
    • ensure megatron is 2.2.0+ by @jeffra in #755
    • updating docs to use fork of megatron-lm and minor example/docs fix by @pacman100 in #766
    • adding support to return logits and generate for Megatron-LM GPT models by @pacman100 in #819

    PyTorch 1.13 support

    Fixes a bug that returned SIGKILL errors on Windows.

    • Isolate distrib_run by @muellerzr in #828

    Kaggle support with the notebook_launcher

    With Kaggle now giving instances with two T4 GPUs, Accelerate can leverage this to do multi-gpu training from the notebook

    • Work in kaggle! by @muellerzr in #783

    What's new?

    • Add non_blocking kwarg to send_to_device() by @NouamaneTazi in #607
    • [ds launcher] un-hijack PYTHONPATH by @stas00 in #741
    • Fix num_processes is not defined by @muellerzr in #746
    • [Device map] nn.Parameter don't have children by @patrickvonplaten in #747
    • Use HTML relative paths for tiles by @lewtun in #749
    • Add gpu_ids to SageMakerConfig though it should never be set by @muellerzr in #751
    • Change num_cpu_threads_per_process default by @muellerzr in #753
    • Return unclipped gradient from grad_clip_norm_ by @samuelstevens in #756
    • refactor by @pacman100 in #758
    • update docs by @pacman100 in #759
    • Only wrap modules in DDP if they require grad by @samuelstevens in #761
    • Move io_same_device hook to before attach_align_device hook on cpu_offload and disk_offload. by @piEsposito in #768
    • Regression cli tests by @muellerzr in #772
    • Fix number of devices in get_balanced_memory by @sgugger in #774
    • Fix all github actions issues + depreciations by @muellerzr in #773
    • Fix flakey wandb test by @muellerzr in #775
    • Add defaults for launchers by @muellerzr in #778
    • Allow BatchSamplerShard to not even out batches by @sgugger in #776
    • Make rich toggleable and seperate out a new environment utility file by @muellerzr in #779
    • Add same_network + docs by @muellerzr in #780
    • fix transformers tests by @ArthurZucker in #777
    • Add Dev Container configuration by @Chris-hughes10 in #782
    • separate dataloader generator from sampler generator by @pacman100 in #789
    • Consider top-level buffers when computing infer_auto_device_map by @younesbelkada in #792
    • Add even_batches keyword to Accelerator by @Chris-hughes10 in #781
    • Fix device_map="auto" on CPU-only envs by @sgugger in #797
    • Fix extraction of state dict in offload by @sgugger in #795
    • fix: add pdsh as default launcher by @zanussbaum in #800
    • Deal with optimizer.differentiable in PyTorch 1.13.0 by @comaniac in #803
    • Introduce a pod-config command by @muellerzr in #802
    • Refactor CLI to improve readability by @muellerzr in #810
    • adding support to pickle and unpickle AcceleratedOptimizer by @pacman100 in #811
    • add recurse argument in remove_hook_from_module by @younesbelkada in #812
    • Act on deprecations by @muellerzr in #813
    • Mlflow-tracker-v2 🔥 by @nbroad1881 in #794
    • Update CLI docs and use mps rather than mps_device by @muellerzr in #814
    • Rename pod-config to tpu-config + docs by @muellerzr in #818
    • Update docs by @muellerzr in #823
    • rename sklearn to proper dep by @muellerzr in #825
    • Rename by @muellerzr in #824
    • Update pr docs actions by @mishig25 in #827

    Significant community contributions

    The following contributors have made significant changes to the library over the last release:

    • @Chris-hughes10
      • Add Dev Container configuration (#782)
      • Add even_batches keyword to Accelerator (#781)
    Source code(tar.gz)
    Source code(zip)
  • v0.13.2(Oct 17, 2022)

  • v0.13.1(Oct 7, 2022)

  • v0.13.0(Oct 5, 2022)

    Better multinode support in the launcher

    The accelerate command launch did not work well for distributed training using several machines. This is fixed in this version.

    • Use torchrun for multinode by @muellerzr in #631
    • Fix multi-node issues from launch by @muellerzr in #672

    Launch training on specific GPUs only

    Instead of prefixing your launch command with CUDA_VISIBLE_DEVICES=xxx you can now specify the GPUs you want to use in your Accelerate config.

    • Allow for GPU-ID specification on CLI by @muellerzr in #732

    Better tracebacks and rich support

    The tracebacks are now cleaned up to avoid printing several times the same error, and rich is integrated as an optional dependency.

    • Integrate Rich into Accelerate by @muellerzr in #613
    • Make rich an optional dep by @muellerzr in #673

    What's new?

    • Fix typo in docs/index.mdx by @mishig25 in #610
    • Fix DeepSpeed CI by @muellerzr in #612
    • Added GANs example to examples by @EyalMichaeli in #619
    • Fix example by @muellerzr in #620
    • Update README.md by @ezhang7423 in #622
    • Fully remove subprocess from the multi-gpu launcher by @muellerzr in #623
    • M1 mps fixes by @pacman100 in #625
    • Fix multi-node issues and simplify param logic by @muellerzr in #627
    • update MPS support docs by @pacman100 in #629
    • minor tracker fixes for complete* examples by @pacman100 in #630
    • Put back in place the guard by @muellerzr in #634
    • make init_trackers to launch on main process by @Gladiator07 in #642
    • remove check for main process for trackers initialization by @Gladiator07 in #643
    • fix link by @philschmid in #645
    • Add static_graph arg to DistributedDataParallelKwargs. by @rom1504 in #637
    • Small nits to grad accum docs by @muellerzr in #656
    • Saving hyperparams in yaml file for Tensorboard for #521 by @Shreyz-max in #657
    • Use debug for loggers by @muellerzr in #655
    • Improve docstrings more by @muellerzr in #666
    • accelerate bibtex by @pacman100 in #660
    • Cache torch_tpu check by @muellerzr in #670
    • Manim animation of big model inference by @muellerzr in #671
    • Add aim tracker for accelerate by @muellerzr in #649
    • Specify local network on multinode by @muellerzr in #674
    • Test for min torch version + fix all issues by @muellerzr in #638
    • deepspeed enhancements and fixes by @pacman100 in #676
    • DeepSpeed launcher related changes by @pacman100 in #626
    • adding torchrun elastic params by @pacman100 in #680
    • :bug: fix by @pacman100 in #683
    • Fix skip in dispatch dataloaders by @sgugger in #682
    • Clean up DispatchDataloader a bit more by @sgugger in #686
    • rng state sync for FSDP by @pacman100 in #688
    • Fix DataLoader with samplers that are batch samplers by @sgugger in #687
    • fixing support for Apple Silicon GPU in notebook_launcher by @pacman100 in #695
    • fixing rng sync when using custom sampler and batch_sampler by @pacman100 in #696
    • Improve init_empty_weights to override tensor constructor by @thomasw21 in #699
    • override DeepSpeed grad_acc_steps from accelerator obj by @pacman100 in #698
    • [doc] Fix 404'd link in memory usage guides by @tomaarsen in #702
    • Add in report generation for test failures and make fail-fast false by @muellerzr in #703
    • Update runners with report structure, adjust env variable by @muellerzr in #704
    • docs: examples readability improvements by @ryanrussell in #709
    • docs: utils readability fixups by @ryanrussell in #711
    • refactor(test_tracking): key_occurrence readability fixup by @ryanrussell in #710
    • docs: hooks readability improvements by @ryanrussell in #712
    • sagemaker fixes and improvements by @pacman100 in #708
    • refactor(accelerate): readability improvements by @ryanrussell in #713
    • More docstring nits by @muellerzr in #715
    • Allow custom device placements for different objects by @sgugger in #716
    • Specify gradients in model preparation by @muellerzr in #722
    • Fix regression issue by @muellerzr in #724
    • Fix default for num processes by @sgugger in #726
    • Build and Release docker images on a release by @muellerzr in #725
    • Make running tests more efficient by @muellerzr in #611
    • Fix old naming by @muellerzr in #727
    • Fix issue with one-cycle logic by @muellerzr in #728
    • Remove auto-bug label in issue template by @sgugger in #735
    • Add a tutorial on proper benchmarking by @muellerzr in #734
    • Add an example zoo to the documentation by @muellerzr in #737
    • trlx by @muellerzr in #738
    • Fix memory leak by @muellerzr in #739
    • Include examples for CI by @muellerzr in #740
    • Auto grad accum example by @muellerzr in #742
    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Aug 4, 2022)

    New documentation

    The whole documentation has been revamped, just go look at it here!

    • Complete revamp of the docs by @muellerzr in #495

    New gather_for_metrics method

    When doing distributed evaluation, the dataloader loops back at the beginning of the dataset to make batches that have a round multiple of the number of processes. This causes the predictions to be slightly bigger than the length of the dataset, which used to require some truncating. This is all done behind the scenes now if you replace the gather your did in evaluation by gather_for_metrics.

    • Reenable Gather for Metrics by @muellerzr in #590
    • Fix gather_for_metrics by @muellerzr in #578
    • Add a gather_for_metrics capability by @muellerzr in #540

    Balanced device maps

    When loading big models for inference, device_map="auto" used to fill the GPUs sequentially, making it hard to use a batch size > 1. It now balances the weights evenly on the GPUs so if you have more GPU space than the model size, you can do predictions with a bigger batch size!

    M1 GPU support

    Accelerate now supports M1 GPUs, to learn more about how to setup your environment, see the documentation.

    • M1 GPU mps device integration by @pacman100 in #596

    What's new?

    • Small fixed for balanced device maps by @sgugger in #583
    • Add balanced option for auto device map creation by @sgugger in #534
    • fixing deepspeed slow tests issue by @pacman100 in #604
    • add more conditions on casting by @younesbelkada in #606
    • Remove redundant .run in WandBTracker. by @zh-plus in #605
    • Fix some typos + wordings by @muellerzr in #603
    • reorg of test scripts and minor changes to tests by @pacman100 in #602
    • Move warning by @muellerzr in #598
    • Shorthand way to grab a tracker by @muellerzr in #594
    • Pin deepspeed by @muellerzr in #595
    • Improve docstring by @muellerzr in #591
    • TESTS! by @muellerzr in #589
    • Fix DispatchDataloader by @sgugger in #588
    • Use main_process_first in the examples by @muellerzr in #581
    • Skip and raise NotImplementedError for gather_for_metrics for now by @muellerzr in #580
    • minor FSDP launcher fix by @pacman100 in #579
    • Refine test in set_module_tensor_to_device by @sgugger in #577
    • Fix set_module_tensor_to_device by @sgugger in #576
    • Add 8 bit support - chapter II by @younesbelkada in #539
    • Fix tests, add wandb to gitignore by @muellerzr in #573
    • Fix step by @muellerzr in #572
    • Speed up main CI by @muellerzr in #571
    • ccl version check and import different module according to version by @sywangyi in #567
    • set default num_cpu_threads_per_process to improve oob performance by @sywangyi in #562
    • Add a tqdm helper by @muellerzr in #564
    • Rename actions to be a bit more accurate by @muellerzr in #568
    • Fix clean by @muellerzr in #569
    • enhancements and fixes for FSDP and DeepSpeed by @pacman100 in #532
    • fix: saving model weights by @csarron in #556
    • add on_main_process decorators by @ZhiyuanChen in #488
    • Update imports.py by @KimBioInfoStudio in #554
    • unpin datasets by @lhoestq in #563
    • Create good defaults in accelerate launch by @muellerzr in #553
    • Fix a few minor issues with example code in docs by @BenjaminBossan in #551
    • deepspeed version 0.6.7 fix by @pacman100 in #544
    • Rename test extras to testing by @muellerzr in #545
    • Add production testing + fix failing CI by @muellerzr in #547
    • Add a gather_for_metrics capability by @muellerzr in #540
    • Allow for kwargs to be passed to trackers by @muellerzr in #542
    • Add support for downcasting bf16 on TPUs by @muellerzr in #523
    • Add more documentation for device maps computations by @sgugger in #530
    • Restyle prepare one by @muellerzr in #531
    • Pick a better default for offload_state_dict by @sgugger in #529
    • fix some parameter setting does not work for CPU DDP and bf16 fail in… by @sywangyi in #527
    • Fix accelerate tests command by @sgugger in #528

    Significant community contributions

    The following contributors have made significant changes to the library over the last release:

    • @sywangyi
      • ccl version check and import different module according to version (#567)
      • set default num_cpu_threads_per_process to improve oob performance (#562)
      • fix some parameter setting does not work for CPU DDP and bf16 fail in… (#527)
    • @ZhiyuanChen
      • add on_main_process decorators (#488)
    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Jul 18, 2022)

    Gradient Accumulation

    Accelerate now handles gradient accumulation if you want, just pass along gradient_accumulation_steps=xxx when instantiating the Accelerator and put all your training loop step under a with accelerator.accumulate(model):. Accelerate will then handle the loss re-scaling and gradient accumulation for you (avoiding slowdowns in distributed training when gradients only need to be synced when you want to step). More details in the documentation.

    • Add gradient accumulation doc by @muellerzr in #511
    • Make gradient accumulation work with dispatched dataloaders by @muellerzr in #510
    • Introduce automatic gradient accumulation wrapper + fix a few test issues by @muellerzr in #484

    Support for SageMaker Data parallelism

    Accelerate now support SageMaker specific brand of data parallelism.

    • SageMaker enhancements to allow custom docker image, input channels referring to s3/remote data locations and metrics logging by @pacman100 in #504
    • SageMaker DP Support by @pacman100 in #494

    What's new?

    • Fix accelerate tests command by @sgugger in #528
    • FSDP integration enhancements and fixes by @pacman100 in #522
    • Warn user if no trackers are installed by @muellerzr in #524
    • Fixup all example CI tests and properly fail by @muellerzr in #517
    • fixing deepspeed multi-node launcher by @pacman100 in #514
    • Add special Parameters modules support by @younesbelkada in #519
    • Don't unwrap in save_state() by @cccntu in #489
    • Fix a bug when reduce a tensor. by @wwhio in #513
    • Add benchmarks by @sgugger in #506
    • Fix DispatchDataLoader length when split_batches=True by @sgugger in #509
    • Fix scheduler in gradient accumulation example by @muellerzr in #500
    • update dataloader wrappers to have total_batch_size attribute by @pacman100 in #493
    • Introduce automatic gradient accumulation wrapper + fix a few test issues by @muellerzr in #484
    • add use_distributed property by @ZhiyuanChen in #487
    • fixing fsdp autowrap functionality by @pacman100 in #475
    • Use datasets 2.2.0 for now by @muellerzr in #481
    • Rm gradient accumulation on TPU by @muellerzr in #479
    • Revert "Pin datasets for now by @muellerzr in #477)"
    • Pin datasets for now by @muellerzr in #477
    • Some typos and cosmetic fixes by @douwekiela in #472
    • Fix when TPU device check is ran by @muellerzr in #469
    • Refactor Utility Documentation by @muellerzr in #467
    • Add docbuilder to quality by @muellerzr in #468
    • Expose some is_*_available utils in docs by @muellerzr in #466
    • Cleanup CI Warnings by @muellerzr in #465
    • Link CI slow runners to the commit by @muellerzr in #464
    • Fix subtle bug in BF16 by @muellerzr in #463
    • Include bf16 support for TPUs and CPUs, and a better check for if a CUDA device supports BF16 by @muellerzr in #462
    • Handle bfloat16 weights in disk offload without adding memory overhead by @noamwies in #460)
    • Handle bfloat16 weights in disk offload by @sgugger in #460
    • Raise a clear warning if a user tries to modify the AcceleratorState by @muellerzr in #458
    • Right step point by @muellerzr in #459
    • Better checks for if a TPU device exists by @muellerzr in #456
    • Offload and modules with unused submodules by @sgugger in #442
    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Jun 15, 2022)

    This release adds two major new features: the DeepSpeed integration has been revamped to match the one in Transformers Trainer, with multiple new options unlocked, and the TPU integration has been sped up.

    This version also officially stops supporting Python 3.6 and requires Python 3.7+

    DeepSpeed integration revamp

    Users can now specify a DeepSpeed config file when they want to use DeepSpeed, which unlocks many new options. More details in the new documentation.

    • Migrate HFDeepSpeedConfig from trfrs to accelerate by @pacman100 in #432
    • DeepSpeed Revamp by @pacman100 in #405

    TPU speedup

    If you're using TPUs we have sped up the dataloaders and models quite a bit, on top of a few bug fixes.

    • Revamp TPU internals to be more efficient + enable mixed precision types by @muellerzr in #441

    What's new?

    • Fix docstring by @muellerzr in #447
    • Add psutil as depenedency by @sgugger in #445
    • fix fsdp torch version dependency by @pacman100 in #437
    • Create Gradient Accumulation Example by @muellerzr in #431
    • init by @muellerzr in #429
    • Introduce no_sync context wrapper + clean up some more warnings for DDP by @muellerzr in #428
    • updating tests to resolve runner failures wrt deepspeed revamp by @pacman100 in #427
    • Fix secrets in Docker workflow by @muellerzr in #426
    • Introduce a Dependency Checker to trigger new Docker Builds on main by @muellerzr in #424
    • Enable slow tests nightly by @muellerzr in #421
    • Push out python 3.6 + fix all tests related to the upgrade by @muellerzr in #420
    • Speedup main CI by @muellerzr in #419
    • Switch to evaluate for metrics by @sgugger in #417
    • Create an issue template for Accelerate by @muellerzr in #415
    • Introduce post-merge runners by @muellerzr in #416
    • Fix debug_launcher issues by @muellerzr in #413
    • Use main egg by @muellerzr in #414
    • Introduce nightly runners by @muellerzr in #410
    • Update requirements to pin tensorboard and include psutil by @muellerzr in #408
    • Fix CUDA examples tests by @muellerzr in #407
    • Move datasets and transformers to under func by @muellerzr in #411
    • Fix CUDA Dockerfile by @muellerzr in #409
    • Hotfix all failing GPU tests by @muellerzr in #401
    • improve metrics logged in examples by @pacman100 in #399
    • Refactor offload_state_dict and fix in offload_weight by @sgugger in #398
    • Refactor version checking into a utility by @muellerzr in #395
    • Include fastai in frameworks by @muellerzr in #396
    • Add packaging to requirements by @muellerzr in #394
    • Better dispatch for submodules by @sgugger in #392
    • Build Docker Images nightly by @muellerzr in #391
    • Small bugfix for the stalebot workflow by @muellerzr in #390
    • Introduce stalebot by @muellerzr in #387
    • Create Dockerfiles for Accelerate by @muellerzr in #377
    • Mix precision -> Mixed precision by @muellerzr in #388
    • Fix OneCycle step length when in multiprocess by @muellerzr in #385
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(May 20, 2022)

    v0.9.0: Refactor utils to use in Transformers

    This release offers no significant new API, it is just needed to have access to some utils in Transformers.

    • Handle deprication errors in launch by @muellerzr in #360
    • Update launchers.py by @tmabraham in #363
    • fix tracking by @pacman100 in #361
    • Remove tensor call by @muellerzr in #365
    • Add a utility for writing a barebones config file by @muellerzr in #371
    • fix deepspeed model saving by @pacman100 in #370
    • deepspeed save model temp fix by @pacman100 in #374
    • Refactor tests to use accelerate launch by @muellerzr in #373
    • fix zero stage-1 by @pacman100 in #378
    • fix shuffling for ShufflerIterDataPipe instances by @loubnabnl in #376
    • Better check for deepspeed availability by @sgugger in #379
    • Refactor some parts in utils by @sgugger in #380
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(May 12, 2022)

    v0.8.0: Big model inference

    Big model inference

    To handle very large models, new functionality has been added in Accelerate:

    • a context manager to initalize empty models
    • a function to load a sharded checkpoint directly on the right devices
    • a set of custom hooks that allow execution of a model split on different devices, as well as CPU or disk offload
    • a magic method that auto-determines a device map for a given model, maximizing the GPU spaces, available RAM before using disk offload as a last resort.
    • a function that wraps the last three blocks in one simple call (load_checkpoint_and_dispatch)

    See more in the documentation

    • Big model inference by @sgugger in #345

    What's new

    • Create peak_memory_uasge_tracker.py by @pacman100 in #336
    • Fixed a typo to enable running accelerate correctly by @Idodox in #339
    • Introduce multiprocess logger by @muellerzr in #337
    • Refactor utils into its own module by @muellerzr in #340
    • Improve num_processes question in CLI by @muellerzr in #343
    • Handle Manual Wrapping in FSDP. Minor fix of fsdp example. by @pacman100 in #342
    • Better prompt for number of training devices by @muellerzr in #344
    • Fix prompt for num_processes by @pacman100 in #347
    • Fix sample calculation in examples by @muellerzr in #352
    • Fixing metric eval in distributed setup by @pacman100 in #355
    • DeepSpeed and FSDP plugin support through script by @pacman100 in #356
    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Apr 29, 2022)

  • v0.7.0(Apr 28, 2022)

    v0.7.0: Logging API, FSDP, batch size finder and examples revamp

    Logging API

    Use any of your favorite logging libraries (TensorBoard, Wandb, CometML...) with just a few lines of code inside your training scripts with Accelerate. All details are in the documentation.

    • Add logging capabilities by @muellerzr in https://github.com/huggingface/accelerate/pull/293

    Support for FSDP (fully sharded DataParallel)

    PyTorch recently released a new model wrapper for sharded DDP training called FSDP. This release adds support for it (note that it doesn't work with mixed precision yet). See all caveats in the documentation.

    • PyTorch FSDP Feature Incorporation by @pacman100 in https://github.com/huggingface/accelerate/pull/321

    Batch size finder

    Say goodbye to the CUDA OOM errors with the new find_executable_batch_size decorator. Just decorate your training function and pick a starting batch size, then let Accelerate do the rest.

    • Add a memory-aware decorator for CUDA OOM avoidance by @muellerzr in https://github.com/huggingface/accelerate/pull/324

    Examples revamp

    The Accelerate examples are now split in two: you can find in the base folder a very simple nlp and computer vision examples, as well as complete versions incorporating all features. But you can also browse the examples in the by_feature subfolder, which will show you exactly what code to add for each given feature (checkpointing, tracking, cross-validation etc.)

    • Refactor Examples by Feature by @muellerzr in https://github.com/huggingface/accelerate/pull/312

    What's Changed

    • Document save/load state by @muellerzr in https://github.com/huggingface/accelerate/pull/290
    • Refactor precisions to its own enum by @muellerzr in https://github.com/huggingface/accelerate/pull/292
    • Load model and optimizet states on CPU to void OOMs by @sgugger in https://github.com/huggingface/accelerate/pull/299
    • Fix example for datasets v2 by @sgugger in https://github.com/huggingface/accelerate/pull/298
    • Leave default as None in mixed_precision for launch command by @sgugger in https://github.com/huggingface/accelerate/pull/300
    • Pass lr_scheduler to Accelerator.prepare by @sgugger in https://github.com/huggingface/accelerate/pull/301
    • Create new TestCase classes and clean up W&B tests by @muellerzr in https://github.com/huggingface/accelerate/pull/304
    • Have custom trackers work with the API by @muellerzr in https://github.com/huggingface/accelerate/pull/305
    • Write tests for comet_ml by @muellerzr in https://github.com/huggingface/accelerate/pull/306
    • Fix training in DeepSpeed by @sgugger in https://github.com/huggingface/accelerate/pull/308
    • Update example scripts by @muellerzr in https://github.com/huggingface/accelerate/pull/307
    • Use --no_local_rank for DeepSpeed launch by @sgugger in https://github.com/huggingface/accelerate/pull/309
    • Fix Accelerate CLI CPU option + small fix for W&B tests by @muellerzr in https://github.com/huggingface/accelerate/pull/311
    • Fix DataLoader sharding for deepspeed in accelerate by @m3rlin45 in https://github.com/huggingface/accelerate/pull/315
    • Create a testing framework for example scripts and fix current ones by @muellerzr in https://github.com/huggingface/accelerate/pull/313
    • Refactor Tracker logic and write guards for logging_dir by @muellerzr in https://github.com/huggingface/accelerate/pull/316
    • Create Cross-Validation example by @muellerzr in https://github.com/huggingface/accelerate/pull/317
    • Create alias for Accelerator.free_memory by @muellerzr in https://github.com/huggingface/accelerate/pull/318
    • fix typo in docs of accelerate tracking by @loubnabnl in https://github.com/huggingface/accelerate/pull/320
    • Update examples to show how to deal with extra validation copies by @muellerzr in https://github.com/huggingface/accelerate/pull/319
    • Fixup all checkpointing examples by @muellerzr in https://github.com/huggingface/accelerate/pull/323
    • Introduce reduce operator by @muellerzr in https://github.com/huggingface/accelerate/pull/326

    New Contributors

    • @m3rlin45 made their first contribution in https://github.com/huggingface/accelerate/pull/315
    • @loubnabnl made their first contribution in https://github.com/huggingface/accelerate/pull/320
    • @pacman100 made their first contribution in https://github.com/huggingface/accelerate/pull/321

    Full Changelog: https://github.com/huggingface/accelerate/compare/v0.6.0...v0.7.0

    Source code(tar.gz)
    Source code(zip)
  • v0.6.2(Mar 31, 2022)

  • v0.6.1(Mar 18, 2022)

  • v0.6.0(Mar 18, 2022)

    This release adds support for bloat16 mixed precision training (requires PyTorch >= 1.10) and a brand-new checkpoint utility to help with resuming interrupted trainings. We also get a completely revamped documentation frontend.

    Checkpoints

    Save the current state of all your objects (models, optimizers, RNG states) with accelerator.save_state(path_to_checkpoint) and reload everything by calling accelerator.load_state(path_to_checkpoint)

    • Add in checkpointing capability by @muellerzr in https://github.com/huggingface/accelerate/pull/255
    • Implementation of saving and loading custom states by @muellerzr in https://github.com/huggingface/accelerate/pull/270

    BFloat16 support

    Accelerate now supports bfloat16 mixed precision training. As a result the old --fp16 argument has been deprecated to be replaced by the more generic --mixed-precision.

    • Add bfloat16 support #243 by @ikergarcia1996 in https://github.com/huggingface/accelerate/pull/247

    New env subcommand

    You can now type accelerate env to have a copy-pastable summary of your environment and default configuration. Very convenient when opening a new issue!

    • add env command by @johnnv1 in https://github.com/huggingface/accelerate/pull/280

    New doc frontend

    The documentation has been switched to the new Hugging Face frontend, like Transformers and Datasets.

    • Convert documentation to the new front by @sgugger in https://github.com/huggingface/accelerate/pull/271

    What's Changed

    • Fix send_to_device with non-tensor data by @sgugger in https://github.com/huggingface/accelerate/pull/177
    • Handle UserDict in all utils by @sgugger in https://github.com/huggingface/accelerate/pull/179
    • Use collections.abc.Mapping to handle both the dict and the UserDict types by @mariosasko in https://github.com/huggingface/accelerate/pull/180
    • fix: use store_true on argparse in nlp example by @monologg in https://github.com/huggingface/accelerate/pull/183
    • Update README.md by @TevenLeScao in https://github.com/huggingface/accelerate/pull/187
    • Add signature check for set_to_none in Optimizer.zero_grad by @sgugger in https://github.com/huggingface/accelerate/pull/189
    • fix typo in code snippet by @MrZilinXiao in https://github.com/huggingface/accelerate/pull/199
    • Add high-level API reference to README by @Chris-hughes10 in https://github.com/huggingface/accelerate/pull/204
    • fix rng_types in accelerator by @s-kumano in https://github.com/huggingface/accelerate/pull/206
    • Pass along drop_last in DispatchDataLoader by @sgugger in https://github.com/huggingface/accelerate/pull/212
    • Rename state to avoid name conflicts with pytorch's Optimizer class. by @yuxinyuan in https://github.com/huggingface/accelerate/pull/224
    • Fix lr scheduler num samples by @sgugger in https://github.com/huggingface/accelerate/pull/227
    • Add customization point for init_process_group kwargs by @sgugger in https://github.com/huggingface/accelerate/pull/228
    • Fix typo in installation docs by @jaketae in https://github.com/huggingface/accelerate/pull/234
    • make deepspeed optimizer match parameters of passed optimizer by @jmhessel in https://github.com/huggingface/accelerate/pull/246
    • Upgrade black to version ~=22.0 by @LysandreJik in https://github.com/huggingface/accelerate/pull/250
    • add support of gather_object by @ZhiyuanChen in https://github.com/huggingface/accelerate/pull/238
    • Add launch flags --module and --no_python (#256) by @parameter-concern in https://github.com/huggingface/accelerate/pull/258
    • Accelerate + Animus/Catalyst = 🚀 by @Scitator in https://github.com/huggingface/accelerate/pull/249
    • Add debug_launcher by @sgugger in https://github.com/huggingface/accelerate/pull/259
    • enhance compatibility of honor type by @ZhiyuanChen in https://github.com/huggingface/accelerate/pull/241
    • Add a flag to use CPU only in the config by @sgugger in https://github.com/huggingface/accelerate/pull/263
    • Basic fixes for DeepSpeed by @sgugger in https://github.com/huggingface/accelerate/pull/264
    • Ability to set the seed with randomness from inside Accelerate by @muellerzr in https://github.com/huggingface/accelerate/pull/266
    • Don't use dispatch_batches when torch is < 1.8.0 by @sgugger in https://github.com/huggingface/accelerate/pull/269
    • Make accelerated model with AMP possible to pickle by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/274
    • Contributing guide by @LysandreJik in https://github.com/huggingface/accelerate/pull/254
    • replace texts and link (master -> main) by @johnnv1 in https://github.com/huggingface/accelerate/pull/282
    • Use workflow from doc-builder by @sgugger in https://github.com/huggingface/accelerate/pull/275
    • Pass along execution info to the exit of autocast by @sgugger in https://github.com/huggingface/accelerate/pull/284

    New Contributors

    • @mariosasko made their first contribution in https://github.com/huggingface/accelerate/pull/180
    • @monologg made their first contribution in https://github.com/huggingface/accelerate/pull/183
    • @TevenLeScao made their first contribution in https://github.com/huggingface/accelerate/pull/187
    • @MrZilinXiao made their first contribution in https://github.com/huggingface/accelerate/pull/199
    • @Chris-hughes10 made their first contribution in https://github.com/huggingface/accelerate/pull/204
    • @s-kumano made their first contribution in https://github.com/huggingface/accelerate/pull/206
    • @yuxinyuan made their first contribution in https://github.com/huggingface/accelerate/pull/224
    • @jaketae made their first contribution in https://github.com/huggingface/accelerate/pull/234
    • @jmhessel made their first contribution in https://github.com/huggingface/accelerate/pull/246
    • @ikergarcia1996 made their first contribution in https://github.com/huggingface/accelerate/pull/247
    • @ZhiyuanChen made their first contribution in https://github.com/huggingface/accelerate/pull/238
    • @parameter-concern made their first contribution in https://github.com/huggingface/accelerate/pull/258
    • @Scitator made their first contribution in https://github.com/huggingface/accelerate/pull/249
    • @muellerzr made their first contribution in https://github.com/huggingface/accelerate/pull/255
    • @BenjaminBossan made their first contribution in https://github.com/huggingface/accelerate/pull/274
    • @johnnv1 made their first contribution in https://github.com/huggingface/accelerate/pull/280

    Full Changelog: https://github.com/huggingface/accelerate/compare/v0.5.1...v0.6.0

    Source code(tar.gz)
    Source code(zip)
  • v0.5.1(Sep 27, 2021)

    v0.5.1: Patch release

    Fix the two following bugs:

    • convert_to_fp32 returned booleans instead of tensors #173
    • wrong dataloader lenght when dispatch_batches=True #175
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Sep 23, 2021)

    v0.5.0 Dispatch batches from main DataLoader

    This release introduces support for iterating through a DataLoader only on the main process, that then dispatches the batches to all processes.

    Dispatch batches from main DataLoader

    The motivation behind this come from dataset streaming which introduces two difficulties:

    • there might be some timeouts for some elements of the dataset, which might then be different in each process launched, thus it's impossible to make sure the data is iterated though the same way on each process
    • when using IterableDataset, each process goes through the dataset, thus applies the preprocessing on all elements. This can yield to the training being slowed down by this preprocessing.

    This new feature is activated by default for all IterableDataset.

    • Central dataloader #164 (@sgugger)
    • Dynamic default for dispatch_batches #168 (@sgugger)

    Various fixes

    • fix fp16 covert back to fp32 for issue: unsupported operand type(s) for /: 'dict' and 'int' #149 (@Doragd)
    • [Docs] Machine config is yaml not json #151 (@patrickvonplaten)
    • Fix gather for 0d tensor #152 (@sgugger)
    • [DeepSpeed] allow untested optimizers deepspeed #150 (@patrickvonplaten)
    • Raise errors instead of warnings with better tests #170 (@sgugger)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Aug 10, 2021)

    v0.4.0 Experimental DeepSpeed support

    This release adds support for DeepSpeed. While the basics are there to support ZeRO-2, ZeRo-3, as well a CPU and NVME offload, the API might evolve a little bit as we polish it in the near future.

    It also adds support for multi-node CPU. In both cases, just filling the questionnaire outputted by accelerate config and then launching your script with accelerate launch is enough, there are no changes in the main API.

    DeepSpeed support

    • Add DeepSpeed support #82 (@vasudevgupta7)
    • DeepSpeed documentation #140 (@sgugger)

    Multinode CPU support

    • Add distributed multi-node cpu only support (MULTI_CPU) #63 (@ddkalamk)

    Various fixes

    • Fix batch_sampler error for IterableDataset #62 (@ddkalamk)
    • Honor namedtuples in inputs/outputs #67 (@sgugger)
    • Fix examples README #70 (@cccntu)
    • TPU not available in kaggle #73 (@yuangan)
    • Pass args in notebook_launcher for multi-GPU #78 (@sgugger)
    • Fix accelerate test with no config file #79 (@cccntu)
    • Use optimizer for consistency #81 (@kumapo)
    • Update README.md #87 (@Separius)
    • Add unscale_gradients method. #88 (@sgugger)
    • Add Accelerator.free_memory #89 (@sgugger)
    • [Feature] Add context manager to allow main process first. #98 (@Guillem96)
    • Pass along kwargs to backward #104 (@sgugger)
    • Add course banner #107 (@sgugger)
    • added closure argument to optimizer.step() #105 (@pmelchior)
    • Fix import error for torch 1.4.0 #108 (@sgugger)
    • Unwrap optimizer before unscaling #115 (@sgugger)
    • Fix DataLoader length when split_batches=True #121 (@sgugger)
    • Fix OptimWrapper init #127 (@sgugger)
    • Fix fp16 by converting outputs back to FP32 #134 (@sgugger)
    • Add caveat on weight-tying on TPUs #138 (@sgugger)
    • Add optimizer not stepped property #139 (@sgugger)
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Apr 29, 2021)

    v0.3.0 Notebook launcher and multi-node training

    Notebook launcher

    After doing all the data preprocessing in your notebook, you can launch your training loop using the new notebook_launcher functionality. This is especially useful for Colab or Kaggle with TPUs! Here is an example on Colab (don't forget to select a TPU runtime).

    This launcher also works if you have multiple GPUs on your machine. You just have to pass along num_processes=your_number_of_gpus in the call to notebook_launcher.

    • Notebook launcher #44 (@sgugger)
    • Add notebook/colab example #52 (@sgugger)
    • Support for multi-GPU in notebook_launcher #56 (@sgugger)

    Multi-node training

    Our multi-node training test setup was flawed and the previous releases of 🤗 Accelerate were not working for multi-node distributed training. This is all fixed now and we have ensured to have more robust tests!

    • fix cluster.py indent error #35 (@JTT94)
    • Set all defaults from config in launcher #38 (@sgugger)
    • Fix port in config creation #50 (@sgugger)

    Various bug fixes

    • Fix typos in examples README #28 (@arjunchandra)
    • Fix load from config #31 (@sgugger)
    • docs: minor spelling tweaks #33 (@brettkoonce)
    • Add set_to_none to AcceleratedOptimizer.zero_grad #43 (@sgugger)
    • fix #53 #54 (@Guitaricet)
    • update launch.py #58 (@Jesse1eung)
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Apr 19, 2021)

  • v0.2.0(Apr 15, 2021)

    v0.2.0 SageMaker launcher

    SageMaker launcher

    It's now possible to launch your training script on AWS instances using SageMaker via accelerate launch.

    • Launch script on SageMaker #26 (@philschmid )
    • Add defaults for compute_environmnent #23 (@sgugger )
    • Add Configuration setup for SageMaker #17 (@philschmid )

    Kwargs handlers

    To customize how the different objects used for mixed precision or distributed training are instantiated, a new API called KwargsHandler is added. This allows the user to pass along the kwargs that will be passed to those objects if used (and it is ignored if those are not used in the current setup, so the script can still run on any kind of setup).

    • Add KwargsHandlers #15 (@sgugger )

    Pad across processes

    Trying to gather tensors that are not of the same size across processes resulted in a process hang, a new method Accelerator.pad_across_processes has been added to help with that.

    • Add utility to pad tensor across processes to max length #19 (@sgugger )

    Various bug fixes

    • added thumbnail #25 (@philschmid )
    • Cleaner diffs in README and index #22 (@sgugger )
    • Use proper size #21 (@sgugger )
    • Alternate diff #20 (@sgugger )
    • Add YAML config support #16 (@sgugger )
    • Don't error on non-Tensors objects in move to device #13 (@sgugger )
    • Add CV example #10 (@sgugger )
    • Readme clean-up #9 (@thomwolf )
    • More flexible RNG synchronization #8 (@sgugger )
    • Fix typos and tighten grammar in README #7 (@lewtun )
    • Update README.md #6 (@voidful )
    • Fix TPU training in example #4 (@thomwolf )
    • Fix example name in README #3 (@LysandreJik )
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Mar 5, 2021)

Owner
Hugging Face
Solving NLP, one commit at a time!
Hugging Face
Pretrained ConvNets for pytorch: NASNet, ResNeXt, ResNet, InceptionV4, InceptionResnetV2, Xception, DPN, etc.

Pretrained models for Pytorch (Work in progress) The goal of this repo is: to help to reproduce research papers results (transfer learning setups for

Remi 8.7k Dec 31, 2022
Distiller is an open-source Python package for neural network compression research.

Wiki and tutorials | Documentation | Getting Started | Algorithms | Design | FAQ Distiller is an open-source Python package for neural network compres

Intel Labs 4.1k Dec 28, 2022
Model summary in PyTorch similar to `model.summary()` in Keras

Keras style model.summary() in PyTorch Keras has a neat API to view the visualization of the model which is very helpful while debugging your network.

Shubham Chandel 3.7k Dec 29, 2022
Fast, general, and tested differentiable structured prediction in PyTorch

Torch-Struct: Structured Prediction Library A library of tested, GPU implementations of core structured prediction algorithms for deep learning applic

HNLP 1.1k Jan 07, 2023
A PyTorch implementation of L-BFGS.

PyTorch-LBFGS: A PyTorch Implementation of L-BFGS Authors: Hao-Jun Michael Shi (Northwestern University) and Dheevatsa Mudigere (Facebook) What is it?

Hao-Jun Michael Shi 478 Dec 27, 2022
The easiest way to use deep metric learning in your application. Modular, flexible, and extensible. Written in PyTorch.

News March 3: v0.9.97 has various bug fixes and improvements: Bug fixes for NTXentLoss Efficiency improvement for AccuracyCalculator, by using torch i

Kevin Musgrave 5k Jan 02, 2023
Differentiable SDE solvers with GPU support and efficient sensitivity analysis.

PyTorch Implementation of Differentiable SDE Solvers This library provides stochastic differential equation (SDE) solvers with GPU support and efficie

Google Research 1.2k Jan 04, 2023
Kaldi-compatible feature extraction with PyTorch, supporting CUDA, batch processing, chunk processing, and autograd

Kaldi-compatible feature extraction with PyTorch, supporting CUDA, batch processing, chunk processing, and autograd

Fangjun Kuang 119 Jan 03, 2023
Code for paper "Energy-Constrained Compression for Deep Neural Networks via Weighted Sparse Projection and Layer Input Masking"

model_based_energy_constrained_compression Code for paper "Energy-Constrained Compression for Deep Neural Networks via Weighted Sparse Projection and

Haichuan Yang 16 Jun 15, 2022
PyTorch Lightning Optical Flow models, scripts, and pretrained weights.

PyTorch Lightning Optical Flow models, scripts, and pretrained weights.

Henrique Morimitsu 105 Dec 16, 2022
GPU-accelerated PyTorch implementation of Zero-shot User Intent Detection via Capsule Neural Networks

GPU-accelerated PyTorch implementation of Zero-shot User Intent Detection via Capsule Neural Networks This repository implements a capsule model Inten

Joel Huang 15 Dec 24, 2022
A tiny package to compare two neural networks in PyTorch

Compare neural networks by their feature similarity

Anand Krishnamoorthy 180 Dec 30, 2022
PyTorch extensions for fast R&D prototyping and Kaggle farming

Pytorch-toolbelt A pytorch-toolbelt is a Python library with a set of bells and whistles for PyTorch for fast R&D prototyping and Kaggle farming: What

Eugene Khvedchenya 1.3k Jan 05, 2023
270 Dec 24, 2022
Implements pytorch code for the Accelerated SGD algorithm.

AccSGD This is the code associated with Accelerated SGD algorithm used in the paper On the insufficiency of existing momentum schemes for Stochastic O

205 Jan 02, 2023
Reformer, the efficient Transformer, in Pytorch

Reformer, the Efficient Transformer, in Pytorch This is a Pytorch implementation of Reformer https://openreview.net/pdf?id=rkgNKkHtvB It includes LSH

Phil Wang 1.8k Jan 06, 2023
3D-RETR: End-to-End Single and Multi-View3D Reconstruction with Transformers

3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers (BMVC 2021) Zai Shi*, Zhao Meng*, Yiran Xing, Yunpu Ma, Roger Wattenhofe

Zai Shi 36 Dec 21, 2022
PyTorch Extension Library of Optimized Autograd Sparse Matrix Operations

PyTorch Sparse This package consists of a small extension library of optimized sparse matrix operations with autograd support. This package currently

Matthias Fey 757 Jan 04, 2023
PyTorch to TensorFlow Lite converter

PyTorch to TensorFlow Lite converter

Omer Ferhat Sarioglu 140 Dec 13, 2022
High-fidelity performance metrics for generative models in PyTorch

High-fidelity performance metrics for generative models in PyTorch

Vikram Voleti 5 Oct 24, 2021