Pytorch Lightning Distributed Accelerators using Ray

Last update: Jan 02, 2023

Related tags

Overview

Distributed PyTorch Lightning Training on Ray

This library adds new PyTorch Lightning plugins for distributed training using the Ray distributed computing framework.

These PyTorch Lightning Plugins on Ray enable quick and easy parallel training while still leveraging all the benefits of PyTorch Lightning and using your desired training protocol, either PyTorch Distributed Data Parallel or Horovod.

Once you add your plugin to the PyTorch Lightning Trainer, you can parallelize training to all the cores in your laptop, or across a massive multi-node, multi-GPU cluster with no additional code changes.

This library also comes with an integration with Ray Tune for distributed hyperparameter tuning experiments.

Installation

You can install the master branch of ray_lightning like so:

pip install git+https://github.com/ray-project/ray_lightning#ray_lightning

PyTorch Distributed Data Parallel Plugin on Ray

The RayPlugin provides Distributed Data Parallel training on a Ray cluster. PyTorch DDP is used as the distributed training protocol, and Ray is used to launch and manage the training worker processes.

Here is a simplified example:

import pytorch_lightning as ptl
from ray_lightning import RayPlugin

# Create your PyTorch Lightning model here.
ptl_model = MNISTClassifier(...)
plugin = RayPlugin(num_workers=4, cpus_per_worker=1, use_gpu=True)

# If using GPUs, set the ``gpus`` arg to a value > 0.
# The actual number of GPUs is determined by ``num_workers``.
trainer = pl.Trainer(..., gpus=1, plugins=[plugin])
trainer.fit(ptl_model)

Because Ray is used to launch processes, instead of the same script being called multiple times, you CAN use this plugin even in cases when you cannot use the standard DDPPlugin such as

Jupyter Notebooks, Google Colab, Kaggle
Calling fit or test multiple times in the same script

Horovod Plugin on Ray

Or if you prefer to use Horovod as the distributed training protocol, use the HorovodRayPlugin instead.

import pytorch_lightning as ptl
from ray_lightning import HorovodRayPlugin

# Create your PyTorch Lightning model here.
ptl_model = MNISTClassifier(...)

# 2 nodes, 4 workers per node, each using 1 CPU and 1 GPU.
plugin = HorovodRayPlugin(num_hosts=2, num_slots=4, use_gpu=True)

# If using GPUs, set the ``gpus`` arg to a value > 0.
# The actual number of GPUs is determined by ``num_slots``.
trainer = pl.Trainer(..., gpus=1, plugins=[plugin])
trainer.fit(ptl_model)

Multi-node Distributed Training

Using the same examples above, you can run distributed training on a multi-node cluster with just 2 simple steps.

Use Ray's cluster launcher to start a Ray cluster- ray up my_cluster_config.yaml.
Execute your Python script on the Ray cluster- ray submit my_cluster_config.yaml train.py. This will rsync your training script to the head node, and execute it on the Ray cluster.

You no longer have to set environment variables or configurations and run your training script on every single node.

Hyperparameter Tuning with Ray Tune

ray_lightning also integrates with Ray Tune to provide distributed hyperparameter tuning for your distributed model training. You can run multiple PyTorch Lightning training runs in parallel, each with a different hyperparameter configuration, and each training run parallelized by itself. All you have to do is move your training code to a function, pass the function to tune.run, and make sure to add the appropriate callback (Either TuneReportCallback or TuneReportCheckpointCallback) to your PyTorch Lightning Trainer.

Example using ray_lightning with Tune:

def train_mnist(config):
    
    # Create your PTL model.
    model = MNISTClassifier(config)

    # Create the Tune Reporting Callback
    metrics = {"loss": "ptl/val_loss", "acc": "ptl/val_accuracy"}
    callbacks = [TuneReportCallback(metrics, on="validation_end")]
    
    trainer = pl.Trainer(
        max_epochs=4,
        callbacks=callbacks,
        plugins=[RayPlugin(num_workers=4, use_gpu=False)])
    trainer.fit(model)
    
config = {
    "layer_1": tune.choice([32, 64, 128]),
    "layer_2": tune.choice([64, 128, 256]),
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([32, 64, 128]),
}

# Make sure to specify how many actors each training run will create via the "extra_cpu" field.
analysis = tune.run(
        train_mnist,
        metric="loss",
        mode="min",
        config=config,
        num_samples=num_samples,
        resources_per_trial={
            "cpu": 1,
            "extra_cpu": 4
        },
        name="tune_mnist")
        
print("Best hyperparameters found were: ", analysis.best_config)

FAQ

RaySGD already has a Pytorch Lightning integration. What's the difference between this integration and that?

The key difference is which Trainer you'll be interacting with. In this library, you will still be using Pytorch Lightning's Trainer. You'll be able to leverage all the features of Pytorch Lightning, and Ray is used just as a backend to handle distributed training.

With RaySGD's integration, you'll be converting your LightningModule to be RaySGD compatible, and will be interacting with RaySGD's TorchTrainer. RaySGD's TorchTrainer is not as feature rich nor as easy to use as Pytorch Lightning's Trainer (no built in support for logging, early stopping, etc.). However, it does have built in support for fault-tolerant and elastic training. If these are hard requirements for you, then RaySGD's integration with PTL might be a better option.

I see that RayPlugin is based off of Pytorch Lightning's DDPSpawnPlugin. However, doesn't the PTL team discourage the use of spawn?

As discussed here, using a spawn approach instead of launch is not all that detrimental. The original factors for discouraging spawn were:

not being able to use 'spawn' in a Jupyter or Colab notebook, and
not being able to use multiple workers for data loading.

Neither of these should be an issue with the RayPlugin due to Ray's serialization mechanisms. The only thing to keep in mind is that when using this plugin, your model does have to be serializable/pickleable.

Comments

Unable to use both GPUs

Hello, thanks for this amazing library for Lightning!

I am trying to run Lighting and Ray Tune on a system with 2 GPU. As a start, I want to use both of GPU to train 1 trial at a time.

However, when I use

def train_model(config):
    ...
    trainer = pl.Trainer(
        gpus=2,
        accelerator="ddp",
        callbacks=[checkpoint_callback, tune_report_callback],
        plugins=[RayPlugin(num_workers=1, use_gpu=True)],
        precision=16,
    )
    trainer.fit(model, dm)

if __name__ == "__main__":

    ray.init()

    config = {"batch_size": 256}

    analysis = tune.run(
        train_model,
        metric="loss",
        mode="min",
        config=config,
        num_samples=1,
        resources_per_trial={"gpu": 2},
        name="test",
    )

I get the error about the actor or task not being able to be scheduled.

== Status ==
Memory usage on this node: 17.5/125.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 1/32 CPUs, 2/2 GPUs, 0.0/70.46 GiB heap, 0.0/23.58 GiB objects (0/1.0 accelerator_type:RTX)
Result logdir: /home/x/ray_results/test
Number of trials: 1/1 (1 RUNNING)

...

(pid=109446) GPU available: True, used: True
(pid=109446) TPU available: None, using: 0 TPU cores
(pid=109446) Using native 16bit precision.
(pid=109446) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
2021-03-30 01:12:21,546 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffff609630d00bec4e0790a0da3f01000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {31.000000/32.000000 CPU, 70.556641 GiB/70.556641 GiB memory, 0.000000/2.000000 GPU, 23.583984 GiB/23.583984 GiB object_store_memory, 1.000000/1.000000 node:192.168.1.159, 1.000000/1.000000 accelerator_type:RTX}
. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

Firstly, I tried changing Trainer to use gpus=1 and num_workers=2, but we still get the same error

    trainer = pl.Trainer(
        gpus=1,
        accelerator="ddp",
        callbacks=[checkpoint_callback, tune_report_callback],
        plugins=[RayPlugin(num_workers=2, use_gpu=True)],
        precision=16,
    )

== Status ==
Memory usage on this node: 17.1/125.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 1/32 CPUs, 2/2 GPUs, 0.0/70.07 GiB heap, 0.0/23.44 GiB objects (0/1.0 accelerator_type:RTX)
Result logdir: /home/x/ray_results/test
Number of trials: 1/1 (1 RUNNING)

...

(pid=111559) GPU available: True, used: True
(pid=111559) TPU available: None, using: 0 TPU cores
(pid=111559) Using native 16bit precision.
(pid=111559) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
2021-03-30 01:14:17,641 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffff98dc17f283c4962e8f312ee401000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {31.000000/32.000000 CPU, 70.068359 GiB/70.068359 GiB memory, 0.000000/2.000000 GPU, 1.000000/1.000000 accelerator_type:RTX, 1.000000/1.000000 node:192.168.1.159, 23.437500 GiB/23.437500 GiB object_store_memory}
. In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

Next, I reduced num_workers=2 to num_workers=1, and the error remains

    trainer = pl.Trainer(
        val_check_interval=0.1,
        gpus=1,
        accelerator="ddp",
        callbacks=[checkpoint_callback, tune_report_callback],
        plugins=[RayPlugin(num_workers=1, use_gpu=True)],
        precision=16,
        progress_bar_refresh_rate=1000,  # refresh every 1000 iterations
    )

== Status ==
Memory usage on this node: 19.4/125.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 1/32 CPUs, 2/2 GPUs, 0.0/68.7 GiB heap, 0.0/23.05 GiB objects (0/1.0 accelerator_type:RTX)
Result logdir: /home/x/ray_results/test
Number of trials: 1/1 (1 RUNNING)

...

(pid=113817) GPU available: True, used: True
(pid=113817) TPU available: None, using: 0 TPU cores
(pid=113817) Using native 16bit precision.
(pid=113817) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
2021-03-30 01:16:23,959 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffffa8b7c70cbc343023efcc60f501000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {31.000000/32.000000 CPU, 68.701172 GiB/68.701172 GiB memory, 0.000000/2.000000 GPU, 23.046875 GiB/23.046875 GiB object_store_memory, 1.000000/1.000000 accelerator_type:RTX, 1.000000/1.000000 node:192.168.1.159}
. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

Finally, I reduced resources_per_trial to resources_per_trial={"gpu": 1}, and it finally runs, but appears to be using only 1 GPU, not 2.

    trainer = pl.Trainer(
        gpus=1,
        accelerator="ddp",
        callbacks=[checkpoint_callback, tune_report_callback],
        plugins=[RayPlugin(num_workers=1, use_gpu=True)],
        precision=16,
    )

    analysis = tune.run(
        train_model,
        metric="loss",
        mode="min",
        config=config,
        num_samples=1,
        resources_per_trial={"gpu": 1},
        name="test",
    )

== Status ==
Memory usage on this node: 18.8/125.8 GiB
Using FIFO scheduling algorithm.
Resources requested: 1/32 CPUs, 1/2 GPUs, 0.0/69.24 GiB heap, 0.0/23.19 GiB objects (0/1.0 accelerator_type:RTX)
Result logdir: /home/x/ray_results/test
Number of trials: 1/1 (1 RUNNING)

...

(pid=119147) GPU available: True, used: True
(pid=119147) TPU available: None, using: 0 TPU cores
(pid=119147) Using native 16bit precision.
(pid=119147) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
(pid=119167) initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
(pid=119167) 
(pid=119167)   | Name  | Type          | Params
(pid=119167) ----------------------------------------
(pid=119167) 0 | model | TestPLModel00 | 3.7 K 
(pid=119167) ----------------------------------------
(pid=119167) 3.7 K     Trainable params
(pid=119167) 0         Non-trainable params
(pid=119167) 3.7 K     Total params
(pid=119167) 0.015     Total estimated model params size (MB)

I am using

pytorch 1.7.0
pytorch-lightning 1.2.5
ray 1.2.0
ray-lightning 0.0.1

What should be the correct way to train a single trial using both GPU devices?

opened by nyxynyx 36

NCCL peer access is not supported error

just the training was about to start (nvidia-smi shows the GPU memory filling up), there's a new error

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729062494/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled cuda error, NCCL version 2.7.8

Starting the Python training script using NCCL_IB_DISABLE=1 python tune.py does not help.

Enabled debug messages using NCCL_DEBUG="INFO" NCCL_IB_DISABLE=1 python tune.py, and the following new messages appeared:

(pid=909597) RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1603729062494/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled cuda error, NCCL version 2.7.8
(pid=909608) z-pc:909608:910337 [0] NCCL INFO Channel 00/02 :    0   1
(pid=909608) z-pc:909608:910337 [0] NCCL INFO Channel 01/02 :    0   1
(pid=909608) z-pc:909608:910337 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
(pid=909608) z-pc:909608:910337 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
(pid=909608) z-pc:909608:910337 [0] NCCL INFO Channel 00 : 0[2d000] -> 1[2e000] via P2P/IPC
(pid=909608) 
(pid=909608) z-pc:909608:910337 [0] transport/p2p.cc:238 NCCL WARN failed to open CUDA IPC handle : 217 peer access is not supported between these two devices
(pid=909608) z-pc:909608:910337 [0] NCCL INFO transport.cc:68 -> 1
(pid=909608) z-pc:909608:910337 [0] NCCL INFO init.cc:766 -> 1
(pid=909608) z-pc:909608:910337 [0] NCCL INFO init.cc:840 -> 1
(pid=909608) z-pc:909608:910337 [0] NCCL INFO group.cc:73 -> 1 [Async thread]
(pid=909611) z-pc:909611:910338 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
(pid=909611) z-pc:909611:910338 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
(pid=909611) z-pc:909611:910338 [0] NCCL INFO Channel 00 : 1[2e000] -> 0[2d000] via P2P/IPC
(pid=909611) 
(pid=909611) z-pc:909611:910338 [0] transport/p2p.cc:238 NCCL WARN failed to open CUDA IPC handle : 217 peer access is not supported between these two devices
(pid=909611) z-pc:909611:910338 [0] NCCL INFO transport.cc:68 -> 1
(pid=909611) z-pc:909611:910338 [0] NCCL INFO init.cc:766 -> 1
(pid=909611) z-pc:909611:910338 [0] NCCL INFO init.cc:840 -> 1
(pid=909611) z-pc:909611:910338 [0] NCCL INFO group.cc:73 -> 1 [Async thread]
2021-03-30 18:20:32,548 ERROR trial_runner.py:616 -- Trial train_model_190f1_00000: Error processing event.
$ nvidia-smi topo -m
	GPU0	GPU1	CPU Affinity	NUMA Affinity
GPU0	 X 	PHB	0-31		N/A
GPU1	PHB	 X 	0-31		N/A

Originally posted by @nyxynyx

opened by amogkam 22

question: Do I need to use ray.init() before using the Ray Accelerator?

Hey, Thank you for creating a needed library.

I am very new to using Ray, and I already had a project built around PL. I looked around on how to add Ray distributed training backend to my project, and I found this library that does not force me to not use the PL trainer.

Now, I am trying to use the accelerator on my local machine, but I failed to do so. I think it's a really simple issue, because of my lack of knowledge.

This the bit where I add the accelerator:

if accelerator_use:
    ray.init()
    accelerator = RayAccelerator(num_workers=4, cpus_per_worker=1, use_gpu=True)
else:
    accelerator = None

I tried without using ray init and I got an error, and when I add ray init I get this:

2021-03-06 14:54:45,217 WARNING worker.py:1107 -- The actor or task with ID ffffffffffffffff63964fa4841d4a2ecb45751801000000 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000} for placement, but this node only has remaining {7.000000/8.000000 CPU, 7.177734 GiB/7.177734 GiB memory, 0.000000/1.000000 GPU, 1.000000/1.000000 node:172.20.10.2, 2.441406 GiB/2.441406 GiB object_store_memory}
. In total there are 0 pending tasks and 6 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

opened by MohammedAljahdali 12

Compatibility with PTL 1.6
Todos:

[x] Check if we need to_state_stream / load_state_stream P(0)

[x] Check multi node (P0)

[x] Check multi GPU/multi node (P0)

[x] Fix / change tests (P0)

[x] Check that recent PRs are included, e.g. https://github.com/ray-project/ray_lightning/pull/156 P(0.5-1)

[x] Check Ray client (P1)

[x] Check fractional GPUs (P2)

[x] DDP sharded (P2)
opened by krfricke 11

ray ddp fails with 2 gpu workers

  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1797, in __setup_profiler
    self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir)
  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 2249, in log_dir
    dirpath = self.strategy.broadcast(dirpath)
  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 215, in broadcast
    torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1817, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/home/ubuntu/anaconda3/envs/automm-dev-pl-latest/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1159, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:891, internal error, NCCL version 21.0.3
ncclInternalError: Internal check failed. This is either a bug in NCCL or due to memory corruptio

use this branch: https://github.com/sxjscience/autogluon/tree/kaggle_california_house and install autogluon via bash full_install.sh Afterwards, try this script: https://gist.github.com/sxjscience/53bc799e37cc0680ca9e53c2fea75cd7 Internally, the ray strategy are constructed here: https://github.com/sxjscience/autogluon/blob/59f01b95381fba5651db17fd98fa84164ad168c2/multimodal/src/autogluon/multimodal/predictor.py#L1036-L1052 .

opened by JiahaoYao 10

[Tune] Ray Tune + Ray Lightning too many tasks warning

I noticed such warning log constantly being logged out while using Ray Tune + Ray Lightning. For example: Warning: More than 20000 tasks are pending submission to actor 386ebf690ec87ad0d825174701000000. To reduce memory usage, wait for these tasks to finish before sending more.

Do I need to worry about it?

opened by yinweisu 10
CUDA devices are not exposed when running in DDP mode with multiple GPUs
Hi all!

First of all thanks for this great project!

To my issue: When I tried your example for hyperparameter tuning I discovered that it only worked when using cpu. After some digging, I found out that the problem is related to get_tune_ddp_resources Since head_bundle only requests cpu it does not expose the required CUDA devices for the child bundles. Therefore, Lightning fails to run on GPU(s) since CUDA_VISIBLE_DEVICES is not set.

As a workaround, I have added

if use_gpu: os.environ["CUDA_VISIBLE_DEVICES"] = ",".join([str(i) for i in range(num_workers)])

in __init__ of RayPlugin It gets the job done but seems incredibly hacky. Is there some other way around?

Thanks in advance!
opened by MarkusSpanring 10

Does not appear to be compatible with the current version of Lightning

I was excited to try this out but the code appears to not be working due to a missing import:

Traceback (most recent call last):
  File "...", line 9, in <module>
    from ray_lightning import RayAccelerator
  File ".../lib/python3.8/site-packages/ray_lightning/__init__.py", line 1, in <module>
    from ray_lightning.ray_ddp import RayAccelerator
  File ".../lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 8, in <module>
    from pytorch_lightning.accelerators import DDPSpawnAccelerator
ImportError: cannot import name 'DDPSpawnAccelerator' from 'pytorch_lightning.accelerators' (.../lib/python3.8/site-packages/pytorch_lightning/accelerators/__init__.py)

opened by import-antigravity 10

[Windows] RuntimeError: Distributed package doesn't have NCCL built in

Hey, I am having an issue when I run trainer.fit with the accelerator I get the following error:

2021-03-08 13:45:49,085 INFO services.py:1172 -- View the Ray dashboard at http://127.0.0.1:8265
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Using native 16bit precision.
Global seed set to 1234
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
Traceback (most recent call last):
  File "train.py", line 26, in cli_main
    train(None, cfg)
  File "train.py", line 102, in train
    trainer.fit(model, datamodule=dm)
  File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\ray_lightning\ray_ddp.py", line 184, in train
    results = process_results(futures, queue)
  File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\ray_lightning\util.py", line 103, in process_results
    ray.get(ready)
  File "C:\Users\Mohammed\AppData\Roaming\Python\Python38\site-packages\ray\_private\client_mode_hook.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Mohammed\AppData\Roaming\Python\Python38\site-packages\ray\worker.py", line 1456, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RayExecutor.execute() (pid=27620, ip=192.168.8.100)
  File "python\ray\_raylet.pyx", line 480, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 432, in ray._raylet.execute_task.function_executor
  File "C:\Users\Mohammed\AppData\Roaming\Python\Python38\site-packages\ray\function_manager.py", line 556, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\ray_lightning\ray_ddp.py", line 31, in execute
    return fn(*args, **kwargs)
  File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\ray_lightning\ray_ddp.py", line 218, in train_remote
    super(RayAccelerator, self).ddp_train(
  File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\pytorch_lightning\accelerators\ddp_spawn_accelerator.py", line 127, in ddp_train
    self.init_ddp_connection(
  File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\ray_lightning\ray_ddp.py", line 232, in init_ddp_connection
    torch.distributed.init_process_group(
  File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\torch\distributed\distributed_c10d.py", line 503, in init_process_group
    _update_default_pg(_new_process_group_helper(
  File "C:\Users\Mohammed\AppData\Local\conda\conda\envs\htts\lib\site-packages\torch\distributed\distributed_c10d.py", line 597, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL "
RuntimeError: Distributed package doesn't have NCCL built in

opened by MohammedAljahdali 9

Optimising with respect to the epoch that scored highest for a trial (instead of the last epoch)

Hi,

I am using metric="acc" for ray.tune, with mode="max". However, I think that the score for the last epoch is being used as the "best" score for the trial.

E.g., for trial train_mnist_c1a02_00000 from the following console output, the reported acc is 0.942736:

+-------------------------+------------+-------+--------------+-----------+-----------+-------------+--------+------------------+----------+----------+
| Trial name              | status     | loc   |   batch_size |   layer_1 |   layer_2 |          lr |   iter |   total time (s) |      loss |      acc |
|-------------------------+------------+-------+--------------+-----------+-----------+-------------+--------+------------------+-----------+----------|
| train_mnist_c1a02_00000 | TERMINATED |       |           64 |       128 |       256 | 0.000120742 |     16 |    166.36  | -0.938587 | 0.942736 |
| train_mnist_c1a02_00001 | TERMINATED |       |          128 |       128 |        64 | 0.000120068 |     16 |    138.23  | -0.923084 | 0.929161 |
| train_mnist_c1a02_00002 | TERMINATED |       |           64 |        32 |       256 | 0.000308457 |     16 |    168.73  | -0.942267 | 0.945811 |
| train_mnist_c1a02_00003 | TERMINATED |       |           64 |        32 |       256 | 0.0927983   |     16 |    162.749 | -0.103807 | 0.103807 |
+-------------------------+------------+-------+--------------+-----------+-----------+-------------+--------+------------------+-----------+----------+

However, when looking at progress.csv in tune_mnist/train_mnist_c1a02_00000_0_batch_size=64,layer_1=128,layer_2=256,lr=0.00012074_2021-08-25_20-18-14, the highest acc is 0.9441488981246948 (where 0.942736 is the score for the last epoch):

loss,acc,time_this_iter_s,done,timesteps_total,episodes_total,training_iteration,experiment_id,date,timestamp,time_total_s,pid,hostname,node_ip,time_since_restore,timesteps_since_restore,iterations_since_restore,trial_id
-0.7927070260047913,0.8189826607704163,28.36010980606079,False,,,1,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-18-49,1629886729,28.36010980606079,6438,b078,10.141.1.144,28.36010980606079,0,1,c1a02_00000
-0.8159223198890686,0.829454779624939,9.054965734481812,False,,,2,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-18-58,1629886738,37.4150755405426,6438,b078,10.141.1.144,37.4150755405426,0,2,c1a02_00000
-0.8259921669960022,0.835106372833252,9.04287576675415,False,,,3,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-19-07,1629886747,46.45795130729675,6438,b078,10.141.1.144,46.45795130729675,0,3,c1a02_00000
-0.831539511680603,0.8390957117080688,8.813407182693481,False,,,4,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-19-16,1629886756,55.271358489990234,6438,b078,10.141.1.144,55.271358489990234,0,4,c1a02_00000
-0.8353389501571655,0.8413397073745728,9.657632112503052,False,,,5,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-19-26,1629886766,64.92899060249329,6438,b078,10.141.1.144,64.92899060249329,0,5,c1a02_00000
-0.896843671798706,0.9082446694374084,9.233526706695557,False,,,6,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-19-35,1629886775,74.16251730918884,6438,b078,10.141.1.144,74.16251730918884,0,6,c1a02_00000
-0.9138997197151184,0.9224567413330078,9.06982707977295,False,,,7,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-19-44,1629886784,83.23234438896179,6438,b078,10.141.1.144,83.23234438896179,0,7,c1a02_00000
-0.919984757900238,0.9273603558540344,9.12305760383606,False,,,8,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-19-53,1629886793,92.35540199279785,6438,b078,10.141.1.144,92.35540199279785,0,8,c1a02_00000
-0.9245654940605164,0.9311003684997559,9.59031629562378,False,,,9,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-03,1629886803,101.94571828842163,6438,b078,10.141.1.144,101.94571828842163,0,9,c1a02_00000
-0.9273290038108826,0.9330950379371643,9.259892702102661,False,,,10,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-12,1629886812,111.20561099052429,6438,b078,10.141.1.144,111.20561099052429,0,10,c1a02_00000
-0.9301624298095703,0.935339093208313,9.178678750991821,False,,,11,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-21,1629886821,120.38428974151611,6438,b078,10.141.1.144,120.38428974151611,0,11,c1a02_00000
-0.9327710270881653,0.9389959573745728,9.190800189971924,False,,,12,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-30,1629886830,129.57508993148804,6438,b078,10.141.1.144,129.57508993148804,0,12,c1a02_00000
-0.9332801103591919,0.9382479786872864,9.453409433364868,False,,,13,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-40,1629886840,139.0284993648529,6438,b078,10.141.1.144,139.0284993648529,0,13,c1a02_00000
-0.9350691437721252,0.9409075379371643,8.920868873596191,False,,,14,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-49,1629886849,147.9493682384491,6438,b078,10.141.1.144,147.9493682384491,0,14,c1a02_00000
-0.9383015632629395,0.9441488981246948,9.268950700759888,False,,,15,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-20-58,1629886858,157.21831893920898,6438,b078,10.141.1.144,157.21831893920898,0,15,c1a02_00000
-0.9385870099067688,0.942736029624939,9.141263961791992,False,,,16,daa92b03869b42409a0f9a69b6d7918d,2021-08-25_20-21-07,1629886867,166.35958290100098,6438,b078,10.141.1.144,166.35958290100098,0,16,c1a02_00000

I see that this was an issue here: https://github.com/ray-project/ray/issues/5174, but it isn't clear what the fix was.

Thanks.

Source:

"""Simple example using RayAccelerator and Ray Tune"""

from pl_bolts.datamodules.mnist_datamodule import MNISTDataModule
from ray import tune
from ray_lightning.tests.utils import LightningMNISTClassifier
from ray_lightning.tune import TuneReportCallback, get_tune_ddp_resources
from ray_lightning import RayPlugin
import os
import pytorch_lightning as pl
import ray

DATA_DIR = "/datasets/work/hb-mlaifsp-mm/source/Datasets/mnist"
NUM_WORKERS = 1
NUM_SAMPLES = 4
MAX_EPOCHS = 16
USE_GPU = True

def train_mnist(config):


    model = LightningMNISTClassifier(config, DATA_DIR)

    metrics = {"loss": "ptl/val_loss", "acc": "ptl/val_accuracy"}
    callbacks = [TuneReportCallback(metrics, on="validation_end")]

    trainer = pl.Trainer(
        max_epochs=MAX_EPOCHS,
        callbacks=callbacks,
        progress_bar_refresh_rate=0,
        plugins=[RayPlugin(num_workers=NUM_WORKERS, use_gpu=USE_GPU)],
    )

    dm = MNISTDataModule(data_dir=DATA_DIR, num_workers=NUM_WORKERS, batch_size=config["batch_size"])
    trainer.fit(model, dm)

def tune_mnist():
    config = {
        "layer_1": tune.choice([32, 64, 128]),
        "layer_2": tune.choice([64, 128, 256]),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([32, 64, 128]),
    }

    ray.init()
    analysis = tune.run(
        train_mnist,
        metric="acc",
        mode="max",
        local_dir=os.getcwd(),
        config=config,
        num_samples=NUM_SAMPLES,
        resources_per_trial=get_tune_ddp_resources(num_workers=NUM_WORKERS, use_gpu=USE_GPU),
        name="tune_mnist",
    )
    print("Best hyperparameters found were: ", analysis.best_config)

if __name__ == "__main__":
    tune_mnist()

opened by anicolson 8

Cloudpickle Dataset deserialization error

Hi,

When I try to run the code with RayPlugin in my tests, I get the following error:

(pid=2127144) 2021-07-14 06:16:52,345   ERROR serialization.py:250 -- No module named 'test_runner'
(pid=2127144) Traceback (most recent call last):
(pid=2127144)   File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/ray/serialization.py", line 248, in deserialize_objects
(pid=2127144)     obj = self._deserialize_object(data, metadata, object_ref)
(pid=2127144)   File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/ray/serialization.py", line 190, in _deserialize_object
(pid=2127144)     return self._deserialize_msgpack_data(data, metadata_fields)
(pid=2127144)   File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/ray/serialization.py", line 168, in _deserialize_msgpack_data
(pid=2127144)     python_objects = self._deserialize_pickle5_data(pickle5_data)
(pid=2127144)   File "/home/rizhiy/miniconda3/envs/ntf/lib/python3.8/site-packages/ray/serialization.py", line 158, in _deserialize_pickle5_data
(pid=2127144)     obj = pickle.loads(in_band)
(pid=2127144) ModuleNotFoundError: No module named 'test_runner'

test_runner is the name of my testing script, but I'm not sure why you would need to serialize anything inside it. It just loads data and calls Runner, which is my wrapper around pl.Trainer.

How should I properly launch the training?

opened by Rizhiy 8

Bump pytorch-lightning from 1.6.4 to 1.8.6
Bumps pytorch-lightning from 1.6.4 to 1.8.6.

Release notes

Sourced from pytorch-lightning's releases.

Weekly patch release

App

Added

Added partial support for fastapi Request annotation in configure_api handlers (#16047)

Added a nicer UI with URL and examples for the autoscaler component (#16063)

Enabled users to have more control over scaling out/in intervals (#16093)

Added more datatypes to the serving component (#16018)

Added work.delete method to delete the work (#16103)

Added display_name property to LightningWork for the cloud (#16095)

Added ColdStartProxy to the AutoScaler (#16094)

Added status endpoint, enable ready (#16075)

Implemented ready for components (#16129)

Changed

The default start_method for creating Work processes locally on macOS is now 'spawn' (previously 'fork') (#16089)

The utility lightning.app.utilities.cloud.is_running_in_cloud now returns True during the loading of the app locally when running with --cloud (#16045)

Updated Multinode Warning (#16091)

Updated app testing (#16000)

Changed overwrite to True (#16009)

Simplified messaging in cloud dispatch (#16160)

Added annotations endpoint (#16159)

Fixed

Fixed PythonServer messaging "Your app has started" (#15989)

Fixed auto-batching to enable batching for requests coming even after the batch interval but is in the queue (#16110)

Fixed a bug where AutoScaler would fail with min_replica=0 (#16092

Fixed a non-thread safe deepcopy in the scheduler (#16114)

Fixed HTTP Queue sleeping for 1 sec by default if no delta was found (#16114)

Fixed the endpoint info tab not showing up in the AutoScaler UI (#16128)

Fixed an issue where an exception would be raised in the logs when using a recent version of streamlit (#16139)

Fixed e2e tests (#16146)

Full Changelog: https://github.com/Lightning-AI/lightning/compare/1.8.5.post0...1.8.6

Minor patch release

App

Fixed install/upgrade - removing single quote (#16079)

Fixed bug where components that are re-instantiated several times failed to initialize if they were modifying self.lightningignore (#16080)

Fixed a bug where apps that had previously been deleted could not be run again from the CLI (#16082)

Pytorch

... (truncated)

Commits

caa3329 update chlog for 1.8.6 (#16165)

bffdc2f Releasing: 1.8.6 (#16134)

a8a3519 release: 1.8.5.post (#16086)

e5d5901 releasing 1.8.5 (#16051)

60b3cc9 1.8.4.post release final (#15999)

db10422 dependencies (#15994)

6a39743 Releasing/1.8.4.post (#15988)

7eb5ff5 Releasing/1.8.4 extra2 (#15960)

e0e75d1 Releasing/1.8.4 extra (#15952)

a0d3475 Fix typo in package publish action (#15948)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 0
Make `_GPUAccelerator.get_parallel_devices` fit `GPUAccelerator` API

Closes https://github.com/ray-project/ray_lightning/issues/235

The PTL Accelerator API is expected to return a List, and not None. This PR updates our _GPUAccelerator abstraction to fit this API.

opened by amogkam 0
Support string based GPU ids

GPU device ids can be specified with an integer index, but may also be specified as strings.

This PR ensures that both cases are supported by root_device. The code is taken from what is being done in Ray Train: https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/train/torch/train_loop_utils.py?L470-498

Closes https://github.com/ray-project/ray_lightning/issues/236

opened by amogkam 0
Update protobuf requirement from <=3.20.1 to <4.21.13
Updates the requirements on protobuf to permit the latest version.

Commits

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies
opened by dependabot[bot] 0

Multi-GPU training fails with `ValueError` on systems with UUID GPU IDs

I'm currently trying to use ray_lightning to distribute model training over the resources in my ray cluster, like so:

ngpu = int(ray.cluster_resources().get("GPU", 0))
use_gpu = ngpu > 0
num_workers = ngpu
ncpu = 8
strategy = RayStrategy(num_workers,ncpu,use_gpu, find_unused_parameters=False)
# define dataloaders
# define callbacks
trainer = PlTrainer(
    logger=False,
    max_epochs=50,
    callbacks=callbacks,
    gpus=1,
    enable_model_summary=False,
    enable_checkpointing=False,
    strategy=strategy,
)
trainer.fit(lit_model, train_dataloader, val_dataloader)

However, this code results in a ValueError:

  File "/home/gridsan/dgraff/molpal/molpal/models/mpnmodels.py", line 207, in train
    trainer.fit(lit_model, train_dataloader, val_dataloader)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 58, in launch
    ray_output = self.run_function_on_workers(
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 249, in run_function_on_workers
    results = process_results(self._futures, self.tune_queue)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/util.py", line 64, in process_results
    ray.get(ready)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray/_private/worker.py", line 2289, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::RayExecutor.execute() (pid=49053, ip=172.31.130.105, repr=<ray_lightning.launchers.utils.RayExecutor object at 0x7f392469a6d0>)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/utils.py", line 52, in execute
    return fn(*args, **kwargs)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/launchers/ray_launcher.py", line 295, in _wrapping_function
    self._strategy._worker_setup(process_idx=global_rank)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 170, in _worker_setup
    self._process_group_backend = self._get_process_group_backend()
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp_spawn.py", line 166, in _get_process_group_backend
    or get_default_process_group_backend_for_device(self.root_device)
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 295, in root_device
    cuda_visible_list = [
  File "/home/gridsan/dgraff/.conda/envs/molpal/lib/python3.8/site-packages/ray_lightning/ray_ddp.py", line 296, in <listcomp>
    int(dev) for dev in cuda_visible_str.split(",")
ValueError: invalid literal for int() with base 10: 'GPU-dade4b6e-8461-eee0-e8bb-4f7e570856f4'

It seems like the internal code relies on an ordinal GPU device naming scheme. I.e.,

$ echo $CUDA_VISIBLE_DEVICES
0,1

which seems reasonable, given that what I typically encounter on most systems. But on my system, the GPU device naming looks something like this:

$ echo $CUDA_VISIBLE_DEVICES
GPU-23c5e712-9b16-e21a-df00-7dab564ade42,GPU-cdaae969-b14c-6b80-2fa2-de8e9efe87a1

So it seems like there are two options:

I could ask my sys-admins to rename the GPUs on the cluster to the more "standard" ordinal scheme. They'll probably tell me "No." and reference the CUDA_VISIBLE_DEVICES specification where it states that device names of the form GPU-<UUID> is the second option in addition to integer indices
This block of code is altered ray_lightning/ray_ddp.py#L292:

gpu_id = ray.get_gpu_ids()[0]  # NOTE: this value is cast to `int(...)` in the main branch. The could would break _here_ in the current code but breaks later v0.3
cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")
if cuda_visible_str and cuda_visible_str != "NoDevFiles":
    cuda_visible_list = [
        int(dev) for dev in cuda_visible_str.split(",")
    ]
    device_id = cuda_visible_list.index(gpu_id)
    return torch.device("cuda", device_id)

I think the block should be changed to:

gpu_id = ray.get_gpu_ids()[0]
cuda_visible_str = os.environ.get("CUDA_VISIBLE_DEVICES", "")
if cuda_visible_str and cuda_visible_str != "NoDevFiles":
    cuda_visible_list = list(cuda_visible_str.split(","))
    device_id = cuda_visible_list.index(gpu_id)
    return torch.device("cuda", device_id)

Thanks for the great work so far!

opened by davidegraff 1

TypeError in a SLURM environment due to internal API break

Using the master branch of ray-lightning with pytorch-lightning v1.6 in a SLURM environment leads to following exception:

ray.exceptions.RayTaskError(TypeError): ray::ImplicitFunc.train() (pid=117539, ip=10.181.76.37, repr=train)
  File ".../lib/python3.9/site-packages/ray/tune/trainable/trainable.py", line 367, in train
    raise skipped from exception_cause(skipped)
  File ".../lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py", line 335, in entrypoint
    return self._trainable_func(
  File ".../lib/python3.9/site-packages/ray/tune/trainable/function_trainable.py", line 652, in _trainable_func
    output = fn()
  File ".../random_search.py", line 122, in train
    trainer = Trainer(
  File ".../lib/python3.9/site-packages/pytorch_lightning/utilities/argparse.py", line 339, in insert_env_defaults
    return fn(self, **kwargs)
  File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in __init__
    self._accelerator_connector = AcceleratorConnector(
  File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 204, in __init__
    self.cluster_environment: ClusterEnvironment = self._choose_and_init_cluster_environment()
  File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 549, in _choose_and_init_cluster_environment
    if self._is_slurm_managing_tasks():
  File ".../lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 562, in _is_slurm_managing_tasks
    total_requested_devices = len(self._parallel_devices) * self._num_nodes_flag
TypeError: object of type 'NoneType' has no len()

The _GPUAccelerator.get_parallel_devices method breaks the internal Pytorch Lightning API by returning None in some cases, is this intentional? Returning an empty List instead of None fixes my issue, but I don't know if None is required in other ray-lightning use cases.

I would be more than happy to provide a PR if you think the fix is fine.

Thank you for this very convenient package and keep up the fantastic work!

opened by dcfidalgo 2

Releases(v0.3.0)

v0.3.0(Aug 23, 2022)
What's Changed

Bump version for development by @amogkam in https://github.com/ray-project/ray_lightning/pull/122

Update README to render on Ray docs by @amogkam in https://github.com/ray-project/ray_lightning/pull/135

Fix bash code block in Readme by @Yard1 in https://github.com/ray-project/ray_lightning/pull/136

Fix for fractional GPU by @amogkam in https://github.com/ray-project/ray_lightning/pull/125

Update broken PTL link by @amogkam in https://github.com/ray-project/ray_lightning/pull/137

Fix hanging trainer.test() by @amogkam in https://github.com/ray-project/ray_lightning/pull/142

Fix ray_ddp_sharded_example by @chongxiaoc in https://github.com/ray-project/ray_lightning/pull/153

Pop kwargs to support LightningCLI by @amogkam in https://github.com/ray-project/ray_lightning/pull/154

ray_ddp: support logged_metrics as part of remote worker return value by @chongxiaoc in https://github.com/ray-project/ray_lightning/pull/156

Support PyTorch Lightning 1.6 by @JiahaoYao in https://github.com/ray-project/ray_lightning/pull/163

Fix docs formatting by @JiahaoYao in https://github.com/ray-project/ray_lightning/pull/188

fix issue #189 by @JiahaoYao in https://github.com/ray-project/ray_lightning/pull/190

[Ray lightning 1.6] update the change according to the comment in #163 by @JiahaoYao in https://github.com/ray-project/ray_lightning/pull/195

New Contributors

@Yard1 made their first contribution in https://github.com/ray-project/ray_lightning/pull/136

@chongxiaoc made their first contribution in https://github.com/ray-project/ray_lightning/pull/153

@JiahaoYao made their first contribution in https://github.com/ray-project/ray_lightning/pull/163

Full Changelog: https://github.com/ray-project/ray_lightning/compare/0.2.0...v0.3.0
Source code(tar.gz)
Source code(zip)
ray_lightning-0.3.0-py3-none-any.whl(49.62 KB)
0.2.0(Feb 2, 2022)
Support for PyTorch Lightning v1.5 (#115, #121)!

Update HorovodRayPlugin API to match the new Horovod on Ray API. num_hosts and num_slots args have been deprecated in favor of a generic num_workers arg (#71).

get_tune_ddp_resouces has been renamed to get_tune_resources and can now be used for both RayPlugin and HorovodRayPlugin (#71).

Rename the cpus_per_worker arg in get_tune_resources utility to num_cpus_per_worker to match the arg name in RayPlugin (#96).

Annotate the APIs as beta (#88).

Source code(tar.gz)
Source code(zip)
0.1.1(Aug 20, 2021)
Documentation & Docstring Updates

Support PyTorch Lightning 1.4.2

Source code(tar.gz)
Source code(zip)
0.1.0(Aug 12, 2021)

Initial Release of ray_lightning!

You can install it as pip install ray-lightning
Source code(tar.gz)
Source code(zip)

Pytorch Lightning Distributed Accelerators using Ray

Related tags

Overview

Distributed PyTorch Lightning Training on Ray

Installation

PyTorch Distributed Data Parallel Plugin on Ray

Horovod Plugin on Ray

Multi-node Distributed Training

Hyperparameter Tuning with Ray Tune

FAQ

Comments

Weekly patch release

App

Added

Changed

Fixed

Minor patch release

App

Pytorch

Releases(v0.3.0)

v0.3.0(Aug 23, 2022)

What's Changed

New Contributors

0.2.0(Feb 2, 2022)

0.1.1(Aug 20, 2021)

0.1.0(Aug 12, 2021)

Owner

Code for Greedy Gradient Ensemble for Visual Question Answering （ICCV 2021, Oral）

PyTorch implementation of SwAV (Swapping Assignments between Views)

Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

Complete system for facial identity system

Bridging Vision and Language Model

Nvidia Semantic Segmentation monorepo

Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on PaddlePaddle

PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Rate-limit-semaphore - Semaphore implementation with rate limit restriction for async-style (any core)

Minimal implementation of PAWS (https://arxiv.org/abs/2104.13963) in TensorFlow.

PyTorch implementation of the Quasi-Recurrent Neural Network - up to 16 times faster than NVIDIA's cuDNN LSTM

Source code of all the projects of Udacity Self-Driving Car Engineer Nanodegree.

A geometric deep learning pipeline for predicting protein interface contacts.

Diabetes-Feature-Engineering - A machine learning model that can predict whether people have diabetes when their characteristics are specified

OMAMO: orthology-based model organism selection

PyTorch implementation of PSPNet

Distilled coarse part of LoFTR adapted for compatibility with TensorRT and embedded divices

This repo is customed for VisDrone.

An official PyTorch implementation of the TKDE paper "Self-Supervised Graph Representation Learning via Topology Transformations".

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.