TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

Related tags

Deep Learningtorchx
Overview

PyPI License Tests Lint

TorchX

TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

For the latest documentation, please refer to our website.

Requirements

TorchX SDK (torchx):

  • python3 (3.8+)
  • torch

TorchX Kubeflow Pipelines Support (torchx-kfp):

  • torchx
  • kfp

Installation

# install torchx sdk and CLI
pip install torchx

# install torchx kubeflow pipelines (kfp) support
pip install "torchx[kfp]"

Quickstart

See the quickstart guide.

Contributing

We welcome PRs! See the CONTRIBUTING file.

License

TorchX is BSD licensed, as found in the LICENSE file.

Comments
  • cli: defer loading schedulers until used

    cli: defer loading schedulers until used

    This updates the cli, schedulers and runners so that the schedulers are only imported when they need to be used. This means only the relevant scheduler is loaded which vastly improves responsiveness.

    • torchx --help 900ms -> 300ms
    • torchx status local_cwd:// 1.38s to 840ms

    Breaking changes:

    • schedulers.get_schedulers() has been removed in favor of schedulers.get_scheduler_factories() since it's too dangerous
    • Runner.run_opts() now takes a specific scheduler name instead of returning all scheduler runopts since that requires loading all schedulers.

    get_schedulers is an internal interface since downstream users should be using the runner interface so changing this shouldn't be an issue

    Runner.run_opts() is part of the user Runner interface but it's only really practical for the CLI so I doubt there's any OSS usage of it

    Test plan:

    (torchx) [email protected] ~/D/torchx (deferload)> time torchx --help
    usage: torchx [-h] [--log_level LOG_LEVEL] [--version] {builtins,cancel,configure,describe,log,run,runopts,status} ...
    
    torchx CLI
    
    optional arguments:
      -h, --help            show this help message and exit
      --log_level LOG_LEVEL
                            Python logging log level
      --version             show program's version number and exit
    
    sub-commands:
      Use the following commands to run operations, e.g.: torchx run ${JOB_NAME}
    
      {builtins,cancel,configure,describe,log,run,runopts,status}
    
    ________________________________________________________
    Executed in  300.70 millis    fish           external
       usr time  280.99 millis  916.00 micros  280.07 millis
       sys time   16.46 millis    0.00 micros   16.46 millis
    
    (torchx) [email protected] ~/D/torchx (deferload)> time torchx status local_docker://torchx/sh-lsgcv92jm13ps
    torchx 2022-06-24 15:55:58 INFO     AppDef:
      State: SUCCEEDED
      Num Restarts: -1
    Roles:
     *sh[0]:SUCCEEDED
      sh[1]:SUCCEEDED
      sh[2]:SUCCEEDED
    
    ________________________________________________________
    Executed in  507.45 millis    fish           external
       usr time  472.78 millis  926.00 micros  471.86 millis
       sys time   19.77 millis    0.00 micros   19.77 millis
    
    CLA Signed 
    opened by d4l3k 17
  • Extend k8s integ tests, add general class that allows executing different components on local and k8s schedulers

    Extend k8s integ tests, add general class that allows executing different components on local and k8s schedulers

    Summary: Extend k8s integ tests, add general class that allows executing different components on local and k8s schedulers

    Differential Revision: D30980471

    CLA Signed fb-exported 
    opened by aivanou 17
  • API to list jobs by scheduler

    API to list jobs by scheduler

    Given a scheduler, torchx list -s <scheduler_name> lists the jobs scheduled on it. MVP that supports listing jobs for kubernetes scheduler to address https://github.com/pytorch/torchx/issues/503 More features like listing just jobs started by torchx, listing active jobs, filtering options to be implemented in following PRs

    Test plan:

    (venv38) [email protected]:~/wp/torchx$ torchx list -s kubernetes
    kubernetes://default/default:cifar-trainer-a5qvfhe1hyb1b
    kubernetes://default/default:cifar-trainer-d796ei2tdf4bc
    kubernetes://default/default:cifar-trainer-em0iao2m9agog
    kubernetes://default/default:cifar-trainer-ew33oxmdg7t0c
    kubernetes://default/default:cifar-trainer-grjsnfxeinqbd
    kubernetes://default/default:cifar-trainer-p4bwlewvwmt1f
    kubernetes://default/default:cifar-trainer-pwy4c4omfufff
    kubernetes://default/default:cifar-trainer-qyan5cp6vz5od
    kubernetes://default/default:cifar-trainer-rbrd5krzkyz3c
    kubernetes://default/default:cifar-trainer-rw9kn5rtau6se
    kubernetes://default/default:cifar-trainer-tz45941r7u1b
    kubernetes://default/default:cifar-trainer-uycdlshurnf6c
    kubernetes://default/default:cifar-trainer-vstsr7qtecaif
    kubernetes://default/default:cifar-trainer-xnqu2mxdc2a5b
    .
    .
    .
    .
    kubernetes://default/default:train-zhqz9fhr7h0tsc
    kubernetes://default/default:trainer-bwhcp3vw2khftc
    kubernetes://default/default:trainer-m3mxh6dcxgwtv
    
    CLA Signed 
    opened by priyaramani 15
  • ray_scheduler: workspace + fixed no role logging

    ray_scheduler: workspace + fixed no role logging

    This updates Ray to have proper workspace support.

    • -c working_dir=... is deprecated in favor of torchx run --workspace=...
    • -c requirements=... is optional and requirements.txt will be automatically read from the workspace if present
    • torchx log ray://foo/bar works without requiring /ray/0

    Test plan:

    (torchx) [email protected] ~/D/t/e/ray (ray)> torchx run -s ray --wait --log dist.ddp --env LOGLEVEL=INFO -j 2x1 -m scripts.compute_world_size
    torchx 2022-05-18 16:55:31 INFO     Checking for changes in workspace `file:///home/tristanr/Developer/torchrec/examples/ray`...
    torchx 2022-05-18 16:55:31 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
    torchx 2022-05-18 16:55:31 INFO     Built new image `/tmp/torchx_workspacebe6331jv` based on original image `ghcr.io/pytorch/torchx:0.2.0dev0` and changes in workspace `file:///home/tristanr/Developer/torch
    rec/examples/ray` for role[0]=compute_world_size.
    torchx 2022-05-18 16:55:31 WARNING  The Ray scheduler does not support port mapping.
    torchx 2022-05-18 16:55:31 INFO     Uploading package gcs://_ray_pkg_63a39f7096dfa0bd.zip.
    torchx 2022-05-18 16:55:31 INFO     Creating a file package for local directory '/tmp/torchx_workspacebe6331jv'.
    ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td
    torchx 2022-05-18 16:55:31 INFO     Launched app: ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td
    torchx 2022-05-18 16:55:31 INFO     AppStatus:
      msg: PENDING
      num_restarts: -1
      roles:
      - replicas:
        - hostname: <NONE>
          id: 0
          role: ray
          state: !!python/object/apply:torchx.specs.api.AppState
          - 2
          structured_error_msg: <NONE>
        role: ray
      state: PENDING (2)
      structured_error_msg: <NONE>
      ui_url: null
    
    torchx 2022-05-18 16:55:31 INFO     Job URL: None
    torchx 2022-05-18 16:55:31 INFO     Waiting for the app to finish...
    torchx 2022-05-18 16:55:31 INFO     Waiting for app to start before logging...
    torchx 2022-05-18 16:55:43 INFO     Job finished: SUCCEEDED
    (torchx) [email protected] ~/D/t/e/ray (ray)> torchx log ray://torchx/127.0.0.1:8265-compute_world_size-mpr03nzqvvg3td
    ray/0 Waiting for placement group to start.
    ray/0 running ray.wait on [ObjectRef(8f2664c081ffc268e1c4275021ead9801a8d33861a00000001000000), ObjectRef(afe9f14f5a927c04b8e247b9daca5a9348ef61061a00000001000000)]
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
    ray/0 (CommandActor pid=494377)   entrypoint       : scripts.compute_world_size
    ray/0 (CommandActor pid=494377)   min_nodes        : 2
    ray/0 (CommandActor pid=494377)   max_nodes        : 2
    ray/0 (CommandActor pid=494377)   nproc_per_node   : 1
    ray/0 (CommandActor pid=494377)   run_id           : compute_world_size-mpr03nzqvvg3td
    ray/0 (CommandActor pid=494377)   rdzv_backend     : c10d
    ray/0 (CommandActor pid=494377)   rdzv_endpoint    : localhost:29500
    ray/0 (CommandActor pid=494377)   rdzv_configs     : {'timeout': 900}
    ray/0 (CommandActor pid=494377)   max_restarts     : 0
    ray/0 (CommandActor pid=494377)   monitor_interval : 5
    ray/0 (CommandActor pid=494377)   log_dir          : None
    ray/0 (CommandActor pid=494377)   metrics_cfg      : {}
    ray/0 (CommandActor pid=494377)
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_vyq136c_/compute_world_size-mpr03nzqvvg3td_nu4r0f6t
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
    ray/0 (CommandActor pid=494406)   entrypoint       : scripts.compute_world_size
    ray/0 (CommandActor pid=494406)   min_nodes        : 2
    ray/0 (CommandActor pid=494406)   max_nodes        : 2
    ray/0 (CommandActor pid=494406)   nproc_per_node   : 1
    ray/0 (CommandActor pid=494406)   run_id           : compute_world_size-mpr03nzqvvg3td
    ray/0 (CommandActor pid=494406)   rdzv_backend     : c10d
    ray/0 (CommandActor pid=494406)   rdzv_endpoint    : 172.26.20.254:29500
    ray/0 (CommandActor pid=494406)   rdzv_configs     : {'timeout': 900}
    ray/0 (CommandActor pid=494406)   max_restarts     : 0
    ray/0 (CommandActor pid=494406)   monitor_interval : 5
    ray/0 (CommandActor pid=494406)   log_dir          : None
    ray/0 (CommandActor pid=494406)   metrics_cfg      : {}
    ray/0 (CommandActor pid=494406)
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_t38mo11i/compute_world_size-mpr03nzqvvg3td_ehvp80_p
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] starting workers for entrypoint: python
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous'ing worker group
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result:
    ray/0 (CommandActor pid=494377)   restart_count=0
    ray/0 (CommandActor pid=494377)   master_addr=tristanr-arch2
    ray/0 (CommandActor pid=494377)   master_port=48089
    ray/0 (CommandActor pid=494377)   group_rank=1
    ray/0 (CommandActor pid=494377)   group_world_size=2
    ray/0 (CommandActor pid=494377)   local_ranks=[0]
    ray/0 (CommandActor pid=494377)   role_ranks=[1]
    ray/0 (CommandActor pid=494377)   global_ranks=[1]
    ray/0 (CommandActor pid=494377)   role_world_sizes=[2]
    ray/0 (CommandActor pid=494377)   global_world_sizes=[2]
    ray/0 (CommandActor pid=494377)
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_vyq136c_/compute_world_size-mpr03nzqvvg3td_nu4r0f6t/attempt_0/0/error.json
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Rendezvous complete for workers. Result:
    ray/0 (CommandActor pid=494406)   restart_count=0
    ray/0 (CommandActor pid=494406)   master_addr=tristanr-arch2
    ray/0 (CommandActor pid=494406)   master_port=48089
    ray/0 (CommandActor pid=494406)   group_rank=0
    ray/0 (CommandActor pid=494406)   group_world_size=2
    ray/0 (CommandActor pid=494406)   local_ranks=[0]
    ray/0 (CommandActor pid=494406)   role_ranks=[0]
    ray/0 (CommandActor pid=494406)   global_ranks=[0]
    ray/0 (CommandActor pid=494406)   role_world_sizes=[2]
    ray/0 (CommandActor pid=494406)   global_world_sizes=[2]
    ray/0 (CommandActor pid=494406)
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] Starting worker group
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_t38mo11i/compute_world_size-mpr03nzqvvg3td_ehvp80_p/attempt_0/0/error.json
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:[] worker group successfully finished. Waiting 300 seconds for other agents to finish.
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
    ray/0 (CommandActor pid=494377) INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.000942230224609375 seconds
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:[] worker group successfully finished. Waiting 300 seconds for other agents to finish.
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (SUCCEEDED). Waiting 300 seconds for other agents to finish
    ray/0 (CommandActor pid=494406) INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0013003349304199219 seconds
    ray/0 (CommandActor pid=494377) [0]:initializing `gloo` process group
    ray/0 (CommandActor pid=494377) [0]:successfully initialized process group
    ray/0 (CommandActor pid=494377) [0]:rank: 1, actual world_size: 2, computed world_size: 2
    ray/0 (CommandActor pid=494406) [0]:initializing `gloo` process group
    ray/0 (CommandActor pid=494406) [0]:successfully initialized process group
    ray/0 (CommandActor pid=494406) [0]:rank: 0, actual world_size: 2, computed world_size: 2
    ray/0 running ray.wait on [ObjectRef(afe9f14f5a927c04b8e247b9daca5a9348ef61061a00000001000000)]
    
    CLA Signed 
    opened by d4l3k 14
  • Investigate Perf Drop for fairseq gpu training on aws batch with EFA (p3dn)

    Investigate Perf Drop for fairseq gpu training on aws batch with EFA (p3dn)

    🐛 Bug

    Not really a TorchX bug per-se, but opening this issue to keep track of this investigation.

    There seems to be a perf drop when running the same fairseq trainer on AWS Batch on p3dn (w/ EFA) versus running on the same host baremetal (no Docker).

    Module (check all that applies):

    • [ ] torchx.spec
    • [ ] torchx.component
    • [ ] torchx.apps
    • [ ] torchx.runtime
    • [ ] torchx.cli
    • [ ] torchx.schedulers
    • [ ] torchx.pipelines
    • [ ] torchx.aws
    • [ ] torchx.examples
    • [x] other

    Below is a table summarizing the different runs:

    Common Params:

    1. Instance Type: p3dn.24xlarge
    2. Num Nodes: 2
    3. Nodes in same placement group: Yes
    4. Environment Variables: a. LOGLEVEL=INFO b. NCCL_SOCKET_IFNAME=eth0 c. NCCL_ALGO=RING d. NCCL_PROTO=simple e. FI_PROVIDER=efa f. FI_EFA_USE_DEVICE_RDMA=1
    5. In all cases RDMA is not enabled (seems to not be available on p3dn see: https://github.com/aws/aws-ofi-nccl/issues/104)

    |Scheduler | Docker | EFA picked up by NCCL | NCCL_ALGO | Performance | Log | |-------------|:----------- |:-------------------------|:-------------|:-------------|:-------------| | BareMetal | No | No | Tree | baseline (~110k wpm) | gist-link| | BareMetal | Yes | Yes | Tree | WIP | |
    | AWS Batch | Yes | Yes | Ring | ~44k wpm | | | AWS Batch | Yes | Yes | Tree | ~70k wpm

    • Scheduler BareMetal: ssh onto the nodes and kick off the trainer manually
    • Docker Yes/No: trainer runs in docker (e.g. docker run) versus python -m torch.distributed.launch)

    Logs

    To Reproduce

    Steps to reproduce the behavior:

    Expected behavior

    Environment

    • torchx version (e.g. 0.1.0rc1): torchx-nightly
    • Python version: 3.8+
    • OS (e.g., Linux): Amazon Linux2
    • How you installed torchx (conda, pip, source, docker): pip
    • Docker image and tag (if using docker):
    • Git commit (if installed from source): N/A
    • Execution environment (on-prem, AWS, GCP, Azure etc): AWS Batch
    • Any other relevant information:

    Additional context

    bug aws_batch 
    opened by kiukchung 14
  • schedulers/aws_batch: add a scheduler to launch jobs directly on aws_batch

    schedulers/aws_batch: add a scheduler to launch jobs directly on aws_batch

    This adds a scheduler that allows for launching TorchX jobs directly on AWS Batch. This requires almost no infrastructure setup (just UI) and provided a fairly sane docker job engine.

    Test plan:

    scripts/awsbatchint.sh
    pytest torchx/schedulers/test/aws_batch_scheduler_test.py
    pyre
    
    CLA Signed 
    opened by d4l3k 14
  • Ray scheduler driver and job api

    Ray scheduler driver and job api

    To run distributed pytorch test

    cd torch/torchx/schedulers/test
    python -m unittest ray_scheduler_test.py
    
    2021-11-24 00:53:22,810 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
    2021-11-24 00:53:23,127 INFO sdk.py:144 -- Uploading package gcs://_ray_pkg_e212ba1ff8e15e24.zip.
    2021-11-24 00:53:23,128 INFO packaging.py:352 -- Creating a file package for local directory '/tmp/tmp9copqxd2'.
    status: PENDING
    status: PENDING
    status: RUNNING
    status: RUNNING
    status: RUNNING
    status: RUNNING
    status: SUCCEEDED
    2021-11-24 00:53:25,521 INFO worker.py:840 -- Connecting to existing Ray cluster at address: 172.31.2.209:6379
    (CommandActor pid=3575) initializing `gloo` process group
    (CommandActor pid=3574) initializing `gloo` process group
    (CommandActor pid=3575) successfully initialized process group
    (CommandActor pid=3575) rank: 1, actual world_size: 2, computed world_size: 2
    (CommandActor pid=3574) successfully initialized process group
    (CommandActor pid=3574) rank: 0, actual world_size: 2, computed world_size: 2
    

    To run distributed pytorch test using torchX CLI

    # Setup cluster and get a HEAD NODE IP
    ray up -y ray_cluster.yaml
    pip install torchx[dev]
    
    # Get a job ID from deployed job
    torchx run -s ray -cfg dashboard_address=34.209.89.185:20002,working_dir=aivanou_test utils.binary --entrypoint ray_simple.py
    
    # Use job ID to get logs or job status
    torchx describe ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW
    torchx log ray://torchx/34.209.89.185:20002-raysubmit_aKvezN3NyA2mqZeW
    
    CLA Signed 
    opened by msaroufim 14
  • Move `examples` under `torchx` module

    Move `examples` under `torchx` module

    Summary: The diff moves examples under torchx namespace, also removes examples Dockerfile, and makes torchx image to use dev-requirements

    Differential Revision: D31464358

    fb-exported 
    opened by aivanou 14
  • github/kfp: run integration tests even for external users

    github/kfp: run integration tests even for external users

    This splits the kfp integration tests into two steps.

    1. without secrets on PR branch: builds the pipeline + containers and saves them as a GitHub artifact
    2. with secrets from master branch: loads the artifact and launches it on the KFP cluster

    Test plan:

    scripts/kfpint.py --path /tmp/foo --save
    scripts/kfpint.py --path /tmp/foo --load
    

    CI

    CLA Signed Merged 
    opened by d4l3k 14
  • add LSF scheduler

    add LSF scheduler

    I prototyped the LSF scheduler for torchx. It supports native, Docker, and Singularity as runtime with a shared filesystem at this moment. I confirmed it worked with Gloo and NCCL on small VPC V100 clusters.

    Note: torchx log command is available only when the torchx host shares the filesystem with cluster nodes (e.g., NFS).

    In a nutshell, the LSF scheduler translates a torchx request to be LSF job submissions (i.e., bsub). For distributed apps, it creates multiple bsub. I also added lsf to scripts/component_integration_tests.py. Here is the log output with my three-node LSF cluster and you can find dryrun results there.

    component_integration_tests.lsf.txt

    Regarding Singularity image compatibility, it already automates to convert docker images into singularity image format, and so, only we have to do is to generate singularity-exec arguments from torchx requests. Note that users still need to set prefix docker:// for image names if they want to use docker images.

    The following are example commands.

    Example: native hello_world and CLI utils

    $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=native utils.echo --msg hello_world --num_replicas 3
    lsf://torchx/echo-pxc3gn5ct061k
    $ torchx list -s lsf
    $ torchx status lsf://torchx/echo-pxc3gn5ct061k
    $ torchx cancel lsf://torchx/echo-pxc3gn5ct061k
    $ torchx log --stream stdout lsf://torchx/echo-pxc3gn5ct061k/echo/0
    

    Example: Docker hello_world

    $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=docker utils.echo --image alpine:latest --msg hello_world --num_replicas 3
    

    Example: Singularity hello_world

    $ torchx run -s lsf -cfg jobdir=/mnt/data/torchx,runtime=singularity utils.echo --image docker://alpine:latest --msg hello_world --num_replicas 3
    

    Example: Docker Distributed

    $ cp scripts/dist_app.py /mnt/data/dist/
    $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=docker,host_network=True" dist.ddp -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
    

    Example: Singularity Distributed

    $ cp scripts/dist_app.py /mnt/data/dist/
    $ torchx run -s lsf -cfg "jobdir=/mnt/data/torchx,runtime=singularity,host_network=True" dist.ddp --image docker://ghcr.io/pytorch/torchx:0.3.0dev0 -j 2x2 --gpu 2 --script /data/dist_app.py --mount "type=bind,src=/mnt/data/dist,dst=/data"
    
    CLA Signed 
    opened by takeshi-yoshimura 13
  • c10d communication failure on k8s cluster

    c10d communication failure on k8s cluster

    🐛 Bug

    cross-referencing the issue posted here:

    https://github.com/pytorch/pytorch/issues/80772

    which is about race condition between rank 0 pod registering with DNS and c10d service trying to look it up.

    To Reproduce

    Steps to reproduce the behavior:

    1. run the code with torchx run --scheduler kubernetes dist.ddp -j 8x1 --script cifar_dist.py
    2. see the pods scheduled and then erroring

    Code:

    import os
    import torch
    import torch.distributed as dist
    import torch.nn as nn
    import torch.optim as optim
    from torch.utils.data.distributed import DistributedSampler
    from torch.utils.data import DataLoader
    
    from torchvision.models import resnet18
    from torchvision.datasets import CIFAR10
    import torchvision.transforms as transforms
    
    from torch.nn.parallel import DistributedDataParallel as DDP
    import random
    import numpy as np
    from time import sleep
    
    model_dir = '/model'
    model_filename = 'mymodel'
    batch_size = 128
    
    
    def set_random_seeds(seed=0):
    
        torch.manual_seed(seed)
        # torch.backends.cudnn.deterministic = True
        # torch.backends.cudnn.benchmark = False
        np.random.seed(seed)
        random.seed(seed)
    
    
    def evaluate(model, device, test_loader):
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for data in test_loader:
                images, labels = data[0].to(device), data[1].to(device)
                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        accuracy = correct / total
        return accuracy
    
    
    dist.init_process_group("gloo")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    
    if torch.cuda.is_available():
        device = torch.cuda.device(f'cuda:{rank}')
    else:
        device = torch.device('cpu')
    
    
    print(f"Running basic DDP example on rank {rank} of {world_size}")
    
    if rank == 0:
        model_filepath = os.path.join(model_dir, model_filename)
        if not os.path.exists(model_dir):
            os.makedirs(model_dir)
    
    set_random_seeds()
    
    print('create model...')
    # create model
    model = resnet18(pretrained=False)
    if torch.cuda.is_available():
        torch.cuda.set_device(rank)
        device = torch.cuda.device(f'cuda:{rank}')
        ddp_model = DDP(model, device_ids=[rank], output_device=rank)
    else:
        device = torch.device('cpu')
        ddp_model = DDP(model)
    
    print('prepare dataset...')
    # Prepare dataset and dataloader
    transform = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(),
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    ])
    
    train_set = CIFAR10(root="data", train=True, download=True, transform=transform) 
    test_set = CIFAR10(root="data", train=False, download=True, transform=transform)
    
    train_sampler = DistributedSampler(dataset=train_set)
    
    train_loader = DataLoader(dataset=train_set, batch_size=batch_size, sampler=train_sampler, num_workers=0)
    # Test loader does not have to follow distributed sampling strategy
    test_loader = DataLoader(dataset=test_set, batch_size=128, shuffle=False, num_workers=0)
    
    print('setup model...')
    
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-5)
    
    for epoch in range(1000):
        print(f'Epoch {epoch}')
        if epoch % 10 == 0:
            if rank == 0:
                accuracy = evaluate(model=ddp_model, device=device, test_loader=test_loader)
                torch.save(ddp_model.state_dict(), model_filepath)
                print("-" * 75)
                print(f"Epoch: {epoch}, Accuracy: {accuracy}")
                print("-" * 75)
            else:
                print("-" * 75)
                print(f"Epoch: {epoch}")
                print("-" * 75)
    
        ddp_model.train()
    
        i = 0
        for data in train_loader:
            print(i)
            i += 1
            inputs, labels = data[0].to(device), data[1].to(device)
            optimizer.zero_grad()
            outputs = ddp_model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
        
    

    Expected behavior

    I expect pods to connect to c10d properly and run the code.

    Environment

    Collecting environment information...
    PyTorch version: 1.11.0
    Is debug build: False
    CUDA used to build PyTorch: 11.3
    ROCM used to build PyTorch: N/A
    
    OS: Ubuntu 18.04.6 LTS (x86_64)
    GCC version: Could not collect
    Clang version: Could not collect
    CMake version: Could not collect
    Libc version: glibc-2.27
    
    Python version: 3.8.12 (default, Oct 12 2021, 13:49:34)  [GCC 7.5.0] (64-bit runtime)
    Python platform: Linux-5.4.17-2136.306.1.3.el8uek.x86_64-x86_64-with-glibc2.17
    Is CUDA available: False
    CUDA runtime version: No CUDA
    GPU models and configuration: No CUDA
    Nvidia driver version: No CUDA
    cuDNN version: No CUDA
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    Is XNNPACK available: True
    
    Versions of relevant libraries:
    [pip3] botorch==0.6.0
    [pip3] gpytorch==1.6.0
    [pip3] mypy-extensions==0.4.3
    [pip3] numpy==1.21.2
    [pip3] pytorch-lightning==1.5.10
    [pip3] torch==1.11.0
    [pip3] torch-model-archiver==0.6.0
    [pip3] torchelastic==0.2.2
    [pip3] torchmetrics==0.9.1
    [pip3] torchserve==0.6.0
    [pip3] torchtext==0.12.0
    [pip3] torchvision==0.12.0
    [pip3] torchx==0.2.0
    [conda] blas                      1.0                         mkl  
    [conda] botorch                   0.6.0                    pypi_0    pypi
    [conda] cudatoolkit               11.3.1               ha36c431_9    nvidia
    [conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
    [conda] gpytorch                  1.6.0                    pypi_0    pypi
    [conda] mkl                       2021.4.0           h06a4308_640  
    [conda] mkl-service               2.4.0            py38h7f8727e_0  
    [conda] mkl_fft                   1.3.1            py38hd3c417c_0  
    [conda] mkl_random                1.2.2            py38h51133e4_0  
    [conda] numpy                     1.21.2           py38h20f2e39_0  
    [conda] numpy-base                1.21.2           py38h79a1101_0  
    [conda] pytorch                   1.11.0          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
    [conda] pytorch-lightning         1.5.10                   pypi_0    pypi
    [conda] pytorch-mutex             1.0                        cuda    pytorch
    [conda] torch-model-archiver      0.6.0                    pypi_0    pypi
    [conda] torchelastic              0.2.2                    pypi_0    pypi
    [conda] torchmetrics              0.9.1                    pypi_0    pypi
    [conda] torchserve                0.6.0                    pypi_0    pypi
    [conda] torchtext                 0.12.0                     py38    pytorch
    [conda] torchvision               0.12.0               py38_cu113    pytorch
    [conda] torchx                    0.2.0                    pypi_0    pypi
    
    • Execution environment: Oracle Cloud OKE cluster k8s version 1.23.4
    • dns: CoreDNS-1.8.6 linux/amd64, go1.16.7 BoringCrypto, 69a006c9f1a
    opened by streamnsight 13
  • Find default namespace from kube_config.

    Find default namespace from kube_config.

    For kubernetes scheduler, find default namespace from kube_config other than hard-coded "default" value.

    Right now the default namespace for list method is hard-coded as "Default", which is not the case for multiple user scenario in kubernetes cluster. The default namespace can be read from current context of kube configuration.

    Test plan:

    Tested manually with a generated kube config.

    CLA Signed 
    opened by liuwenchao 3
  • (torchx/aws_batch) Enable Region Selection in TorchConfig

    (torchx/aws_batch) Enable Region Selection in TorchConfig

    QOL improvements for support for AWS regions.

    Current Behavior: Use the default Region specified in .aws config

    Proposed Behaviour: Have manual override in .torchxconfig. If no override and no default throw and error

    Future Improvements: Add profile support. Add fargate support.

    CLA Signed 
    opened by ashvinnihalani 3
  • Kubernetes: support default scheduler instead of volcano for autoscaling

    Kubernetes: support default scheduler instead of volcano for autoscaling

    Description

    Right now Volcano doesn't handle autoscaling correctly since it won't schedule jobs if there's not enough resources. For single node jobs it would be nice to support the non-volcano K8s batch scheduler so the Pods will be created and can autoscale the cluster.

    Detailed Proposal

    Add a setting scheduler to the kubernetes scheduler to allow setting schedulerName to "default-scheduler" to use the standard logic.

    https://volcano.sh/en/docs/vcjob/#schedulername

    There's some prototype work that was done for TPU node prototyping in https://github.com/pytorch/torchx/commit/66e934c59b376a8c0b9aa6523fde34e81a673eb2 which required the same default scheduler logic

    Alternatives

    Additional context/links

    opened by d4l3k 0
  • [torchx/schedulers - aws batch] Only display jobs with torchx tag when listing

    [torchx/schedulers - aws batch] Only display jobs with torchx tag when listing

    Description

    Add a tag filter by the existence of the tag key: torchx.python.org/version when listing jobs.

    Motivation/Background

    Currently when the batch compute environment has jobs not launched via torchx, when we list jobs via:

    torchx list -s aws_batch
    

    It lists ALL the jobs (not just the ones that are submitted via torchx). Since we already tag the batch jobs launched with torchx with the tag key torchx.pytorch.org/version (see screenshot below) we should only list the jobs that have this tag.

    image

    Furthermore, we can also:

    1. Add a unix-username tag so that we only display the jobs relevant to the user
    2. Add a --filter option to either show all (*) jobs or filter by user or job queue or other useful criteria

    Detailed Proposal

    See motivation above.

    Alternatives

    N/A

    Additional context/links

    N/A

    opened by kiukchung 0
  • [RFC][torchx/schedulers - aws batch] support wildcard job queue selection

    [RFC][torchx/schedulers - aws batch] support wildcard job queue selection

    Description

    For the AWS Batch scheduler integration in torchx, support wildcards in the job_queue runopt with a simple (or configurable or extendable) queue selection logic. For instance, assume that my organization uses a job queue naming convention of the form

    ${TEAM_NAME}-${CLUSTER_TYPE}-${REGION}
    
    Example:
    
    pt_r2p-gpu_cluster-us-east-1a
    pt_r2p-gpu_cluster-us-east-1b
    pt_r2p-gpu_cluster-us-east-1c
    pt_r2p-cpu_cluster-us-east-1a
    pt_r2p-cpu_cluster-us-east-1b
    pt_r2p-trainium_cluster-us-east-1a
    pt_r2p-trainium_cluster-us-east-1c
    
    pt_core-gpu_cluster-us-east-1a
    pt_core-cpu_cluster-us-east-1a
    

    If I'm in the pt_r2p team, and want to submit a job to any gpu compute environment that has free capacity regardless of region, then I can use a wildcard on the ${CLUSTER_TYPE} portion of the job queue name as:

    [aws_batch]
    job_queue=pt_r2p-gpu_cluster-*
    

    Motivation/Background

    Ideally, with AWS Batch we create a single job queue (JQ) connected to multiple compute environments (CE) and always submit to the same JQ to have Batch figure out which CE the job needs to be submitted to. With Fair Share (FS) scheduling announced a year ago (see announcement) this is theoretically possible. However many users of AWS Batch are still using FIFO (originally supported) scheduling policy in which case having a single JQ is impractical in a multi-team use case scenario since users from other teams may affect the scheduling overhead of my team. This starts escalating pretty quickly in cases where teams BYO (bring your own) capacity.

    Detailed Proposal

    Support wild-cards for job_queue names for the aws_batch scheduler with the following MVP queue selection algorithm (greedy):

    1. Find all job queues that match the wild-card expression
    2. For each job queue pull the details of the CE that it is hooked up to
    3. Filter the CEs down to the ones that actually support the host type + quantity that the job needs
    4. For each filtered CE look at the free resources and rank them by most-free -> least-free
    5. Pick the job queue that has the most CEs with the highest rank

    This algorithm effectively choses the JQ that the job needs to be submitted to that will yield the least wait time in the queue.

    To actually implement the greedy algorithm above, I suggest that we add chain-able selection algorithms. For instance the algorithm above can be expressed as a chain of primitives:

    jqs = get_matching("pt_r2p-gpu_cluster-*")
    jqs = filter_resource(jqs, role.resource)
    jqs = order_by(jqs, Ordering.FREE_CAPACITY, desc=True)
    
    jqs[0] # <-- select this one to run
    

    Similarly a "first-match" algorithm can be implemented as:

    get_matching("pt_r2p-gpu_cluster-*")[0]
    

    We can follow torchdata's datapipe interface such that each function in the chain has the signature:

    def fn(jqs: List[JobQueue], *args, **kwargs) -> List[JobQueue]:
       """
       Returns a sorted/filtered/manipulated list of job queues to pass to the next chained fn.
       """
       pass
    

    Alternatives

    1. Resources/guidelines/script to migrate from FIFO queues Fair-Share.

    Additional context/links

    N/A

    opened by kiukchung 1
Releases(v0.4.0)
  • v0.4.0(Dec 30, 2022)

  • v0.3.0(Oct 27, 2022)

  • v0.2.0(Jun 15, 2022)

  • v0.1.2(Mar 29, 2022)

    Full Changelog: https://github.com/pytorch/torchx/compare/v0.1.1...v0.1.2

    Milestone: https://github.com/pytorch/torchx/milestones/3

    • PyTorch 1.11 Support
    • Python 3.10 Support
    • torchx.workspace
      • TorchX now supports a concept of workspaces. This enables seamless launching of jobs using changes present in your local workspace. For Docker based schedulers, we automatically build a new docker container on job launch making it easier than ever to run experiments. #333
    • torchx.schedulers
      • Ray #329
        • Newly added Ray scheduler makes it easy to launch jobs on Ray.
        • https://pytorch.medium.com/large-scale-distributed-training-with-torchx-and-ray-1d09a329aacb
      • AWS Batch #381
        • Newly added AWS Batch scheduler makes it easy to launch jobs in AWS with minimal infrastructure setup.
      • Slurm
        • Slurm jobs will by default launch in the current working directory to match local_cwd and workspace behavior. #372
        • Replicas now have their own log files and can be accessed programmatically. #373
        • Support for comment, mail-user and constraint fields. #391
        • Workspace support (prototype) - Slurm jobs can now be launched in isolated experiment directories. #416
      • Kubernetes
        • Support for running jobs under service accounts. #408
        • Support for specifying instance types. #433
      • All Docker-based Schedulers (Kubernetes, Batch, Docker)
        • Added bind mount and volume supports #420, #426
        • Bug fix: Better shm support for large dataloader #429
        • Support for .dockerignore and custom Dockerfiles #401
      • Local Scheduler
        • Automatically set CUDA_VISIBLE_DEVICES #383
        • Improved log ordering #366
    • torchx.components
      • dist.ddp
        • Rendezvous works out of the box on all schedulers #400
        • Logs are now prefixed with local ranks #412
        • Can specify resources via the CLI #395
        • Can specify environment variables via the CLI #399
      • HPO
        • Ax runner now lives in the Ax repo https://github.com/facebook/Ax/commit/8e2e68f21155e918996bda0b7d97b5b9ef4e0cba
    • torchx.cli
      • .torchxconfig
        • You can now specify component argument defaults .torchxconfig https://github.com/pytorch/torchx/commit/c37cfd7846d5a0cb527dd19c8c95e881858f8f0a
        • ~/.torchxconfig can now be used to set user level defaults. #378
        • --workspace can be configured #397
      • Color change and bug fixes #419
    • torchx.runner
      • Now supports workspace interfaces. #360
      • Returned lines now preserve whitespace to provide support for progress bars #425
      • Events are now logged to torch.monitor when available. #379
    • torchx.notebook (prototype)
      • Added new workspace interface for developing models and launching jobs via a Jupyter Notebook. #356
    • Docs
      • Improvements to clarify TorchX usage w/ workspaces and general cleanups.
      • #374, #402, #404, #407, #434
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Nov 18, 2021)

  • v0.1.0(Oct 21, 2021)

  • v0.1.0rc1(Oct 18, 2021)

  • v0.1.0rc0(Oct 5, 2021)

  • v0.1.0b0(Jun 29, 2021)

  • v0.1.0.dev2(Jun 24, 2021)

  • v0.1.0.dev1(Jun 17, 2021)

  • v0.1.0.dev0(Jun 16, 2021)

[NeurIPS 2021] "Delayed Propagation Transformer: A Universal Computation Engine towards Practical Control in Cyber-Physical Systems"

Delayed Propagation Transformer: A Universal Computation Engine towards Practical Control in Cyber-Physical Systems Introduction Multi-agent control i

VITA 6 May 05, 2022
Python scripts performing class agnostic object localization using the Object Localization Network model in ONNX.

ONNX Object Localization Network Python scripts performing class agnostic object localization using the Object Localization Network model in ONNX. Ori

Ibai Gorordo 15 Oct 14, 2022
Autoregressive Predictive Coding: An unsupervised autoregressive model for speech representation learning

Autoregressive Predictive Coding This repository contains the official implementation (in PyTorch) of Autoregressive Predictive Coding (APC) proposed

iamyuanchung 173 Dec 18, 2022
Soomvaar is the repo which 🏩 contains different collection of 👨‍💻🚀code in Python and 💫✨Machine 👬🏼 learning algorithms📗📕 that is made during 📃 my practice and learning of ML and Python✨💥

Soomvaar 📌 Introduction Soomvaar is the collection of various codes implement in machine learning and machine learning algorithms with python on coll

Felix-Ayush 42 Dec 30, 2022
Flexible-CLmser: Regularized Feedback Connections for Biomedical Image Segmentation

Flexible-CLmser: Regularized Feedback Connections for Biomedical Image Segmentation The skip connections in U-Net pass features from the levels of enc

Boheng Cao 1 Dec 29, 2021
Official code for the publication "HyFactor: Hydrogen-count labelled graph-based defactorization Autoencoder".

HyFactor Graph-based architectures are becoming increasingly popular as a tool for structure generation. Here, we introduce a novel open-source archit

Laboratoire-de-Chemoinformatique 11 Oct 10, 2022
Graph Attention Networks

GAT Graph Attention Networks (Veličković et al., ICLR 2018): https://arxiv.org/abs/1710.10903 GAT layer t-SNE + Attention coefficients on Cora Overvie

Petar Veličković 2.6k Jan 05, 2023
TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios

TPH-YOLOv5 This repo is the implementation of "TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured

cv516Buaa 439 Dec 22, 2022
Implements a fake news detection program using classifiers.

Fake news detection Implements a fake news detection program using classifiers for Data Mining course at UoA. Description The project is the categoriz

Apostolos Karvelas 1 Jan 09, 2022
Count GitHub Stars ⭐

Count GitHub Stars per Day ⭐ Track GitHub stars per day over a date range to measure the open-source popularity of different repositories. Requirement

Ultralytics 20 Nov 20, 2022
The code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

CrossFormer This repository is the code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention. Introduction Existin

cheerss 238 Jan 06, 2023
This repository contains the source code for the paper Tutorial on amortized optimization for learning to optimize over continuous domains by Brandon Amos

Tutorial on Amortized Optimization This repository contains the source code for the paper Tutorial on amortized optimization for learning to optimize

Meta Research 144 Dec 26, 2022
🏅 Top 5% in 제2회 연구개발특구 인공지능 경진대회 AI SPARK 챌린지

AI_SPARK_CHALLENG_Object_Detection 제2회 연구개발특구 인공지능 경진대회 AI SPARK 챌린지 🏅 Top 5% in mAP(0.75) (443명 중 13등, mAP: 0.98116) 대회 설명 Edge 환경에서의 가축 Object Dete

3 Sep 19, 2022
Differential Privacy for Heterogeneous Federated Learning : Utility & Privacy tradeoffs

Differential Privacy for Heterogeneous Federated Learning : Utility & Privacy tradeoffs In this work, we propose an algorithm DP-SCAFFOLD(-warm), whic

19 Nov 10, 2022
基于tensorflow 2.x的图片识别工具集

Classification.tf2 基于tensorflow 2.x的图片识别工具集 功能 粗粒度场景图片分类 细粒度场景图片分类 其他场景图片分类 模型部署 tensorflow serving本地推理和docker部署 tensorRT onnx ... 数据集 https://hyper.a

Wei Qi 1 Nov 03, 2021
Face Recognition & AI Based Smart Attendance Monitoring System.

In today’s generation, authentication is one of the biggest problems in our society. So, one of the most known techniques used for authentication is h

Sagar Saha 1 Jan 14, 2022
VIsually-Pivoted Audio and(N) Text

VIP-ANT: VIsually-Pivoted Audio and(N) Text Code for the paper Connecting the Dots between Audio and Text without Parallel Data through Visual Knowled

Yän.PnG 16 Nov 04, 2022
[ICCV' 21] "Unsupervised Point Cloud Pre-training via Occlusion Completion"

OcCo: Unsupervised Point Cloud Pre-training via Occlusion Completion This repository is the official implementation of paper: "Unsupervised Point Clou

Hanchen 204 Dec 24, 2022
HyperLib: Deep learning in the Hyperbolic space

HyperLib: Deep learning in the Hyperbolic space Background This library implements common Neural Network components in the hypberbolic space (using th

105 Dec 25, 2022
Prompt Tuning with Rules

PTR Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification" If you use the code, please cite the following paper: @art

THUNLP 118 Dec 30, 2022