OpenDILab RL Kubernetes Custom Resource and Operator Lib

Last update: Dec 29, 2022

Overview

DI Orchestrator

DI Orchestrator is designed to manage DI (Decision Intelligence) jobs using Kubernetes Custom Resource and Operator.

Prerequisites

A well-prepared kubernetes cluster. Follow the instructions to create a kubernetes cluster, or create a local kubernetes node referring to kind or minikube
Cert-manager. Installation on kubernetes please refer to cert-manager docs. Or you can install it by the following command.

kubectl create -f ./config/certmanager/cert-manager.yaml

Install DI Orchestrator

DI Orchestrator consists of two components: di-operator and di-server. Install di-operator and di-server with the following command.

kubectl create -f ./config/di-manager.yaml

di-operator and di-server will be installed in di-system namespace.

$ kubectl get pod -n di-system
NAME                               READY   STATUS    RESTARTS   AGE
di-operator-57cc65d5c9-5vnvn   1/1     Running   0          59s
di-server-7b86ff8df4-jfgmp     1/1     Running   0          59s

Install global components of DIJob defined in AggregatorConfig:

kubectl create -f config/samples/agconfig.yaml -n di-system

Submit DIJob

# submit DIJob
$ kubectl create -f config/samples/dijob-cartpole.yaml

# get pod and you will see coordinator is created by di-operator
# a few seconds later, you will see collectors and learners created by di-server
$ kubectl get pod

# get logs of coordinator
$ kubectl logs cartpole-dqn-coordinator

User Guide

Refers to user-guide. For Chinese version, please refer to 中文手册

Contributing

Refers to developer-guide.

Comments

在 Pod 内增加集群信息
希望以 dijob replica 方式提交时，每个 pod 都能见到整个 replica 的 host 信息和自己的启动顺序，增加以下几个环境变量：

replica 中所有 pod 的 FQDN，依据启动顺序排序

当前 pod 的 FQDN

当前 pod 的顺序编号

DI-engine 中会根据这些变量实现对应的网络连接，attach-to 的生成逻辑可以从 di-orchestrator 中移除
enhancement
opened by sailxjx 3

add tasks to dijob spec

1. goal

There is only one pod template defined in a dijob, which results in that we can not define different commands or resources for different componets of di-engine such as collector, learner and evaluator. So we are supposed to find a more general way to define a custom resource of dijob.

2. design *

Inspired by VolcanoJob, we define the spec.tasks to describe different componets of di-engine. spec.tasks is a list, which allows us to define multiple tasks. We can specify different task.type to label the task as one of collector, learner, evaluator and none. none means the task is a general task, which is the default value.

After change, the dijob can be defined as follow:

apiVersion: diengine.opendilab.org/v2alpha1
kind: DIJob
metadata:
  name: job-with-tasks
spec:
  priority: "normal"  # job priority, which is a reserved field for allocator
  backoffLimit: 0  # restart count
  cleanPodPolicy: "Running"  # the policy to clean pods after job completion
  preemptible: false  # job is preemtible or not
  minReplicas: 2  
  maxReplicas: 5
  tasks:
  - replicas: 1
    name: "learner"
    type: learner
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label learner xxx
          resources:
            requests:
              cpu: "1"
              nvidia.com/gpu: 1
        restartPolicy: Never
  - replicas: 1
    name: "evaluator"
    type: evaluator
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label evaluator xxx
        restartPolicy: Never
  - replicas: 2
    name: "collector"
    type: collector
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label collector xxx
        restartPolicy: Never
status:
  conditions:
  - lastTransitionTime: "2022-05-26T07:25:11Z"
    lastUpdateTime: "2022-05-26T07:25:11Z"
    message: job created.
    reason: JobPending
    status: "False"
    type: Pending
  - lastTransitionTime: "2022-05-26T07:25:11Z"
    lastUpdateTime: "2022-05-26T07:25:11Z"
    message: job is starting since all pods are created.
    reason: JobStarting
    status: "False"
    type: Starting
  phase: Starting
  profilings: {}
  readyReplicas: 0
  replicas: 4
  taskStatus:
    learner:
      Pending: 1
    evaluator:
      Pending: 1
    collector:
      Pending: 2
  reschedules: 0
  restarts: 0

task definition:

type Task struct {
	Name string `json:"name,omitempty"`

	Type TaskType `json:"type,omitempty"`

	Replicas int32 `json:"replicas,omitempty"`

	Template corev1.PodTemplateSpec `json:"template,omitempty"`
}

type TaskType string

const (
	TaskTypeLearner TaskType = "learner"

	TaskTypeCollector TaskType = "collector"

	TaskTypeEvaluator TaskType = "evaluator"

	TaskTypeNone TaskType = "none"
)

status.taskStatus definition:

type DIJobStatus struct {
  // Phase defines the observed phase of the job
  // +kubebuilder:default=Pending
  Phase Phase `json:"phase,omitempty"`

  // ...
  
  // map for different task statuses. key: task.name, value: TaskStatus
  TaskStatus map[string]TaskStatus

  // ...
}

// count of different pod phases
type TaskStatus map[corev1.PodPhase]int32

enhancement

opened by konnase 1

new version for di-engine new architecture
release notes

features

v1.0.0 for DI-engine new architecture

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface

enhancement
opened by konnase 1
v0.2.0
[x] split webhook and operator

[x] add dockerfile.dev

[x] update CleanPolicyALL to CleanPolicyAll

[x] remove k8s service related operations from server, and operator is responsible for managing services

[x] add e2e test

enhancement
opened by konnase 1
refactor job spec
refactor job spec definition and add spec.tasks to support multi tasks #20

add DI_RANK to pod env and remove engineFields in job.spec #16

add e2e test

add validator to validate the correctness of dijob spec

change job.phase to Pending when job replicas scaled to 0

implement a processor to process di-server requests

refactor project structure

enhancement
opened by konnase 0
Release/v1.0
release notes

features

v1.0.0 for DI-engine new architecture

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface

enhancement
opened by konnase 0
fix: job failed submit when collector/learner missed

job failed submit when collector/learner missed because webhook create an empty dijob, and golang builder add some default value to some feilds of collector/learner, which result in invalid type error. solved by make coordinator/collector/learner as pointers.
bug

opened by konnase 0
Feat/job create event
add event handler for dijob, and mark job as Created when job submitted

mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)

mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted

version -> v0.2.1

enhancement
opened by konnase 0
allocate的一些问题

1.目前的allocator的逻辑，对于不可被抢占的job的初始分配，仅利用minreplicas修改replicas属性，那job的pods部署到哪个节点是完全由K8S决定吗？而且Release1.13代码的allocator.go中对不可被抢占job的初始分配部分貌似还没有写。 2.job是否可以被抢占的含义具体是什么？和是否能被调度是不是等价的？ 3.调度策略的FitPolicy的Allocate和Optimize方法也没有进行实现，这部分内容什么时候可以补充？ 4.文档中存在许多与最新代码不符合的地方，比如DIJob.Spec.Group属性在代码中已经被移除，文档中提到的job.spec.minreplicas属性代码中也没有，而是在JobInfo中。可以更新一下文档吗？感谢！

opened by RZ-Q 3

Releases(v1.1.3)

v1.1.3(Aug 22, 2022)
bugs fix

judge which task a pod belongs to according to task name instead of task type (https://github.com/opendilab/DI-orchestrator/pull/27)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.2(Jul 21, 2022)
bugs fix

global cmd flag error(https://github.com/opendilab/DI-orchestrator/pull/23)

wrong pod subdomain(https://github.com/opendilab/DI-orchestrator/pull/24)

incorrect to get global rank(https://github.com/opendilab/DI-orchestrator/pull/25)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.1(Jul 4, 2022)
update status replicas and task status

add volumes to job spec

update status CompletionTimestamp when job completed

see details in https://github.com/opendilab/DI-orchestrator/pull/22
Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.0(Jun 30, 2022)
refactor job spec definition and add spec.tasks to support multi tasks #20

add DI_RANK to pod env and remove engineFields in job.spec #16

add e2e test

add validator to validate the correctness of dijob spec

change job.phase to Pending when job replicas scaled to 0

implement a processor to process di-server requests

refactor project structure

see details in https://github.com/opendilab/DI-orchestrator/pull/21
Source code(tar.gz)
Source code(zip)
di-manager.yaml(374.01 KB)
v1.0.0(Mar 23, 2022)
features

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface see https://github.com/opendilab/DI-orchestrator/pull/18

Source code(tar.gz)
Source code(zip)
di-manager.yaml(350.52 KB)
v0.2.2(Dec 15, 2021)
bug fix

resolve bug that job failed to submit when collector/learner missed (https://github.com/opendilab/DI-orchestrator/pull/14)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(1.38 MB)
v0.2.1(Oct 12, 2021)
feature

add event handler for dijob, and mark job as Created when job submitted(https://github.com/opendilab/DI-orchestrator/pull/13)

mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)

mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted

Source code(tar.gz)
Source code(zip)
di-manager.yaml(1.38 MB)
v0.2.0(Sep 28, 2021)
change orchestrator image repository

version -> v0.2.0

Source code(tar.gz)
Source code(zip)
v0.2.0-rc.0(Sep 6, 2021)
split webhook and operator

add dockerfile.dev

update CleanPolicyALL to CleanPolicyAll

remove k8s service related operations from server, and operator is responsible for managing services

add e2e test

Source code(tar.gz)
Source code(zip)
v0.1.0(Jul 8, 2021)
Features

Define DIJob CRD to support DI jobs' submission

Define AggregatorConfig CRD to support aggregator definition

Add webhook to validate DIJob submission

Provide http service for DI jobs to request for DI modules

Docs to introduce DI-orchestrator architecture

Source code(tar.gz)
Source code(zip)

Owner

OpenDILab

Open sourced Decision Intelligence (DI)

GitHub Repository

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

68 Jul 18, 2022

Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set (CVPRW 2019). A PyTorch implementation.

Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set —— PyTorch implementation This is an unofficial offici

833 Dec 28, 2022

A setup script to generate ITK Python Wheels

ITK Python Package This project provides a setup.py script to build ITK Python binary packages and infrastructure to build ITK external module Python

59 Dec 14, 2022

QR2Pass-project - A proof of concept for an alternative (passwordless) authentication system to a web server

QR2Pass This is a proof of concept for an alternative (passwordless) authenticat

4 Dec 09, 2022

GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery

GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery This is the code to the paper: Gradient-Based Learn

3 Feb 15, 2022

Rust bindings for the C++ api of PyTorch.

tch-rs Rust bindings for the C++ api of PyTorch. The goal of the tch crate is to provide some thin wrappers around the C++ PyTorch api (a.k.a. libtorc

2.3k Dec 30, 2022

Code for the paper "Training GANs with Stronger Augmentations via Contrastive Discriminator" (ICLR 2021)

Training GANs with Stronger Augmentations via Contrastive Discriminator (ICLR 2021) This repository contains the code for reproducing the paper: Train

174 Dec 29, 2022

Modifications of the official PyTorch implementation of StyleGAN3. Let's easily generate images and videos with StyleGAN2/2-ADA/3!

Alias-Free Generative Adversarial Networks (StyleGAN3) Official PyTorch implementation of the NeurIPS 2021 paper Alias-Free Generative Adversarial Net

185 Dec 24, 2022

All of the figures and notebooks for my deep learning book, for free!

"Deep Learning - A Visual Approach" by Andrew Glassner This is the official repo for my book from No Starch Press. Ordering the book My book is called

227 Jan 04, 2023

A toolset of Python programs for signal modeling and indentification via sparse semilinear autoregressors.

SPAAR Description A toolset of Python programs for signal modeling via sparse semilinear autoregressors. References Vides, F. (2021). Computing Semili

0 Oct 30, 2021

Safe Policy Optimization with Local Features

Safe Policy Optimization with Local Feature (SPO-LF) This is the source-code for implementing the algorithms in the paper "Safe Policy Optimization wi

6 Jun 05, 2022

This is 2nd term discrete maths project done by UCU students that uses backtracking to solve various problems.

Backtracking Project Sponsors This is a project made by UCU students: Olha Liuba - crossword solver implementation Hanna Yershova - sudoku solver impl

4 Oct 17, 2021

A minimalist environment for decision-making in autonomous driving

highway-env A collection of environments for autonomous driving and tactical decision-making tasks An episode of one of the environments available in

1.6k Jan 07, 2023

PyTorch evaluation code for Delving Deep into the Generalization of Vision Transformers under Distribution Shifts.

Out-of-distribution Generalization Investigation on Vision Transformers This repository contains PyTorch evaluation code for Delving Deep into the Gen

72 Dec 13, 2022

HMLET (Hybrid-Method-of-Linear-and-non-linEar-collaborative-filTering-method)

Methods HMLET (Hybrid-Method-of-Linear-and-non-linEar-collaborative-filTering-method) Dynamically selecting the best propagation method for each node

7 Dec 18, 2022

Code for the paper: Sketch Your Own GAN

Sketch Your Own GAN Project | Paper | Youtube | Slides Our method takes in one or a few hand-drawn sketches and customizes an off-the-shelf GAN to mat

677 Dec 28, 2022

Official repository of the paper "A Variational Approximation for Analyzing the Dynamics of Panel Data". Mixed Effect Neural ODE. UAI 2021.

Official repository of the paper (UAI 2021) "A Variational Approximation for Analyzing the Dynamics of Panel Data", Mixed Effect Neural ODE. Panel dat

7 Nov 26, 2022

OpenDILab RL Kubernetes Custom Resource and Operator Lib

Related tags

Overview

DI Orchestrator

Prerequisites

Install DI Orchestrator

Submit DIJob

User Guide

Contributing

Comments

1. goal

2. design *

release notes

features

release notes

features

Releases(v1.1.3)

v1.1.3(Aug 22, 2022)

bugs fix

v1.1.2(Jul 21, 2022)

bugs fix

v1.1.1(Jul 4, 2022)

v1.1.0(Jun 30, 2022)

v1.0.0(Mar 23, 2022)

features

v0.2.2(Dec 15, 2021)

bug fix

v0.2.1(Oct 12, 2021)

feature

v0.2.0(Sep 28, 2021)

v0.2.0-rc.0(Sep 6, 2021)

v0.1.0(Jul 8, 2021)

Features

Owner

OpenDILab

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

Accurate 3D Face Reconstruction with Weakly-Supervised Learning: From Single Image to Image Set (CVPRW 2019). A PyTorch implementation.

A setup script to generate ITK Python Wheels

QR2Pass-project - A proof of concept for an alternative (passwordless) authentication system to a web server

GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery

Rust bindings for the C++ api of PyTorch.

Code for the paper "Training GANs with Stronger Augmentations via Contrastive Discriminator" (ICLR 2021)

Modifications of the official PyTorch implementation of StyleGAN3. Let's easily generate images and videos with StyleGAN2/2-ADA/3!

All of the figures and notebooks for my deep learning book, for free!

A toolset of Python programs for signal modeling and indentification via sparse semilinear autoregressors.

Safe Policy Optimization with Local Features

This is 2nd term discrete maths project done by UCU students that uses backtracking to solve various problems.

A minimalist environment for decision-making in autonomous driving

PyTorch evaluation code for Delving Deep into the Generalization of Vision Transformers under Distribution Shifts.

HMLET (Hybrid-Method-of-Linear-and-non-linEar-collaborative-filTering-method)

Code for the paper: Sketch Your Own GAN

Official repository of the paper "A Variational Approximation for Analyzing the Dynamics of Panel Data". Mixed Effect Neural ODE. UAI 2021.

torchbearer: A model fitting library for PyTorch

unet-family: Ultimate version

Learning a mapping from images to psychological similarity spaces with neural networks.