OpenDILab RL Kubernetes Custom Resource and Operator Lib

Last update: Dec 29, 2022

Overview

DI Orchestrator

DI Orchestrator is designed to manage DI (Decision Intelligence) jobs using Kubernetes Custom Resource and Operator.

Prerequisites

A well-prepared kubernetes cluster. Follow the instructions to create a kubernetes cluster, or create a local kubernetes node referring to kind or minikube
Cert-manager. Installation on kubernetes please refer to cert-manager docs. Or you can install it by the following command.

kubectl create -f ./config/certmanager/cert-manager.yaml

Install DI Orchestrator

DI Orchestrator consists of two components: di-operator and di-server. Install di-operator and di-server with the following command.

kubectl create -f ./config/di-manager.yaml

di-operator and di-server will be installed in di-system namespace.

$ kubectl get pod -n di-system
NAME                               READY   STATUS    RESTARTS   AGE
di-operator-57cc65d5c9-5vnvn   1/1     Running   0          59s
di-server-7b86ff8df4-jfgmp     1/1     Running   0          59s

Install global components of DIJob defined in AggregatorConfig:

kubectl create -f config/samples/agconfig.yaml -n di-system

Submit DIJob

# submit DIJob
$ kubectl create -f config/samples/dijob-cartpole.yaml

# get pod and you will see coordinator is created by di-operator
# a few seconds later, you will see collectors and learners created by di-server
$ kubectl get pod

# get logs of coordinator
$ kubectl logs cartpole-dqn-coordinator

User Guide

Refers to user-guide. For Chinese version, please refer to 中文手册

Contributing

Refers to developer-guide.

Comments

在 Pod 内增加集群信息
希望以 dijob replica 方式提交时，每个 pod 都能见到整个 replica 的 host 信息和自己的启动顺序，增加以下几个环境变量：

replica 中所有 pod 的 FQDN，依据启动顺序排序

当前 pod 的 FQDN

当前 pod 的顺序编号

DI-engine 中会根据这些变量实现对应的网络连接，attach-to 的生成逻辑可以从 di-orchestrator 中移除
enhancement
opened by sailxjx 3

add tasks to dijob spec

1. goal

There is only one pod template defined in a dijob, which results in that we can not define different commands or resources for different componets of di-engine such as collector, learner and evaluator. So we are supposed to find a more general way to define a custom resource of dijob.

2. design *

Inspired by VolcanoJob, we define the spec.tasks to describe different componets of di-engine. spec.tasks is a list, which allows us to define multiple tasks. We can specify different task.type to label the task as one of collector, learner, evaluator and none. none means the task is a general task, which is the default value.

After change, the dijob can be defined as follow:

apiVersion: diengine.opendilab.org/v2alpha1
kind: DIJob
metadata:
  name: job-with-tasks
spec:
  priority: "normal"  # job priority, which is a reserved field for allocator
  backoffLimit: 0  # restart count
  cleanPodPolicy: "Running"  # the policy to clean pods after job completion
  preemptible: false  # job is preemtible or not
  minReplicas: 2  
  maxReplicas: 5
  tasks:
  - replicas: 1
    name: "learner"
    type: learner
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label learner xxx
          resources:
            requests:
              cpu: "1"
              nvidia.com/gpu: 1
        restartPolicy: Never
  - replicas: 1
    name: "evaluator"
    type: evaluator
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label evaluator xxx
        restartPolicy: Never
  - replicas: 2
    name: "collector"
    type: collector
    template:
      metadata:
        name: di
      spec:
        containers:
        - image: registry.sensetime.com/xlab/ding:nightly
          imagePullPolicy: IfNotPresent
          name: pydi
          env:
          - name: NCCL_DEBUG
            value: "INFO"
          command: ["/bin/bash", "-c",]
          args: 
          - |
            ditask --label collector xxx
        restartPolicy: Never
status:
  conditions:
  - lastTransitionTime: "2022-05-26T07:25:11Z"
    lastUpdateTime: "2022-05-26T07:25:11Z"
    message: job created.
    reason: JobPending
    status: "False"
    type: Pending
  - lastTransitionTime: "2022-05-26T07:25:11Z"
    lastUpdateTime: "2022-05-26T07:25:11Z"
    message: job is starting since all pods are created.
    reason: JobStarting
    status: "False"
    type: Starting
  phase: Starting
  profilings: {}
  readyReplicas: 0
  replicas: 4
  taskStatus:
    learner:
      Pending: 1
    evaluator:
      Pending: 1
    collector:
      Pending: 2
  reschedules: 0
  restarts: 0

task definition:

type Task struct {
	Name string `json:"name,omitempty"`

	Type TaskType `json:"type,omitempty"`

	Replicas int32 `json:"replicas,omitempty"`

	Template corev1.PodTemplateSpec `json:"template,omitempty"`
}

type TaskType string

const (
	TaskTypeLearner TaskType = "learner"

	TaskTypeCollector TaskType = "collector"

	TaskTypeEvaluator TaskType = "evaluator"

	TaskTypeNone TaskType = "none"
)

status.taskStatus definition:

type DIJobStatus struct {
  // Phase defines the observed phase of the job
  // +kubebuilder:default=Pending
  Phase Phase `json:"phase,omitempty"`

  // ...
  
  // map for different task statuses. key: task.name, value: TaskStatus
  TaskStatus map[string]TaskStatus

  // ...
}

// count of different pod phases
type TaskStatus map[corev1.PodPhase]int32

enhancement

opened by konnase 1

new version for di-engine new architecture
release notes

features

v1.0.0 for DI-engine new architecture

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface

enhancement
opened by konnase 1
v0.2.0
[x] split webhook and operator

[x] add dockerfile.dev

[x] update CleanPolicyALL to CleanPolicyAll

[x] remove k8s service related operations from server, and operator is responsible for managing services

[x] add e2e test

enhancement
opened by konnase 1
refactor job spec
refactor job spec definition and add spec.tasks to support multi tasks #20

add DI_RANK to pod env and remove engineFields in job.spec #16

add e2e test

add validator to validate the correctness of dijob spec

change job.phase to Pending when job replicas scaled to 0

implement a processor to process di-server requests

refactor project structure

enhancement
opened by konnase 0
Release/v1.0
release notes

features

v1.0.0 for DI-engine new architecture

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface

enhancement
opened by konnase 0
fix: job failed submit when collector/learner missed

job failed submit when collector/learner missed because webhook create an empty dijob, and golang builder add some default value to some feilds of collector/learner, which result in invalid type error. solved by make coordinator/collector/learner as pointers.
bug

opened by konnase 0
Feat/job create event
add event handler for dijob, and mark job as Created when job submitted

mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)

mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted

version -> v0.2.1

enhancement
opened by konnase 0
allocate的一些问题

1.目前的allocator的逻辑，对于不可被抢占的job的初始分配，仅利用minreplicas修改replicas属性，那job的pods部署到哪个节点是完全由K8S决定吗？而且Release1.13代码的allocator.go中对不可被抢占job的初始分配部分貌似还没有写。 2.job是否可以被抢占的含义具体是什么？和是否能被调度是不是等价的？ 3.调度策略的FitPolicy的Allocate和Optimize方法也没有进行实现，这部分内容什么时候可以补充？ 4.文档中存在许多与最新代码不符合的地方，比如DIJob.Spec.Group属性在代码中已经被移除，文档中提到的job.spec.minreplicas属性代码中也没有，而是在JobInfo中。可以更新一下文档吗？感谢！

opened by RZ-Q 3

Releases(v1.1.3)

v1.1.3(Aug 22, 2022)
bugs fix

judge which task a pod belongs to according to task name instead of task type (https://github.com/opendilab/DI-orchestrator/pull/27)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.2(Jul 21, 2022)
bugs fix

global cmd flag error(https://github.com/opendilab/DI-orchestrator/pull/23)

wrong pod subdomain(https://github.com/opendilab/DI-orchestrator/pull/24)

incorrect to get global rank(https://github.com/opendilab/DI-orchestrator/pull/25)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.1(Jul 4, 2022)
update status replicas and task status

add volumes to job spec

update status CompletionTimestamp when job completed

see details in https://github.com/opendilab/DI-orchestrator/pull/22
Source code(tar.gz)
Source code(zip)
di-manager.yaml(445.36 KB)
v1.1.0(Jun 30, 2022)
refactor job spec definition and add spec.tasks to support multi tasks #20

add DI_RANK to pod env and remove engineFields in job.spec #16

add e2e test

add validator to validate the correctness of dijob spec

change job.phase to Pending when job replicas scaled to 0

implement a processor to process di-server requests

refactor project structure

see details in https://github.com/opendilab/DI-orchestrator/pull/21
Source code(tar.gz)
Source code(zip)
di-manager.yaml(374.01 KB)
v1.0.0(Mar 23, 2022)
features

remove webhook

manage commands with cobra

refactor orchestrator architecture inspired from adaptdl

use gin to rewrite di-server

update di-server http interface see https://github.com/opendilab/DI-orchestrator/pull/18

Source code(tar.gz)
Source code(zip)
di-manager.yaml(350.52 KB)
v0.2.2(Dec 15, 2021)
bug fix

resolve bug that job failed to submit when collector/learner missed (https://github.com/opendilab/DI-orchestrator/pull/14)

Source code(tar.gz)
Source code(zip)
di-manager.yaml(1.38 MB)
v0.2.1(Oct 12, 2021)
feature

add event handler for dijob, and mark job as Created when job submitted(https://github.com/opendilab/DI-orchestrator/pull/13)

mark collector and learner as optional, only coordinator is required(https://github.com/opendilab/DI-orchestrator/pull/13/commits/653e64af01ec7752b08d4bf8381738d566fca224)

mark job Failed when the submitted job is incorrect(https://github.com/opendilab/DI-orchestrator/pull/13/commits/bea840a5eee3508be18b53b325168a5647daff94), but it's hard to test since client-go reflector decodes DIJob strictly, we have no chance to handle DIJob add event when incorrect job submitted

Source code(tar.gz)
Source code(zip)
di-manager.yaml(1.38 MB)
v0.2.0(Sep 28, 2021)
change orchestrator image repository

version -> v0.2.0

Source code(tar.gz)
Source code(zip)
v0.2.0-rc.0(Sep 6, 2021)
split webhook and operator

add dockerfile.dev

update CleanPolicyALL to CleanPolicyAll

remove k8s service related operations from server, and operator is responsible for managing services

add e2e test

Source code(tar.gz)
Source code(zip)
v0.1.0(Jul 8, 2021)
Features

Define DIJob CRD to support DI jobs' submission

Define AggregatorConfig CRD to support aggregator definition

Add webhook to validate DIJob submission

Provide http service for DI jobs to request for DI modules

Docs to introduce DI-orchestrator architecture

Source code(tar.gz)
Source code(zip)

Owner

OpenDILab

Open sourced Decision Intelligence (DI)

GitHub Repository

Code for "Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation". [AAAI 2021]

Graph Evolving Meta-Learning for Low-resource Medical Dialogue Generation Code to be further cleaned... This repo contains the code of the following p

29 Nov 01, 2022

Share a benchmark that can easily apply reinforcement learning in Job-shop-scheduling

Gymjsp Gymjsp is an open source Python library, which uses the OpenAI Gym interface for easily instantiating and interacting with RL environments, and

134 Dec 08, 2022

Coded illumination for improved lensless imaging

CodedCam Coded Illumination for Improved Lensless Imaging Paper | Supplementary results | Data and Code are available. Coded illumination for improved

1 Nov 29, 2021

Kohei's 5th place solution for xview3 challenge

xview3-kohei-solution Usage This repository assumes that the given data set is stored in the following locations: $ ls data/input/xview3/*.csv data/in

2 Jan 17, 2022

It's a powerful version of linebot

CTPS-FINAL Linbot-sever.py 主程式 Algorithm.py 推薦演算法，媒合餐廳端資料與顧客端資料 config.ini 儲存 channel-access-token、channel-secret 資料 Preface 生活在成大將近4年，我們每天的午餐時間看著形形色色

1 Oct 17, 2022

Keeper for Ricochet Protocol, implemented with Apache Airflow

Ricochet Keeper This repository contains Apache Airflow DAGs for executing keeper operations for Ricochet Exchange. Usage You will need to run this us

5 May 24, 2022

李云龙二次元风格化!打滚卖萌，使用了animeGANv2进行了视频的风格迁移

李云龙二次元风格化！一键star、fork，你也可以生成这样的团长！打滚卖萌求star求fork! 0.效果展示视频效果前往B站观看效果最佳：李云龙二次元风格化： github开源repo：李云龙二次元风格化百度AIstudio开源地址,一键fork即可运行: 李云龙二次元风格化！一键fork

44 Dec 04, 2022

Optimizes image files by converting them to webp while also updating all references.

About Optimizes images by (re-)saving them as webp. For every file it replaced it automatically updates all references. Works on single files as well

18 Dec 23, 2022

Github for the conference paper GLOD-Gaussian Likelihood OOD detector

FOOD - Fast OOD Detector Pytorch implamentation of the confernce peper FOOD arxiv link. Abstract Deep neural networks (DNNs) perform well at classifyi

17 Jun 19, 2022

Re-TACRED: Addressing Shortcomings of the TACRED Dataset

Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset

40 Dec 10, 2022

Code for the paper: Sketch Your Own GAN

Sketch Your Own GAN Project | Paper | Youtube | Slides Our method takes in one or a few hand-drawn sketches and customizes an off-the-shelf GAN to mat

677 Dec 28, 2022

Instant Real-Time Example-Based Style Transfer to Facial Videos

FaceBlit: Instant Real-Time Example-Based Style Transfer to Facial Videos The official implementation of FaceBlit: Instant Real-Time Example-Based Sty

131 Dec 19, 2022

Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21'

Argument Extraction by Generation Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21' Dependencies pytorch=1.6 tr

87 Dec 26, 2022

Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

CMIC-Retrieval Code for Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning. ICCV 2021. Introduction In this wo

42 Nov 17, 2022

PyContinual (An Easy and Extendible Framework for Continual Learning)

PyContinual (An Easy and Extendible Framework for Continual Learning) Easy to Use You can sumply change the baseline, backbone and task, and then read

176 Jan 05, 2023

A general framework for inferring CNNs efficiently. Reduce the inference latency of MobileNet-V3 by 1.3x on an iPhone XS Max without sacrificing accuracy.

GFNet-Pytorch (NeurIPS 2020) This repo contains the official code and pre-trained models for the glance and focus network (GFNet). Glance and Focus: a

169 Oct 28, 2022

OpenDILab RL Kubernetes Custom Resource and Operator Lib

Related tags

Overview

DI Orchestrator

Prerequisites

Install DI Orchestrator

Submit DIJob

User Guide

Contributing

Comments

1. goal

2. design *

release notes

features

release notes

features

Releases(v1.1.3)

v1.1.3(Aug 22, 2022)

bugs fix

v1.1.2(Jul 21, 2022)

bugs fix

v1.1.1(Jul 4, 2022)

v1.1.0(Jun 30, 2022)

v1.0.0(Mar 23, 2022)

features

v0.2.2(Dec 15, 2021)

bug fix

v0.2.1(Oct 12, 2021)

feature

v0.2.0(Sep 28, 2021)

v0.2.0-rc.0(Sep 6, 2021)

v0.1.0(Jul 8, 2021)

Features

Owner

OpenDILab

Code for "Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation". [AAAI 2021]

Share a benchmark that can easily apply reinforcement learning in Job-shop-scheduling

Coded illumination for improved lensless imaging

Kohei's 5th place solution for xview3 challenge

It's a powerful version of linebot

Keeper for Ricochet Protocol, implemented with Apache Airflow

李云龙二次元风格化!打滚卖萌，使用了animeGANv2进行了视频的风格迁移

Optimizes image files by converting them to webp while also updating all references.

Github for the conference paper GLOD-Gaussian Likelihood OOD detector

Re-TACRED: Addressing Shortcomings of the TACRED Dataset

Code for the paper: Sketch Your Own GAN

Instant Real-Time Example-Based Style Transfer to Facial Videos

Code for paper "Document-Level Argument Extraction by Conditional Generation". NAACL 21'

Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

PyContinual (An Easy and Extendible Framework for Continual Learning)

A general framework for inferring CNNs efficiently. Reduce the inference latency of MobileNet-V3 by 1.3x on an iPhone XS Max without sacrificing accuracy.

基于PaddleOCR搭建的OCR server... 离线部署用

Classify bird species based on their songs using SIamese Networks and 1D dilated convolutions.

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

HandTailor: Towards High-Precision Monocular 3D Hand Recovery