Machine Learning Platform for Kubernetes

Last update: Dec 23, 2022

Overview

Reproduce, Automate, Scale your data science.

Welcome to Polyaxon, a platform for building, training, and monitoring large scale deep learning applications. We are making a system to solve reproducibility, automation, and scalability for machine learning applications.

Polyaxon deploys into any data center, cloud provider, or can be hosted and managed by Polyaxon, and it supports all the major deep learning frameworks such as Tensorflow, MXNet, Caffe, Torch, etc.

Polyaxon makes it faster, easier, and more efficient to develop deep learning applications by managing workloads with smart container and node management. And it turns GPU servers into shared, self-service resources for your team or organization.

Install

TL;DR;

Install CLI

# Install Polyaxon CLI
$ pip install -U polyaxon

Create a deployment

# Create a namespace
$ kubectl create namespace polyaxon

# Add Polyaxon charts repo
$ helm repo add polyaxon https://charts.polyaxon.com

# Deploy Polyaxon
$ polyaxon admin deploy -f config.yaml

# Access API
$ polyaxon port-forward

Please check polyaxon installation guide

Quick start

TL;DR;

Start a project

# Create a project
$ polyaxon project create --name=quick-start --description='Polyaxon quick start.'

Train and track logs & resources

# Upload code and start experiments
$ polyaxon run -f experiment.yaml -l

Dashboard

# Start Polyaxon dashboard
$ polyaxon dashboard

Dashboard page will now open in your browser. Continue? [Y/n]: y

Notebook

# Start Jupyter notebook for your project
$ polyaxon run --hub notebook

Tensorboard

# Start TensorBoard for a run's output
$ polyaxon run --hub tensorboard --run-uuid=UUID

Please check our quick start guide to start training your first experiment.

Distributed job

Polyaxon supports and simplifies distributed jobs. Depending on the framework you are using, you need to deploy the corresponding operator, adapt your code to enable the distributed training, and update your polyaxonfile.

Here are some examples of using distributed training:

Hyperparameters tuning

Polyaxon has a concept for suggesting hyperparameters and managing their results very similar to Google Vizier called experiment groups. An experiment group in Polyaxon defines a search algorithm, a search space, and a model to train.

Parallel executions

You can run your processing or model training jobs in parallel, Polyaxon provides a mapping abstraction to manage concurrent jobs.

DAGs and workflows

Polyaxon DAGs is a tool that provides container-native engine for running machine learning pipelines. A DAG manages multiple operations with dependencies. Each operation is defined by a component runtime. This means that operations in a DAG can be jobs, services, distributed jobs, parallel executions, or nested DAGs.

Architecture

Documentation

Check out our documentation to learn more about Polyaxon.

Dashboard

Polyaxon comes with a dashboard that shows the projects and experiments created by you and your team members.

To start the dashboard, just run the following command in your terminal

$ polyaxon dashboard -y

Project status

Polyaxon is stable and it's running in production mode at many startups and Fortune 500 companies.

Contributions

Please follow the contribution guide line: Contribute to Polyaxon.

Research

If you use Polyaxon in your academic research, we would be grateful if you could cite it.

Feel free to contact us, we would love to learn about your project and see how we can support your custom need.

Comments

Tensorboard error for the quick-start example

Describe the bug

I'm running the examples from the quick-start guide and when I tried to start Tensorboard I got the error:

Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 316, in create_or_update_deployment return self.create_deployment(name=name, body=body), True File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 302, in create_deployment namespace=self.namespace, body=body File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 175, in create_namespaced_deployment (data) = self.create_namespaced_deployment_with_http_info(namespace, body, **kwargs) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 266, in create_namespaced_deployment_with_http_info collection_formats=collection_formats) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api _return_http_data_only, collection_formats, _preload_content, _request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api _request_timeout=_request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 266, in POST body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 222, in request raise ApiException(http_resp=r) kubernetes.client.rest.ApiException: (403) Reason: Forbidden HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 21 Jan 2020 17:03:28 GMT', 'Content-Length': '374'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.extensions is forbidden: User \"system:serviceaccount:polyaxon:polyaxon-polyaxon-serviceaccount\" cannot create resource \"deployments\" in API group \"extensions\" in the namespace \"polyaxon\"","reason":"Forbidden","details":{"group":"extensions","kind":"deployments"},"code":403} During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 319, in create_or_update_deployment return self.update_deployment(name=name, body=body), False File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 309, in update_deployment name=name, namespace=self.namespace, body=body File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 4089, in patch_namespaced_deployment (data) = self.patch_namespaced_deployment_with_http_info(name, namespace, body, **kwargs) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/apis/extensions_v1beta1_api.py", line 4189, in patch_namespaced_deployment_with_http_info collection_formats=collection_formats) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 334, in call_api _return_http_data_only, collection_formats, _preload_content, _request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 168, in __call_api _request_timeout=_request_timeout) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 393, in request body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 286, in PATCH body=body) File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 222, in request raise ApiException(http_resp=r) kubernetes.client.rest.ApiException: (403) Reason: Forbidden HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 21 Jan 2020 17:03:28 GMT', 'Content-Length': '484'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.extensions \"plx-tensorboard-5aa275f671f64a75924c66323cb0e6a4\" is forbidden: User \"system:serviceaccount:polyaxon:polyaxon-polyaxon-serviceaccount\" cannot patch resource \"deployments\" in API group \"extensions\" in the namespace \"polyaxon\"","reason":"Forbidden","details":{"name":"plx-tensorboard-5aa275f671f64a75924c66323cb0e6a4","group":"extensions","kind":"deployments"},"code":403} During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/polyaxon/polyaxon/scheduler/tensorboard_scheduler.py", line 53, in start_tensorboard reconcile_url=get_tensorboard_reconcile_url(tensorboard.unique_name)) File "/polyaxon/polyaxon/polypod/tensorboard.py", line 234, in start_tensorboard reraise=True) File "/usr/local/lib/python3.7/site-packages/polyaxon_k8s/manager.py", line 322, in create_or_update_deployment raise PolyaxonK8SError(e) polyaxon_k8s.exceptions.PolyaxonK8SError: (403) Reason: Forbidden HTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Tue, 21 Jan 2020 17:03:28 GMT', 'Content-Length': '484'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"deployments.extensions \"plx-tensorboard-5aa275f671f64a75924c66323cb0e6a4\" is forbidden: User \"system:serviceaccount:polyaxon:polyaxon-polyaxon-serviceaccount\" cannot patch resource \"deployments\" in API group \"extensions\" in the namespace \"polyaxon\"","reason":"Forbidden","details":{"name":"plx-tensorboard-5aa275f671f64a75924c66323cb0e6a4","group":"extensions","kind":"deployments"},"code":403}

To Reproduce

$ git clone https://github.com/polyaxon/polyaxon-quick-start.git
$ # run create, init, etc.
$ polyaxon run -f polyaxonfile_hyperparams.yml
$ # wait..
$ polyaxon tensorboard -g 1 start

Expected behavior

No error.

Environment

Kubernetes 1.17 using Kubeadm on a local cluster.

Let me know if you need more info.

bug area/helm-charts

opened by vakker 24

Expose configmaps/secrets to build environment

Hey, I was wondering if I could expose configmaps or secrets to build jobs aswell. What I'm trying to do is add some custom apt sources along with a client cert in order to install some internal packages as dependencies. Currently we work around this by installing some packages at runtime.

opened by Mofef 22
No nodes in cluster and experiments fail to build

I deployed Polyaxon on Minikube (Mac) and am trying to run experiments using the polyaxon quickstart repo (https://github.com/polyaxon/polyaxon-quick-start.git). However, the experiment build keeps failing, and running 'polyaxon cluster' shows no nodes:

Cluster info:

major 1 minor 10 compiler gc platform linux/amd64 build_date 2018-03-26T16:44:10Z git_commit fc32d2f3698e36b93322a3465f63a14e9f0eaead go_version go1.9.3 git_version v1.10.0 git_tree_state clean

When I run 'kubectl get pods --all-namespaces', this is the output

NAMESPACE NAME READY STATUS RESTARTS AGE kube-system coredns-c4cffd6dc-42gcs 1/1 Running 0 23h kube-system etcd-minikube 1/1 Running 0 23h kube-system kube-addon-manager-minikube 1/1 Running 0 23h kube-system kube-apiserver-minikube 1/1 Running 0 23h kube-system kube-controller-manager-minikube 1/1 Running 0 23h kube-system kube-dns-86f4d74b45-652fq 3/3 Running 0 23h kube-system kube-proxy-npxr5 1/1 Running 0 23h kube-system kube-scheduler-minikube 1/1 Running 0 23h kube-system kubernetes-dashboard-6f4cfc5d87-p2z4j 1/1 Running 0 23h kube-system storage-provisioner 1/1 Running 0 23h kube-system tiller-deploy-778f674bf5-xhmsv 1/1 Running 0 23h polyaxon polyaxon-docker-registry-78d5499fc9-4wm69 1/1 Running 0 5h polyaxon polyaxon-polyaxon-api-7b97bb447d-jl6h6 2/2 Running 0 5h polyaxon polyaxon-polyaxon-beat-77fb6cccc7-lmdhw 2/2 Running 0 5h polyaxon polyaxon-polyaxon-events-79c8ff59d9-2rqcq 1/1 Running 0 5h polyaxon polyaxon-polyaxon-hpsearch-9b5589f5-874n5 1/1 Running 0 5h polyaxon polyaxon-polyaxon-k8s-events-697cf8bb65-mnjz8 1/1 Running 0 5h polyaxon polyaxon-polyaxon-logs-7bf467999-b8755 1/1 Running 0 5h polyaxon polyaxon-polyaxon-monitors-57db4f7cd7-7x2j5 2/2 Running 0 5h polyaxon polyaxon-polyaxon-resources-glgwq 1/1 Running 0 5h polyaxon polyaxon-polyaxon-scheduler-76ccf9d665-xb9bg 1/1 Running 0 5h polyaxon polyaxon-postgresql-78d4cff55c-jhcvz 1/1 Running 0 5h polyaxon polyaxon-rabbitmq-6448d76c84-vp5ll 1/1 Running 0 5h polyaxon polyaxon-redis-688468649b-tg6qp 1/1 Running 0 5h

I have also tried running 'helm update' and upgraded polyaxon to the latest release (0.3.2). How can I troubleshoot this?

opened by jonathanlimsc 21
deleted flagged missed in initialization

Describe the bug

Getting this error with version 1.1.9

To reproduce

polyaxon upgrade && polyaxon run -f poylaxonfile

Expected behavior

Run completed

Environment

polyaxon 1.1.9
question not-reproducible

opened by zeyaddeeb 20
Scheduling many jobs at the same time leads to zombie state jobs (possible race condition?)
Describe the bug

It's hard to consistently reproduce, but when scheduling many jobs such that the build happens to be at the same time, it seems like we can get the following scenario: K8s correctly schedules the pods according to their requests/limits and the available resources. Polyaxon however believes that some jobs are running although they are unschedulable by K8s. When freeing up resources quickly enough, K8s actually schedules those jobs and nothing else happens. However, if resources are blocked long enough, Polyaxon's heartbeat service will automatically stop these jobs (that it believes are running although they are unschedulable by K8s) and fail them. To me, this could be a critical bug in the scheduler and really seems like some kind of race condition. I haven't tested it with multiple users, but I assume this would occur if many users submit different jobs at the same time (a likely scenario).

To Reproduce

Create a job with a fairly large build and long runnning time (>2000 seconds).

Make sure that only two of these jobs can run on the cluster at a time (by requesting resources accordingly).

Run this job many times with polyaxon run -f polyaxonfile.yml (submit this command again as soon as it terminates and repeat 5 times)

Expected behavior

The jobs should just be recognized as unschedulable and scheduled when the resources become available again.

Environment

Polyaxon 0.5.6, Kubernetes 1.15.4
opened by MatthiasKohl 20

Can't use TPU

Describe the bug

I tried to use Cloud TPU. But I got the error on StackDriver logging. And the experiment was failed. It seems that we need to specify tensorflow version with annotation.

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: admission webhook \"pod-init.cloud-tpus.google.com\" denied the request: TensorFlow version must be specified in annotation \"tf-version.cloud-tpus.google.com\" for pod requesting Cloud TPUs","reason":"InternalError","details":{"causes":[{"message":"admission webhook \"pod-init.cloud-tpus.google.com\" denied the request: TensorFlow version must be specified in annotation \"tf-version.cloud-tpus.google.com\" for pod requesting Cloud TPUs"}]},"code":500}

To Reproduce

YAML

---
version: 1

kind: experiment

environment:
  resources:
    cpu:
      requests: 4
      limits: 4
    memory:
      requests: 15000
      limits: 15000
    tpu:
      requests: 8
      limits: 8

build:
  image: tensorflow/tensorflow:1.12.0
  build_steps:
    - pip install --no-cache-dir -r requirements.txt

run:
  # this is just a dummy python file.
  cmd: python test.py

requirements.txt

polyaxon-client==0.3.8
polyaxon-cli==0.3.8
jupyter
google-cloud-storage

Expected behavior

We can create a TPU.

Environment

Polyaxon: 0.3.8

Links

https://cloud.google.com/tpu/docs/kubernetes-engine-setup
https://github.com/tensorflow/tpu/blob/master/models/official/resnet/resnet_k8s.yaml#L28

bug

opened by yu-iskw 20

Deploying on Kubernetes cluster created w/ Kubespray

Hi -

I'm trying to spin up a Kubernetes cluster without the benefit of managed service like EKS or GKE, then deploy Polyaxon on that cluster. Currently I'm experiencing some issues on the Polyaxon side of this process.

To deploy the Kubernetes cluster I'm using kubespray. I'm able to deploy the cluster to the point that kubectl get nodes shows the expected nodes in a ready state, and I'm able to deploy a simple Node.js app as a test. I am not, however, able to successfully install Polyaxon on the cluster.

I've tried on both AWS and on my local machine using Vagrant/Virtualbox. The issues I'm experiencing are different between the two cases, which I find interesting, so I'll document both.

AWS

I deployed Kubernetes by loosely following this tutorial. Things went smoothly for the most part, except that I needed to deal with this issue using this fix. I used 3 t2.large instance running Ubuntu 16.04 and the standard kubespray configuration.

As I mentioned above, I get the expected output from kubectl get nodes, and I'm able to deploy the Node.js app at the end of the tutorial.

At first, the Polyaxon installation/deployment also seems to succeed:

[email protected]:~$ helm install polyaxon/polyaxon \
> --name=polyaxon \
> --namespace=polyaxon \
> -f polyaxon_config.yml
NAME:   polyaxon
LAST DEPLOYED: Sat Feb  9 00:03:29 2019
NAMESPACE: polyaxon
STATUS: DEPLOYED

RESOURCES:
==> v1/Secret
NAME                             TYPE    DATA  AGE
polyaxon-docker-registry-secret  Opaque  1     3m4s
polyaxon-postgresql              Opaque  1     3m4s
polyaxon-rabbitmq                Opaque  2     3m4s
polyaxon-polyaxon-secret         Opaque  4     3m4s

==> v1/ConfigMap
NAME                      DATA  AGE
redis-config              1     3m4s
polyaxon-polyaxon-config  141   3m4s

==> v1beta1/ClusterRole
NAME                           AGE
polyaxon-polyaxon-clusterrole  3m4s

==> v1beta1/DaemonSet
NAME                         DESIRED  CURRENT  READY  UP-TO-DATE  AVAILABLE  NODE SELECTOR  AGE
polyaxon-polyaxon-resources  2        2        2      2           2          <none>         3m4s

==> v1beta1/Deployment
NAME                          DESIRED  CURRENT  UP-TO-DATE  AVAILABLE  AGE
polyaxon-docker-registry      1        1        1           1          3m4s
polyaxon-postgresql           1        1        1           1          3m4s
polyaxon-rabbitmq             1        1        1           1          3m4s
polyaxon-redis                1        1        1           1          3m4s
polyaxon-polyaxon-api         1        1        1           0          3m4s
polyaxon-polyaxon-beat        1        1        1           1          3m4s
polyaxon-polyaxon-events      1        1        1           1          3m4s
polyaxon-polyaxon-hpsearch    1        1        1           1          3m4s
polyaxon-polyaxon-k8s-events  1        1        1           1          3m4s
polyaxon-polyaxon-monitors    1        1        1           1          3m4s
polyaxon-polyaxon-scheduler   1        1        1           1          3m3s

==> v1/Pod(related)
NAME                                           READY  STATUS   RESTARTS  AGE
polyaxon-polyaxon-resources-hpbcv              1/1    Running  0         3m4s
polyaxon-polyaxon-resources-m7bjv              1/1    Running  0         3m4s
polyaxon-docker-registry-58bff6f777-vkl6h      1/1    Running  0         3m4s
polyaxon-postgresql-f4fc68c67-25t4p            1/1    Running  0         3m4s
polyaxon-rabbitmq-74c5d87cf6-qlk2b             1/1    Running  0         3m4s
polyaxon-redis-6f7db88668-99qvw                1/1    Running  0         3m4s
polyaxon-polyaxon-api-75c5989cb4-ppv7t         1/2    Running  0         3m4s
polyaxon-polyaxon-beat-759d6f9f96-qdhmd        2/2    Running  0         3m3s
polyaxon-polyaxon-events-86f49f8b78-vvscx      1/1    Running  0         3m4s
polyaxon-polyaxon-hpsearch-5f77c8d6cd-gkdms    1/1    Running  0         3m3s
polyaxon-polyaxon-k8s-events-555f6c8754-c242k  1/1    Running  0         3m3s
polyaxon-polyaxon-monitors-864dd8fb67-h7s47    2/2    Running  0         3m2s
polyaxon-polyaxon-scheduler-7f4978774d-pm9xz   1/1    Running  0         3m2s

==> v1/ServiceAccount
NAME                                      SECRETS  AGE
polyaxon-polyaxon-serviceaccount          1        3m4s
polyaxon-polyaxon-workers-serviceaccount  1        3m4s

==> v1beta1/ClusterRoleBinding
NAME                                   AGE
polyaxon-polyaxon-clusterrole-binding  3m4s

==> v1beta1/Role
NAME                            AGE
polyaxon-polyaxon-role          3m4s
polyaxon-polyaxon-workers-role  3m4s

==> v1beta1/RoleBinding
NAME                                    AGE
polyaxon-polyaxon-role-binding          3m4s
polyaxon-polyaxon-workers-role-binding  3m4s

==> v1/Service
NAME                      TYPE          CLUSTER-IP     EXTERNAL-IP  PORT(S)                                AGE
polyaxon-docker-registry  NodePort      10.233.42.186  <none>       5000:31813/TCP                         3m4s
polyaxon-postgresql       ClusterIP     10.233.17.56   <none>       5432/TCP                               3m4s
polyaxon-rabbitmq         ClusterIP     10.233.33.173  <none>       4369/TCP,5672/TCP,25672/TCP,15672/TCP  3m4s
polyaxon-redis            ClusterIP     10.233.31.108  <none>       6379/TCP                               3m4s
polyaxon-polyaxon-api     LoadBalancer  10.233.36.234  <pending>    80:32050/TCP,1337:31832/TCP            3m4s

After a few minutes all the expected pods are running:

[email protected]:~$ kubectl get pods --namespace polyaxon
NAME                                            READY   STATUS    RESTARTS   AGE
polyaxon-docker-registry-58bff6f777-vkl6h       1/1     Running   0          3m49s
polyaxon-polyaxon-api-75c5989cb4-ppv7t          1/2     Running   0          3m49s
polyaxon-polyaxon-beat-759d6f9f96-qdhmd         2/2     Running   0          3m48s
polyaxon-polyaxon-events-86f49f8b78-vvscx       1/1     Running   0          3m49s
polyaxon-polyaxon-hpsearch-5f77c8d6cd-gkdms     1/1     Running   0          3m48s
polyaxon-polyaxon-k8s-events-555f6c8754-c242k   1/1     Running   0          3m48s
polyaxon-polyaxon-monitors-864dd8fb67-h7s47     2/2     Running   0          3m47s
polyaxon-polyaxon-resources-hpbcv               1/1     Running   0          3m49s
polyaxon-polyaxon-resources-m7bjv               1/1     Running   0          3m49s
polyaxon-polyaxon-scheduler-7f4978774d-pm9xz    1/1     Running   0          3m47s
polyaxon-postgresql-f4fc68c67-25t4p             1/1     Running   0          3m49s
polyaxon-rabbitmq-74c5d87cf6-qlk2b              1/1     Running   0          3m49s
polyaxon-redis-6f7db88668-99qvw                 1/1     Running   0          3m49s

The issue in this case arises with the LoadBalancer IP, which remains suspended in a pending state:

[email protected]:~$ kubectl get --namespace polyaxon svc -w polyaxon-polyaxon-api
NAME                    TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                       AGE
polyaxon-polyaxon-api   LoadBalancer   10.233.52.219   <pending>     80:30684/TCP,1337:31886/TCP   13h

[email protected]:~$ kubectl get svc --namespace polyaxon polyaxon-polyaxon-api -o json
{
    "apiVersion": "v1",
    "kind": "Service",
    "metadata": {
        "creationTimestamp": "2019-02-09T01:03:11Z",
        "labels": {
            "app": "polyaxon-polyaxon-api",
            "chart": "polyaxon-0.3.8",
            "heritage": "Tiller",
            "release": "polyaxon",
            "role": "polyaxon-api",
            "type": "polyaxon-core"
        },
        "name": "polyaxon-polyaxon-api",
        "namespace": "polyaxon",
        "resourceVersion": "17172",
        "selfLink": "/api/v1/namespaces/polyaxon/services/polyaxon-polyaxon-api",
        "uid": "78640925-2c06-11e9-8f3f-121248b9afae"
    },
    "spec": {
        "clusterIP": "10.233.52.219",
        "externalTrafficPolicy": "Cluster",
        "ports": [
            {
                "name": "api",
                "nodePort": 30684,
                "port": 80,
                "protocol": "TCP",
                "targetPort": 80
            },
            {
                "name": "streams",
                "nodePort": 31886,
                "port": 1337,
                "protocol": "TCP",
                "targetPort": 1337
            }
        ],
        "selector": {
            "app": "polyaxon-polyaxon-api"
        },
        "sessionAffinity": "None",
        "type": "LoadBalancer"
    },
    "status": {
        "loadBalancer": {}
    }
}

Looking through the Polyaxon issues, I see that this can happen on minikube, but I wasn't able to find anything that helps me debug my particular case. What are the conditions that need to be met in the Kubernetes deployment, in order for the LoadBalancer IP step to succeed?

Vagrant/Virtualbox

I was suspicious that my issues might be specific to the AWS environment, rather than a general issue with kubespray/polyaxon, so as a second test I tried deploying the Kubernetes cluster locally using Vagrant and Virtualbox. To do this I used the Vagrantfile in the kubespray repo as described here.

After debugging a couple kubespray issues, I was able to get the cluster up and running and deploy the Node.js app again.

Deploying Polyaxon, I again saw the issue w/ the LoadBalancer IP getting stuck in a pending state. What was interesting to me though, was that a number of pods actually failed to run as well, despite the fact that the deployment ostensibly succeeded:

[email protected]:~$ helm ls
NAME            REVISION        UPDATED                         STATUS          CHART           APP VERSION     NAMESPACE
polyaxon        1               Sat Feb  9 06:01:21 2019        DEPLOYED        polyaxon-0.3.8                  polyaxon

[email protected]:~$ kubectl get pods --namespace polyaxon
NAME                                           READY   STATUS    RESTARTS   AGE
polyaxon-docker-registry-58bff6f777-wlb9p      0/1     Pending   0          36m
polyaxon-polyaxon-api-6bc75ff4ff-v694k         0/2     Pending   0          36m
polyaxon-polyaxon-beat-744c96b9f8-mbz5j        0/2     Pending   0          36m
polyaxon-polyaxon-events-58d9c9cbd6-72skt      0/1     Pending   0          36m
polyaxon-polyaxon-hpsearch-dc9cf6556-8rh78     0/1     Pending   0          36m
polyaxon-polyaxon-k8s-events-9f8cdf5-fvqnx     0/1     Pending   0          36m
polyaxon-polyaxon-monitors-58766747c9-gcf2r    0/2     Pending   0          36m
polyaxon-polyaxon-resources-rnntm              1/1     Running   0          36m
polyaxon-polyaxon-resources-t4pv6              0/1     Pending   0          36m
polyaxon-polyaxon-resources-x9f42              0/1     Pending   0          36m
polyaxon-polyaxon-scheduler-76bfdcfcc7-d9tq4   0/1     Pending   0          36m
polyaxon-postgresql-f4fc68c67-lwgds            1/1     Running   0          36m
polyaxon-rabbitmq-74c5d87cf6-lhvj8             1/1     Running   0          36m
polyaxon-redis-6f7db88668-6wlgs                1/1     Running   0          36m

I'm not quite sure what's going on here. My best guess would be that the virtual machines don't have the necessary resources to run these pods? ... Would be interesting to hear the experts weigh in 😄.

Please help!

opened by jayleverett 20

polyaxon/polyaxon-api is start but no service on

docker log

Running...
Use default user
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
Restarting nginx: nginx.
nginx is running.
[uWSGI] getting INI configuration from web/uwsgi.nginx.ini
*** Starting uWSGI 2.0.18 (64bit) on [Tue Aug 18 08:34:22 2020] ***
compiled with version: 6.3.0 20170516 on 13 August 2020 13:15:05
os: Linux-4.18.0-193.el8.x86_64 #1 SMP Fri May 8 10:59:10 UTC 2020
nodename: polyaxon-polyaxon-api-5c8f885949-wjq9p
machine: x86_64
clock source: unix
pcre jit disabled
detected number of CPU cores: 4
current working directory: /polyaxon
detected binary path: /usr/local/bin/uwsgi
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
chdir() to /polyaxon/web/..
your memory page size is 4096 bytes
detected max file descriptor number: 1048576
lock engine: pthread robust mutexes
thunder lock: enabled
uwsgi socket 0 bound to UNIX address /polyaxon/web/../web/polyaxon.sock fd 3
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
Python version: 3.7.6 (default, Jan  3 2020, 23:53:24)  [GCC 6.3.0 20170516]
Python main interpreter initialized at 0x5626c4254800
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 425960 bytes (415 KB) for 4 cores
*** Operational MODE: preforking ***
added /polyaxon/web/../polyaxon/ to pythonpath.
WSGI app 0 (mountpoint='') ready in 2 seconds on interpreter 0x5626c4254800 pid: 66 (default app)
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 66)
spawned uWSGI worker 1 (pid: 72, cores: 1)
spawned uWSGI worker 2 (pid: 73, cores: 1)
spawned uWSGI worker 3 (pid: 74, cores: 1)
spawned uWSGI worker 4 (pid: 75, cores: 1)

docker image

polyaxon/polyaxon-gateway                                        1.1.7                 a52bd2a3a36d        4 days ago          473MB
polyaxon/polyaxon-api                                            1.1.7                 dc1d59a6bff9        4 days ago          590MB
polyaxon/polyaxon-cli                                            1.1.7                 5ea8e132a2a0        4 days ago          419MB

kubectl --namespace=polyaxon get pod

NAME                                          READY   STATUS    RESTARTS   AGE
polyaxon-polyaxon-api-5c8f885949-wjq9p        0/1     Running   4          30m
polyaxon-polyaxon-gateway-77c4d46d4d-t85ww    1/1     Running   0          30m
polyaxon-polyaxon-operator-7f48b54676-mh48l   1/1     Running   0          30m
polyaxon-polyaxon-streams-7c4876dc54-jh2p6    1/1     Running   0          30m
polyaxon-postgresql-0                         1/1     Running   0          30m

helm version

Client: &version.Version{SemVer:"v2.16.10", GitCommit:"bceca24a91639f045f22ab0f41e47589a932cf5e", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.16.10", GitCommit:"bceca24a91639f045f22ab0f41e47589a932cf5e", GitTreeState:"clean"}

kubectl version

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2", GitCommit:"52c56ce7a8272c798dbc29846288d7cd9fbae032", GitTreeState:"clean", BuildDate:"2020-04-16T11:48:36Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

question

opened by zhangchunsheng 19

Logs are not displayed correctly in terminal

Describe the bug

Unable to see the logs correctly. Unfortunately the only things visible within in terminal are callback errors:

$ polyaxon experiment -xp X logs
building -- 
scheduled -- 
starting -- 
running -- 
error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
error from callback <function SocketTransportMixin.socket.<locals>.<lambda> at 0x7fd723146400>: the JSON object must be str, not 'bytes'
...
error from callback <bound method SocketTransportMixin._on_close of <polyaxon_client.transport.Transport object at 0x7fd723190978>>: _on_close() missing 1 required positional argument: 'ws'

To Reproduce

Started experiment with polyaxon run -u and then started the logs-view polyaxon experiment -xp X logs

Experiment:

https://github.com/polyaxon/polyaxon-examples/tree/master/tensorflow/cifare10/polyaxonfile.yml

Expected behavior

Building -- creating image -
  master.1 -- INFO:tensorflow:Using config: {'_model_dir': '/outputs/root/cifar10/experiments/1', '_save_checkpoints_secs': 600, '_num_ps_replicas': 0, '_keep_checkpoint_max': 5, '_session_config': gpu_options {
  master.1 --   force_gpu_compatible: true
  master.1 -- }

Environment

Local

polyaxon is running within a virtualenv using python3.

Cluster

OS: Ubuntu 18.04 Kubernetes: 1.12.1

bug

opened by naetherm 19

"cluster-admin not found" error while installing polyaxon with helm

I am using minikube to set up a local kubernetes single node cluster. I have set up helm as described in the docs. But when I try to deploy polyaxon by following the docs, I get an error.

temp-training:~ shivam.m$ helm install --wait polyaxon/polyaxon Error: release rousing-peahen failed: clusterroles.rbac.authorization.k8s.io "rousing-peahen-polyaxon-ingress-clusterrole" is forbidden: attempt to grant extra privileges: [PolicyRule{Resources:["configmaps"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["configmaps"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["endpoints"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["endpoints"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["nodes"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["nodes"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["pods"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["pods"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["secrets"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["secrets"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["nodes"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["get"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["list"]} PolicyRule{Resources:["services"], APIGroups:[""], Verbs:["watch"]} PolicyRule{Resources:["ingresses"], APIGroups:["extensions"], Verbs:["get"]} PolicyRule{Resources:["ingresses"], APIGroups:["extensions"], Verbs:["list"]} PolicyRule{Resources:["ingresses"], APIGroups:["extensions"], Verbs:["watch"]} PolicyRule{Resources:["events"], APIGroups:[""], Verbs:["create"]} PolicyRule{Resources:["events"], APIGroups:[""], Verbs:["patch"]} PolicyRule{Resources:["ingresses/status"], APIGroups:["extensions"], Verbs:["update"]}] user=&{system:serviceaccount:kube-system:tiller 8e197f15-1373-11e8-9b02-080027bbca2c [system:serviceaccounts system:serviceaccounts:kube-system system:authenticated] map[]} ownerrules=[] ruleResolutionErrors=[clusterroles.rbac.authorization.k8s.io "cluster-admin" not found]

I tried disabling the rbac and running it again but then I get an error related to port allocation. temp-training:~ shivam.m$ helm install --set=rbac.enabled=false polyaxon/polyaxon Error: release mortal-gorilla failed: Service "mortal-gorilla-docker-registry" is invalid: spec.ports[0].nodePort: Invalid value: 31813: provided port is already allocated
bug

opened by codophobia 19

Unable to run experiments with v1.1.8

Describe the bug

Unable to run experiments with new version 1.1.8. "Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f168f918700>: Failed to establish a new connection: [Errno 111] Connection refused')" Seems to be from tracking.init()

Also when running polyaxon project ls (only the first time):

Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f030fb6dbe0>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /api/v1/compatibility/cb08b595c6be5fe48fcbaf4860dd900c/1-1-8/cli
Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f030fb6dc88>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /api/v1/compatibility/cb08b595c6be5fe48fcbaf4860dd900c/1-1-8/cli
Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f030fb6dd68>: Failed to establish a new connection: [Errno 101] Network is unreachable')': /api/v1/compatibility/cb08b595c6be5fe48fcbaf4860dd900c/1-1-8/cli
Could not connect to remote server to fetch compatibility versions.
Checking CLI compatibility version ...
Could get the min/latest versions from compatibility API.

However if I run it again it works as expected.

To Reproduce

version: 1.1
kind: component
name: simple-experiment
description: Minimum information to run this TF.Keras example
tags: [examples]
run:
  kind: job
  init:
  - git: {url: "https://github.com/polyaxon/polyaxon-quick-start"}
    container:
      env:
        - name: http_proxy
          value: "***"
        - name: https_proxy
          value: "***"
  container:
    image: polyaxon/polyaxon-quick-start
    workingDir: "{{ globals.artifacts_path }}/polyaxon-quick-start"
    command: [python3, model.py]
    env:
      - name: http_proxy
        value: "***"
      - name: https_proxy
        value: "***"

Expected behavior

A running experiment.

Environment

deploymentChart: platform
deploymentVersion: 1.1.8

artifactsStore:
  name: minio
  kind: s3
  schema: {"bucket": "***"}
  secret:
    name: "***"

connections:
  - name: data
    kind: volume_claim
    schema:
      mountPath: ***
      volumeClaim: ***
      readOnly: true

scheduler:
  enabled: true

streams:
  enabled: true

postgresql:
  persistence:
    enabled: true
    storageClass: nfs

redis:
  enabled: true
  master:
    persistence:
      enabled: true
      storageClass: nfs
  slave:
    persistence:
      enabled: true
      storageClass: nfs
broker: redis

rabbitmq-ha:
  enabled: false

ui:
  enabled: true
  adminEnabled: true

bug regression

opened by ONordander 17

Polyaxon Python API - RunClient `watch_logs()` alternate or parameter to stop its execution and return string

Hi, Context: I have been running some experiments on EKS. Its working great, but my logs disappear after the run execution. Also while the execution is happening, after arbitrary time pod disconnects and previous logs are lost. EKS/polyaxon/mpi recovers the jobs execution and Launcher pod starts the training from where disconnect happened.

Issue: The issue is that i want to retain the logs of my runs. I am not able to use persistent volumes yet which can be a solution. What i am trying to use is the polyaxon python api. More specifically i am using RunClient and looking at get_logs() and watch_logs(). get_logs() is not returning anything and i think its not intended for this. watch_logs() is returning the logs but issue is, its not technically "returning" anything. It seems to be like a stream function, which stdouts on console (jupyter, shell). In my code i am not able to get the logs with this, as it keeps on printing without stop.

Question/Enhancement Is there another way to get the logs through python api? or can we have an alternate function to watch_logs which just returns the logs and its execution is done. I intend to keep saving snapshot of logs so that even if disconnection happens i can then join the log files later. Open to any suggestions. FYI, i have tried cli too. polyaxon ops logs -f its giving me encoding issues.
question

opened by QaisarRajput 1
Errors related to uploading artifacts while tracking runs are silent
Current behavior

From slack:

Another question about logging metrics to a run through a local jupyter notebook. After our conversation on Nov. 30th. :point_up:, things were working fine. However recently the dashboard has stopped displaying metrics again, and I'm seeing weird behaviour in polyaxon. I don't know what has changed. Looking for advice since I'm out of troubleshooting ideas. Details in the thread... Here's code I have that recreates the problem:

from polyaxon import tracking tracking.init( owner="owner", project="project-name", name="test_run", run_uuid=None, is_new=True ) tracking.set_run_event_logger() tracking.log_text(name="some_text_metric", text="some text") for step in range(1, 100): tracking.log_metric(name="some_step_metric", value=step/2, step=step) tracking.log_succeeded() tracking.end()

After using some debugging using:

from polyaxon.logger import configure_logger configure_logger(verbose=True) ...

It turns out that :

Thank you for that command there. After looking at the logging from that, I realized that my polyaxon cli host had switched from the url of the gateway deployed on our cluster to https://cloud.polyaxon.com/. After some tests, it looks like logging metrics through cloud.polyaxon.com causes the issues I was seeing with artifacts. When I switched the polyaxon host to the url of our gateway, then the dashboard started correctly displaying metrics.

Enhancement

As suggested by the user, the upload is happening in a thread, API errors (404/401/403) should show to help the user debug issues:

Any chance that you'd update the code to provide a useful error message when someone tries this?

enhancement area/tracking area/client
opened by polyaxon-team 0
Add config to support proxy env var with GCS
Current behavior

Seems like GCS-FS does not automatically pick the proxy env vars, see https://github.com/fsspec/gcsfs/pull/491

Enhancement

Add trust_env if proxy env vars are used:

fs = GCSFileSystem(project='my-google-project', session_kwargs={'trust_env': True})
area/cli area/streams area/sidecar area/client
opened by polyaxon-team 0
Stopping an operation with a pending pod removes the operation but does not delete the pod

Describe the bug

Stopping an operation where the pod is pending with image pull error, removes the operations from Polyaxon's table but does not correctly delete the pod.
bug core

opened by polyaxon-team 0
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
Polyaxon CLI should raise an error for invalid input/ouputs names with dots `.`
Current behavior

The CLI currently allows users to pass inputs/outputs with dots ., also the platform allows the run to be scheduled. However the interpolation engine does not allow reusing the param's variable name, especially when using DAGs or Joins, since the parser uses the dot . get extract the variables required.

Enhancement

There are two options:

Add a validation on the parsing level to show an error to the user before they submit the operation to the platform, to prevent any confusion.

Allow using [] as an alternative solution to getting params/inputs/outputs values instead of . and update the documentation to show how it can be used.

enhancement area/specification area/cli
opened by polyaxon-team 0

Machine Learning Platform for Kubernetes

Related tags

Overview

Install

TL;DR;

Quick start

TL;DR;

Distributed job

Hyperparameters tuning

Parallel executions

DAGs and workflows

Architecture

Documentation

Dashboard

Project status

Contributions

Research

Comments

Describe the bug

To Reproduce

Expected behavior

Environment

Describe the bug

To reproduce

Expected behavior

Environment

Describe the bug

To Reproduce

Expected behavior

Environment

Describe the bug

To Reproduce

YAML

requirements.txt

Expected behavior

Environment

Links

AWS

Vagrant/Virtualbox

Describe the bug

To Reproduce

Experiment:

Expected behavior

Environment

Local

Cluster

Describe the bug

To Reproduce

Expected behavior

Environment

Current behavior

Enhancement

Current behavior

Enhancement

Describe the bug

Patching CVE-2007-4559

Current behavior

Enhancement

Releases(v1.12.2)

Owner

polyaxon

Kernel Point Convolutions

The AugNet Python module contains functions for the fast computation of image similarity.

Geometric Vector Perceptron --- a rotation-equivariant GNN for learning from biomolecular structure

Dataset VSD4K includes 6 popular categories: game, sport, dance, vlog, interview and city.

Tensor-based approaches for fMRI classification

Code for "Solving Graph-based Public Good Games with Tree Search and Imitation Learning"

Image processing in Python

Instance-wise Occlusion and Depth Orders in Natural Scenes (CVPR 2022)

A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Leibniz is a python package which provide facilities to express learnable partial differential equations with PyTorch

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

PyTorch-centric library for evaluating and enhancing the robustness of AI technologies

A heterogeneous entity-augmented academic language model based on Open Academic Graph (OAG)

Multi-Anchor Active Domain Adaptation for Semantic Segmentation (ICCV 2021 Oral)

An implementation of an abstract algebra for music tones (pitches).

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

Hierarchical Clustering: O(1)-Approximation for Well-Clustered Graphs

This repository contains PyTorch models for SpecTr (Spectral Transformer).

Torchlight2 lan game server tool - A message forwarding tool for Torchlight 2 lan game