CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

Overview

CALVIN

CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks Oier Mees, Lukas Hermann, Erick Rosete, Wolfram Burgard

We present CALVIN (Composing Actions from Language and Vision), an open-source simulated benchmark to learn long-horizon language-conditioned tasks. Our aim is to make it possible to develop agents that can solve many robotic manipulation tasks over a long horizon, from onboard sensors, and specified only via human language. CALVIN tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets and supports flexible specification of sensor suites.

๐Ÿ’ป Quick Start

To begin, clone this repository locally

git clone --recurse-submodules https://github.com/mees/calvin.git
$ export CALVIN_ROOT=$(pwd)/calvin

Install requirements:

$ cd $CALVIN_ROOT
$ virtualenv -p $(which python3) --system-site-packages calvin_env # or use conda
$ source calvin_env/bin/activate
$ sh install.sh

Download dataset (choose which split you want to download with the argument D, ABC or ABCD):

$ cd $CALVIN_ROOT/dataset
$ sh download_data.sh D | ABC | ABCD

๐Ÿ‹๏ธโ€โ™‚๏ธ Train Baseline Agent

Train baseline models:

$ cd $CALVIN_ROOT/calvin_models/calvin_agent
$ python training.py

You want to scale your training to a multi-gpu setup? Just specify the number of GPUs and DDP will automatically be used for training thanks to Pytorch Lightning. To train on all available GPUs:

$ python training.py trainer.gpus=-1

If you have access to a Slurm cluster, we also provide trainings scripts here.

You can use Hydra's flexible overriding system for changing hyperparameters. For example, to train a model with rgb images from both static camera and the gripper camera:

$ python training.py datamodule/observation_space=lang_rgb_static_gripper model/perceptual_encoder=gripper_cam

To train a model with RGB-D from both cameras:

$ python training.py datamodule/observation_space=lang_rgbd_both model/perceptual_encoder=RGBD_both

To train a model with rgb images from the static camera and visual tactile observations:

$ python training.py datamodule/observation_space=lang_rgb_static_tactile model/perceptual_encoder=static_RGB_tactile

To see all available hyperparameters:

$ python training.py --help

To resume a training, just override the hydra working directory :

$ python training.py hydra.run.dir=runs/my_dir

๐Ÿ–ผ๏ธ Sensory Observations

CALVIN supports a range of sensors commonly utilized for visuomotor control:

  1. Static camera RGB images - with shape 200x200x3.
  2. Static camera Depth maps - with shape 200x200x1.
  3. Gripper camera RGB images - with shape 200x200x3.
  4. Gripper camera Depth maps - with shape 200x200x1.
  5. Tactile image - with shape 120x160x2x3.
  6. Proprioceptive state - EE position (3), EE orientation in euler angles (3), gripper width (1), joint positions (7), gripper action (1).

๐Ÿ•น๏ธ Action Space

In CALVIN, the agent must perform closed-loop continuous control to follow unconstrained language instructions characterizing complex robot manipulation tasks, sending continuous actions to the robot at 30hz. In order to give researchers and practitioners the freedom to experiment with different action spaces, CALVIN supports the following actions spaces:

  1. Absolute cartesian pose - EE position (3), EE orientation in euler angles (3), gripper action (1).
  2. Relative cartesian displacement - EE position (3), EE orientation in euler angles (3), gripper action (1).
  3. Joint action - Joint positions (7), gripper action (1).

๐Ÿ’ช Evaluation: The Calvin Challenge

Long-horizon Multi-task Language Control (LH-MTLC)

The aim of the CALVIN benchmark is to evaluate the learning of long-horizon language-conditioned continuous control policies. In this setting, a single agent must solve complex manipulation tasks by understanding a series of unconstrained language expressions in a row, e.g., โ€œopen the drawer. . . pick up the blue block. . . now push the block into the drawer. . . now open the sliding doorโ€. We provide an evaluation protocol with evaluation modes of varying difficulty by choosing different combinations of sensor suites and amounts of training environments. To avoid a biased initial position, the robot is reset to a neutral position before every multi-step sequence.

To evaluate a trained calvin baseline agent, run the following command:

$ cd $CALVIN_ROOT/calvin_models/calvin_agent
$ python evaluation/evaluate_policy.py --dataset_path <PATH/TO/DATASET> --train_folder <PATH/TO/TRAINING/FOLDER>

Optional arguments:

  • --checkpoint <PATH/TO/CHECKPOINT>: by default, the evaluation loads the last checkpoint in the training log directory. You can instead specify the path to another checkpoint by adding this to the evaluation command.
  • --debug: print debug information and visualize environment.

If you want to evaluate your own model architecture on the CALVIN challenge, you can implement the CustomModel class in evaluate_policy.py as an interface to your agent. You need to implement the following methods:

  • __init__(): gets called once at the beginning of the evaluation.
  • reset(): gets called at the beginning of each evaluation sequence.
  • step(obs, goal): gets called every step and returns the predicted action.

Then evaluate the model by running:

$ python evaluation/evaluate_policy.py --dataset_path <PATH/TO/DATASET> --custom_model

You are also free to use your own language model instead of using the precomputed language embeddings provided by CALVIN. For this, implement CustomLangEmbeddings in evaluate_policy.py and add --custom_lang_embeddings to the evaluation command.

Multi-task Language Control (MTLC)

Alternatively, you can evaluate the policy on single tasks and without resetting the robot to a neutral position. Note that this evaluation is currently only available for our baseline agent.

$ python evaluation/evaluate_policy_singlestep.py --dataset_path <PATH/TO/DATASET> --train_folder <PATH/TO/TRAINING/FOLDER> [--checkpoint <PATH/TO/CHECKPOINT>] [--debug]

Pre-trained Model

Download the MCIL model checkpoint trained on the static camera rgb images on environment D.

$ wget http://calvin.cs.uni-freiburg.de/model_weights/D_D_static_rgb_baseline.zip
$ unzip D_D_static_rgb_baseline.zip

๐Ÿ’ฌ Relabeling Raw Language Annotations

You want to try learning language conditioned policies in CALVIN with a new awesome language model?

We provide an example script to relabel the annotations with different language model provided in SBert, such as the larger MPNet (paraphrase-mpnet-base-v2) or its corresponding multilingual model (paraphrase-multilingual-mpnet-base-v2). The supported options are "mini", "mpnet" and "multi". If you want to try different SBert models, just change the model name here.

cd $CALVIN_ROOT/calvin_models/calvin_agent
python utils/relabel_with_new_lang_model.py +path=$CALVIN_ROOT/dataset/task_D_D/ +name_folder=new_lang_model_folder model.nlp_model=mpnet

If you additionally want to sample different language annotations for each sequence (from the same task annotations) in the training split run the same command with the parameter reannotate=true.

๐Ÿ“ˆ SOTA Models

Open-source models that outperform the MCIL baselines from CALVIN:

Contact Oier to add your model here.

Reinforcement Learning with CALVIN

Are you interested in trying reinforcement learning agents for the different manipulation tasks in the CALVIN environment? We provide a google colab to showcase how to leverage the CALVIN task indicators to learn RL agents with a sparse reward.

Citation

If you find the dataset or code useful, please cite:

@article{calvin21,
author = {Oier Mees and Lukas Hermann and Erick Rosete-Beas and Wolfram Burgard},
title = {CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks},
journal={arXiv preprint arXiv:2112.03227},
year = 2021,
}

License

MIT License

Owner
Oier Mees
PhD Student at the University of Freiburg, Germany. Researcher in Machine Learning and Robotics.
Oier Mees
[ICCV-2021] An Empirical Study of the Collapsing Problem in Semi-Supervised 2D Human Pose Estimation

An Empirical Study of the Collapsing Problem in Semi-Supervised 2D Human Pose Estimation (ICCV 2021) Introduction This is an official pytorch implemen

rongchangxie 42 Jan 04, 2023
This project deals with the detection of skin lesions within the ISICs dataset using YOLOv3 Object Detection with Darknet.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Skin Lesion detection using YOLO This project deal

Lalith Veerabhadrappa Badiger 1 Nov 22, 2021
ICRA 2021 - Robust Place Recognition using an Imaging Lidar

Robust Place Recognition using an Imaging Lidar A place recognition package using high-resolution imaging lidar. For best performance, a lidar equippe

Tixiao Shan 293 Dec 27, 2022
Open source code for the paper of Neural Sparse Voxel Fields.

Neural Sparse Voxel Fields (NSVF) Project Page | Video | Paper | Data Photo-realistic free-viewpoint rendering of real-world scenes using classical co

Meta Research 647 Dec 27, 2022
RID-Noise: Towards Robust Inverse Design under Noisy Environments

This is code of RID-Noise. Reproduce RID-Noise Results Toy tasks Please refer to the notebook ridnoise.ipynb to view experiments on three toy tasks. B

Thyrix 2 Nov 23, 2022
Fastshap: A fast, approximate shap kernel

fastshap: A fast, approximate shap kernel fastshap was designed to be: Fast Calculating shap values can take an extremely long time. fastshap utilizes

Samuel Wilson 22 Sep 24, 2022
Atomistic Line Graph Neural Network

Table of Contents Introduction Installation Examples Pre-trained models Quick start using colab JARVIS-ALIGNN webapp Peformances on a few datasets Use

National Institute of Standards and Technology 91 Dec 30, 2022
Rest API Written In Python To Classify NSFW Images.

Rest API Written In Python To Classify NSFW Images.

Wahyusaputra 2 Dec 23, 2021
CZU-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and 10 wearable inertial sensors

CZU-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and 10 wearable inertial sensors โ€ƒโ€ƒIn order to facilitate the res

yujmo 11 Dec 12, 2022
Multi-layer convolutional LSTM with Pytorch

Convolution_LSTM_pytorch Thanks for your attention. I haven't got time to maintain this repo for a long time. I recommend this repo which provides an

Zijie Zhuang 733 Dec 30, 2022
A demonstration of using a live Tensorflow session to create an interactive face-GAN explorer.

Streamlit Demo: The Controllable GAN Face Generator This project highlights Streamlit's new hash_func feature with an app that calls on TensorFlow to

Streamlit 257 Dec 31, 2022
AirCode: A Robust Object Encoding Method

AirCode This repo contains source codes for the arXiv preprint "AirCode: A Robust Object Encoding Method" Demo Object matching comparison when the obj

Chen Wang 30 Dec 09, 2022
Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions

Image-Adaptive YOLO for Object Detection in Adverse Weather Conditions Accepted by AAAI 2022 [arxiv] Wenyu Liu, Gaofeng Ren, Runsheng Yu, Shi Guo, Jia

liuwenyu 245 Dec 16, 2022
Automatic labeling, conversion of different data set formats, sample size statistics, model cascade

Simple Gadget Collection for Object Detection Tasks Automatic image annotation Conversion between different annotation formats Obtain statistical info

llt 4 Aug 24, 2022
Sky Computing: Accelerating Geo-distributed Computing in Federated Learning

Sky Computing Introduction Sky Computing is a load-balanced framework for federated learning model parallelism. It adaptively allocate model layers to

HPC-AI Tech 72 Dec 27, 2022
JupyterLite demo deployed to GitHub Pages ๐Ÿš€

JupyterLite Demo JupyterLite deployed as a static site to GitHub Pages, for demo purposes. โœจ Try it in your browser โœจ โžก๏ธ https://jupyterlite.github.io

JupyterLite 223 Jan 04, 2023
Convert Mission Planner (ArduCopter) Waypoint Missions to Litchi CSV Format to execute on DJI Drones

Mission Planner to Litchi Convert Mission Planner (ArduCopter) Waypoint Surveys to Litchi CSV Format to execute on DJI Drones Litchi doesn't support S

Yaros 24 Dec 09, 2022
CNN visualization tool in TensorFlow

tf_cnnvis A blog post describing the library: https://medium.com/@falaktheoptimist/want-to-look-inside-your-cnn-we-have-just-the-right-tool-for-you-ad

InFoCusp 778 Jan 02, 2023
An self sufficient AI that crawls the web to learn how to generate art from keywords

Roxx-IO - The Smart Artist AI! TO DO / IDEAS Implement Web-Scraping Functionality Figure out a less annoying (and an off button for it) text to speech

Tatz 5 Mar 21, 2022
RLDS stands for Reinforcement Learning Datasets

RLDS RLDS stands for Reinforcement Learning Datasets and it is an ecosystem of tools to store, retrieve and manipulate episodic data in the context of

Google Research 135 Jan 01, 2023