Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Last update: Nov 24, 2022

Related tags

Overview

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Official PyTorch implementation for the paper

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation Rishabh Jangir*, Nicklas Hansen*, Sambaran Ghosal, Mohit Jain, and Xiaolong Wang

[arXiv], [Webpage]

Installation

GPU access with CUDA >=11.1 support is required. Install MuJoCo if you do not have it installed already:

Obtain a license on the MuJoCo website.
Download MuJoCo binaries here.
Unzip the downloaded archive into ~/.mujoco/mujoco200 and place your license key file mjkey.txt at ~/.mujoco.
Use the env variables MUJOCO_PY_MJKEY_PATH and MUJOCO_PY_MUJOCO_PATH to specify the MuJoCo license key path and the MuJoCo directory path.
Append the MuJoCo subdirectory bin path into the env variable LD_LIBRARY_PATH.

Then, the remainder of the dependencies can be installed with the following commands:

conda env create -f setup/conda.yml
conda activate lookcloser

Training

We provide training scripts for solving each of the four tasks using our method. The training scripts can be found in the scripts directory. Training takes approximately 16 hours on a single GPU for 500k timesteps.

Command: bash scripts/multiview.sh runs with the default arguments set towards training the reach environment with image observations with our crossview method.

Please take a look at src/arguments.py for detailed description of arguments and their usage. The different baselines considered in the paper can be run with little modification of the input arguments.

Results

We find that while using multiple views alone improves the sim-to-real performance of SAC, our Transformer-based view fusion is far more robust across all tasks.

See our paper for more results.

Method

Our method improves vision-based robotic manipulation by fusing information from multiple cameras using transformers. The learned RL policy transfers from simulation to a real robot, and solves precision-based manipulation tasks directly from uncalibrated cameras, without access to state information, and with a high degree of variability in task configurations.

Attention Maps

We visualize attention maps learned by our method, and find that it learns to relate concepts shared between the two views, e.g. when querying a point on an object shown the egocentric view, our method attends strongly to the same object in the third-person view, and vice-versa.

Tasks

Together with our method, we also release a set of four image-based robotic manipulation tasks used in our research. Each task is goal-conditioned with the goal specified directly in the image observations, the agent has no access to state information, and task configurations are randomly initialized at the start of each episode. The provided tasks are:

Reach: Reach a randomly positioned mark on the table with the robot's end-effector.
Push: Push a box to a goal position indicated by a mark on the table.
Pegbox: Place a peg attached to the robot's end-effector with a string into a box.
Hammerall: Hammer in an out-of-position peg; each episode, only one of four pegs are randomly initialized out-of-position.

Citation

If you find our work useful in your research, please consider citing the paper as follows:

@article{Jangir2022Look,
  title={Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation},
  author={ Rishabh Jangir and Nicklas Hansen and Sambaral Ghosal and Mohit Jain and Xiaolong Wang},
  booktitle={arXiv},
  primaryclass={cs.LG},
  year={2022}
}

License

This repository is licensed under the MIT license; see LICENSE for more information.

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Related tags

Overview

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Installation

Training

Results

Method

Attention Maps

Tasks

Citation

License

Owner

Rishabh Jangir

codes for Image Inpainting with External-internal Learning and Monochromic Bottleneck

Code for the ICCV 2021 paper "Pixel Difference Networks for Efficient Edge Detection" (Oral).

[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

A distributed deep learning framework that supports flexible parallelization strategies.

Generative Models as a Data Source for Multiview Representation Learning

Reverse engineer your pytorch vision models, in style

Train a deep learning net with OpenStreetMap features and satellite imagery.

Pythonic particle-based (super-droplet) warm-rain/aqueous-chemistry cloud microphysics package with box, parcel & 1D/2D prescribed-flow examples in Python, Julia and Matlab

This repository contains the code for the paper Neural RGB-D Surface Reconstruction

This repo contains the official code of our work SAM-SLR which won the CVPR 2021 Challenge on Large Scale Signer Independent Isolated Sign Language Recognition.

Fine-grained Post-training for Improving Retrieval-based Dialogue Systems - NAACL 2021

NumPy로 구현한 딥러닝 라이브러리입니다. (자동 미분 지원)

PRTR: Pose Recognition with Cascade Transformers

Towhee is a flexible machine learning framework currently focused on computing deep learning embeddings over unstructured data.

Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising

Predicting lncRNA–protein interactions based on graph autoencoders and collaborative training

Implementation of self-attention mechanisms for general purpose. Focused on computer vision modules. Ongoing repository.

2.86% and 15.85% on CIFAR-10 and CIFAR-100

Multi-Objective Loss Balancing for Physics-Informed Deep Learning

This repository implements Douzero's interface to IGCA.