History Aware Multimodal Transformer for Vision-and-Language Navigation

Overview

History Aware Multimodal Transformer for Vision-and-Language Navigation

This repository is the official implementation of History Aware Multimodal Transformer for Vision-and-Language Navigation. Project webpage: https://cshizhe.github.io/projects/vln_hamt.html

Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow instructions and navigate in real scenes. In this work, we introduce a History Aware Multimodal Transformer (HAMT) to incorporate a long-horizon history into multimodal decision making. HAMT efficiently encodes all the past panoramic observations via a hierarchical vision transformer. It, then, jointly combines text, history and current observation to predict the next action. We first train HAMT end-to-end using several proxy tasks including single-step action prediction and spatial relation prediction, and then use reinforcement learning to further improve the navigation policy. HAMT achieves new state of the art on a broad range of VLN tasks, including VLN with fine-grained instructions (R2R, RxR) high-level instructions (R2R-Last, REVERIE), dialogs (CVDN) as well as long-horizon VLN (R4R, R2R-Back).

framework

Installation

  1. Install Matterport3D simulators: follow instructions here. We use the latest version (all inputs and outputs are batched).
export PYTHONPATH=Matterport3DSimulator/build:$PYTHONPATH
  1. Install requirements:
conda create --name vlnhamt python=3.8.5
conda activate vlnhamt
pip install torch==1.7.1+cu101 torchvision==0.8.2+cu101 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
  1. Download data from Dropbox, including processed annotations, features and pretrained models. Put the data in `datasets' directory.

  2. (Optional) If you want to train HAMT end-to-end, you should download original Matterport3D data.

Extracting features (optional)

Scripts to extract visual features are in preprocess directory:

CUDA_VISIBLE_DEVICES=0 python preprocess/precompute_img_features_vit.py \
    --model_name vit_base_patch16_224 --out_image_logits \
    --connectivity_dir datasets/R2R/connectivity \
    --scan_dir datasets/Matterport3D/v1_unzip_scans \
    --num_workers 4 \
    --output_file datasets/R2R/features/pth_vit_base_patch16_224_imagenet.hdf5

Training with proxy tasks

Stage 1: Pretrain with fixed ViT features

NODE_RANK=0
NUM_GPUS=4
CUDA_VISIBLE_DEVICES='0,1,2,3' python -m torch.distributed.launch \
    --nproc_per_node=${NUM_GPUS} --node_rank $NODE_RANK \
    pretrain_src/main_r2r.py --world_size ${NUM_GPUS} \
    --model_config pretrain_src/config/r2r_model_config.json \
    --config pretrain_src/config/pretrain_r2r.json \
    --output_dir datasets/R2R/exprs/pretrain/cmt-vitbase-6tasks

Stage 2: Train ViT in an end-to-end manner

Change the config file as `pretrain_r2r_e2e.json'.

Fine-tuning for sequential action prediction

cd finetune_src
bash scripts/run_r2r.bash
bash scripts/run_r2r_back.bash
bash scripts/run_r2r_last.bash
bash scripts/run_r4r.bash
bash scripts/run_reverie.bash
bash scripts/run_cvdn.bash

Citation

If you find this work useful, please consider citing:

@InProceedings{chen2021hamt,
author       = {Chen, Shizhe and Guhur, Pierre-Louis and Schmid, Cordelia and Laptev, Ivan},
title        = {History Aware multimodal Transformer for Vision-and-Language Navigation},
booktitle    = {NeurIPS},
year         = {2021},
}

Acknowledgement

Some of the codes are built upon pytorch-image-models, UNITER and Recurrent-VLN-BERT. Thanks them for their great works!

Owner
Shizhe Chen
Shizhe Chen
PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

PatrickStar enables Larger, Faster, Greener Pretrained Models for NLP. Democratize AI for everyone.

Tencent 633 Dec 28, 2022
🏆 • 5050 most frequent words in 109 languages

🏆 Most Common Words Multilingual 5000 most frequent words in 109 languages. Uses wordfrequency.info as a source. 🔗 License source code license data

14 Nov 24, 2022
Simple, hackable offline speech to text - using the VOSK-API.

Simple, hackable offline speech to text - using the VOSK-API.

Campbell Barton 844 Jan 07, 2023
An automated program that helps customers of Pizza Palour place their pizza orders

PIzza_Order_Assistant Introduction An automated program that helps customers of Pizza Palour place their pizza orders. The program uses voice commands

Tindi Sommers 1 Dec 26, 2021
Code for the paper "Language Models are Unsupervised Multitask Learners"

Status: Archive (code is provided as-is, no updates expected) gpt-2 Code and models from the paper "Language Models are Unsupervised Multitask Learner

OpenAI 16.1k Jan 08, 2023
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

fastNLP fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。 fastNLP具有如下的特性: 统一的Tabular式数据容器,简化数据预处理过程; 内置多种数据集的Loader和Pipe,省去预处理代码; 各种方便的NLP工具,例如Embedd

fastNLP 2.8k Jan 01, 2023
Honor's thesis project analyzing whether the GPT-2 model can more effectively generate free-verse or structured poetry.

gpt2-poetry The following code is for my senior honor's thesis project, under the guidance of Dr. Keith Holyoak at the University of California, Los A

Ashley Kim 2 Jan 09, 2022
Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

lang lang is a simple stack based programming language written in Python. It can

Christoffer Aakre 1 May 30, 2022
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
SimBERT升级版(SimBERTv2)!

RoFormer-Sim RoFormer-Sim,又称SimBERTv2,是我们之前发布的SimBERT模型的升级版。 介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

317 Dec 23, 2022
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

UNITER: UNiversal Image-TExt Representation Learning This is the official repository of UNITER (ECCV 2020). This repository currently supports finetun

Yen-Chun Chen 680 Dec 24, 2022
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Hiring We are hiring at all levels (including FTE researchers and interns)! If you are interested in working with us on NLP and large-scale pre-traine

Microsoft 7.8k Jan 09, 2023
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

Tao Lei 14 Dec 12, 2022
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Abel 211 Dec 28, 2022
A minimal code for fairseq vq-wav2vec model inference.

vq-wav2vec inference A minimal code for fairseq vq-wav2vec model inference. Runs without installing the fairseq toolkit and its dependencies. Usage ex

Vladimir Larin 7 Nov 15, 2022
Words-per-minute - A terminal app written in python utilizing the curses module that tests the user's ability to type

words-per-minute A terminal app written in python utilizing the curses module th

Tanim Islam 1 Jan 14, 2022
This is the offline-training-pipeline for our project.

offline-training-pipeline This is the offline-training-pipeline for our project. We adopt the offline training and online prediction Machine Learning

0 Apr 22, 2022
GVT is a generic translation tool for parts of text on the PC screen with Text to Speak functionality.

GVT is a generic translation tool for parts of text on the PC screen with Text to Speech functionality. I wanted to create it because the existing tools that I experimented with did not satisfy me in

Nuked 1 Aug 21, 2022
Code-autocomplete, a code completion plugin for Python

Code AutoComplete code-autocomplete, a code completion plugin for Python.

xuming 13 Jan 07, 2023
leaking paid token generator that was a shit lmao for 100$ haha

Discord-Token-Generator-Leaked leaking paid token generator that was a shit lmao for 100$ he selling it for 100$ wth here the code enjoy don't forget

Keevo 5 Apr 15, 2022