Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Last update: Dec 14, 2022

Related tags

Deep Learning vln-bert

Overview

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra

Paper: https://arxiv.org/abs/2004.14973

Model Zoo

A variety of pre-trained VLN-BERT weights can accessed through the following links:

	Pre-training Stages	Job ID	Val Unseen SR	URL
0	no pre-training	174631	30.52%	TBD
1	1	175134	45.17%	TBD
3	1 and 2	221943	49.64%	download
2	1 and 3	220929	50.02%	download
4	1, 2, and 3 (Full Model)	220825	59.26%	download

Usage Instructions

Follow the instructions in INSTALL.md to setup this codebase. The instructions walk you through several steps including preprocessing the Matterport3D panoramas by extracting regions with a pretrained object detector.

Training

To preform stage 3 of pre-training, first download ViLBERT weights from here. Then, run:

python \
-m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--node_rank=0 \
train.py \
--from_pretrained <path/to/vilbert_pytorch_model_9.bin> \
--save_name [pre_train_run_id] \
--num_epochs 50 \
--warmup_proportion 0.08 \
--cooldown_factor 8 \
--masked_language \
--masked_vision \
--no_ranking

To fine-tune VLN-BERT for the path selection task, run:

python \
-m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=1 \
--node_rank=0 \
train.py \
--from_pretrained <path/to/pytorch_model_50.bin> \
--save_name [fine_tune_run_id]

Evaluation

To evaluate a pre-trained model, run:

python test.py \
--split [val_seen|val_unseen] \
--from_pretrained <path/to/run_[run_id]_pytorch_model.bin> \
--save_name [run_id]

followed by:

python scripts/calculate-metrics.py <path/to/results_[val_seen|val_unseen].json>

Citation

If you find this code useful, please consider citing:

@inproceedings{majumdar2020improving,
  title={Improving Vision-and-Language Navigation with Image-Text Pairs from the Web},
  author={Arjun Majumdar and Ayush Shrivastava and Stefan Lee and Peter Anderson and Devi Parikh and Dhruv Batra},
  booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2020}
}

Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Related tags

Overview

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web

Model Zoo

Usage Instructions

Training

Evaluation

Citation

Owner

Arjun Majumdar

Explicable Reward Design for Reinforcement Learning Agents [NeurIPS'21]

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

QRec: A Python Framework for quick implementation of recommender systems (TensorFlow Based)

Revisiting Global Statistics Aggregation for Improving Image Restoration

GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training @ KDD 2020

EDPN: Enhanced Deep Pyramid Network for Blurry Image Restoration

Accelerate Neural Net Training by Progressively Freezing Layers

Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms applied on Continuous Control Tasks

Face recognition. Redefined.

I explore rock vs. mine prediction using a SONAR dataset

GenshinMapAutoMarkTools - Tools To add/delete/refresh resources mark in Genshin Impact Map

An efficient toolkit for Face Stylization based on the paper "AgileGAN: Stylizing Portraits by Inversion-Consistent Transfer Learning"

Robotic Process Automation in Windows and Linux by using Driagrams.net BPMN diagrams.

This is the reference implementation for "Coresets via Bilevel Optimization for Continual Learning and Streaming"

a delightful machine learning tool that allows you to train, test and use models without writing code

The audio-video synchronization of MKV Container Format is exploited to achieve data hiding

Pytorch implementation of the paper DocEnTr: An End-to-End Document Image Enhancement Transformer.

Real life contra a deep learning project built using mediapipe and openc

RAANet: Range-Aware Attention Network for LiDAR-based 3D Object Detection with Auxiliary Density Level Estimation

Implements the training, testing and editing tools for "Pluralistic Image Completion"