[CVPR 2021] Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Overview

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Introduction

We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% [email protected] improvement).

Please also check out the project website here.

For additional detail, please see the Scan2Cap paper:
"Scan2Cap: Context-aware Dense Captioning in RGB-D Scans"
by Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang
from Technical University of Munich and Simon Fraser University.

Data

ScanRefer

If you would like to access to the ScanRefer dataset, please fill out this form. Once your request is accepted, you will receive an email with the download link.

Note: In addition to language annotations in ScanRefer dataset, you also need to access the original ScanNet dataset. Please refer to the ScanNet Instructions for more details.

Download the dataset by simply executing the wget command:

wget <download_link>

Scan2CAD

As learning the relative object orientations in the relational graph requires CAD model alignment annotations in Scan2CAD, please refer to the Scan2CAD official release (you need ~8MB on your disk). Once the data is downloaded, extract the zip file under data/ and change the path to Scan2CAD annotations (CONF.PATH.SCAN2CAD) in lib/config.py . As Scan2CAD doesn't cover all instances in ScanRefer, please download the mapping file and place it under CONF.PATH.SCAN2CAD. Parsing the raw Scan2CAD annotations by the following command:

python scripts/Scan2CAD_to_ScanNet.py

Setup

Please execute the following command to install PyTorch 1.8:

conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=10.2 -c pytorch

Install the necessary packages listed out in requirements.txt:

pip install -r requirements.txt

And don't forget to refer to Pytorch Geometric to install the graph support.

After all packages are properly installed, please run the following commands to compile the CUDA modules for the PointNet++ backbone:

cd lib/pointnet2
python setup.py install

Before moving on to the next step, please don't forget to set the project root path to the CONF.PATH.BASE in lib/config.py.

Data preparation

  1. Download the ScanRefer dataset and unzip it under data/ - You might want to run python scripts/organize_scanrefer.py to organize the data a bit.
  2. Download the preprocessed GLoVE embeddings (~990MB) and put them under data/.
  3. Download the ScanNetV2 dataset and put (or link) scans/ under (or to) data/scannet/scans/ (Please follow the ScanNet Instructions for downloading the ScanNet dataset).

After this step, there should be folders containing the ScanNet scene data under the data/scannet/scans/ with names like scene0000_00

  1. Pre-process ScanNet data. A folder named scannet_data/ will be generated under data/scannet/ after running the following command. Roughly 3.8GB free space is needed for this step:
cd data/scannet/
python batch_load_scannet_data.py

After this step, you can check if the processed scene data is valid by running:

python visualize.py --scene_id scene0000_00
  1. (Optional) Pre-process the multiview features from ENet.

    a. Download the ENet pretrained weights (1.4MB) and put it under data/

    b. Download and decompress the extracted ScanNet frames (~13GB).

    c. Change the data paths in config.py marked with TODO accordingly.

    d. Extract the ENet features:

    python scripts/compute_multiview_features.py

    e. Project ENet features from ScanNet frames to point clouds; you need ~36GB to store the generated HDF5 database:

    python scripts/project_multiview_features.py --maxpool

    You can check if the projections make sense by projecting the semantic labels from image to the target point cloud by:

    python scripts/project_multiview_labels.py --scene_id scene0000_00 --maxpool

Usage

End-to-End training for 3D dense captioning

Run the following script to start the end-to-end training of Scan2Cap model using the multiview features and normals. For more training options, please run scripts/train.py -h:

python scripts/train.py --use_multiview --use_normal --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 --batch_size 12 --epoch 50

The trained model as well as the intermediate results will be dumped into outputs/ . For evaluating the model (@0.5IoU), please run the following script and change the accordingly, and note that arguments must match the ones for training:

python scripts/eval.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --eval_caption --min_iou 0.5

Evaluating the detection performance:

python scripts/eval.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --eval_detection

You can even evaluate the pretraiend object detection backbone:

python scripts/eval.py --folder <output_folder> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10 --eval_detection --eval_pretrained

If you want to visualize the results, please run this script to generate bounding boxes and descriptions for scene to outputs/ :

python scripts/visualize.py --folder <output_folder> --scene_id <scene_id> --use_multiview --use_normal --use_topdown --use_relation --num_graph_steps 2 --num_locals 10

Note that you need to run python scripts/export_scannet_axis_aligned_mesh.py first to generate axis-aligned ScanNet mesh files.

3D dense captioning with ground truth bounding boxes

For experimenting the captioning performance with ground truth bounding boxes, you need to extract the box features with a pre-trained extractor. The pretrained ones are already in pretrained, but if you want to train a new one from scratch, run the following script:

python scripts/train_maskvotenet.py --batch_size 8 --epoch 200 --lr 1e-3 --wd 0 --use_multiview --use_normal

The pretrained model will be stored under outputs/ . Before we proceed, you need to move the to pretrained/ and change the name of the folder to XYZ_MULTIVIEW_NORMAL_MASKS_VOTENET, which must reflect the features while training, e.g. MULTIVIEW -> --use_multiview.

After that, let's run the following script to extract the features for the ground truth bounding boxes. Note that the feature options must match the ones in the previous steps:

python scripts/extract_gt_features.py --batch_size 16 --epoch 100 --use_multiview --use_normal --train --val

The extracted features will be stored as a HDF5 database under /gt_ _features . You need ~610MB space on your disk.

Now the box features are ready - we're good to go! Next step: run the following command to start training the dense captioning pipeline with the extraced ground truth box features:

python scripts/train_pretrained.py --mode gt --batch_size 32 --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10

For evaluating the model, run the following command:

python scripts/eval_pretrained.py --folder <ouptut_folder> --mode gt --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 

3D dense captioning with pre-trained VoteNet bounding boxes

If you would like to play around with the pre-trained VoteNet bounding boxes, you can directly use the pre-trained VoteNet in pretrained. After picking the model you like, run the following command to extract the bounding boxes and associated box features:

python scripts/extract_votenet_features.py --batch_size 16 --epoch 100 --use_multiview --use_normal --train --val

Now the box features are ready. Next step: run the following command to start training the dense captioning pipeline with the extraced VoteNet boxes:

python scripts/train_pretrained.py --mode votenet --batch_size 32 --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10

For evaluating the model, run the following command:

python scripts/eval_pretrained.py --folder <ouptut_folder> --mode votenet --use_topdown --use_relation --use_orientation --num_graph_steps 2 --num_locals 10 

Experiments on ReferIt3D

Yes, of course you can use the ReferIt3D dataset for training and evaluation. Simply download ReferIt3D dataset and unzip it under data, then run the following command to convert it to ScanRefer format:

python scripts/organize_referit3d.py

Then you can simply specify the dataset you would like to use by --dataset ReferIt3D in the aforementioned steps. Have fun!

2D Experiments

Please refer to Scan2Cad-2D for more information.

Citation

If you found our work helpful, please kindly cite our paper via:

@inproceedings{chen2021scan2cap,
  title={Scan2Cap: Context-aware Dense Captioning in RGB-D Scans},
  author={Chen, Zhenyu and Gholami, Ali and Nie{\ss}ner, Matthias and Chang, Angel X},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={3193--3203},
  year={2021}
}

License

Scan2Cap is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

Copyright (c) 2021 Dave Zhenyu Chen, Ali Gholami, Matthias Nießner, Angel X. Chang

Owner
Dave Z. Chen
PhD candidate at TUM
Dave Z. Chen
Neural Radiance Fields Using PyTorch

This project is a PyTorch implementation of Neural Radiance Fields (NeRF) for reproduction of results whilst running at a faster speed.

Vedant Ghodke 1 Feb 11, 2022
【Arxiv】Exploring Separable Attention for Multi-Contrast MR Image Super-Resolution

SANet Exploring Separable Attention for Multi-Contrast MR Image Super-Resolution Dependencies numpy==1.18.5 scikit_image==0.16.2 torchvision==0.8.1 to

36 Jan 05, 2023
一套完整的微博舆情分析流程代码,包括微博爬虫、LDA主题分析和情感分析。

已经将项目的关键文件上传,包含微博爬虫、LDA主题分析和情感分析三个部分。 1.微博爬虫 实现微博评论爬取和微博用户信息爬取,一天大概十万条。 2.LDA主题分析 实现文档主题抽取,包括数据清洗及分词、主题数的确定(主题一致性和困惑度)和最优主题模型的选择(暴力搜索)。 3.情感分析 实现评论文本的

182 Jan 02, 2023
Neural models of common sense. 🤖

Unicorn on Rainbow Neural models of common sense. This repository is for the paper: Unicorn on Rainbow: A Universal Commonsense Reasoning Model on a N

AI2 60 Jan 05, 2023
Jittor 64*64 implementation of StyleGAN

StyleGanJittor (Tsinghua university computer graphics course) Overview Jittor 64

Song Shengyu 3 Jan 20, 2022
Bot developed in Python that automates races in pegaxy.

español | português About it: This is a fork from pega-racing-bot. This bot, developed in Python, is to automate races in pegaxy. The game developers

4 Apr 08, 2022
Pytorch implementation of One-Shot Affordance Detection

One-shot Affordance Detection PyTorch implementation of our one-shot affordance detection models. This repository contains PyTorch evaluation code, tr

46 Dec 12, 2022
PyTorch implementation of our paper How robust are discriminatively trained zero-shot learning models?

How robust are discriminatively trained zero-shot learning models? This repository contains the PyTorch implementation of our paper How robust are dis

Mehmet Kerim Yucel 5 Feb 04, 2022
Sequence Modeling with Structured State Spaces

Structured State Spaces for Sequence Modeling This repository provides implementations and experiments for the following papers. S4 Efficiently Modeli

HazyResearch 896 Jan 01, 2023
Official pytorch implementation of DeformSyncNet: Deformation Transfer via Synchronized Shape Deformation Spaces

DeformSyncNet: Deformation Transfer via Synchronized Shape Deformation Spaces Minhyuk Sung*, Zhenyu Jiang*, Panos Achlioptas, Niloy J. Mitra, Leonidas

Zhenyu Jiang 21 Aug 30, 2022
Contour-guided image completion with perceptual grouping (BMVC 2021 publication)

Contour-guided Image Completion with Perceptual Grouping Authors Morteza Rezanejad*, Sidharth Gupta*, Chandra Gummaluru, Ryan Marten, John Wilder, Mic

Sid Gupta 6 Dec 27, 2022
Pytorch implementation of MLP-Mixer with loading pre-trained models.

MLP-Mixer-Pytorch PyTorch implementation of MLP-Mixer: An all-MLP Architecture for Vision with the function of loading official ImageNet pre-trained p

Qiushi Yang 2 Sep 29, 2022
Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields.

This repository contains the code release for Mip-NeRF: A Multiscale Representation for Anti-Aliasing Neural Radiance Fields. This implementation is written in JAX, and is a fork of Google's JaxNeRF

Google 625 Dec 30, 2022
The offcial repository for 'CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos', SIGIR2022

CharacterBERT-DR The offcial repository for CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos, Sh

ielab 11 Nov 15, 2022
Code base for reproducing results of I.Schubert, D.Driess, O.Oguz, and M.Toussaint: Learning to Execute: Efficient Learning of Universal Plan-Conditioned Policies in Robotics. NeurIPS (2021)

Learning to Execute (L2E) Official code base for completely reproducing all results reported in I.Schubert, D.Driess, O.Oguz, and M.Toussaint: Learnin

3 May 18, 2022
Randstad Artificial Intelligence Challenge (powered by VGEN). Soluzione proposta da Stefano Fiorucci (anakin87) - primo classificato

Randstad Artificial Intelligence Challenge (powered by VGEN) Soluzione proposta da Stefano Fiorucci (anakin87) - primo classificato Struttura director

Stefano Fiorucci 1 Nov 13, 2021
This is a template for the Non-autoregressive Deep Learning-Based TTS model (in PyTorch).

Non-autoregressive Deep Learning-Based TTS Template This is a template for the Non-autoregressive TTS model. It contains Data Preprocessing Pipeline D

Keon Lee 13 Dec 05, 2022
Code to reproduce the experiments in the paper "Transformer Based Multi-Source Domain Adaptation" (EMNLP 2020)

Transformer Based Multi-Source Domain Adaptation Dustin Wright and Isabelle Augenstein To appear in EMNLP 2020. Read the preprint: https://arxiv.org/a

CopeNLU 36 Dec 05, 2022
The Pytorch implementation for "Video-Text Pre-training with Learned Regions"

Region_Learner The Pytorch implementation for "Video-Text Pre-training with Learned Regions" (arxiv) We are still cleaning up the code further and pre

Rui Yan 0 Mar 20, 2022
Towards Part-Based Understanding of RGB-D Scans

Towards Part-Based Understanding of RGB-D Scans (CVPR 2021) We propose the task of part-based scene understanding of real-world 3D environments: from

26 Nov 23, 2022