[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

Overview

Versatile Multi-Modal Pre-Training for
Human-Centric Perception

Fangzhou Hong1  Liang Pan1  Zhongang Cai1,2,3Ziwei Liu1*
1S-Lab, Nanyang Technological University  2SenseTime Research  3Shanghai AI Laboratory

Accepted to CVPR 2022 (Oral)

This repository contains the official implementation of Versatile Multi-Modal Pre-Training for Human-Centric Perception. For brevity, we name our method HCMoCo.


arXivProject PageDataset

Citation

If you find our work useful for your research, please consider citing the paper:

@article{hong2022hcmoco,
  title={Versatile Multi-Modal Pre-Training for Human-Centric Perception},
  author={Hong, Fangzhou and Pan, Liang and Cai, Zhongang and Liu, Ziwei},
  journal={arXiv preprint arXiv:2203.13815},
  year={2022}
}

Updates

[03/2022] Code release!

[03/2022] HCMoCo is accepted to CVPR 2022 for Oral presentation 🥳 !

Installation

We recommend using conda to manage the python environment. The commands below are provided for your reference.

git clone [email protected]:hongfz16/HCMoCo.git
cd HCMoCo
conda create -n HCMoCo python=3.6
conda activate HCMoCo
conda install -c pytorch pytorch=1.6.0 torchvision=0.7.0 cudatoolkit=10.1
pip install -r requirements.txt

Other than the above steps, if you want to run the PointNet++ experiments, please remember to compile the pointnet operators.

cd pycontrast/networks/pointnet2
python setup.py install

Dataset Preparation

1. NTU RGB-D Dataset

This dataset is for the pre-train process. Download the 'NTU RGB+D 60' dataset here. Extract the data to pycontrast/data/NTURGBD/NTURGBD. The folder structure should look like:

./
├── ...
└── pycontrast/data/NTURGBD/
    ├──NTURGBD/
        ├── nturgb+d_rgb/
        ├── nturgb+d_depth_masked/
        ├── nturgb+d_skeletons/
        └── ...

Preprocess the raw data using the following two python scripts which could produce calibrated RGB frames in nturgb+d_rgb_warped_correction and extracted skeleton information in nturgb+d_parsed_skeleton.

cd pycontrast/data/NTURGBD
python generate_skeleton_data.py
python preprocess_nturgbd.py

2. NTURGBD-Parsing-4K Dataset

This dataset is for both the pre-train process and depth human parsing task. Follow the instructions here for the preparation of NTURGBD-Parsing-4K dataset.

3. MPII Human Pose Dataset

This dataset is for the pre-train process. Download the 'MPII Human Pose Dataset' here. Extract them to pycontrast/data/mpii. The folder structure should look like:

./
├── ...
└── pycontrast/data/mpii
    ├── annot/
    └── images/

4. COCO Keypoint Detection Dataset

This dataset is for both the pre-train process and DensePose estimation. Download the COCO 2014 train/val images/annotations here. Extract them to pycontrast/data/coco. The folder structure should look like:

./
├── ...
└── pycontrast/data/coco
    ├── annotations/
        └── *.json
    └── images/
        ├── train2014/
            └── *.jpg
        └── val2014/
            └── *.jpg

5. Human3.6M Dataset

This dataset is for the RGB human parsing task. Download the Human3.6M dataset here and extract under HRNet-Semantic-Segmentation/data/human3.6m. Use the provided script mp_parsedata.py for the pre-processing of the raw data. The folder structure should look like:

./
├── ...
└── HRNet-Semantic-Segmentation/data/human3.6m
    ├── protocol_1/
        ├── rgb
        └── seg
    ├── flist_2hz_train.txt
    ├── flist_2hz_eval.txt
    └── ...

6. ITOP Dataset

This dataset is for the depth 3D pose estimation. Download the ITOP dataset here and extract under A2J/data. Use the provided script data_preprocess.py for the pre-processing of the raw data. The folder structure should look like:

./
├── ...
└── A2J/data
    ├── side_train/
    ├── side_test/
    ├── itop_size_mean.npy
    ├── itop_size_std.npy
    ├── bounding_box_depth_train.pkl
    ├── itop_side_bndbox_test.mat
    └── ...

Model Zoo

TBA

HCMoCo Pre-train

Finally, let's start the pre-training process. We use slurm to manage the distributed training. You might need to modify the below mentioned scripts according to your own distributed training method. We develop HCMoCo based on the CMC repository. The codes for this part are provided under pycontrast.

1. First Stage

For the first stage, we only perform 'Sample-level modality-invariant representation learning' for 100 epoch. We provide training scripts for this stage under pycontrast/scripts/FirstStage. Specifically, we provide the scripts for training with 'NTURGBD+MPII': train_ntumpiirgbd2s_hrnet_w18.sh and 'NTURGBD+COCO': train_ntucocorgbd2s_hrnet_w18.sh.

cd pycontrast
sh scripts/FirstStage/train_ntumpiirgbd2s_hrnet_w18.sh

2. Second Stage

For the second stage, all three proposed learning targets in HCMoCo are used to continue training for another 100 epoch. We provide training scripts for this stage under pycontrast/scripts/SecondStage. The naming of scripts are corresponding to that of the first stage.

3. Extract pre-trained weights

After the two-stage pre-training, we need to extract pre-trained weights of RGB/depth encoders for transfering to downstream tasks. Specifically, please refer to pycontrast/transfer_ckpt.py for extracting pre-trained weights of the RGB encoder and pycontrast/transfer_ckpt_depth.py for that of the depth encoder.

Evaluation on Downstream Tasks

1. DensePose Estimation

The DensePose estimation is performed on COCO dataset. Please refer to detectron2 for the training and evaluation of DensePose estimation. We provide our config files under DensePose-Config for your reference. Fill the config option MODEL.WEIGHTS with the path to the pre-trained weights.

2. RGB Human Parsing

The RGB human parsing is performed on Human3.6M dataset. We develop the RGB human parsing task based on the HRNet-Semantic-Segmentation repository and include the our version in this repository. We provide a config template HRNet-Semantic-Segmentation/experiments/human36m/config-template.yaml. Remember to fill the config option MODEL.PRETRAINED with the path to the pre-trained weights. The training and evaluation commands are provided below.

cd HRNet-Semantic-Segmentation
# Training
python -m torch.distributed.launch \
  --nproc_per_node=2 \
  --master_port=${port} \
  tools/train.py \
      --cfg ${config_file}
# Evaluation
python tools/test.py \
    --cfg ${config_file} \
    TEST.MODEL_FILE ${path_to_trained_model}/best.pth \
    TEST.FLIP_TEST True \
    TEST.NUM_SAMPLES 0

3. Depth Human Parsing

The depth human parsing is performed on our proposed NTURGBD-Parsing-4K dataset. Similarly, the code for depth human parsing is developed based on the HRNet-Semantic-Segmentation repository. We provide a config template HRNet-Semantic-Segmentation/experiments/nturgbd_d/config-template.yaml. Please refer to the above 'RGB Human Parsing' section for detailed usages.

4. Depth 3D Pose Estimation

The depth 3D pose estimation is evaluated on ITOP dataset. We develop the codes based on the A2J repository. Since the original repository does not provide the training codes, we implemented it by ourselves. The training and evaluation commands are provided below.

cd A2J
python main.py \
    --pretrained_pth ${path_to_pretrained_weights} \
    --output ${path_to_the_output_folder}

Experiments on the Versatility of HCMoCo

1. Cross-Modality Supervision

The experiments for the versatility of HCMoCo are evaluated on NTURGBD-Parsing-4K datasets. For the 'RGB->Depth' cross-modality supervision, please refer to pycontrast/scripts/Versatility/train_ntusegrgbd2s_hrnet_w18_sup_rgb_cmc1_other1.sh. For the 'Depth->RGB' cross-modality supervision, please refer to pycontrast/scripts/Versatility/train_ntusegrgbd2s_hrnet_w18_sup_d_cmc1_other1.sh.

cd pycontrast
sh scripts/Versatility/train_ntusegrgbd2s_hrnet_w18_sup_rgb_cmc1_other1.sh
sh scripts/Versatility/train_ntusegrgbd2s_hrnet_w18_sup_d_cmc1_other1.sh

2. Missing-Modality Inference

Please refer to the provided script pycontrast/scripts/Versatility/train_ntusegrgbd2s_hrnet_w18_sup_rgbd_cmc1_other1.sh

cd pycontrast
sh scripts/Versatility/train_ntusegrgbd2s_hrnet_w18_sup_rgbd_cmc1_other1.sh

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgements

This work is supported by NTU NAP, MOE AcRF Tier 2 (T2EP20221-0033), and under the RIE2020 Industry Alignment Fund – Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

We thank the following repositories for their contributions in our implementation: CMC, HRNet-Semantic-Segmentation, SemGCN, PointNet2.PyTorch, and A2J.

Owner
Fangzhou Hong
Ph.D. Student in [email protected]
Fangzhou Hong
Implementation of the paper All Labels Are Not Created Equal: Enhancing Semi-supervision via Label Grouping and Co-training

SemCo The official pytorch implementation of the paper All Labels Are Not Created Equal: Enhancing Semi-supervision via Label Grouping and Co-training

42 Nov 14, 2022
A Marvelous ChatBot implement using PyTorch.

PyTorch Marvelous ChatBot [Update] it's 2019 now, previously model can not catch up state-of-art now. So we just move towards the future a transformer

JinTian 223 Oct 18, 2022
Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation

Context Decoupling Augmentation for Weakly Supervised Semantic Segmentation The code of: Context Decoupling Augmentation for Weakly Supervised Semanti

54 Dec 12, 2022
A hobby project which includes a hand-gesture based virtual piano using a mobile phone camera and OpenCV library functions

Overview This is a hobby project which includes a hand-gesture controlled virtual piano using an android phone camera and some OpenCV library. My moti

Abhinav Gupta 1 Nov 19, 2021
《LXMERT: Learning Cross-Modality Encoder Representations from Transformers》(EMNLP 2020)

The Most Important Thing. Our code is developed based on: LXMERT: Learning Cross-Modality Encoder Representations from Transformers

53 Dec 16, 2022
Churn prediction

Churn-prediction Churn-prediction Data preprocessing:: Label encoder is used to normalize the categorical variable Data Transformation:: For each data

1 Sep 28, 2022
Discover hidden deepweb pages

DeepWeb Scapper Att: Demo version An simple script to scrappe deepweb to find pages. Will return if any of those exists and will save on a file. You s

Héber Júlio 77 Oct 02, 2022
ShinRL: A Library for Evaluating RL Algorithms from Theoretical and Practical Perspectives

Status: Under development (expect bug fixes and huge updates) ShinRL: A Library for Evaluating RL Algorithms from Theoretical and Practical Perspectiv

37 Dec 28, 2022
Bonnet: An Open-Source Training and Deployment Framework for Semantic Segmentation in Robotics.

Bonnet: An Open-Source Training and Deployment Framework for Semantic Segmentation in Robotics. By Andres Milioto @ University of Bonn. (for the new P

Photogrammetry & Robotics Bonn 314 Dec 30, 2022
This is a collection of all challenges in HKCERT CTF 2021

香港網絡保安新生代奪旗挑戰賽 2021 (HKCERT CTF 2021) This is a collection of all challenges (and writeups) in HKCERT CTF 2021 Challenges ID Chinese name Name Score S

10 Jan 27, 2022
Learning Open-World Object Proposals without Learning to Classify

Learning Open-World Object Proposals without Learning to Classify Pytorch implementation for "Learning Open-World Object Proposals without Learning to

Dahun Kim 149 Dec 22, 2022
TrackFormer: Multi-Object Tracking with Transformers

TrackFormer: Multi-Object Tracking with Transformers This repository provides the official implementation of the TrackFormer: Multi-Object Tracking wi

Tim Meinhardt 321 Dec 29, 2022
HAT: Hierarchical Aggregation Transformers for Person Re-identification

HAT: Hierarchical Aggregation Transformers for Person Re-identification

11 Sep 05, 2022
Code for C2-Matching (CVPR2021). Paper: Robust Reference-based Super-Resolution via C2-Matching.

C2-Matching (CVPR2021) This repository contains the implementation of the following paper: Robust Reference-based Super-Resolution via C2-Matching Yum

Yuming Jiang 151 Dec 26, 2022
Code for "Retrieving Black-box Optimal Images from External Databases" (WSDM 2022)

Retrieving Black-box Optimal Images from External Databases (WSDM 2022) We propose how a user retreives an optimal image from external databases of we

joisino 5 Apr 13, 2022
The audio-video synchronization of MKV Container Format is exploited to achieve data hiding

The audio-video synchronization of MKV Container Format is exploited to achieve data hiding, where the hidden data can be utilized for various management purposes, including hyper-linking, annotation

Maxim Zaika 1 Nov 17, 2021
Rasterize with the least efforts for researchers.

utils3d Rasterize and do image-based 3D transforms with the least efforts for researchers. Based on numpy and OpenGL. It could be helpful when you wan

Ruicheng Wang 8 Dec 15, 2022
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

CALVIN CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks Oier Mees, Lukas Hermann, Erick Rosete,

Oier Mees 107 Dec 26, 2022
PyG (PyTorch Geometric) - A library built upon PyTorch to easily write and train Graph Neural Networks (GNNs)

PyG (PyTorch Geometric) is a library built upon PyTorch to easily write and train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.

PyG 16.5k Jan 08, 2023
Official PyTorch code of Holistic 3D Scene Understanding from a Single Image with Implicit Representation (CVPR 2021)

Implicit3DUnderstanding (Im3D) [Project Page] Holistic 3D Scene Understanding from a Single Image with Implicit Representation Cheng Zhang, Zhaopeng C

Cheng Zhang 149 Jan 08, 2023