Official PyTorch implementation for paper "Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer"

Last update: Dec 20, 2022

Overview

UPT: Unary–Pairwise Transformers

This repository contains the official PyTorch implementation for the paper

Frederic Z. Zhang, Dylan Campbell and Stephen Gould. Efficient Two-Stage Detection of Human–Object Interactions with a Novel Unary–Pairwise Transformer. arXiv preprint arXiv:2112.01838.

[project page] [preprint]

Abstract

...
However, the success of such one-stage HOI detectors can largely be attributed to the representation power of transformers. We discovered that when equipped with the same transformer, their two-stage counterparts can be more performant and memory-efficient, while taking a fraction of the time to train. In this work, we propose the Unary–Pairwise Transformer, a two-stage detector that exploits unary and pairwise representa-tions for HOIs. We observe that the unary and pairwise parts of our transformer network specialise, with the former preferentially increasing the scores of positive examples and the latter decreasing the scores of negative examples. We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches. At inference time, our model with ResNet50 approaches real-time performance on a single GPU.

Demonstration on data in the wild

Model Zoo

We provide weights for UPT models pre-trained on HICO-DET and V-COCO for potential downstream applications. In addition, we also provide weights for fine-tuned DETR models to facilitate reproducibility. To attempt fine-tuning the DETR model yourself, refer to this repository.

Model	Dataset	Default Settings	Inference	UPT Weights	DETR Weights
UPT-R50	HICO-DET	(`31.66`, `25.94`, `33.36`)	`0.042s`	weights	weights
UPT-R101	HICO-DET	(`32.31`, `28.55`, `33.44`)	`0.061s`	weights	weights
UPT-R101-DC5	HICO-DET	(`32.62`, `28.62`, `33.81`)	`0.124s`	weights	weights

Model	Dataset	Scenario 1	Scenario 2	Inference	UPT Weights	DETR Weights
UPT-R50	V-COCO	`59.0`	`64.5`	`0.043s`	weights	weights
UPT-R101	V-COCO	`60.7`	`66.2`	`0.064s`	weights	weights
UPT-R101-DC5	V-COCO	`61.3`	`67.1`	`0.131s`	weights	weights

The inference speed was benchmarked on a GeForce RTX 3090. Note that weights of the UPT model include those of the detector (DETR). You do not need to download the DETR weights, unless you want to train the UPT model from scratch. Training UPT-R50 with 8 GeForce GTX TITAN X GPUs takes around 5 hours on HICO-DET and 40 minutes on V-COCO, almost a tenth of the time compared to other one-stage models such as QPIC.

Contact

For general inquiries regarding the paper and code, please post them in Discussions. For bug reports and feature requests, please post them in Issues. You can also contact me at [email protected].

Prerequisites

Install the lightweight deep learning library Pocket. The recommended PyTorch version is 1.9.0.
Download the repository and the submodules.

git clone https://github.com/fredzzhang/upt.git
git submodule init
git submodule update

Prepare the HICO-DET dataset.
1. If you have not downloaded the dataset before, run the following script.
```
cd /path/to/upt/hicodet
bash download.sh
```
1. If you have previously downloaded the dataset, simply create a soft link.
```
cd /path/to/upt/hicodet
ln -s /path/to/hicodet_20160224_det ./hico_20160224_det
```
Prepare the V-COCO dataset (contained in MS COCO).
1. If you have not downloaded the dataset before, run the following script
```
cd /path/to/upt/vcoco
bash download.sh
```
1. If you have previously downloaded the dataset, simply create a soft link
```
cd /path/to/upt/vcoco
ln -s /path/to/coco ./mscoco2014
```

License

UPT is released under the BSD-3-Clause License.

Inference

We have implemented inference utilities with different visualisation options. Provided you have downloaded the model weights to checkpoints/, run the following command to visualise detected instances together with the attention maps from the cooperative and competitive layers. Use the flag --index to select images, and --box-score-thresh to modify the filtering threshold on object boxes.

python inference.py --resume checkpoints/upt-r50-hicodet.pt --index 8789

Here is the sample output. Note that we manually selected some informative attention maps to display. The predicted scores for each action will be printed by the script as well.

To select the V-COCO dataset and V-COCO models, use the flag --dataset vcoco, and then load the corresponding weights. To visualise interactive human-object pairs for a particular action class, use the flag --action to specify the action index. Here is a lookup table for the action indices.

Additionally, to cater for different needs, we implemented an option to run inference on custom images, using the flag --image-path. The following is an example for interaction holding an umbrella.

python inference.py --resume checkpoints/upt-r50-hicodet.pt --image-path ./assets/umbrella.jpeg --action 36

Training and Testing

Refer to launch_template.sh for training and testing commands with different options. To train the UPT model from scratch, you need to download the weights for the corresponding DETR model, and place them under /path/to/upt/checkpoints/. Adjust --world-size based on the number of GPUs available.

To test the UPT model on HICO-DET, you can either use the Python utilities we implemented or the Matlab utilities provided by Chao et al.. For V-COCO, we did not implement evaluation utilities, and instead use the utilities provided by Gupta et al.. Refer to these instructions for more details.

Citation

If you find our work useful for your research, please consider citing us:

@article{zhang2021upt,
  author    = {Frederic Z. Zhang and Dylan Campbell and Stephen Gould},
  title     = {Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer},
  journal   = {arXiv preprint arXiv:2112.01838},
  year      = {2021}
}

@inproceedings{zhang2021scg,
  author    = {Frederic Z. Zhang, Dylan Campbell and Stephen Gould},
  title     = {Spatially Conditioned Graphs for Detecting Human–Object Interactions},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month     = {October},
  year      = {2021},
  pages     = {13319-13327}
}

Comments

error when test vcoco

I use python main.py --cache --dataset vcoco --data-root vcoco/ --partitions trainval test --output-dir vcoco-r50 --resume checkpoints/upt-r50-vcoco.pt to generate cache.pkl. But report a error when eval it.

The eval code is:

from vsrl_eval import VCOCOeval

vsrl_annot_file = 'data/vcoco/vcoco_val.json'
coco_file = 'data/instances_vcoco_all_2014.json'
split_file = 'data/splits/vcoco_val.ids'

vcocoeval = VCOCOeval(vsrl_annot_file, coco_file, split_file)

det_file = '/media/ming-t/Deng/relation_mppe/HOI-UPT/vcoco-r50/cache.pkl'
vcocoeval._do_eval(det_file, ovr_thresh=0.5)

The error is:

loading annotations into memory...
Done (t=0.74s)
creating index...
index created!
loading vcoco annotations...
Traceback (most recent call last):
  File "test.py", line 14, in <module>
    vcocoeval._do_eval(det_file, ovr_thresh=0.5)
  File "/media/ming-t/Deng/relation_mppe/HOI-UPT/lib/vcoco/vsrl_eval.py", line 194, in _do_eval
    self._do_agent_eval(vcocodb, detections_file, ovr_thresh=ovr_thresh)
  File "/media/ming-t/Deng/relation_mppe/HOI-UPT/lib/vcoco/vsrl_eval.py", line 417, in _do_agent_eval
    assert(np.amax(rec) <= 1)
  File "<__array_function__ internals>", line 180, in amax
  File "/home/ming-t/anaconda3/envs/pocket/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 2793, in amax
    return _wrapreduction(a, np.maximum, 'max', axis, None, out,
  File "/home/ming-t/anaconda3/envs/pocket/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity

How to solve it?

opened by leijue222 14

The HOI loss is NaN for rank 0

Dir sir, I followed with readme to build this UPT network,but when i use the instruction python main.py --world-size 1 --dataset vcoco --data-root ./v-coco --partitions trainval test --pretrained ../detr-r50-vcoco.pth --output-dir ./upt-r50-vcoco.pt

i got an error

`Traceback (most recent call last): File "main.py", line 208, in mp.spawn(main, nprocs=args.world_size, args=(args,)) File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/miniconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/root/autodl-tmp/upload/main.py", line 125, in main engine(args.epochs) File "/root/pocket/pocket/pocket/core/distributed.py", line 139, in call self._on_each_iteration() File "/root/autodl-tmp/upload/utils.py", line 138, in _on_each_iteration raise ValueError(f"The HOI loss is NaN for rank {self._rank}") ValueError: The HOI loss is NaN for rank 0`

I tried to train without pretrain model it works the same error.I tried to print the loss but it shown an empty tensor.As a beginner , i have no idea what it happened.If you could give me any help,i would be appreciated. I look forward to receiving your reply.Thank you for a lot.
Inactive

opened by OBVIOUSDAWN 11
Generate the results on the friends.gif

Hello! Thank you for this amazing work! I am curious to know how you got the inference results showing the names of the objects and the activities on the demo_friends.gif. Can you please tell how you achieved that? Thanks in advance.

opened by Andre1998Shuvam 8
Predicted object class instead of just number?

Hi,

thanks for the amazing work! I would like to ask where can we see the predicted object class of each bounding box, or the prediction result for a triplet form? Thank you!
enhancement

opened by xiaoxiaoczw 7
code bug?

Thanks for your amazing work.

i'm confused about the code here: https://github.com/fredzzhang/upt/blob/192743045e4c5f6827299fdc15809f5b97ac1b8e/interaction_head.py#L333

the code x_keep, y_keep = torch.nonzero(torch.logical_and(x != y, x < n_h)).unbind(1) is designed for getting valid human-object pairs, but i think it also causes some human-human pairs without the constraint y >= n_h.

opened by ltttpku 6
There is still a problem.

Traceback (most recent call last): File "inference.py", line 225, in main(args) File "C:\Users\User\anaconda3\envs\colab\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "inference.py", line 150, in main upt = build_detector(args, conversion) File "C:\Users\User\PycharmProjects\hoi\UPT\upt.py", line 276, in build_detector detr.backbone[0].num_channels, File "C:\Users\User\anaconda3\envs\colab\lib\site-packages\torch\nn\modules\module.py", line 1207, in getattr raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'DETRsegm' object has no attribute 'backbone'

I changed the torch version and tried it in the collab environment, but the problem still occurs in the same place.

If possible, can you tell me all libraries using "pip freeze > requirements.txt"?

If it is not possible to disclose it externally, I would appreciate it if you could send it to [email protected].

opened by ghzmwhdk777 4
Here is problem

Traceback (most recent call last): File "inference.py", line 225, in main(args) File "C:\Users\User\anaconda3\envs\colab\lib\site-packages\torch\autograd\grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "inference.py", line 150, in main upt = build_detector(args, conversion) File "C:\Users\User\PycharmProjects\hoi\UPT\upt.py", line 268, in build_detector detr, , postprocessors = build_model(args) File "C:\Users\User\PycharmProjects\hoi\UPT\detr\models_init.py", line 6, in build_model return build(args) File "C:\Users\User\PycharmProjects\hoi\UPT\detr\models\detr.py", line 313, in build num_classes = 20 if args.dataset_file != 'coco' else 91 AttributeError: 'Namespace' object has no attribute 'dataset_file'

opened by ghzmwhdk777 4
How about training a DETR model on the VCOCO dataset

Thank you for your excellent work. you have provided a tutorial on training DETR models on the HICO-DET dataset, could you tell us how you trained the DETR on the VCOCO dataset?
moved to discussion

opened by ddwhzh 4
confused about the vcoco dataset
There're some cool properties of VCOCO dataset you implemented: "object_to_action" gives me the list of actions for each object, i.e. {1: [0, 3, 11, 15], 2: [0, 1, 2, 3, 11], ......} "objects" return the list of objects, i.e. ['background', 'person', 'bicycle', .......] "actions" return the list of actions, i.e. ['hold obj', 'sit instr', 'ride instr', .......]

However, I'm confused about the relationships among them:

Which object does the key 1 of "1: [0, 3, 11, 15]", which is the first item of object_to_action, represent?

Which action does the values [0, 3, 11, 15] of "1: [0, 3, 11, 15]" represent?

According to the List of actions and objects, Actions 0, 3, 11, 15 represent hold obj, look obj, carry obj, cut obj respectively while Object 1 represent person, which appears to be weird.
question moved to discussion
opened by ltttpku 3
train

python main.py --world-size 1 --pretrained checkpoints/detr-r50-hicodet.pth --output-dir checkpoints/upt-r50-hicodet

raise ValueError(f"The HOI loss is NaN for rank {self._rank}") ValueError: The HOI loss is NaN for rank 0

opened by wangjunbianqiang 2

Multiple loss training code

Hi, @fredzzhang :

I want to try training with multiple losses. I found the relevant code. I added a loss, which is running and no error is reported.

but I want to successfully train multiple loss and set the hyperparameters of loss, how do I do it?

if self.training:

        interaction_loss = self.compute_interaction_loss(boxes, bh, bo, logits, prior, targets, pairwise_tokens_x_collated)
        interaction_x_loss = self.compute_interaction_x_loss(boxes, bh, bo, logits, prior, targets, pairwise_tokens_x_collated)
        loss_dict = dict(
            interaction_loss=interaction_loss,
            interaction_x_loss = interaction_x_loss
        )
        return loss_dict

def _on_each_iteration(self):

    loss_dict = self._state.net(
        *self._state.inputs, targets=self._state.targets)
    if loss_dict['interaction_loss'].isnan():
        raise ValueError(f"The HOI loss is NaN for rank {self._rank}")

    self._state.loss = sum(loss for loss in loss_dict.values())
    self._state.optimizer.zero_grad(set_to_none=True)
    self._state.loss.backward()
    if self.max_norm > 0:
        torch.nn.utils.clip_grad_norm_(self._state.net.parameters(), self.max_norm)
    self._state.optimizer.step()

yaoyaosanqi.

opened by yaoyaosanqi 2

Releases(v1.0)

v1.0(Dec 6, 2021)

This is the initial release of the UPT model.
Source code(tar.gz)
Source code(zip)

Owner

Frederic Zhang

PhD researcher, photographer, substandard musician but a linguistic genius

GitHub Repository https://fredzzhang.com/unary-pairwise-transformers

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

17 Mar 10, 2022

Omniverse sample scripts - A guide for developing with Python scripts on NVIDIA Ominverse

Omniverse sample scripts ここでは、NVIDIA Omniverse ( https://www.nvidia.com/ja-jp/om

37 Nov 17, 2022

Multi-layer convolutional LSTM with Pytorch

Convolution_LSTM_pytorch Thanks for your attention. I haven't got time to maintain this repo for a long time. I recommend this repo which provides an

733 Dec 30, 2022

PyTorch experiments with the Zalando fashion-mnist dataset

zalando-pytorch PyTorch experiments with the Zalando fashion-mnist dataset Project Organization ├── LICENSE ├── Makefile - Makefile with co

31 Sep 25, 2021

Implementation of our paper "DMT: Dynamic Mutual Training for Semi-Supervised Learning"

DMT: Dynamic Mutual Training for Semi-Supervised Learning This repository contains the code for our paper DMT: Dynamic Mutual Training for Semi-Superv

120 Dec 30, 2022

code for TCL: Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022

Vision-Language Pre-Training with Triple Contrastive Learning, CVPR 2022 News (03/16/2022) upload retrieval checkpoints finetuned on COCO and Flickr T

187 Jan 02, 2023

PyDeepFakeDet is an integrated and scalable tool for Deepfake detection.

PyDeepFakeDet An integrated and scalable library for Deepfake detection research. Introduction PyDeepFakeDet is an integrated and scalable Deepfake de

49 Dec 11, 2022

Source code of generalized shuffled linear regression

Generalized-Shuffled-Linear-Regression Code for the ICCV 2021 paper: Generalized Shuffled Linear Regression. Authors: Feiran Li, Kent Fujiwara, Fumio

7 Oct 26, 2022

Processed, version controlled history of Minecraft's generated data and assets

mcmeta Processed, version controlled history of Minecraft's generated data and assets Repository structure Each of the following branches has a commit

75 Dec 28, 2022

atmaCup #11 の Public 4th / Pricvate 5th Solution のリポジトリです。

#11 atmaCup 2021-07-09 ~ 2020-07-21 に行われた #11 [初心者歓迎! / 画像編] atmaCup のリポジトリです。結果は Public 4th / Private 5th でした。フレームワークは PyTorch で、実装は pytorch-image-m

12 Apr 07, 2022

Wafer Fault Detection using MlOps Integration

Wafer Fault Detection using MlOps Integration This is an end to end machine learning project with MlOps integration for predicting the quality of wafe

0 Mar 11, 2022

Task-related Saliency Network For Few-shot learning

Task-related Saliency Network For Few-shot learning This is an official implementation in Tensorflow of TRSN. Abstract An essential cue of human wisdo

1 Nov 18, 2021

Official implementation of the paper ``Unifying Nonlocal Blocks for Neural Networks'' (ICCV'21)

Spectral Nonlocal Block Overview Official implementation of the paper: Unifying Nonlocal Blocks for Neural Networks (ICCV'21) Spectral View of Nonloca

91 Dec 14, 2022

Automated Hyperparameter Optimization Competition

QQ浏览器2021AI算法大赛 - 自动超参数优化竞赛 ACM CIKM 2021 AnalyticCup 在信息流推荐业务场景中普遍存在模型或策略效果依赖于“超参数”的问题，而“超参数"的设定往往依赖人工经验调参，不仅效率低下维护成本高，而且难以实现更优效果。因此，本次赛题以超参数优化为主题，从真

20 Dec 09, 2021

A torch implementation of "Pixel-Level Domain Transfer"

Pixel Level Domain Transfer A torch implementation of "Pixel-Level Domain Transfer". based on dcgan.torch. Dataset The dataset used is "LookBook", fro

260 Sep 02, 2022

Inferring Lexicographically-Ordered Rewards from Preferences

Inferring Lexicographically-Ordered Rewards from Preferences Code author: Alihan Hüyük ([e

1 Feb 13, 2022

Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)

Quasi-Dense Tracking This is the offical implementation of paper Quasi-Dense Similarity Learning for Multiple Object Tracking. We present a trailer th

327 Dec 27, 2022

Official implementation of "Synthetic Temporal Anomaly Guided End-to-End Video Anomaly Detection" (ICCV Workshops 2021: RSL-CV).

Official PyTorch implementation of "Synthetic Temporal Anomaly Guided End-to-End Video Anomaly Detection" This is the implementation of the paper "Syn

11 Oct 07, 2022

TensorFlow implementation of "Attention is all you need (Transformer)"

[TensorFlow 2] Attention is all you need (Transformer) TensorFlow implementation of "Attention is all you need (Transformer)" Dataset The MNIST datase

4 Jan 05, 2022

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image

Expressive Body Capture: 3D Hands, Face, and Body from a Single Image [Project Page] [Paper] [Supp. Mat.] Table of Contents License Description Fittin

1.3k Jan 07, 2023