In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Overview

Contrastive Learning of Object Representations

Supervisor:

Institutions:

Project Description

Contrastive Learning is an unsupervised method for learning similarities or differences in a dataset, without the need of labels. The main idea is to provide the machine with similar (so called positive samples) and with very different data (negative or corrupted samples). The task of the machine then is to leverage this information and to pull the positive examples in the embedded space together, while pushing the negative examples further apart. Next to being unsupervised, another major advantage is that the loss is applied on the latent space rather than being pixel-base. This saves computation and memory, because there is no need for a decoder and also delivers more accurate results.

eval_3_obj

In this work, we will investigate the SetCon model from 'Learning Object-Centric Video Models by Contrasting Sets' by Löwe et al. [1] (Paper) The SetCon model has been published in November 2020 by the Google Brain Team and introduces an attention-based object extraction in combination with contrastive learning. It incorporates a novel slot-attention module [2](Paper), which is an iterative attention mechanism to map the feature maps from the CNN-Encoder to a predefined number of object slots and has been inspired by the transformer models from the NLP world.

We investigate the utility of this architecture when used together with realistic video footage. Therefore, we implemented the SetCon with PyTorch according to its description and build upon it to meet our requirements. We then created two different datasets, in which we film given objects from different angles and distances, similar to Pirk [3] (Github, Paper). However, they relied on a faster-RCNN for the object detection, whereas the goal of the SetCon is to extract the objects solely by leveraging the contrastive loss and the slot attention module. By training a decoder on top of the learned representations, we found that in many cases the model can successfully extract objects from a scene.

This repository contains our PyTorch-implementation of the SetCon-Model from 'Learning Object-Centric Video Models by Contrasting Sets' by Löwe et al. Implementation is based on the description in the article. Note, this is not the official implementation. If you have questions, feel free to reach out to me.

Results

For our work, we have taken two videos, a Three-Object video and a Seven-Object video. In these videos we interacted with the given objects and moved them to different places and constantly changed the view perspective. Both are 30mins long, such that each contains about 54.000 frames.

eval_3_obj
Figure 1: An example of the object extraction on the test set of the Three-Object dataset.

We trained the contrastive pretext model (SetCon) on the first 80% and then evaluated the learned representations on the remaining 20%. Therefore, we trained a decoder, similar to the evaluation within the SetCon paper and looked into the specialisation of each slot. Figures 1 and 2 display two evaluation examples, from the test-set of the Three-Object Dataset and the Seven-Object Dataset. Bot figures start with the ground truth for three timestamps. During evaluation only the ground truth at t will be used to obtain the reconstructed object slots as well as their alpha masks. The Seven-Object video is itended to be more complex and one can perceive in figure 2 that the model struggles more than on the Three-Obejct dataset to route the objects to slots. On the Three-Object dataset, we achieved 0.0043 ± 0.0029 MSE and on the Seven-Object dataset 0.0154 ± 0.0043 MSE.

eval_7_obj
Figure 2: An example of the object extraction on the test set of the Seven-Object dataset.

How to use

For our work, we have taken two videos, a Three-Object video and Seven-Object video. Both datasets are saved as frames and are then encoded in a h5-files. To use a different dataset, we further provide a python routine process frames.py, which converts frames to h5 files.

For the contrastive pretext-task, the training can be started by:

python3 train_pretext.py --end 300000 --num-slots 7
        --name pretext_model_1 --batch-size 512
        --hidden-dim=1024 --learning-rate 1e-5
        --feature-dim 512 --data-path ’path/to/h5file’

Further arguments, like the size of the encoder or for an augmentation pipeline, use the flag -h for help. Afterwards, we froze the weights from the encoder and the slot-attention-module and trained a downstream decoder on top of it. The following command will train the decoder upon the checkpoint file from the pretext task:

python3 train_decoder.py --end 250000 --num-slots 7
        --name downstream_model_1 --batch-size 64
        --hidden-dim=1024 --feature-dim 512
        --data-path ’path/to/h5file’
        --pretext-path "path/to/pretext.pth.tar"
        --learning-rate 1e-5

For MSE evaluation on the test-set, use both checkpoints, from the pretext- model for the encoder- and slot-attention-weights and from the downstream- model for the decoder-weights and run:

python3 eval.py --num-slots 7 --name evaluation_1
        --batch-size 64 --hidden-dim=1024
        --feature-dim 512 --data-path ’path/to/h5file’
        --pretext-path "path/to/pretext.pth.tar"
        --decoder-path "path/to/decoder.pth.tar"

Implementation Adjustments

Instead of many small sequences of artificially created frames, we need to deal with a long video-sequence. Therefore, each element in our batch mirrors a single frame at a given time t, not a sequence. For this single frame at time t, we load its two predecessors, which are then used to predict the frame at t, and thereby create a positive example. Further, we found, that the infoNCE-loss to be numerically unstable in our case, hence we opted for the almost identical but more stable NT-Xent in our implementation.

References

[1] Löwe, Sindy et al. (2020). Learning object-centric video models by contrasting sets. Google Brain team.

[2] Locatello, Francesco et al. Object-centric learning with slot attention.

[3] Pirk, Sören et al. (2019). Online object representations with contrastive learning. Google Brain team.

Owner
Dirk Neuhäuser
Dirk Neuhäuser
The code of paper "Block Modeling-Guided Graph Convolutional Neural Networks".

Block Modeling-Guided Graph Convolutional Neural Networks This repository contains the demo code of the paper: Block Modeling-Guided Graph Convolution

22 Dec 08, 2022
Multi-task Multi-agent Soft Actor Critic for SMAC

Multi-task Multi-agent Soft Actor Critic for SMAC Overview The CARE formulti-task: Multi-Task Reinforcement Learning with Context-based Representation

RuanJingqing 8 Sep 30, 2022
Black box hyperparameter optimization made easy.

BBopt BBopt aims to provide the easiest hyperparameter optimization you'll ever do. Think of BBopt like Keras (back when Theano was still a thing) for

Evan Hubinger 70 Nov 03, 2022
This is the code for our paper "Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text"

Iconary This is the code for our paper "Iconary: A Pictionary-Based Game for Testing Multimodal Communication with Drawings and Text". It includes the

AI2 6 May 24, 2022
Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

mxin262 183 Jan 03, 2023
A TensorFlow 2.x implementation of Masked Autoencoders Are Scalable Vision Learners

Masked Autoencoders Are Scalable Vision Learners A TensorFlow implementation of Masked Autoencoders Are Scalable Vision Learners [1]. Our implementati

Aritra Roy Gosthipaty 59 Dec 10, 2022
Code repo for "RBSRICNN: Raw Burst Super-Resolution through Iterative Convolutional Neural Network" (Machine Learning and the Physical Sciences workshop in NeurIPS 2021).

RBSRICNN: Raw Burst Super-Resolution through Iterative Convolutional Neural Network An official PyTorch implementation of the RBSRICNN network as desc

Rao Muhammad Umer 6 Nov 14, 2022
(AAAI2020)Grapy-ML: Graph Pyramid Mutual Learning for Cross-dataset Human Parsing

Grapy-ML: Graph Pyramid Mutual Learning for Cross-dataset Human Parsing This repository contains pytorch source code for AAAI2020 oral paper: Grapy-ML

54 Aug 04, 2022
PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and Multi-Step Knowledge Distillation

PocketNet This is the official repository of the paper: PocketNet: Extreme Lightweight Face Recognition Network using Neural Architecture Search and M

Fadi Boutros 40 Dec 22, 2022
ScaleNet: A Shallow Architecture for Scale Estimation

ScaleNet: A Shallow Architecture for Scale Estimation Repository for the code of ScaleNet paper: "ScaleNet: A Shallow Architecture for Scale Estimatio

Axel Barroso 34 Nov 09, 2022
Ensembling Off-the-shelf Models for GAN Training

Data-Efficient GANs with DiffAugment project | paper | datasets | video | slides Generated using only 100 images of Obama, grumpy cats, pandas, the Br

MIT HAN Lab 1.2k Dec 26, 2022
Face Alignment using python

Face Alignment Face Alignment using python Input Image Aligned Face Aligned Face Aligned Face Input Image Aligned Face Input Image Aligned Face Instal

Sajjad Aemmi 28 Nov 23, 2022
[ICRA2021] Reconstructing Interactive 3D Scene by Panoptic Mapping and CAD Model Alignment

Interactive Scene Reconstruction Project Page | Paper This repository contains the implementation of our ICRA2021 paper Reconstructing Interactive 3D

97 Dec 28, 2022
Meta Learning Backpropagation And Improving It (VSML)

Meta Learning Backpropagation And Improving It (VSML) This is research code for the NeurIPS 2021 publication Kirsch & Schmidhuber 2021. Many concepts

Louis Kirsch 22 Dec 21, 2022
Opinionated code formatter, just like Python's black code formatter but for Beancount

beancount-black Opinionated code formatter, just like Python's black code formatter but for Beancount Try it out online here Features MIT licensed - b

Launch Platform 16 Oct 11, 2022
Example for AUAV 2022 with obstacle avoidance.

AUAV 2022 Sample This is a sample PX4 based quadrotor path planning framework based on Ubuntu 20.04 and ROS noetic for the IEEE Autonomous UAS 2022 co

James Goppert 11 Sep 16, 2022
PyTorch Implementation of the paper Learning to Reweight Examples for Robust Deep Learning

Learning to Reweight Examples for Robust Deep Learning Unofficial PyTorch implementation of Learning to Reweight Examples for Robust Deep Learning. Th

Daniel Stanley Tan 325 Dec 28, 2022
Adaptive FNO transformer - official Pytorch implementation

Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers This repository contains PyTorch implementation of the Adaptive Fourier Neu

NVIDIA Research Projects 77 Dec 29, 2022
FlingBot: The Unreasonable Effectiveness of Dynamic Manipulations for Cloth Unfolding

This repository contains code for training and evaluating FlingBot in both simulation and real-world settings on a dual-UR5 robot arm setup for Ubuntu 18.04

Columbia Artificial Intelligence and Robotics Lab 70 Dec 06, 2022
Mmrotate - OpenMMLab Rotated Object Detection Benchmark

OpenMMLab website HOT OpenMMLab platform TRY IT OUT 📘 Documentation | 🛠️ Insta

OpenMMLab 1.2k Jan 04, 2023