[BMVC'21] Official PyTorch Implementation of Grounded Situation Recognition with Transformers

Overview

Grounded Situation Recognition with Transformers

Paper | Model Checkpoint

  • This is the official PyTorch implementation of Grounded Situation Recognition with Transformers (BMVC 2021).
  • GSRTR (Grounded Situation Recognition TRansformer) achieves state of the art in all evaluation metrics on the SWiG benchmark.
  • This repository contains instructions, code and model checkpoint.

Overview

Grounded Situation Recognition (GSR) is the task that not only classifies a salient action (verb), but also predicts entities (nouns) associated with semantic roles and their locations in the given image. Inspired by the remarkable success of Transformers in vision tasks, we propose a GSR model based on a Transformer encoder-decoder architecture. The attention mechanism of our model enables accurate verb classification by capturing high-level semantic feature of an image effectively, and allows the model to flexibly deal with the complicated and image-dependent relations between entities for improved noun classification and localization. Our model is the first Transformer architecture for GSR, and achieves the state of the art in every evaluation metric on the SWiG benchmark.

model

GSRTR mainly consists of two components: Transformer Encoder for verb prediction, and Transformer Decoder for grounded noun prediction. For details, please see Grounded Situation Recognition with Transformers by Junhyeong Cho, Youngseok Yoon, Hyeonjun Lee and Suha Kwak.

Environment Setup

We provide instructions for environment setup.

# Clone this repository and navigate into the repository
git clone https://github.com/jhcho99/gsrtr.git    
cd gsrtr                                          

# Create a conda environment, activate the environment and install PyTorch via conda
conda create --name gsrtr python=3.9              
conda activate gsrtr                             
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge 

# Install requirements via pip
pip install -r requirements.txt                   

SWiG Dataset

Annotations are given in JSON format, and annotation files are under "SWiG/SWiG_jsons/" directory. Images can be downloaded here. Please download the images and store them in "SWiG/images_512/" directory.

SWiG_Image In the SWiG dataset, each image is associated with Verb, Frame and Groundings.

A) Verb: each image is paired with a verb. In the annotation file, "verb" denotes the salient action for an image.

B) Frame: a frame denotes the set of semantic roles for a verb. For example, the frame for verb "Catching" denotes the set of semantic roles "Agent", "Caught Item", "Tool" and "Place". In the annotation file, "frames" show the set of semantic roles for a verb, and noun annotations for each role. There are three noun annotations for each role, which are given by three different annotators.

C) Groundings: each grounding is described in [x1, y1, x2, y2] format. In the annotation file, "bb" denotes groundings for roles. Note that nouns can be labeled without groundings, e.g., in the case of occluded objects. When there is no grounding for a role, [-1, -1, -1, -1] is given.

# an example of annotation for an image

"catching_175.jpg": {
    "verb": "catching",
    "height": 512, 
    "width": 910,
    "bb": {"tool": [-1, -1, -1, -1], 
           "caughtitem": [444, 169, 671, 317], 
           "place": [-1, -1, -1, -1], 
           "agent": [270, 112, 909, 389]},
    "frames": [{"tool": "n05282433", "caughtitem": "n02190166", "place": "n03991062", "agent": "n00017222"}, 
               {"tool": "n05302499", "caughtitem": "n02190166", "place": "n03990474", "agent": "n00017222"}, 
               {"tool": "n07655505", "caughtitem": "n13152742", "place": "n00017222", "agent": "n02190166"}]
    }

In imsitu_space.json file, there is additional information for verb and noun.

# an example of additional verb information

"catching": {
    "framenet": "Getting", 
    "abstract": "an AGENT catches a CAUGHTITEM with a TOOL at a PLACE", 
    "def": "capture a sought out item", 
    "order": ["agent", "caughtitem", "tool", "place"], 
    "roles": {"tool": {"framenet": "manner", "def": "The object used to do the catch action"}, 
              "caughtitem": {"framenet": "theme", "def": "The entity being caught"}, 
              "place": {"framenet": "place", "def": "The location where the catch event is happening"}, 
              "agent": {"framenet": "recipient", "def": "The entity doing the catch action"}}
    }
# an example of additional noun information

"n00017222": {
    "gloss": ["plant", "flora", "plant life"], 
    "def": "(botany) a living organism lacking the power of locomotion"
    }

Additional Details

  • All images should be under "SWiG/images_512/" directory.
  • train.json file is for train set.
  • dev.json file is for development set.
  • test.json file is for test set.

Training

To train GSRTR on a single node with 4 gpus for 40 epochs, run:

python -m torch.distributed.launch --nproc_per_node=4 --use_env main.py \
           --backbone resnet50 --batch_size 16 --dataset_file swig --epochs 40 \
           --num_workers 4 --enc_layers 6 --dec_layers 6 --dropout 0.15 --hidden_dim 512 \
           --output_dir gsrtr

To train GSRTR on a Slurm cluster with submitit using 4 TITAN Xp gpus for 40 epochs, run:

python run_with_submitit.py --ngpus 4 --nodes 1 --job_dir gsrtr \
        --backbone resnet50 --batch_size 16 --dataset_file swig --epochs 40 \
        --num_workers 4 --enc_layers 6 --dec_layers 6 --dropout 0.15 --hidden_dim 512 \
        --partition titanxp
  • A single epoch takes about 30 minutes. 40 epoch training takes around 20 hours on a single machine with 4 TITAN Xp gpus.
  • We use AdamW optimizer with learning rate 10-4 (10-5 for backbone), weight decay 10-4 and β = (0.9, 0.999).
  • Random Color Jittering, Random Gray Scaling, Random Scaling and Random Horizontal Flipping are used for augmentation.

Inference

To run an inference on a custom image, run:

python inference.py --image_path inference/filename.jpg \
                    --saved_model gsrtr_checkpoint.pth \
                    --output_dir inference
  • Model checkpoint can be downloaded here.

Here is an example of inference result: inference_result

Acknowledgements

Our code is modified and adapted from these amazing repositories:

Contact

Junhyeong Cho ([email protected])

Citation

If you find our work useful for your research, please cite our paper:

@InProceedings{cho2021gsrtr,
    title={Grounded Situation Recognition with Transformers},
    author={Junhyeong Cho and Youngseok Yoon and Hyeonjun Lee and Suha Kwak},
    booktitle={British Machine Vision Conference (BMVC)},
    year={2021}
}

License

GSRTR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Owner
Junhyeong Cho
Student at POSTECH | Studied at Stanford, UIUC and UC Berkeley
Junhyeong Cho
Papers, Datasets, Algorithms, SOTA for STR. Long-time Maintaining

Scene Text Recognition Recommendations Everythin about Scene Text Recognition SOTA • Papers • Datasets • Code Contents 1. Papers 2. Datasets 2.1 Synth

Deep Learning and Vision Computing Lab, SCUT 197 Jan 05, 2023
Introduction to image processing, most used and popular functions of OpenCV

👀 OpenCV 101 Introduction to image processing, most used and popular functions of OpenCV go here.

Vusal Ismayilov 3 Jul 02, 2022
A Joint Video and Image Encoder for End-to-End Retrieval

Frozen️ in Time ❄️ ️️️️ ⏳ A Joint Video and Image Encoder for End-to-End Retrieval (arXiv) Repository to contain the code, models, data for end-to-end

225 Dec 25, 2022
Fine tuning keras-ocr python package with custom synthetic dataset from scratch

OCR-Pipeline-with-Keras The keras-ocr package generally consists of two parts: a Detector and a Recognizer: Detector is responsible for creating bound

Eugene 1 Jan 05, 2022
Select range and every time the screen changes, OCR is activated.

ASOCR(Auto Screen OCR) Select range and every time you press Space key, OCR is activated. 範囲を選ぶと、あなたがスペースキーを押すたびに、画面が変わる度にOCRが起動します。 usage1: simple OC

1 Feb 13, 2022
This repo contains a script that allows us to find range of colors in images using openCV, and then convert them into geo vectors.

Vectorizing color range This repo contains a script that allows us to find range of colors in images using openCV, and then convert them into geo vect

Development Seed 9 Jul 27, 2022
Convolutional Recurrent Neural Network (CRNN) for image-based sequence recognition.

Convolutional Recurrent Neural Network This software implements the Convolutional Recurrent Neural Network (CRNN), a combination of CNN, RNN and CTC l

Baoguang Shi 2k Dec 31, 2022
Rest API Written In Python To Classify NSFW Images.

✨ NSFW Classifier API ✨ Rest API Written In Python To Classify NSFW Images. Fastest Solution If you don't want to selfhost it, there's already an inst

Akshay Rajput 23 Dec 30, 2022
MeshToGeotiff - A fast Python algorithm to convert a 3D mesh into a GeoTIFF

MeshToGeotiff - A fast Python algorithm to convert a 3D mesh into a GeoTIFF Python class for converting (very fast) 3D Meshes/Surfaces to Raster DEMs

8 Sep 10, 2022
SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

NVIDIA Research Projects 31 Nov 22, 2022
Convert scans of handwritten notes to beautiful, compact PDFs

Convert scans of handwritten notes to beautiful, compact PDFs

Matt Zucker 4.8k Jan 01, 2023
This is used to convert a string to an Image with Handwritten Characters.

Text-to-Handwriting-using-python This is used to convert a string to an Image with Handwritten Characters. text_to_handwriting(string: str, save_to: s

Akashdeep Mahata 3 Aug 15, 2022
PAGE XML format collection for document image page content and more

PAGE-XML PAGE XML format collection for document image page content and more For an introduction, please see the following publication: http://www.pri

PRImA Research Lab 46 Nov 14, 2022
基于Paddle框架的PSENet复现

PSENet-Paddle 基于Paddle框架的PSENet复现 本项目基于paddlepaddle框架复现PSENet,并参加百度第三届论文复现赛,将在2021年5月15日比赛完后提供AIStudio链接~敬请期待 AIStudio链接 参考项目: whai362-PSENet 环境配置 本项目

QuanHao Guo 4 Apr 24, 2022
Official PyTorch implementation for "Mixed supervision for surface-defect detection: from weakly to fully supervised learning"

Mixed supervision for surface-defect detection: from weakly to fully supervised learning [Computers in Industry 2021] Official PyTorch implementation

ViCoS Lab 169 Dec 30, 2022
Awesome Spectral Indices in Python.

Awesome Spectral Indices in Python: Numpy | Pandas | GeoPandas | Xarray | Earth Engine | Planetary Computer | Dask GitHub: https://github.com/davemlz/

David Montero Loaiza 98 Jan 02, 2023
天池2021"全球人工智能技术创新大赛"【赛道一】:医学影像报告异常检测 - 第三名解决方案

天池2021"全球人工智能技术创新大赛"【赛道一】:医学影像报告异常检测 比赛链接 个人博客记录 目录结构 ├── final------------------------------------决赛方案PPT ├── preliminary_contest--------------------

19 Aug 17, 2022
An organized collection of tutorials and projects created for aspriring computer vision students.

A repository created with the purpose of teaching students in BME lab 308A- Hanoi University of Science and Technology

Givralnguyen 5 Nov 24, 2021
The papers published in top-tier AI conferences in recent years.

AI-conference-papers The papers published in top-tier AI conferences in recent years. Paper table AAAI ICLR CVPR ICML ICCV ECCV NIPS 2019 ✔️ ✔️ ✔️ ✔️

Jinbae Park 6 Dec 09, 2022
Code for CVPR2021 paper "Learning Salient Boundary Feature for Anchor-free Temporal Action Localization"

AFSD: Learning Salient Boundary Feature for Anchor-free Temporal Action Localization This is an official implementation in PyTorch of AFSD. Our paper

Tencent YouTu Research 146 Dec 24, 2022