LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Overview

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Where we are ?

12.27 目前和原论文仍有1%左右得差距,但已经力压很多SOTA了

ckpt__448_epoch_25.pth mIoU Overall IoU [email protected]
Refcoco val 70.743 71.671 82.26
Refcoco testA 73.679 74.772 -
Refcoco testB 67.582 67.339 -

12.29 45epoch的结果又上升了大约1%

ckpt__448_epoch_45.pth mIoU Overall IoU
Refcoco val 71.949 72.246
Refcoco testA 74.533 75.467
Refcoco testB 67.849 68.123

the pretrain model will be released soon

对原论文的复现

论文链接: https://arxiv.org/abs/2112.02244

官方实现: https://github.com/yz93/LAVT-RIS

Architecture

Features

  • 将不同模态feature的fusion提前到Image Encoder阶段

  • 思路上对这两篇论文有很多借鉴

    • Vision-Language Transformer and Query Generation for Referring Segmentation

    • Locate then Segment: A Strong Pipeline for Referring Image Segmentation

  • 采用了比较新的主干网络 Swin-Transformer

Usage

详细参数设置可以见args.py

for training

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 main.py --batch_size 2 --cfg_file configs/swin_base_patch4_window7_224.yaml --size 448

for evaluation

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch --nproc_per_node 4 --master_port 23458 main.py --size 448 --batch_size 1 --resume --eval --type val --eval_mode cat --pretrain ckpt_448_epoch_20.pth --cfg_file configs/swin_base_patch4_window7_224.yaml

*.pth 都放在./checkpoint

for resume from checkpoint

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12346 main.py --batch_size 2 --cfg_file configs/swin_base_patch4_window7_224.yaml --size 448 --resume --pretrain ckpt_448_epoch_10.pth

for dataset preparation

please get details from ./data/readme.md

Need to be finished

由于我在复现的时候,官方的code还没有出来,所以一些细节上的设置可能和官方code不同

  • Swin Transformer 我选择的是 swin_base_patch4_window12_384_22k.pth,具体代码可以参考官方代码 https://github.com/microsoft/Swin-Transformer/blob/main/get_started.md 原论文中的图像resize的尺寸是480*480,可是我目前基于官方的代码若想调到这个尺寸,总是会报错,查了一下觉得可能用object detection 的swin transformer的code比较好

    12.27 这个问题目前也已经得到了较好的解决,目前训练用的是 swin_base_patch4_window7_224_22k.pth, 输入图片的尺寸调整到448*448

    解决方案可以参考:

    https://github.com/microsoft/Swin-Transformer/issues/155

  • 原论文中使用的lr_scheduler是polynomial learning rate decay, 没有给出具体的参数手动设置了一下

    12.21 目前来看感觉自己设置的不是很好

    12.27 调整了一下设置,初始学习率的设置真的很重要,特别是根据batch_size 去scale你的 inital learning rate

  • 原论文中的batch_size=32,基于自己的实验我猜想应该是用了8块GPU,每一块的batch_size=4, 由于我第一次写DDP code,训练时发现,程序总是会在RANK0上给其余RANK开辟类似共享显存的东西,导致我无法做到原论文相同的配置,需要改进

  • 仔细观察Refcoco的数据集,会发现一个target会对应好几个sentence,training时我设计的是随机选一个句子,evaluate时感觉应该要把所有句子用上会更好,关于这一点我想了两种evaluate的方法

    目前eval 只能支持 batch_size=1

    • 将所有句子concatenate成为一个句子,送入BERT,Input 形式上就是(Image,cat(sent_1,sent_2,sent_3)) => model => pred

    实验发现这种eval_mode 下的mean IOU 会好不少, overall_IOU 也会好一点

    • 对同一张图片处理多次处理,然后将结果进行平均,Input 形式上就是 ((Image,sent_1),(Image,sent_2),(Image,sent_3)) => model => average(pred_1,pred_2,pred_3)

Visualization

详细见inference.ipynb

input sentences

  1. right girl
  2. closest girl on right

results

Failure cases study

AnalysisFailure.ipynb 提供了一个研究model不work的途径,主要是筛选了IoU < 0.5的case,并在这些case中着重查看了一下IoU < 0.10.4 < IoU < 0.5 的例子

目前我只看了一些有限的failure cases,做了如下总结

  • 模型对于similar,dense object在language guide下定位不精确
  • 模型对于language的理解不分主次
  • refcoco本身标记的一些问题
Owner
zichengsaber
CVer
zichengsaber
[WWW 2022] Zero-Shot Stance Detection via Contrastive Learning

PT-HCL for Zero-Shot Stance Detection The code of this repository is constantly being updated... Please look forward to it! Introduction This reposito

Akuchi 12 Dec 21, 2022
NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.

880 Jan 07, 2023
The easiest tool for extracting radiomics features and training ML models on them.

Simple pipeline for experimenting with radiomics features Installation git clone https://github.com/piotrekwoznicki/ClassyRadiomics.git cd classrad pi

Piotr Woźnicki 17 Aug 04, 2022
unofficial pytorch implementation of RefineGAN

RefineGAN unofficial pytorch implementation of RefineGAN (https://arxiv.org/abs/1709.00753) for CSMRI reconstruction, the official code using tensorpa

xinby17 5 Jul 21, 2022
DenseNet Implementation in Keras with ImageNet Pretrained Models

DenseNet-Keras with ImageNet Pretrained Models This is an Keras implementation of DenseNet with ImageNet pretrained weights. The weights are converted

Felix Yu 568 Oct 31, 2022
Labelbox is the fastest way to annotate data to build and ship artificial intelligence applications

Labelbox Labelbox is the fastest way to annotate data to build and ship artificial intelligence applications. Use this github repository to help you s

labelbox 1.7k Dec 29, 2022
Combinatorially Hard Games where the levels are procedurally generated

puzzlegen Implementation of two procedurally simulated environments with gym interfaces. IceSlider: the agent needs to reach and stop on the pink squa

Autonomous Learning Group 3 Jun 26, 2022
Code, Models and Datasets for OpenViDial Dataset

OpenViDial This repo contains downloading instructions for the OpenViDial dataset in 《OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Vis

119 Dec 08, 2022
[ICCV 2021] A Simple Baseline for Semi-supervised Semantic Segmentation with Strong Data Augmentation

[ICCV 2021] A Simple Baseline for Semi-supervised Semantic Segmentation with Strong Data Augmentation

CodingMan 45 Dec 12, 2022
Think Big, Teach Small: Do Language Models Distil Occam’s Razor?

Think Big, Teach Small: Do Language Models Distil Occam’s Razor? Software related to the paper "Think Big, Teach Small: Do Language Models Distil Occa

0 Dec 07, 2021
Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection

Novel Instances Mining with Pseudo-Margin Evaluation for Few-Shot Object Detection (NimPme) The official implementation of Novel Instances Mining with

12 Sep 08, 2022
The BCNet related data and inference model.

BCNet This repository includes the some source code and related dataset of paper BCNet: Learning Body and Cloth Shape from A Single Image, ECCV 2020,

81 Dec 12, 2022
Where2Act: From Pixels to Actions for Articulated 3D Objects

Where2Act: From Pixels to Actions for Articulated 3D Objects The Proposed Where2Act Task. Given as input an articulated 3D object, we learn to propose

Kaichun Mo 69 Nov 28, 2022
A modular application for performing anomaly detection in networks

Deep-Learning-Models-for-Network-Annomaly-Detection The modular app consists for mainly three annomaly detection algorithms. The system supports model

Shivam Patel 1 Dec 09, 2021
Rax is a Learning-to-Rank library written in JAX

🦖 Rax: Composable Learning to Rank using JAX Rax is a Learning-to-Rank library written in JAX. Rax provides off-the-shelf implementations of ranking

Google 247 Dec 27, 2022
DIT is a DTLS MitM proxy implemented in Python 3. It can intercept, manipulate and suppress datagrams between two DTLS endpoints and supports psk-based and certificate-based authentication schemes (RSA + ECC).

DIT - DTLS Interception Tool DIT is a MitM proxy tool to intercept DTLS traffic. It can intercept, manipulate and/or suppress DTLS datagrams between t

52 Nov 30, 2022
Unsupervised Image to Image Translation with Generative Adversarial Networks

Unsupervised Image to Image Translation with Generative Adversarial Networks Paper: Unsupervised Image to Image Translation with Generative Adversaria

Hao 71 Oct 30, 2022
Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

114 Jul 19, 2022
Implementation of the paper titled "Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees"

Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees Implementation of the paper titled "Using Sampling to

MIDAS, IIIT Delhi 2 Aug 29, 2022
An ML & Correlation platform for transforming disparate data points of interest into usable intelligence.

SSIDprobeCollector An ML & Correlation platform for transforming disparate data points of interest into usable intelligence. At a High level the platf

Bill Reyor 1 Jan 30, 2022