LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Overview

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Where we are ?

12.27 目前和原论文仍有1%左右得差距,但已经力压很多SOTA了

ckpt__448_epoch_25.pth mIoU Overall IoU [email protected]
Refcoco val 70.743 71.671 82.26
Refcoco testA 73.679 74.772 -
Refcoco testB 67.582 67.339 -

12.29 45epoch的结果又上升了大约1%

ckpt__448_epoch_45.pth mIoU Overall IoU
Refcoco val 71.949 72.246
Refcoco testA 74.533 75.467
Refcoco testB 67.849 68.123

the pretrain model will be released soon

对原论文的复现

论文链接: https://arxiv.org/abs/2112.02244

官方实现: https://github.com/yz93/LAVT-RIS

Architecture

Features

  • 将不同模态feature的fusion提前到Image Encoder阶段

  • 思路上对这两篇论文有很多借鉴

    • Vision-Language Transformer and Query Generation for Referring Segmentation

    • Locate then Segment: A Strong Pipeline for Referring Image Segmentation

  • 采用了比较新的主干网络 Swin-Transformer

Usage

详细参数设置可以见args.py

for training

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12345 main.py --batch_size 2 --cfg_file configs/swin_base_patch4_window7_224.yaml --size 448

for evaluation

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch --nproc_per_node 4 --master_port 23458 main.py --size 448 --batch_size 1 --resume --eval --type val --eval_mode cat --pretrain ckpt_448_epoch_20.pth --cfg_file configs/swin_base_patch4_window7_224.yaml

*.pth 都放在./checkpoint

for resume from checkpoint

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node 4 --master_port 12346 main.py --batch_size 2 --cfg_file configs/swin_base_patch4_window7_224.yaml --size 448 --resume --pretrain ckpt_448_epoch_10.pth

for dataset preparation

please get details from ./data/readme.md

Need to be finished

由于我在复现的时候,官方的code还没有出来,所以一些细节上的设置可能和官方code不同

  • Swin Transformer 我选择的是 swin_base_patch4_window12_384_22k.pth,具体代码可以参考官方代码 https://github.com/microsoft/Swin-Transformer/blob/main/get_started.md 原论文中的图像resize的尺寸是480*480,可是我目前基于官方的代码若想调到这个尺寸,总是会报错,查了一下觉得可能用object detection 的swin transformer的code比较好

    12.27 这个问题目前也已经得到了较好的解决,目前训练用的是 swin_base_patch4_window7_224_22k.pth, 输入图片的尺寸调整到448*448

    解决方案可以参考:

    https://github.com/microsoft/Swin-Transformer/issues/155

  • 原论文中使用的lr_scheduler是polynomial learning rate decay, 没有给出具体的参数手动设置了一下

    12.21 目前来看感觉自己设置的不是很好

    12.27 调整了一下设置,初始学习率的设置真的很重要,特别是根据batch_size 去scale你的 inital learning rate

  • 原论文中的batch_size=32,基于自己的实验我猜想应该是用了8块GPU,每一块的batch_size=4, 由于我第一次写DDP code,训练时发现,程序总是会在RANK0上给其余RANK开辟类似共享显存的东西,导致我无法做到原论文相同的配置,需要改进

  • 仔细观察Refcoco的数据集,会发现一个target会对应好几个sentence,training时我设计的是随机选一个句子,evaluate时感觉应该要把所有句子用上会更好,关于这一点我想了两种evaluate的方法

    目前eval 只能支持 batch_size=1

    • 将所有句子concatenate成为一个句子,送入BERT,Input 形式上就是(Image,cat(sent_1,sent_2,sent_3)) => model => pred

    实验发现这种eval_mode 下的mean IOU 会好不少, overall_IOU 也会好一点

    • 对同一张图片处理多次处理,然后将结果进行平均,Input 形式上就是 ((Image,sent_1),(Image,sent_2),(Image,sent_3)) => model => average(pred_1,pred_2,pred_3)

Visualization

详细见inference.ipynb

input sentences

  1. right girl
  2. closest girl on right

results

Failure cases study

AnalysisFailure.ipynb 提供了一个研究model不work的途径,主要是筛选了IoU < 0.5的case,并在这些case中着重查看了一下IoU < 0.10.4 < IoU < 0.5 的例子

目前我只看了一些有限的failure cases,做了如下总结

  • 模型对于similar,dense object在language guide下定位不精确
  • 模型对于language的理解不分主次
  • refcoco本身标记的一些问题
Owner
zichengsaber
CVer
zichengsaber
This repository contains PyTorch models for SpecTr (Spectral Transformer).

SpecTr: Spectral Transformer for Hyperspectral Pathology Image Segmentation This repository contains PyTorch models for SpecTr (Spectral Transformer).

Boxiang Yun 45 Dec 13, 2022
Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

Automatic, Readable, Reusable, Extendable Machin is a reinforcement library designed for pytorch. Build status Platform Status Linux Windows Supported

Iffi 348 Dec 24, 2022
A Physics-based Noise Formation Model for Extreme Low-light Raw Denoising (CVPR 2020 Oral & TPAMI 2021)

ELD The implementation of CVPR 2020 (Oral) paper "A Physics-based Noise Formation Model for Extreme Low-light Raw Denoising" and its journal (TPAMI) v

Kaixuan Wei 359 Jan 01, 2023
CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation (ACMMM'21 Oral Paper)

CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation (ACMMM'21 Oral Paper) (Accepted for oral presentation at ACM

Minha Kim 1 Nov 12, 2021
Deep Learning as a Cloud API Service.

Deep API Deep Learning as Cloud APIs. This project provides pre-trained deep learning models as a cloud API service. A web interface is available as w

Wu Han 4 Jan 06, 2023
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022
Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

Instance-Aware Latent-Space Search This is a PyTorch implementation of the following paper: Disentangled Face Attribute Editing via Instance-Aware Lat

67 Dec 21, 2022
Learning to Simulate Dynamic Environments with GameGAN (CVPR 2020)

Learning to Simulate Dynamic Environments with GameGAN PyTorch code for GameGAN Learning to Simulate Dynamic Environments with GameGAN Seung Wook Kim,

199 Dec 26, 2022
[NeurIPS 2021] Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods

Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods Large Scale Learning on Non-Homophilous Graphs: New Benchmark

60 Jan 03, 2023
Code for the paper "Asymptotics of ℓ2 Regularized Network Embeddings"

README Code for the paper Asymptotics of L2 Regularized Network Embeddings. Requirements Requires Stellargraph 1.2.1, Tensorflow 2.6.0, scikit-learm 0

Andrew Davison 0 Jan 06, 2022
tensorflow code for inverse face rendering

InverseFaceRender This is tensorflow code for our project: Learning Inverse Rendering of Faces from Real-world Videos. (https://arxiv.org/abs/2003.120

Yuda Qiu 18 Nov 16, 2022
PSPNet in Chainer

PSPNet This is an unofficial implementation of Pyramid Scene Parsing Network (PSPNet) in Chainer. Training Requirement Python 3.4.4+ Chainer 3.0.0b1+

Shunta Saito 76 Dec 12, 2022
Personalized Federated Learning using Pytorch (pFedMe)

Personalized Federated Learning with Moreau Envelopes (NeurIPS 2020) This repository implements all experiments in the paper Personalized Federated Le

Charlie Dinh 226 Dec 30, 2022
Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination

Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination Pratul P. Srinivasan, Ben Mildenhall, Matthew Tancik, Jonathan T. Barron,

Pratul Srinivasan 65 Dec 14, 2022
Style transfer, deep learning, feature transform

FastPhotoStyle License Copyright (C) 2018 NVIDIA Corporation. All rights reserved. Licensed under the CC BY-NC-SA 4.0 license (https://creativecommons

NVIDIA Corporation 10.9k Jan 02, 2023
Supervised Contrastive Learning for Product Matching

Contrastive Product Matching This repository contains the code and data download links to reproduce the experiments of the paper "Supervised Contrasti

Web-based Systems Group @ University of Mannheim 18 Dec 10, 2022
Simple Dynamic Batching Inference

Simple Dynamic Batching Inference 解决了什么问题? 众所周知,Batch对于GPU上深度学习模型的运行效率影响很大。。。 是在Inference时。搜索、推荐等场景自带比较大的batch,问题不大。但更多场景面临的往往是稀碎的请求(比如图片服务里一次一张图)。 如果

116 Jan 01, 2023
Code release for SLIP Self-supervision meets Language-Image Pre-training

SLIP: Self-supervision meets Language-Image Pre-training What you can find in this repo: Pre-trained models (with ViT-Small, Base, Large) and code to

Meta Research 621 Dec 31, 2022
scAR (single-cell Ambient Remover) is a package for data denoising in single-cell omics.

scAR scAR (single cell Ambient Remover) is a package for denoising multiple single cell omics data. It can be used for multiple tasks, such as, sgRNA

19 Nov 28, 2022
Official Code for "Non-deep Networks"

Non-deep Networks arXiv:2110.07641 Ankit Goyal, Alexey Bochkovskiy, Jia Deng, Vladlen Koltun Overview: Depth is the hallmark of DNNs. But more depth m

Ankit Goyal 567 Dec 12, 2022