DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Last update: Dec 27, 2022

Related tags

Overview

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Created by Yongming Rao*, Wenliang Zhao*, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu,

This repository contains PyTorch implementation for DenseCLIP.

DenseCLIP is a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. By further using the contextual information from the image to prompt the language model, we are able to facilitate our model to better exploit the pre-trained knowledge. Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones including both CLIP models and ImageNet pre-trained models.

Our code is based on mmsegmentation and mmdetection and timm.

[Project Page] [arXiv]

Usage

Requirements

torch>=1.8.0
torchvision
timm
mmcv-full==1.3.17
mmseg==0.19.0
mmdet==2.17.0
fvcore

To use our code, please first install the mmcv-full and mmseg/mmdet following the official guidelines (mmseg, mmdet) and prepare the datasets accordingly.

Pre-trained CLIP Models

Download the pre-trained CLIP models (RN50.pt, RN101.pt, VIT-B-16.pt) and save them to the pretrained folder.

Segmentation

Model Zoo

We provide DenseCLIP models for Semantic FPN framework.

Model	FLOPs (G)	Params (M)	mIoU(SS)	mIoU(MS)	config	url
RN50-CLIP	248.8	31.0	36.9	43.5	config	-
RN50-DenseCLIP	269.2	50.3	43.5	44.7	config	Tsinghua Cloud
RN101-CLIP	326.6	50.0	42.7	44.3	config	-
RN101-DenseCLIP	346.3	67.8	45.1	46.5	config	Tsinghua Cloud
ViT-B-CLIP	1037.4	100.8	49.4	50.3	config	-
ViT-B-DenseCLIP	1043.1	105.3	50.6	51.3	config	Tsinghua Cloud

Training & Evaluation on ADE20K

To train the DenseCLIP model based on CLIP ResNet-50, run:

bash dist_train.sh configs/denseclip_fpn_res50_512x512_80k.py 8

To evaluate the performance with multi-scale testing, run:

bash dist_test.sh configs/denseclip_fpn_res50_512x512_80k.py /path/to/checkpoint 8 --eval mIoU --aug-test

To better measure the complexity of the models, we provide a tool based on fvcore to accurately compute the FLOPs of torch.einsum and other operations:

python get_flops.py /path/to/config --fvcore

You can also remove the --fvcore flag to obtain the FLOPs measured by mmcv for comparisons.

Detection

Model Zoo

We provide models for both RetinaNet and Mask-RCNN framework.

RetinaNet

Model	FLOPs (G)	Params (M)	box AP	config	url
RN50-CLIP	265	38	36.9	config	-
RN50-DenseCLIP	285	60	37.8	config	Tsinghua Cloud
RN101-CLIP	341	57	40.5	config	-
RN101-DenseCLIP	360	78	41.1	config	Tsinghua Cloud

Mask R-CNN

Model	FLOPs (G)	Params (M)	box AP	mask AP	config	url
RN50-CLIP	301	44	39.3	36.8	config	-
RN50-DenseCLIP	327	67	40.2	37.6	config	Tsinghua Cloud
RN101-CLIP	377	63	42.2	38.9	config	-
RN101-DenseCLIP	399	84	42.6	39.6	config	Tsinghua Cloud

Training & Evaluation on COCO

To train our DenseCLIP-RN50 using RetinaNet framework, run

 bash dist_train.sh configs/retinanet_denseclip_r50_fpn_1x_coco.py 8

To evaluate the box AP of RN50-DenseCLIP (RetinaNet), run

bash dist_test.sh configs/retinanet_denseclip_r50_fpn_1x_coco.py /path/to/checkpoint 8 --eval bbox

To evaluate both the box AP and the mask AP of RN50-DenseCLIP (Mask-RCNN), run

bash dist_test.sh configs/mask_rcnn_denseclip_r50_fpn_1x_coco.py /path/to/checkpoint 8 --eval bbox segm

License

MIT License

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{rao2021denseclip,
  title={DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting},
  author={Rao, Yongming and Zhao, Wenliang and Chen, Guangyi and Tang, Yansong and Zhu, Zheng and Huang, Guan and Zhou, Jie and Lu, Jiwen},
  journal={arXiv preprint arXiv:2112.01518},
  year={2021}
}

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Related tags

Overview

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Usage

Requirements

Pre-trained CLIP Models

Segmentation

Model Zoo

Training & Evaluation on ADE20K

Detection

Model Zoo

RetinaNet

Mask R-CNN

Training & Evaluation on COCO

License

Citation

Owner

Yongming Rao

It's a powerful version of linebot

TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

ViSD4SA, a Vietnamese Span Detection for Aspect-based sentiment analysis dataset

[ICCV 2021] Target Adaptive Context Aggregation for Video Scene Graph Generation

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Pytorch implementation of paper Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data

AOT (Associating Objects with Transformers) in PyTorch

Neural Fixed-Point Acceleration for Convex Optimization

Safe Local Motion Planning with Self-Supervised Freespace Forecasting, CVPR 2021

Traductor de lengua de señas al español basado en Python con Opencv y MedaiPipe

Dieser Scanner findet Websites, die nicht direkt in Suchmaschinen auftauchen, aber trotzdem erreichbar sind.

U-Net Brain Tumor Segmentation

Deploying PyTorch Model to Production with FastAPI in CUDA-supported Docker

Adaptive Prototype Learning and Allocation for Few-Shot Segmentation (CVPR 2021)

Event queue (Equeue) dialect is an MLIR Dialect that models concurrent devices in terms of control and structure.

A simple consistency training framework for semi-supervised image semantic segmentation

Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)

Implementation of a Transformer that Ponders, using the scheme from the PonderNet paper

Stacs-ci - A set of modules to enable integration of STACS with commonly used CI / CD systems

Two types of Recommender System : Content-based Recommender System and Colaborating filtering based recommender system

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Related tags

Overview

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Usage

Requirements

Pre-trained CLIP Models

Segmentation

Model Zoo

Training & Evaluation on ADE20K

Detection

Model Zoo

RetinaNet

Mask R-CNN

Training & Evaluation on COCO

License

Citation

Owner

Yongming Rao

It's a powerful version of linebot

​TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

ViSD4SA, a Vietnamese Span Detection for Aspect-based sentiment analysis dataset

[ICCV 2021] Target Adaptive Context Aggregation for Video Scene Graph Generation

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Pytorch implementation of paper Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data

AOT (Associating Objects with Transformers) in PyTorch

Neural Fixed-Point Acceleration for Convex Optimization

Safe Local Motion Planning with Self-Supervised Freespace Forecasting, CVPR 2021

Traductor de lengua de señas al español basado en Python con Opencv y MedaiPipe

Dieser Scanner findet Websites, die nicht direkt in Suchmaschinen auftauchen, aber trotzdem erreichbar sind.

U-Net Brain Tumor Segmentation

Deploying PyTorch Model to Production with FastAPI in CUDA-supported Docker

Adaptive Prototype Learning and Allocation for Few-Shot Segmentation (CVPR 2021)

Event queue (Equeue) dialect is an MLIR Dialect that models concurrent devices in terms of control and structure.

A simple consistency training framework for semi-supervised image semantic segmentation

Normalization Matters in Weakly Supervised Object Localization (ICCV 2021)

Implementation of a Transformer that Ponders, using the scheme from the PonderNet paper

Stacs-ci - A set of modules to enable integration of STACS with commonly used CI / CD systems

Two types of Recommender System : Content-based Recommender System and Colaborating filtering based recommender system

TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.