Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Last update: Dec 30, 2022

Related tags

Deep Learning DeCLIP

Overview

DeCLIP

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.

Our paper is available in arxiv

Updates

** Our code, dataset and models will be relased soon**

Introduction

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) (Radfordet al., 2021) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our DeCLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from these intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1× fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.

Model

Our pretrain visual backbone model (w/o text encoder)

DeCLIP_r50 GoogleDriver.
DeCLIP_vitb32 GoogleDriver

Citing DeCLIP

@misc{li2021supervision,
      title={Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm}, 
      author={Yangguang Li and Feng Liang and Lichen Zhao and Yufeng Cui and Wanli Ouyang and Jing Shao and Fengwei Yu and Junjie Yan},
      year={2021},
      eprint={2110.05208},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Related tags

Overview

DeCLIP

Updates

Introduction

Model

Our pretrain visual backbone model (w/o text encoder)

Citing DeCLIP

Owner

Sense-GVT

SubOmiEmbed: Self-supervised Representation Learning of Multi-omics Data for Cancer Type Classification

Open-source implementation of Google Vizier for hyper parameters tuning

The FIRST GANs-based omics-to-omics translation framework

Code for Greedy Gradient Ensemble for Visual Question Answering （ICCV 2021, Oral）

Predict bus arrival time using VertexAI and Nvidia's Jetson Nano

DrNAS: Dirichlet Neural Architecture Search

JORLDY an open-source Reinforcement Learning (RL) framework provided by KakaoEnterprise

Retinal Vessel Segmentation with Pixel-wise Adaptive Filters (ISBI 2022)

Pytorch implementation for Semantic Segmentation/Scene Parsing on MIT ADE20K dataset

Referring Video Object Segmentation

PyTorch implementation of the paper: Label Noise Transition Matrix Estimation for Tasks with Lower-Quality Features

Constructing Neural Network-Based Models for Simulating Dynamical Systems

This code is an unofficial implementation of HiFiSinger.

Code for "Diffusion is All You Need for Learning on Surfaces"

RobustART: Benchmarking Robustness on Architecture Design and Training Techniques

Official code for "Maximum Likelihood Training of Score-Based Diffusion Models", NeurIPS 2021 (spotlight)

A High-Quality Real Time Upscaler for Anime Video

MTCNN face detection implementation for TensorFlow, as a PIP package.

The official pytorch implemention of the CVPR paper "Temporal Modulation Network for Controllable Space-Time Video Super-Resolution".

A Small and Easy approach to the BraTS2020 dataset (2D Segmentation)