Simple image captioning model - CLIP prefix captioning.

Last update: Jan 04, 2023

Related tags

Deep Learning CLIP_prefix_caption

Overview

CLIP prefix captioning.

Inference Notebook:

🥳 New: 🥳 Integrated to Huggingface Spaces with Gradio. See demo:

🥳 New: 🥳 Run it in the browser using replicate.ai UI

Description

Image captioning is a complicated task, where usually a pretrained detection network is used, requires additional supervision in the form of object annotation. The features of the detected objects are then fed to an additional network that is trained to output the correct caption. We present a new approach that does not requires additional information (i.e. requires only images and captions), thus can be applied to any data. In addition, our model's training time is much faster than similar methods while achieving close to state-of-the-art results, even for the Conceptual Captions dataset contains over 3M images.

In our work, we use the CLIP model, which was already trained over an extremely large number of images, thus is capable of generating semantic encodings for arbitrary images without additional supervision. To produce meaningful sentences we fine-tune a pretrained language model, which has been proven to be successful for other natural language tasks. The key idea is to use the CLIP encoding as a prefix to the textual captions by employing a simple Multi-Layer Perceptron (MLP) over the raw encoding, and then fine-tune our language model to generate a valid caption.

COCO Examples


A couple of people standing next to an elephant.	A wooden table sitting in front of a window.	A bunch of bananas sitting on top of a table.


A woman holding a plate with a piece of cake in front of her face.	A wooden table topped with lots of wooden utensils.	A red motorcycle parked on top of a dirt field.

Conceptual Captions Examples


3D render of a man holding a globe.	Students enjoing the cherry blossoms	Green leaf of lettuce on a white plate.


The hotel and casino on the waterfront.	The triangle is a symbol of the soul.	Cartoon boy in the bath.

Inference Notebooks

To help visualize the results we provide a Colab notebook found in notebooks/clip_prefix_captioning_inference.ipynb.
The notebook will download the pretrained models and run inference on a sample images or on images of your choosing. It is recommended to run this in Google Colab. Both COCO and Conceptual Captions pretrained models are available.

Inference GUI

Run it in the browser using replicate.ai UI.

COCO training

Clone, create environment and install dependencies:

git clone https://github.com/rmokady/CLIP_prefix_caption && cd CLIP_prefix_caption
conda env create -f environment.yml
conda activate clip_prefix_caption

Download train_captions to data/coco/annotations.

Download training images and validation images and unzip (We use Karpathy et el. split).

Extract CLIP features using (output is data/coco/oscar_split_train.pkl):

python parse_coco.py

Train:

python train.py --data ./data/coco/oscar_split_train.pkl --out_dir ./coco_train/

Qualitative results

COCO dataset

Method	[email protected]	[email protected]	[email protected]	[email protected]	METEOR	ROUGE-L	CIDEr	SPICE
Oscar*	75.59	60.09	46.89	36.58	30.40	58.56	124.12	23.17
Ours	74.12	57.40	43.11	32.15	27.10	55.02	108.35	20.12

* uses additional object annotations for training.

Conceptual Captions dataset

Method	ROUGE-L	CIDEr	SPICE
VLP	24.35	77.57	16.59
Ours	26.71	87.26	18.5

Acknowledgments

This project was created by Ron Mokady and Amir Hertz for the Advanced-NLP course by Omer Levy @ TAU. This repository is heavily based on CLIP and Hugging-faces repositories. For training we used the data of COCO dataset and Conceptual Captions. The project was also inspired from this paper.

Contact

For any inquiry please contact us at our email addresses: [email protected] or [email protected].

Simple image captioning model - CLIP prefix captioning.

Related tags

Overview

CLIP prefix captioning.

Description

COCO Examples

Conceptual Captions Examples

Inference Notebooks

Inference GUI

COCO training

Qualitative results

COCO dataset

Conceptual Captions dataset

Acknowledgments

Contact

Owner

Madanalysis5 - A package for event file analysis and recasting of LHC results

Customer Segmentation using RFM

《DeepViT: Towards Deeper Vision Transformer》(2021)

MODNet: Trimap-Free Portrait Matting in Real Time

A Python wrapper for Google Tesseract

PSPNet in Chainer

QAHOI: Query-Based Anchors for Human-Object Interaction Detection (paper)

A large-image collection explorer and fast classification tool

Official implementation of "Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform", ICCV 2021

Simple torch.nn.module implementation of Alias-Free-GAN style filter and resample

Run Effective Large Batch Contrastive Learning on Limited Memory GPU

Machine Unlearning with SISA

The Most Efficient Temporal Difference Learning Framework for 2048

The mini-MusicNet dataset

[peer review] An Arbitrary Scale Super-Resolution Approach for 3D MR Images using Implicit Neural Representation

Rule Based Classification Project

LVI-SAM: Tightly-coupled Lidar-Visual-Inertial Odometry via Smoothing and Mapping

End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model

BirdCLEF 2021 - Birdcall Identification 4th place solution

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI

Simple image captioning model - CLIP prefix captioning.

Related tags

Overview

CLIP prefix captioning.

Description

COCO Examples

Conceptual Captions Examples

Inference Notebooks

Inference GUI

COCO training

Qualitative results

COCO dataset

Conceptual Captions dataset

Acknowledgments

Contact

Owner

Madanalysis5 - A package for event file analysis and recasting of LHC results

Customer Segmentation using RFM

《DeepViT: Towards Deeper Vision Transformer》(2021)

MODNet: Trimap-Free Portrait Matting in Real Time

A Python wrapper for Google Tesseract

PSPNet in Chainer

QAHOI: Query-Based Anchors for Human-Object Interaction Detection (paper)

A large-image collection explorer and fast classification tool

Official implementation of "Variable-Rate Deep Image Compression through Spatially-Adaptive Feature Transform", ICCV 2021

Simple torch.nn.module implementation of Alias-Free-GAN style filter and resample

Run Effective Large Batch Contrastive Learning on Limited Memory GPU

Machine Unlearning with SISA

The Most Efficient Temporal Difference Learning Framework for 2048

The mini-MusicNet dataset

[peer review] An Arbitrary Scale Super-Resolution Approach for 3D MR Images using Implicit Neural Representation

Rule Based Classification Project

LVI-SAM: Tightly-coupled Lidar-Visual-Inertial Odometry via Smoothing and Mapping

End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model

BirdCLEF 2021 - Birdcall Identification 4th place solution

This is the pytorch implementation for the paper: *Learning Accurate Performance Predictors for Ultrafast Automated Model Compression*, which is in submission to TPAMI

This is the pytorch implementation for the paper: Learning Accurate Performance Predictors for Ultrafast Automated Model Compression, which is in submission to TPAMI