Simple image captioning model - CLIP prefix captioning.

Overview

CLIP prefix captioning.


Inference Notebook:

🥳 New: 🥳 Integrated to Huggingface Spaces with Gradio. See demo: Hugging Face Spaces

🥳 New: 🥳 Run it in the browser using replicate.ai UI

Description

Image captioning is a complicated task, where usually a pretrained detection network is used, requires additional supervision in the form of object annotation. The features of the detected objects are then fed to an additional network that is trained to output the correct caption. We present a new approach that does not requires additional information (i.e. requires only images and captions), thus can be applied to any data. In addition, our model's training time is much faster than similar methods while achieving close to state-of-the-art results, even for the Conceptual Captions dataset contains over 3M images.

In our work, we use the CLIP model, which was already trained over an extremely large number of images, thus is capable of generating semantic encodings for arbitrary images without additional supervision. To produce meaningful sentences we fine-tune a pretrained language model, which has been proven to be successful for other natural language tasks. The key idea is to use the CLIP encoding as a prefix to the textual captions by employing a simple Multi-Layer Perceptron (MLP) over the raw encoding, and then fine-tune our language model to generate a valid caption.

COCO Examples

A couple of people standing next to an elephant. A wooden table sitting in front of a window. A bunch of bananas sitting on top of a table.
A woman holding a plate with a piece of cake in front of her face. A wooden table topped with lots of wooden utensils. A red motorcycle parked on top of a dirt field.

Conceptual Captions Examples

3D render of a man holding a globe. Students enjoing the cherry blossoms Green leaf of lettuce on a white plate.
The hotel and casino on the waterfront. The triangle is a symbol of the soul. Cartoon boy in the bath.

Inference Notebooks

To help visualize the results we provide a Colab notebook found in notebooks/clip_prefix_captioning_inference.ipynb.
The notebook will download the pretrained models and run inference on a sample images or on images of your choosing. It is recommended to run this in Google Colab. Both COCO and Conceptual Captions pretrained models are available.

Inference GUI

Run it in the browser using replicate.ai UI.

COCO training

Clone, create environment and install dependencies:

git clone https://github.com/rmokady/CLIP_prefix_caption && cd CLIP_prefix_caption
conda env create -f environment.yml
conda activate clip_prefix_caption

Download train_captions to data/coco/annotations.

Download training images and validation images and unzip (We use Karpathy et el. split).

Extract CLIP features using (output is data/coco/oscar_split_train.pkl):

python parse_coco.py

Train:

python train.py --data ./data/coco/oscar_split_train.pkl --out_dir ./coco_train/

Qualitative results

COCO dataset

Method [email protected] [email protected] [email protected] [email protected] METEOR ROUGE-L CIDEr SPICE
Oscar* 75.59 60.09 46.89 36.58 30.40 58.56 124.12 23.17
Ours 74.12 57.40 43.11 32.15 27.10 55.02 108.35 20.12

* uses additional object annotations for training.

Conceptual Captions dataset

Method ROUGE-L CIDEr SPICE
VLP 24.35 77.57 16.59
Ours 26.71 87.26 18.5

Acknowledgments

This project was created by Ron Mokady and Amir Hertz for the Advanced-NLP course by Omer Levy @ TAU. This repository is heavily based on CLIP and Hugging-faces repositories. For training we used the data of COCO dataset and Conceptual Captions. The project was also inspired from this paper.

Contact

For any inquiry please contact us at our email addresses: [email protected] or [email protected].

Wordplay, an artificial Intelligence based crossword puzzle solver.

Wordplay, AI based crossword puzzle solver A crossword is a word puzzle that usually takes the form of a square or a rectangular grid of white- and bl

Vaibhaw 4 Nov 16, 2022
ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

Alibaba 1.4k Jan 01, 2023
(CVPR 2022 Oral) Official implementation for "Surface Representation for Point Clouds"

RepSurf - Surface Representation for Point Clouds [CVPR 2022 Oral] By Haoxi Ran* , Jun Liu, Chengjie Wang ( * : corresponding contact) The pytorch off

Haoxi Ran 264 Dec 23, 2022
code for Grapadora research paper experimentation

Road feature embedding selection method Code for research paper experimentation Abstract Traffic forecasting models rely on data that needs to be sens

Eric López Manibardo 0 May 26, 2022
Data and codes for ACL 2021 paper: Towards Emotional Support Dialog Systems

Emotional-Support-Conversation Copyright © 2021 CoAI Group, Tsinghua University. All rights reserved. Data and codes are for academic research use onl

126 Dec 21, 2022
Benchmark VAE - Library for Variational Autoencoder benchmarking

Documentation pythae This library implements some of the most common (Variational) Autoencoder models. In particular it provides the possibility to pe

1.1k Jan 02, 2023
A Topic Modeling toolbox

Topik A Topic Modeling toolbox. Introduction The aim of topik is to provide a full suite and high-level interface for anyone interested in applying to

Anaconda, Inc. (formerly Continuum Analytics, Inc.) 93 Dec 01, 2022
Denoising Normalizing Flow

Denoising Normalizing Flow Christian Horvat and Jean-Pascal Pfister 2021 We combine Normalizing Flows (NFs) and Denoising Auto Encoder (DAE) by introd

CHrvt 17 Oct 15, 2022
[CVPR21] LightTrack: Finding Lightweight Neural Network for Object Tracking via One-Shot Architecture Search

LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search The official implementation of the paper LightTra

Multimedia Research 290 Dec 24, 2022
A data-driven maritime port simulator

PySeidon - A Data-Driven Maritime Port Simulator 🌊 Extendable and modular software for maritime port simulation. This software uses entity-component

6 Apr 10, 2022
Cross-view Transformers for real-time Map-view Semantic Segmentation (CVPR 2022 Oral)

Cross View Transformers This repository contains the source code and data for our paper: Cross-view Transformers for real-time Map-view Semantic Segme

Brady Zhou 363 Dec 25, 2022
Code for NeurIPS 2020 article "Contrastive learning of global and local features for medical image segmentation with limited annotations"

Contrastive learning of global and local features for medical image segmentation with limited annotations The code is for the article "Contrastive lea

Krishna Chaitanya 152 Dec 22, 2022
《Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis》(2021)

Image2Reverb Image2Reverb is an end-to-end neural network that generates plausible audio impulse responses from single images of acoustic environments

Nikhil Singh 48 Nov 27, 2022
Official repository for Fourier model that can generate periodic signals

Conditional Generation of Periodic Signals with Fourier-Based Decoder Jiyoung Lee, Wonjae Kim, Daehoon Gwak, Edward Choi This repository provides offi

8 May 25, 2022
Deep Sea Treasure Environment for Multi-Objective Optimization Research

DeepSeaTreasure Environment Installation In order to get started with this environment, you can install it using the following command: python3 -m pip

imec IDLab 6 Nov 14, 2022
To propose and implement a multi-class classification approach to disaster assessment from the given data set of post-earthquake satellite imagery.

To propose and implement a multi-class classification approach to disaster assessment from the given data set of post-earthquake satellite imagery.

Kunal Wadhwa 2 Jan 05, 2022
Learning nonlinear operators via DeepONet

DeepONet: Learning nonlinear operators The source code for the paper Learning nonlinear operators via DeepONet based on the universal approximation th

Lu Lu 239 Jan 02, 2023
DGCNN - Dynamic Graph CNN for Learning on Point Clouds

DGCNN is the author's re-implementation of Dynamic Graph CNN, which achieves state-of-the-art performance on point-cloud-related high-level tasks including category classification, semantic segmentat

Wang, Yue 1.3k Dec 26, 2022
RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching

RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching This repository contains the source code for our paper: RAFT-Stereo: Multilevel

Princeton Vision & Learning Lab 328 Jan 09, 2023
Churn prediction

Churn-prediction Churn-prediction Data preprocessing:: Label encoder is used to normalize the categorical variable Data Transformation:: For each data

1 Sep 28, 2022