Deep-Learning-Image-Captioning - Implementing convolutional and recurrent neural networks in Keras to generate sentence descriptions of images

Last update: Apr 06, 2022

Overview

Deep Learning - Image Captioning with Convolutional and Recurrent Neural Nets

========================================================================

Author: Jonathan Kuo
Python: 3.6.1
TensorFlow: 1.0.1 Keras: 2.0.4

Implementing convolutional and recurrent neural networks in Keras to generate sentence descriptions of images

Introduction

The Keras deep learning architecture of this project was inspired by Deep Visual-Semantic Alignments for Generating Image Descriptions by Andrej Karpathy and Fei-Fei Li.

Given input of a dataset of images and their sentence descriptions, define a Keras (TensorFlow backend) deep learning model that corresponds detected regions on image with description segments. This learning allows the model to output novel descriptions for test images.

Dataset

Microsoft Common Objects in Context (MSCOCO) is an image recognition, segmentation, and captioning dataset. Training data includes 123,000 images and caption pairs. Validation and testing data are both 5,000 images and caption pairs.

Architecture

VGG16 CNN architecture (loaded in Keras) with pre-trained weights on ImageNet are used as the CNN to detect objects in the image. Then, the last dense softmax 200-classification layer was removed in order to pass the 4096-D activations into into the RNN (LSTM). CNN weights are frozen and RNN weights are updated in backpropagation through time (BPTT). The CNN and LSTM is merged before passing into a second LSTM to predict the next word in the sequence. RMSprop is used as the optimizer to combat the vanishing gradient problem.

Demo

View the demo iPython notebook for the model training and prediction on the MSCOCO dataset.

Deep-Learning-Image-Captioning - Implementing convolutional and recurrent neural networks in Keras to generate sentence descriptions of images

Related tags

Overview

Deep Learning - Image Captioning with Convolutional and Recurrent Neural Nets

Introduction

Dataset

Architecture

Demo

Owner

Multispectral Object Detection with Yolov5

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

An 16kHz implementation of HiFi-GAN for soft-vc.

The implementation code for "DAGAN: Deep De-Aliasing Generative Adversarial Networks for Fast Compressed Sensing MRI Reconstruction"

Implementation of "Semi-supervised Domain Adaptive Structure Learning"

Voice assistant - Voice assistant with python

DCSL - Generalizable Crowd Counting via Diverse Context Style Learning

[CVPR'21] Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild

Hand gesture recognition model that can be used as a remote control for a smart tv.

NeurIPS workshop paper 'Counter-Strike Deathmatch with Large-Scale Behavioural Cloning'

DyStyle: Dynamic Neural Network for Multi-Attribute-Conditioned Style Editing

Generate Cartoon Images using Generative Adversarial Network

Learn the Deep Learning for Computer Vision in three steps: theory from base to SotA, code in PyTorch, and space-repetition with Anki

TANL: Structured Prediction as Translation between Augmented Natural Languages

Implementation for Curriculum DeepSDF

My usage of Real-ESRGAN to upscale anime, some test and results in the test_img folder

A Tensorflow implementation of CapsNet based on Geoffrey Hinton's paper Dynamic Routing Between Capsules

This tutorial repository is to introduce the functionality of KGTK to first-time users

Script that attempts to force M1 macs into RGB mode when used with monitors that are defaulting to YPbPr.

Allows including an action inside another action (by preprocessing the Yaml file). This is how composite actions should have worked.