SEJE Pytorch implementation

Last update: Oct 21, 2021

Related tags

Overview

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

Instroduction
Installation
Recipe1M Dataset
Vision models
Out-of-the-box training
Training
Testing
Contact

Introduction

Overview: SEJE is a two-phase deep feature engineering framework for efficient learning of semantics enhanced joint embedding, which clearly separates the deep feature engineering in data preprocessing from training the text-image joint embedding model. We use the Recipe1M dataset for the technical description and empirical validation. In preprocessing, we perform deep feature engineering by combining deep feature engineering with semantic context features derived from raw text-image input data. We leverage LSTM to identify key terms, deep NLP models from the BERT family, TextRank, or TF-IDF to produce ranking scores for key terms before generating the vector representation for each key term by using word2vec. We leverage wideResNet50 and word2vec to extract and encode the image category semantics of food images to help semantic alignment of the learned recipe and image embeddings in the joint latent space. In joint embedding learning, we perform deep feature engineering by optimizing the batch-hard triplet loss function with soft-margin and double negative sampling, taking into account also the category-based alignment loss and discriminator-based alignment loss. Extensive experiments demonstrate that our SEJE approach with deep feature engineering significantly outperforms the state-of-the-art approaches.

SEJE Architecture

SEJE Phase I Architecture and Examples

SEJE Phase II Architecture

SEJE Joint Embedding Optimization with instance-class double hard sampling strategy

SEJE Joint Embedding Optimization with discriminator based alignment loss regularization

SEJE Experimental Evaluation Highlights

Installation

We use the environment with Python 3.7.6 and Pytorch 1.4.0. Run pip install --upgrade cython and then install the dependencies with pip install -r requirements.txt. Our work is an extension of im2recipe.

Recipe1M Dataset

The Recipe1M dataset is available for download here, where you can find some code used to construct the dataset and get the structured recipe text, food images, pre-trained instruction featuers and so on.

Vision models

This current version of the code uses a pre-trained ResNet-50.

Out-of-the-box training

To train the model, you will need to create following files:

data/train_lmdb: LMDB (training) containing skip-instructions vectors, ingredient ids and categories.
data/train_keys: pickle (training) file containing skip-instructions vectors, ingredient ids and categories.
data/val_lmdb: LMDB (validation) containing skip-instructions vectors, ingredient ids and categories.
data/val_keys: pickle (validation) file containing skip-instructions vectors, ingredient ids and categories.
data/test_lmdb: LMDB (testing) containing skip-instructions vectors, ingredient ids and categories.
data/test_keys: pickle (testing) file containing skip-instructions vectors, ingredient ids and categories.
data/text/vocab.txt: file containing all the vocabulary found within the recipes.

Recipe1M LMDBs and pickle files can be found in train.tar, val.tar and test.tar. here

It is worth mentioning that the code is expecting images to be located in a four-level folder structure, e.g. image named 0fa8309c13.jpg can be found in ./data/images/0/f/a/8/0fa8309c13.jpg. Each one of the Tar files contains the first folder level, 16 in total.

The pre-trained TFIDF vectors for each recipe, image category feature for each image and the optimized category label for each image-recipe pair can be found in id2tfidf_vec.pkl, id2img_101_cls_vec.pkl and id2class_1005.pkl respectively.

Word2Vec

Training word2vec with recipe data:

Download and compile word2vec
Train with:

./word2vec -hs 1 -negative 0 -window 10 -cbow 0 -iter 10 -size 300 -binary 1 -min-count 10 -threads 20 -train tokenized_text.txt -output vocab.bin

The pre-trained word2vec model can be found in vocab.bin.

Training

Train the model with:

CUDA_VISIBLE_DEVICES=0 python train.py

We did the experiments with batch size 100, which takes about 11 GB memory.

Testing

Test the trained model with

CUDA_VISIBLE_DEVICES=0 python test.py

The results will be saved in results, which include the MedR result and recall scores for the recipe-to-image retrieval and image-to-recipe retrieval.
Our best model trained with Recipe1M (TSC paper) can be downloaded here.

Contact

We are continuing the development and there is ongoing work in our lab regarding cross-modal retrieval between cooking recipes and food images. For any questions or suggestions you can use the issues section or reach us at [email protected].

Lead Developer: Zhongwei Xie, Georgia Institute of Technology

Advisor: Prof. Dr. Ling Liu, Georgia Institute of Technology

If you use our code, please cite

[1] Zhongwei Xie, Ling Liu, Yanzhao Wu, et al. Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering[J]//ACM Transactions on Information Systems (TOIS).

[2] Zhongwei Xie, Ling Liu, Lin Li, et al. Efficient Deep Feature Calibration for Cross-Modal Joint Embedding Learning[C]//Proceedings of the 2021 International Conference on Multimodal Interaction. 2021: 43-51.

SEJE Pytorch implementation

Related tags

Overview

Contents

Introduction

SEJE Architecture

SEJE Phase I Architecture and Examples

SEJE Phase II Architecture

SEJE Joint Embedding Optimization with instance-class double hard sampling strategy

SEJE Joint Embedding Optimization with discriminator based alignment loss regularization

SEJE Experimental Evaluation Highlights

Installation

Recipe1M Dataset

Vision models

Out-of-the-box training

Word2Vec

Training

Testing

Contact

Owner

A PyTorch implementation of Learning to learn by gradient descent by gradient descent

This is the 3D Implementation of 《Inconsistency-aware Uncertainty Estimation for Semi-supervised Medical Image Segmentation》

Human Activity Recognition example using TensorFlow on smartphone sensors dataset and an LSTM RNN. Classifying the type of movement amongst six activity categories - Guillaume Chevalier

The code for the NeurIPS 2021 paper "A Unified View of cGANs with and without Classifiers".

[CVPR'21] DeepSurfels: Learning Online Appearance Fusion

CTRMs: Learning to Construct Cooperative Timed Roadmaps for Multi-agent Path Planning in Continuous Spaces

3D Human Pose Machines with Self-supervised Learning

AgML is a comprehensive library for agricultural machine learning

Rest API Written In Python To Classify NSFW Images.

ZeroVL - The official implementation of ZeroVL

Forecasting for knowable future events using Bayesian informative priors (forecasting with judgmental-adjustment).

[ ICCV 2021 Oral ] Our method can estimate camera poses and neural radiance fields jointly when the cameras are initialized at random poses in complex scenarios (outside-in scenes, even with less texture or intense noise )

Hummingbird compiles trained ML models into tensor computation for faster inference.

PyTorch implementation of our paper: Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition

Liver segmentation using MONAI and pytorch

PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner [Li et al., 2020].

Code to reproduce the experiments from our NeurIPS 2021 paper " The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective"

ALL Snow Removed: Single Image Desnowing Algorithm Using Hierarchical Dual-tree Complex Wavelet Representation and Contradict Channel Loss (HDCWNet)

Human pose estimation from video plays a critical role in various applications such as quantifying physical exercises, sign language recognition, and full-body gesture control.

A PyTorch library and evaluation platform for end-to-end compression research