KoCLIP: Korean port of OpenAI CLIP, in Flax

Overview

KoCLIP

Open in Streamlit Open In Colab

This repository contains code for KoCLIP, a Korean port of OpenAI's CLIP. This project was conducted as part of Hugging Face's Flax/JAX community week co-organized with Google's Flax, JAX, and Cloud teams (announcement).

Demo

Check out our Streamlit app here. The demo illustrates three potential uses cases of KoCLIP on different downstream tasks:

  • Image to Text: This is essentially a zero-shot image classification task. Given an input image, the models finds the most likely caption among the text labels provided.
  • Text to Image: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrieve the image that best matches given text.
  • Text to Patch: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.

Quickstart

To follow along the code snippets below, we recommend that you refer to the Colab notebook.

  1. Import dependencies and initialize a KoCLIP model along with its processor.
import requests
import jax
from PIL import Image

from koclip import load_koclip

model, processor = load_koclip("koclip-base")
  1. Prepare image and text captions.
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = ["소파 위에 고양이", "강아지와 강아지 주인", "쳇바퀴를 달리는 햄스터", "자동차"]
image
  1. Run inference.
inputs = processor(
    text=text,
    images=image, 
    return_tensors="jax", # could also be "pt" 
    padding=True
)

outputs = model(**inputs)
probs = jax.nn.softmax(outputs.logits_per_image, axis=1)

for idx, prob in sorted(enumerate(*probs), key=lambda x: x[1], reverse=True):
    print(text[idx], prob)

Models

We trained a total of two models, koclip-base and koclip-large. Both models use RoBERTa-large. The decision to use a somewhat large language model was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to good multimodal pipeline given limited data.

KoCLIP LM ViT
koclip-base klue/roberta-large openai/clip-vit-base-patch32
koclip-large klue/roberta-large google/vit-large-patch16-224

Training

KoCLIP was fine-tuned using 82,783 images from the MSCOCO 2014 image captioning dataset. Korean translations of image captions were obtained from AI Hub, an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40,000 images from the validation set of the aforementioned dataset.

KoCLIP was trained on a TPU3-v8 VM. Both text and image encoder backbones were loaded from their pretrained checkpoints. KoCLIP was trained to maximize the similarity score between matching pairs of images and captions.

Findings

In this section, we detail some interesting findings we made throughout the project.

Prompting

We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as

이것은 {{}} 이다.

noticably helped the model produce more reliable results. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.

Multilinguality

Although KoCLIP was trained exclusively on a Korean dataset, we found that English queries also work surprisingly well for simple words (e.g. "dog", "car"). This could be one of two reasons, or a combination thereof:

  • ViT Pretraining: The ViT backbone for koclip-base, openai/clip-vit-base-patch32, was already pretrained on an English dataset. Hence, it is possible that its embeddings still lie in a latent space where vector arithematic can be performed with English text embeddings. One reason against this hypothesis is that koclip-large also demonstrates similar multilingual behavior.

  • LM Knowledge Bleed: klue/roberta-large was trained on a large corpus of Korean text in a self-supervised fashion. One might reasonably suspect that English words were included in parts of the corpus, especially given the high frequency of English word transliterations in contemporary conversational Korean. This might also explain why English queries work for both koclip-base and koclip-large. One reason against this hypothesis is that the authors of KLUE explicitly state in their paper that one criterion for text selection was that "the corpus must be written in contemporary Korean."

At the end of the day, we still found it intriguing that a model that was fine-tuned exclusively on Korean managed to produce semantic embeddings from English queries that work well with ViT.

Team

Acknowledgement

The FlaxHybridCLIP model was adpated from the Hugging Face transformer repository, under jax-projects. We also express gratitude to the teams at Google for generously offering TPU VMs for this project. Last but not least, we thank the KLUE team for making pretrained Korean RoBERTa-large weights publicly available.

References

@misc{park2021klue,
      title={KLUE: Korean Language Understanding Evaluation}, 
      author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
      year={2021},
      eprint={2105.09680},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{radford2021learning,
      title={Learning Transferable Visual Models From Natural Language Supervision}, 
      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
      year={2021},
      eprint={2103.00020},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{lin2015microsoft,
      title={Microsoft COCO: Common Objects in Context}, 
      author={Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and Piotr Dollár},
      year={2015},
      eprint={1405.0312},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{srinivasan2021wit,
      title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, 
      author={Krishna Srinivasan and Karthik Raman and Jiecao Chen and Michael Bendersky and Marc Najork},
      year={2021},
      eprint={2103.01913},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Owner
Jake Tae
CS + Math @ Yale, SWE intern @huggingface
Jake Tae
Cleaned up code for DSTC 10: SIMMC 2.0 track: subtask 2: multimodal coreference resolution

UNITER-Based Situated Coreference Resolution with Rich Multimodal Input: arXiv MMCoref_cleaned Code for the MMCoref task of the SIMMC 2.0 dataset. Pre

Yichen (William) Huang 2 Dec 05, 2022
[TOG 2021] PyTorch implementation for the paper: SofGAN: A Portrait Image Generator with Dynamic Styling.

This repository contains the official PyTorch implementation for the paper: SofGAN: A Portrait Image Generator with Dynamic Styling. We propose a SofGAN image generator to decouple the latent space o

Anpei Chen 694 Dec 23, 2022
SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

SPLADE 🍴 + 🥄 = 🔎 This repository contains the weights for four models as well as the code for running inference for our two papers: [v1]: SPLADE: S

NAVER 170 Dec 28, 2022
Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

CMU Locus Lab 3.5k Jan 01, 2023
RoMA: Robust Model Adaptation for Offline Model-based Optimization

RoMA: Robust Model Adaptation for Offline Model-based Optimization Implementation of RoMA: Robust Model Adaptation for Offline Model-based Optimizatio

9 Oct 31, 2022
Meta Learning for Semi-Supervised Few-Shot Classification

few-shot-ssl-public Code for paper Meta-Learning for Semi-Supervised Few-Shot Classification. [arxiv] Dependencies cv2 numpy pandas python 2.7 / 3.5+

Mengye Ren 501 Jan 08, 2023
Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

Non-Metric Space Library (NMSLIB) Important Notes NMSLIB is generic but fast, see the results of ANN benchmarks. A standalone implementation of our fa

2.9k Jan 04, 2023
Spontaneous Facial Micro Expression Recognition using 3D Spatio-Temporal Convolutional Neural Networks

Spontaneous Facial Micro Expression Recognition using 3D Spatio-Temporal Convolutional Neural Networks Abstract Facial expression recognition in video

Bogireddy Sai Prasanna Teja Reddy 103 Dec 29, 2022
Code for the paper "M2m: Imbalanced Classification via Major-to-minor Translation" (CVPR 2020)

M2m: Imbalanced Classification via Major-to-minor Translation This repository contains code for the paper "M2m: Imbalanced Classification via Major-to

79 Oct 13, 2022
This repository provides an efficient PyTorch-based library for training deep models.

s3sec Test AWS S3 buckets for read/write/delete access This tool was developed to quickly test a list of s3 buckets for public read, write and delete

Bytedance Inc. 123 Jan 05, 2023
Discovering Interpretable GAN Controls [NeurIPS 2020]

GANSpace: Discovering Interpretable GAN Controls Figure 1: Sequences of image edits performed using control discovered with our method, applied to thr

Erik Härkönen 1.7k Jan 03, 2023
Fader Networks: Manipulating Images by Sliding Attributes - NIPS 2017

FaderNetworks PyTorch implementation of Fader Networks (NIPS 2017). Fader Networks can generate different realistic versions of images by modifying at

Facebook Research 753 Dec 23, 2022
COVID-Net Open Source Initiative

The COVID-Net models provided here are intended to be used as reference models that can be built upon and enhanced as new data becomes available

Linda Wang 1.1k Dec 26, 2022
Certified Patch Robustness via Smoothed Vision Transformers

Certified Patch Robustness via Smoothed Vision Transformers This repository contains the code for replicating the results of our paper: Certified Patc

Madry Lab 35 Dec 14, 2022
WeakVRD-Captioning - Implementation of paper Improving Image Captioning with Better Use of Caption

WeakVRD-Captioning - Implementation of paper Improving Image Captioning with Better Use of Caption

30 Oct 28, 2022
Art Project "Schrödinger's Game of Life"

Repo of the project "Team Creative Quantum AI: Schrödinger's Game of Life" Installation new conda env: conda create --name qcml python=3.8 conda activ

ℍ◮ℕℕ◭ℍ ℝ∈ᛔ∈ℝ 2 Sep 15, 2022
For storing the complete exploration of Visual Question Answering for our B.Tech Project

Multi-Image vqa @authors: Akhilesh, Janhavi, Harsh Paper summary, Ideas tried and their corresponding results: on wiki Other discussions: on discussio

Harsh Raj 3 Jun 16, 2022
Python package for multiple object tracking research with focus on laboratory animals tracking.

motutils is a Python package for multiple object tracking research with focus on laboratory animals tracking. Features loads: MOTChallenge CSV, sleap

Matěj Šmíd 2 Sep 05, 2022
Implementation of fast algorithms for Maximum Spanning Tree (MST) parsing that includes fast ArcMax+Reweighting+Tarjan algorithm for single-root dependency parsing.

Fast MST Algorithm Implementation of fast algorithms for (Maximum Spanning Tree) MST parsing that includes fast ArcMax+Reweighting+Tarjan algorithm fo

Miloš Stanojević 11 Oct 14, 2022
Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks

Stochastic Downsampling for Cost-Adjustable Inference and Improved Regularization in Convolutional Networks (SDPoint) This repository contains the cod

Jason Kuen 17 Jul 04, 2022