KoCLIP: Korean port of OpenAI CLIP, in Flax

Overview

KoCLIP

Open in Streamlit Open In Colab

This repository contains code for KoCLIP, a Korean port of OpenAI's CLIP. This project was conducted as part of Hugging Face's Flax/JAX community week co-organized with Google's Flax, JAX, and Cloud teams (announcement).

Demo

Check out our Streamlit app here. The demo illustrates three potential uses cases of KoCLIP on different downstream tasks:

  • Image to Text: This is essentially a zero-shot image classification task. Given an input image, the models finds the most likely caption among the text labels provided.
  • Text to Image: This is essentially an image retrieval task. Given a text, the model looks up a database of pre-computed image embeddings to retrieve the image that best matches given text.
  • Text to Patch: This is also a variant of zero-shot image classification. Given a text and an image, the image is partitioned into subsections, and the model ranks them based on their relevance with the text query.

Quickstart

To follow along the code snippets below, we recommend that you refer to the Colab notebook.

  1. Import dependencies and initialize a KoCLIP model along with its processor.
import requests
import jax
from PIL import Image

from koclip import load_koclip

model, processor = load_koclip("koclip-base")
  1. Prepare image and text captions.
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = ["소파 위에 고양이", "강아지와 강아지 주인", "쳇바퀴를 달리는 햄스터", "자동차"]
image
  1. Run inference.
inputs = processor(
    text=text,
    images=image, 
    return_tensors="jax", # could also be "pt" 
    padding=True
)

outputs = model(**inputs)
probs = jax.nn.softmax(outputs.logits_per_image, axis=1)

for idx, prob in sorted(enumerate(*probs), key=lambda x: x[1], reverse=True):
    print(text[idx], prob)

Models

We trained a total of two models, koclip-base and koclip-large. Both models use RoBERTa-large. The decision to use a somewhat large language model was motivated by the intuition that annotated Korean datasets are rare; a well-trained, performant LM would be key to good multimodal pipeline given limited data.

KoCLIP LM ViT
koclip-base klue/roberta-large openai/clip-vit-base-patch32
koclip-large klue/roberta-large google/vit-large-patch16-224

Training

KoCLIP was fine-tuned using 82,783 images from the MSCOCO 2014 image captioning dataset. Korean translations of image captions were obtained from AI Hub, an open database maintained by subsidiaries of the Korean Ministry of Science and ICT. Validation metrics were monitored using approximately 40,000 images from the validation set of the aforementioned dataset.

KoCLIP was trained on a TPU3-v8 VM. Both text and image encoder backbones were loaded from their pretrained checkpoints. KoCLIP was trained to maximize the similarity score between matching pairs of images and captions.

Findings

In this section, we detail some interesting findings we made throughout the project.

Prompting

We found that KoCLIP performs better when prompting is used to induce zero-shot behavior. Namely, instead of feeding it a single word or short phrase, casting a template such as

이것은 {{}} 이다.

noticably helped the model produce more reliable results. We hypothesize that this is due to the nature of captions in the MSCOCO datset, which are most often full sentences, albeit sometimes short in length.

Multilinguality

Although KoCLIP was trained exclusively on a Korean dataset, we found that English queries also work surprisingly well for simple words (e.g. "dog", "car"). This could be one of two reasons, or a combination thereof:

  • ViT Pretraining: The ViT backbone for koclip-base, openai/clip-vit-base-patch32, was already pretrained on an English dataset. Hence, it is possible that its embeddings still lie in a latent space where vector arithematic can be performed with English text embeddings. One reason against this hypothesis is that koclip-large also demonstrates similar multilingual behavior.

  • LM Knowledge Bleed: klue/roberta-large was trained on a large corpus of Korean text in a self-supervised fashion. One might reasonably suspect that English words were included in parts of the corpus, especially given the high frequency of English word transliterations in contemporary conversational Korean. This might also explain why English queries work for both koclip-base and koclip-large. One reason against this hypothesis is that the authors of KLUE explicitly state in their paper that one criterion for text selection was that "the corpus must be written in contemporary Korean."

At the end of the day, we still found it intriguing that a model that was fine-tuned exclusively on Korean managed to produce semantic embeddings from English queries that work well with ViT.

Team

Acknowledgement

The FlaxHybridCLIP model was adpated from the Hugging Face transformer repository, under jax-projects. We also express gratitude to the teams at Google for generously offering TPU VMs for this project. Last but not least, we thank the KLUE team for making pretrained Korean RoBERTa-large weights publicly available.

References

@misc{park2021klue,
      title={KLUE: Korean Language Understanding Evaluation}, 
      author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
      year={2021},
      eprint={2105.09680},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
@misc{radford2021learning,
      title={Learning Transferable Visual Models From Natural Language Supervision}, 
      author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
      year={2021},
      eprint={2103.00020},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{lin2015microsoft,
      title={Microsoft COCO: Common Objects in Context}, 
      author={Tsung-Yi Lin and Michael Maire and Serge Belongie and Lubomir Bourdev and Ross Girshick and James Hays and Pietro Perona and Deva Ramanan and C. Lawrence Zitnick and Piotr Dollár},
      year={2015},
      eprint={1405.0312},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{srinivasan2021wit,
      title={WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning}, 
      author={Krishna Srinivasan and Karthik Raman and Jiecao Chen and Michael Bendersky and Marc Najork},
      year={2021},
      eprint={2103.01913},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Owner
Jake Tae
CS + Math @ Yale, SWE intern @huggingface
Jake Tae
Reading Group @mila-iqia on Computational Optimal Transport for Machine Learning Applications

Computational Optimal Transport for Machine Learning Reading Group Over the last few years, optimal transport (OT) has quickly become a central topic

Ali Harakeh 11 Aug 26, 2022
Code for KDD'20 "Generative Pre-Training of Graph Neural Networks"

GPT-GNN: Generative Pre-Training of Graph Neural Networks GPT-GNN is a pre-training framework to initialize GNNs by generative pre-training. It can be

Ziniu Hu 346 Dec 19, 2022
Worktory is a python library created with the single purpose of simplifying the inventory management of network automation scripts.

Worktory is a python library created with the single purpose of simplifying the inventory management of network automation scripts.

Renato Almeida de Oliveira 18 Aug 31, 2022
PASSL包含 SimCLR,MoCo,BYOL,CLIP等基于对比学习的图像自监督算法以及 Vision-Transformer,Swin-Transformer,BEiT,CVT,T2T,MLP_Mixer等视觉Transformer算法

PASSL Introduction PASSL is a Paddle based vision library for state-of-the-art Self-Supervised Learning research with PaddlePaddle. PASSL aims to acce

186 Dec 29, 2022
🔥🔥High-Performance Face Recognition Library on PaddlePaddle & PyTorch🔥🔥

face.evoLVe: High-Performance Face Recognition Library based on PaddlePaddle & PyTorch Evolve to be more comprehensive, effective and efficient for fa

Zhao Jian 3.1k Jan 04, 2023
2D&3D human pose estimation

Human Pose Estimation Papers [CVPR 2016] - 201511 [IJCAI 2016] - 201602 Other Action Recognition with Joints-Pooled 3D Deep Convolutional Descriptors

133 Jan 02, 2023
BOVText: A Large-Scale, Multidimensional Multilingual Dataset for Video Text Spotting

BOVText: A Large-Scale, Bilingual Open World Dataset for Video Text Spotting Updated on December 10, 2021 (Release all dataset(2021 videos)) Updated o

weijiawu 47 Dec 26, 2022
Vehicles Counting using YOLOv4 + DeepSORT + Flask + Ngrok

A project for counting vehicles using YOLOv4 + DeepSORT + Flask + Ngrok

Duong Tran Thanh 37 Dec 16, 2022
K-Means Clustering and Hierarchical Clustering Unsupervised Learning Solution in Python3.

Unsupervised Learning - K-Means Clustering and Hierarchical Clustering - The Heritage Foundation's Economic Freedom Index Analysis 2019 - By David Sal

David Salako 1 Jan 12, 2022
This repository contains PyTorch code for Robust Vision Transformers.

This repository contains PyTorch code for Robust Vision Transformers.

117 Dec 07, 2022
Este conversor criará a medida exata para sua receita de capuccino gelado da grandiosa Rafaella Ballerini!

ConversorDeMedidas_CapuccinoGelado Este conversor criará a medida exata para sua receita de capuccino gelado da grandiosa Rafaella Ballerini! Requirem

Arthur Ottoni Ribeiro 48 Nov 15, 2022
An open source implementation of CLIP.

OpenCLIP Welcome to an open source implementation of OpenAI's CLIP (Contrastive Language-Image Pre-training). The goal of this repository is to enable

2.7k Dec 31, 2022
PyTorch implementation of the ExORL: Exploratory Data for Offline Reinforcement Learning

ExORL: Exploratory Data for Offline Reinforcement Learning This is an original PyTorch implementation of the ExORL framework from Don't Change the Alg

Denis Yarats 52 Jan 01, 2023
Lane follower: Lane-detector (OpenCV) + Object-detector (YOLO5) + CAN-bus

Lane Follower This code is for the lane follower, including perception and control, as shown below. Environment Hardware Industrial Camera Intel-NUC(1

Siqi Fan 3 Jul 07, 2022
🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI

PyTorch implementation of OpenAI's Finetuned Transformer Language Model This is a PyTorch implementation of the TensorFlow code provided with OpenAI's

Hugging Face 1.4k Jan 05, 2023
Deep Learning Specialization by Andrew Ng, deeplearning.ai.

Deep Learning Specialization on Coursera Master Deep Learning, and Break into AI This is my personal projects for the course. The course covers deep l

Engen 1.5k Jan 07, 2023
Temporal-Relational CrossTransformers

Temporal-Relational Cross-Transformers (TRX) This repo contains code for the method introduced in the paper: Temporal-Relational CrossTransformers for

83 Dec 12, 2022
Pytorch implementation of NeurIPS 2021 paper: Geometry Processing with Neural Fields.

Geometry Processing with Neural Fields Pytorch implementation for the NeurIPS 2021 paper: Geometry Processing with Neural Fields Guandao Yang, Serge B

Guandao Yang 162 Dec 16, 2022
Hypersearch weight debugging and losses tutorial

tutorial Activate tensorboard option Running TensorBoard remotely When working on a remote server, you can use SSH tunneling to forward the port of th

1 Dec 11, 2021
TensorFlow-based implementation of "ICNet for Real-Time Semantic Segmentation on High-Resolution Images".

ICNet_tensorflow This repo provides a TensorFlow-based implementation of paper "ICNet for Real-Time Semantic Segmentation on High-Resolution Images,"

HsuanKung Yang 406 Nov 27, 2022