Python package to generate image embeddings with CLIP without PyTorch/TensorFlow

Last update: Jan 04, 2023

Overview

imgbeddings

A Python package to generate embedding vectors from images, using OpenAI's robust CLIP model via Hugging Face transformers. These image embeddings, derived from an image model that has seen the entire internet up to mid-2020, can be used for many things: unsupervised clustering (e.g. via umap), embeddings search (e.g. via faiss), and using downstream for other framework-agnostic ML/AI tasks such as building a classifier or calculating image similarity.

The embeddings generation models are ONNX INT8-quantized, meaning they're 20-30% faster on the CPU, much smaller on disk, and doesn't require PyTorch or TensorFlow as a dependency!
Works for many different image domains thanks to CLIP's zero-shot performance.
Includes utilities for using principal component analysis (PCA) to reduces the dimensionality of generated embeddings without losing much info.

Real-World Demo Notebooks

You can read how to use imgbeddings for real-world use cases in these Jupyter Notebooks:

Cats vs. Dogs: image clustering and building a cat/dog classifier
Pokémon: most-similar image search
Image Augmentation: generated embedding resilience to altered inputs

Installation

aitextgen can be installed from PyPI:

pip3 install imgbeddings

Quick Example

Let's say you want to generate an image embedding for a cute cat photo. First you can download the photo:

import requests
from PIL import Image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

Then, you can load imgbeddings. By default, imgbeddings will load a 88MB model based on the patch32 variant of CLIP, which separates each image into 49 32x32 patches.

from imgbeddings import imgbeddings
ibed = imgbeddings()

You can also load the patch16 model by passing patch_size = 16 to imgbeddings() (more granular embeddings but takes about 3x longer to run), or the "large" patch14 model with patch_size = 14 (3.5x model size, 3x longer than patch16).

Then to generate embeddings, all you have to is pass the image to to_embeddings()!

embedding = ibed.to_embeddings(image)
embedding[0][0:5] # array([ 0.914541, 0.45988417, 0.0350069 , -0.9054574 , 0.08941309], dtype=float32)

This returns a 768D numpy vector for each input, which can be used for pretty much anything in the ML/AI world. You can also pass a list of filename and/or PIL Images for batch embeddings generation.

See the Demo Notebooks above for more advanced parameters and real-world use cases. More formal documentation will be added soon.

Ethics

The official paper for CLIP explicitly notes that there are inherent biases in the finished model, and that CLIP shouldn't be used in production applications as a result. My perspective is that having better tools free-and-open-source to detect such issues and make it more transparent is an overall good for the future of AI, especially since there are less-public ways to create image embeddings that aren't as accessible. At the least, this package doesn't do anything that wasn't already available when CLIP was open-sourced in January 2021.

If you do use imgbeddings for your own project, I recommend doing a strong QA pass along a diverse set of inputs for your application, which is something you should always be doing whenever you work with machine learning, biased models or not.

imgbeddings is not responsible for malicious misuse of image embeddings.

Design Notes

Note that CLIP was trained on square images only, and imgbeddings will pad and resize rectangular images into a square (imgbeddings deliberately does not center crop). As a result, images too wide/tall (e.g. more than a 3:1 ratio of largest dimension to smallest) will not generate robust embeddings.
This package only works with image data intentionally as opposed to leveraging CLIP's ability to link image and text. For downstream tasks, using your own text in conjunction with an image will likely give better results. (e.g. if training a model on an image embeddings + text embeddings, feed both and let the model determine the relative importance of each for your use case)

For more miscellaneous design notes, see DESIGN.md.

Maintainer/Creator

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon and GitHub Sponsors. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

License

MIT

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 2, 2023

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Comments

multiple classes

Excuse me, I'm trying to use the work to clustering 4-classes datasets, while I following the instructions in "cat_dogs.ipynb", when using: umap.plot.points, raise a ValueError: "Plotting is currently only implemented for 2D embeddings", I pretty sure I follow the data structure as the repo given. Does it mean it just support binary classes? Thanks a lot~

opened by CinKKKyo 3

Embeddings vary slightly when done in batches vs. single

import requests
from PIL import Image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

from imgbeddings import imgbeddings
ibed = imgbeddings()

embedding = ibed.to_embeddings(image)
embedding[:, 0:5]

array([[ 0.914541  ,  0.45988417,  0.0350069 , -0.9054574 ,  0.08941309]],
      dtype=float32)

embedding = ibed.to_embeddings([image]*4)
embedding[:, 0:5]

array([[ 0.9133097 ,  0.46032238,  0.03528907, -0.90713847,  0.09063635],
       [ 0.9133097 ,  0.46032238,  0.03528907, -0.90713847,  0.09063635],
       [ 0.9133097 ,  0.46032238,  0.03528907, -0.90713847,  0.09063635],
       [ 0.9133097 ,  0.46032238,  0.03528907, -0.90713847,  0.09063635]],
      dtype=float32)

Probably a side effect of ONNX conversion as that's within tolerances. (or a case where intra op is breaking parallelism?)

bug

opened by minimaxir 0

Allow imgbeddings to optionally split an image into parts for more robust embeddings
Let's say you want to split the image into quadrants (2 row x 2 col)

Run each image as a batch of 4 inputs, with each input representing a quadrant

Hstack/contatenate the outputs to create a 768 * 4 vector (3072D)

PCA to get it down to a reasonable size to avoid curse-of-dimensionality shenanigans

This should work since CLIP was trained with center/random cropping so the model should be resilient to subsets.

Since the outcome of a 2x2 would give a maximum robustness for 448x448 images, which is still low, it may be worth it to scale it up/allow arbitrary segments (e.g. 4x4 for 896x896 images, or rectangular inputs) if the image resolution of the input data is consistent (e.g. 1024x1024 for StyleGAN shenanigans).
enhancement
opened by minimaxir 1

Python package to generate image embeddings with CLIP without PyTorch/TensorFlow

Related tags

Overview

imgbeddings

Real-World Demo Notebooks

Installation

Quick Example

Ethics

Design Notes

Maintainer/Creator

See Also

License

You might also like...

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

Simple image captioning model - CLIP prefix captioning.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

CLIPImageClassifier wraps clip image model from transformers

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

Comments

multiple classes

Embeddings vary slightly when done in batches vs. single

Allow imgbeddings to optionally split an image into parts for more robust embeddings

Releases(v0.1.0)

v0.1.0(Mar 28, 2022)

Owner

Max Woolf

Official repository for "Orthogonal Projection Loss" (ICCV'21)

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

Code for reproducing our paper: LMSOC: An Approach for Socially Sensitive Pretraining

Simple converter for deploying Stable-Baselines3 model to TFLite and/or Coral

Mengzi Pretrained Models

🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

mmfewshot is an open source few shot learning toolbox based on PyTorch

[ICCV 2021] Learning A Single Network for Scale-Arbitrary Super-Resolution

In this project, we develop a face recognize platform based on MTCNN object-detection netcwork and FaceNet self-supervised network.

Code for paper "Vocabulary Learning via Optimal Transport for Neural Machine Translation"

Companion code for the paper "Meta-Learning the Search Distribution of Black-Box Random Search Based Adversarial Attacks" by Yatsura et al.

App for identification of various objects. Based on YOLO v4 tiny architecture

Proposal, Tracking and Segmentation (PTS): A Cascaded Network for Video Object Segmentation

Mini Software that give reminder to drink water as per your weight.

[CVPR2021] The source code for our paper 《Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning》.

CoMoGAN: continuous model-guided image-to-image translation. CVPR 2021 oral.

Code basis for the paper "Camera Condition Monitoring and Readjustment by means of Noise and Blur" (2021)

Recreate CenternetV2 based on MMDET.

rliable is an open-source Python library for reliable evaluation, even with a handful of runs, on reinforcement learning and machine learnings benchmarks.

Implementation of paper: "Image Super-Resolution Using Dense Skip Connections" in PyTorch