A concise but complete implementation of CLIP with various experimental improvements from recent papers

Overview

x-clip (wip)

A concise but complete implementation of CLIP with various experimental improvements from recent papers

Install

$ pip install x-clip

Usage

import torch
from x_clip import CLIP

clip = CLIP(
    dim_text = 512,
    dim_image = 512,
    dim_latent = 512,
    num_text_tokens = 10000,
    text_enc_depth = 6,
    text_seq_len = 256,
    text_heads = 8,
    num_visual_tokens = 512,
    visual_enc_depth = 6,
    visual_image_size = 256,
    visual_patch_size = 32,
    visual_heads = 8,
    use_all_token_embeds = True   # whether to use fine-grained contrastive learning (FILIP)
)

text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
mask = torch.ones_like(text).bool()

loss = clip(text, images, text_mask = mask, return_loss = True)
loss.backward()

Citations

@misc{radford2021learning,
    title   = {Learning Transferable Visual Models From Natural Language Supervision}, 
    author  = {Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
    year    = {2021},
    eprint  = {2103.00020},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
@misc{yao2021filip,
    title   = {FILIP: Fine-grained Interactive Language-Image Pre-Training}, 
    author  = {Lewei Yao and Runhui Huang and Lu Hou and Guansong Lu and Minzhe Niu and Hang Xu and Xiaodan Liang and Zhenguo Li and Xin Jiang and Chunjing Xu},
    year    = {2021},
    eprint  = {2111.07783},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
Comments
  • Model forward outputs to text/image similarity score

    Model forward outputs to text/image similarity score

    Any insight on how to take the image/text embeddings (or nominal model forward output) to achieve a simple similarity score as done in the huggingface implementation? HF example here

    In the original paper I see the dot products of the image/text encoder outputs were used, but here I was having troubles with the dimensions on the outputs.

    opened by paulcjh 12
  • Using different encoders in CLIP

    Using different encoders in CLIP

    Hi, I am wondering if it was possible to use different encoders in CLIP ? For images not using vit but resnet for example. And is it possible to replace the text encoder by a features encoder for example ? If I have a vector of features for a given image and I want to use x-clip how should I do that ? I have made a code example that doesnt seems to work, here is what I did:

    import torch
    from x_clip import CLIP
    import torch.nn as nn
    from torchvision import models
    
    class Image_Encoder(torch.nn.Module):
        #output size is (bs,512)
        def __init__(self):
            super(Image_Encoder, self).__init__()
            self.model_pre = models.resnet18(pretrained=False)
            self.base=nn.Sequential(*list(self.model_pre.children()))
            self.base[0]=nn.Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
            self.resnet=self.base[:-1]
    
        def forward(self, x):
            out=self.resnet(x).squeeze()
            return out
    
    
    class features_encoder(torch.nn.Module):
        #output size is (bs,512)
        def __init__(self):
            super(features_encoder, self).__init__()
            self.model =nn.Linear(2048,512)
    
        def forward(self, x):
            out=self.model(x)
            return out
    
    images_encoder=Image_Encoder()
    features_encoder=features_encoder()
    
    clip = CLIP(
        image_encoder = images_encoder,
        text_encoder = features_encoder,
        dim_image = 512,
        dim_text = 512,
        dim_latent = 512
    )
    
    features= torch.randn(4,2048)
    images = torch.randn(4, 3, 256, 256)
    
    loss = clip(features, images, return_loss = True)
    loss.backward()
    

    but I got the following error : forward() takes 2 positional arguments but 3 were given

    Thanks

    opened by ethancohen123 8
  • Visual ssl with channels different than 3

    Visual ssl with channels different than 3

    Hi, seems to be a bug when trying to use visual ssl with a different number of channel than 3 . I think the error came from the visual ssl type ~row 280 here:

    #send a mock image tensor to instantiate parameters self.forward(torch.randn(1, 3, image_size, image_size))

    opened by ethancohen123 4
  • Allow other types of visual  SSL when initiating CLIP

    Allow other types of visual SSL when initiating CLIP

    In the following code as part of CLIP.__init__

            if use_visual_ssl:
                if visual_ssl_type == 'simsiam':
                    ssl_type = SimSiam
                elif visual_ssl_type == 'simclr':
                    ssl_type = partial(SimCLR, temperature = simclr_temperature)
                else:
                    raise ValueError(f'unknown visual_ssl_type')
    
                self.visual_ssl = ssl_type(
                    self.visual_transformer,
                    image_size = visual_image_size,
                    hidden_layer = visual_ssl_hidden_layer
                )
    

    the visual self-supervised learning is hardcoded. I would suggest changing this to accept the visual SSL module as an argument when instantiating CLIP to allow flexibility in the same manner as it does for the image encoder and text encoder.

    Example:

    barlow = BarlowTwins(augmentatation_fns)
    clip = CLIP(..., visual_ssl=barlow)
    
    opened by Froskekongen 4
  • Extract Text and Image Latents

    Extract Text and Image Latents

    Hi, in the current implementation we can only extract text and image embedding (by set return_encodings=True) which are obtained before applying latent linear layers. Isn't it better to add an option to extract latent embeddings? Another importance of this is that with the current code, it is impossible to extract the similarity matrix between a batch of images and a batch of text.

    opened by mmsamiei 2
  • NaN with mock data

    NaN with mock data

    Hi lucidrains,

    Try this and it will NaN within 100 steps (latest Github code). The loss looks fine before NaN.

    import torch
    torch.backends.cudnn.allow_tf32 = True
    torch.backends.cuda.matmul.allow_tf32 = True    
    torch.backends.cudnn.benchmark = True
    
    import random
    import numpy as np
    seed = 42
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    
    num_text_tokens = 10000
    batch_sz = 12
    text_seq_len = 256
    visual_image_size = 256
    
    # mock data
    
    data_sz = 1000
    all_text = torch.randint(0, num_text_tokens, (data_sz, text_seq_len)).cuda()
    all_images = torch.randn(data_sz, 3, visual_image_size, visual_image_size).cuda()
    
    text = torch.zeros((batch_sz, text_seq_len), dtype=torch.long).cuda()
    images = torch.zeros((batch_sz, 3, visual_image_size, visual_image_size)).cuda()
    
    ##########################################################################################
    
    import wandb
    import datetime
    wandb.init(project="Test", name=datetime.datetime.today().strftime('%Y-%m-%d-%H-%M-%S'), save_code=False)
    
    from x_clip import CLIP
    
    clip = CLIP(
        dim_text = 512,
        dim_image = 512,
        dim_latent = 512,
        num_text_tokens = num_text_tokens,
        text_enc_depth = 6,
        text_seq_len = text_seq_len,
        text_heads = 8,
        visual_enc_depth = 6,
        visual_image_size = visual_image_size,
        visual_patch_size = 32,
        visual_heads = 8,
        use_all_token_embeds = False,           # whether to use fine-grained contrastive learning (FILIP)
        decoupled_contrastive_learning = True,  # use decoupled contrastive learning (DCL) objective function, removing positive pairs from the denominator of the InfoNCE loss (CLOOB + DCL)
        extra_latent_projection = True,         # whether to use separate projections for text-to-image vs image-to-text comparisons (CLOOB)
        use_visual_ssl = True,                  # whether to do self supervised learning on iages
        visual_ssl_type = 'simclr',             # can be either 'simclr' or 'simsiam', depending on using DeCLIP or SLIP
        use_mlm = False,                        # use masked language learning (MLM) on text (DeCLIP)
        text_ssl_loss_weight = 0.05,            # weight for text MLM loss
        image_ssl_loss_weight = 0.05            # weight for image self-supervised learning loss
    ).cuda()
    
    optimizer = torch.optim.Adam(clip.parameters(), lr=1e-4, betas=(0.9, 0.99))
    
    for step in range(999999):
        for i in range(batch_sz):
            data_id = random.randrange(0, data_sz - 1)
            text[i] = all_text[data_id]
            images[i] = all_images[data_id]
    
        loss = clip(
            text,
            images,
            freeze_image_encoder = False,   # whether to freeze image encoder if using a pretrained image net, proposed by LiT paper
            return_loss = True              # needs to be set to True to return contrastive loss
        )
        clip.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(clip.parameters(), 1.0)
        optimizer.step()
    
        now_loss = loss.item()
        wandb.log({"loss": now_loss}, step = step)
        print(step, now_loss)
    
        if 'nan' in str(now_loss):
            break
    
    opened by BlinkDL 1
  • Unable to train to convergence (small dataset)

    Unable to train to convergence (small dataset)

    Hi nice work with x-clip. Hoping to play around with it and eventually combine it into your DALLE2 work.

    Currently having some trouble training on roughly 30k image-text pairs. Loss eventually goes negative and starts producing Nan's. I've dropped learning rate down (1e-4) and I'm clipping gradients (max_norm=0.5).

    Any thoughts on what are sane training params/configs on such a small dataset using x-clip?

    opened by jacobwjs 9
Releases(0.12.0)
Owner
Phil Wang
Working with Attention. It's all we need
Phil Wang
paper list in the area of reinforcenment learning for recommendation systems

paper list in the area of reinforcenment learning for recommendation systems

HenryZhao 23 Jun 09, 2022
Official code for the CVPR 2021 paper "How Well Do Self-Supervised Models Transfer?"

How Well Do Self-Supervised Models Transfer? This repository hosts the code for the experiments in the CVPR 2021 paper How Well Do Self-Supervised Mod

Linus Ericsson 157 Dec 16, 2022
Model of an AI powered sign language interpreter.

TEXT AND SPEECH TO SIGN LANGUAGE. A web application which takes in text or live audio speech recording as input, converts and displays the relevant Si

Mark Gatere 4 Mar 30, 2022
A colab notebook for training Stylegan2-ada on colab, transfer learning onto your own dataset.

Stylegan2-Ada-Google-Colab-Starter-Notebook A no thrills colab notebook for training Stylegan2-ada on colab. transfer learning onto your own dataset h

Harnick Khera 66 Dec 16, 2022
Official code release for "GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis"

GRAF This repository contains official code for the paper GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis. You can find detailed usage i

349 Dec 29, 2022
Google-drive-to-sqlite - Create a SQLite database containing metadata from Google Drive

google-drive-to-sqlite Create a SQLite database containing metadata from Google

Simon Willison 140 Dec 04, 2022
GPU-accelerated Image Processing library using OpenCL

pyclesperanto pyclesperanto is a python package for clEsperanto - a multi-language framework for GPU-accelerated image processing. clEsperanto uses Op

17 Dec 25, 2022
Official Pytorch implementation of ICLR 2018 paper Deep Learning for Physical Processes: Integrating Prior Scientific Knowledge.

Deep Learning for Physical Processes: Integrating Prior Scientific Knowledge: Official Pytorch implementation of ICLR 2018 paper Deep Learning for Phy

emmanuel 47 Nov 06, 2022
UMPNet: Universal Manipulation Policy Network for Articulated Objects

UMPNet: Universal Manipulation Policy Network for Articulated Objects Zhenjia Xu, Zhanpeng He, Shuran Song Columbia University Robotics and Automation

Columbia Artificial Intelligence and Robotics Lab 33 Dec 03, 2022
Code for our CVPR 2022 Paper "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection"

GEN-VLKT Code for our CVPR 2022 paper "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection". Contributed by Yue Lia

Yue Liao 47 Dec 04, 2022
Repository for publicly available deep learning models developed in Rosetta community

trRosetta2 This package contains deep learning models and related scripts used by Baker group in CASP14. Installation Linux/Mac clone the package git

81 Dec 29, 2022
Keras implementation of Deeplab v3+ with pretrained weights

Keras implementation of Deeplabv3+ This repo is not longer maintained. I won't respond to issues but will merge PR DeepLab is a state-of-art deep lear

1.3k Dec 07, 2022
This repository contains several jupyter notebooks to help users learn to use neon, our deep learning framework

neon_course This repository contains several jupyter notebooks to help users learn to use neon, our deep learning framework. For more information, see

Nervana 92 Jan 03, 2023
An NLP library with Awesome pre-trained Transformer models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications.

简体中文 | English News [2021-10-12] PaddleNLP 2.1版本已发布!新增开箱即用的NLP任务能力、Prompt Tuning应用示例与生成任务的高性能推理! 🎉 更多详细升级信息请查看Release Note。 [2021-08-22]《千言:面向事实一致性的生

6.9k Jan 01, 2023
Fast and robust certifiable relative pose estimation

Fast and Robust Relative Pose Estimation for Calibrated Cameras This repository contains the code for the relative pose estimation between two central

42 Dec 06, 2022
This repository provides an efficient PyTorch-based library for training deep models.

s3sec Test AWS S3 buckets for read/write/delete access This tool was developed to quickly test a list of s3 buckets for public read, write and delete

Bytedance Inc. 123 Jan 05, 2023
This application is the basic of automated online-class-joiner(for YıldızEdu) within the right time. Gets the ZOOM link by scheduled date and time.

This application is the basic of automated online-class-joiner(for YıldızEdu) within the right time. Gets the ZOOM link by scheduled date and time.

215355 1 Dec 16, 2021
Gradient Inversion with Generative Image Prior

Gradient Inversion with Generative Image Prior This repository is an implementation of "Gradient Inversion with Generative Image Prior", accepted to N

MLLab @ Postech 25 Jan 09, 2023
An Active Automata Learning Library Written in Python

AALpy An Active Automata Learning Library AALpy is a light-weight active automata learning library written in pure Python. You can start learning auto

TU Graz - SAL Dependable Embedded Systems Lab (DES Lab) 78 Dec 30, 2022
Complete U-net Implementation with keras

U Net Lowered with Keras Complete U-net Implementation with keras Original Paper Link : https://arxiv.org/abs/1505.04597 Special Implementations : The

Sagnik Roy 14 Oct 10, 2022