[CVPR 2021] VirTex: Learning Visual Representations from Textual Annotations

Overview

VirTex: Learning Visual Representations from Textual Annotations

Karan Desai and Justin Johnson
University of Michigan


CVPR 2021 arxiv.org/abs/2006.06666

Model Zoo, Usage Instructions and API docs: kdexd.github.io/virtex

VirTex is a pretraining approach which uses semantically dense captions to learn visual representations. We train CNN + Transformers from scratch on COCO Captions, and transfer the CNN to downstream vision tasks including image classification, object detection, and instance segmentation. VirTex matches or outperforms models which use ImageNet for pretraining -- both supervised or unsupervised -- despite using up to 10x fewer images.

virtex-model

Get the pretrained ResNet-50 visual backbone from our best performing VirTex model in one line without any installation!

import torch

# That's it, this one line only requires PyTorch.
model = torch.hub.load("kdexd/virtex", "resnet50", pretrained=True)

Note (For returning users before January 2021):

The pretrained models in our model zoo have changed from v1.0 onwards. They are slightly better tuned than older models, and reproduce the results in our CVPR 2021 accepted paper (arXiv v2). Some training and evaluation hyperparams are changed since v0.9. Please refer CHANGELOG.md

Usage Instructions

  1. How to setup this codebase?
  2. VirTex Model Zoo
  3. How to train your VirTex model?
  4. How to evaluate on downstream tasks?

Full documentation is available at kdexd.github.io/virtex.

Citation

If you find this code useful, please consider citing:

@inproceedings{desai2021virtex,
    title={{VirTex: Learning Visual Representations from Textual Annotations}},
    author={Karan Desai and Justin Johnson},
    booktitle={CVPR},
    year={2021}
}

Acknowledgments

We thank Harsh Agrawal, Mohamed El Banani, Richard Higgins, Nilesh Kulkarni and Chris Rockwell for helpful discussions and feedback on the paper. We thank Ishan Misra for discussions regarding PIRL evaluation protocol; Saining Xie for discussions about replicating iNaturalist evaluation as MoCo; Ross Girshick and Yuxin Wu for help with Detectron2 model zoo; Georgia Gkioxari for suggesting the Instance Segmentation pretraining task ablation; and Stefan Lee for suggestions on figure aesthetics. We thank Jia Deng for access to extra GPUs during project development; and UMich ARC-TS team for support with GPU cluster management. Finally, we thank all the Starbucks outlets in Ann Arbor for many hours of free WiFi. This work was partially supported by the Toyota Research Institute (TRI). However, note that this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.

Comments
  • run on single input image

    run on single input image

    Hi,

    I would like to evaluate your work on a single image for image captioning. Can you tell me the steps I should follow for a single input? For instance, given a folder of images, how would I use your model for inference only on the folder of images?

    Looking at captioning-task from your description, I am not sure how to go about using my own dataset for evaluation of the model.

    Thanks

    opened by nikky4D 15
  • Training loss acts strangely after resuming

    Training loss acts strangely after resuming

    Hi,

    I want to reproduce your pre-training result. There was a accident that caused the interruption of my training. I restored it by the flag "--resume-from" and it acts weirdly. The training and validation loss jumped dramatically at the beginning and then decreased, which seems there is a problem about the restoring. Could you help me about this?

    opened by BaohaoLiao 11
  • BatchNormalization's Running Stats are Accumulated in ImageNet Linear Evaluation

    BatchNormalization's Running Stats are Accumulated in ImageNet Linear Evaluation

    Hi,

    Thanks for the nice paper and clear code!

    I found that the models are set with .train() in clf_linear.py. Thus the running averages (i.e., the states) of BatchNormalization layers will be accumulated when training the ImageNet datasets (via calling the forward function), and the backbone model seems not to be fully frozen. Is it a special design for this fine-tuning task?

    Best, Hao

    opened by airsplay 6
  • Removed link for pretrained model

    Removed link for pretrained model

    Hi,

    I am trying to download the pertained model for image captioning. But the download link has been removed. Could you please update the download link?

    opened by zhuang93 4
  • unable to find a valid cuDNN algorithm to run convolution

    unable to find a valid cuDNN algorithm to run convolution

    sorry to bother you, but I run into this problem and can not to find a way to fix it. it happens when I train the base virtex model. I have update the cuDNN version into 8.0.3, the former version is 7.6.5. both version have this error.

    opened by Charlie-zhang1406 4
  • Question about SentencePiece [SOS] and [EOS] ID.

    Question about SentencePiece [SOS] and [EOS] ID.

    Hi, I saw that in SentencePieceTrainer, as below you made EOS and BOS and MASK and PAS tokens equal to Zero " --bos_id=-1 --eos_id=-1" " --control_symbols=[SOS],[EOS],[MASK]" However, during the captioning, you define sos_index: int = 1, eos_index: int = 2, I am wondering if these setups , have any effects?

    opened by nooralahzadeh 4
  • No loss when pretraining on token classification

    No loss when pretraining on token classification

    I am trying to pretrain using the token classification method. I copied this repo and was just trying to reproduce the results from the study. I am experiencing problems when pretraining using token classification. It seems as though the loss values are not in the output_dict variable.

    When I use pretrain_virtex.py and log every 20 iterations, I get the following output. 2021-11-16T12:20:04.960052+0000: Iter 20 | Time: 0.764 sec | ETA: 54h 39m [Loss nan] [GPU 8774 MB]

    Do you have any idea what could be wrong in the code?

    opened by alexkern1997 3
  • Possible inconsistency in data preprocessing

    Possible inconsistency in data preprocessing

    Hi, thank you so much for sharing this code. It is very helpful.

    However, I am confused about the data preprocessing configuration. In the config files, Caffe-style image mean and std is used, but it seams they are not used in the code. Instead, the code seems to hard-code torchvision-style mean and std (here). Can you confirm that both pretraining and fine-tuning use the latter?

    Furthermore, I am not sure whether the images are in 0-255 range or 0-1. For Caffe-style mean and std, it should be 0-255, but it seems with your hard-coded mean and std, it should be 0-1. However, I noticed you are using opencv to load images, which loads in 0-255, and I did not find anywhere in the code that they are transformed into 0-1, except in supervised pretraining (here).

    Could you please comment on the aforementioned issues? Especially it is important to make sure the config is identical for all pretraining and downstream settings. Since you fine-tune all layers and don't freeze the stem, it is hard to notice if such inconsistencies exist, because the fine-tuning process would fix them to some extent.

    Thank you so much.

    opened by alirezazareian 3
  • Pre-training on another dataset

    Pre-training on another dataset

    Hi,

    Thank you for making this code public!

    I want to pre-train a captioning model on another dataset (ARCH dataset). I went through your codebase and realized that first I need to create a Dataset class for my dataset similar to your Dataset class in virtex/data/datasets/coco_captions.py. Next, I will need to make a modified version of virtex/data/datasets/captioning.py.

    Somehow the files in virtex/data/datasets/ are all ignored by git and I can't make any of them become visible. Can you please help me with it? I would also appreciate any suggestions on how to modify the code at this stage in order to cause the least amount of disruption to the functions and classes which rely on the Dataset classes.

    Many thanks, George Batchkala

    opened by GeorgeBatch 2
  • The weight file on http://kdexd.xyz/virtex/virtex/usage/model_zoo.html was canceled

    The weight file on http://kdexd.xyz/virtex/virtex/usage/model_zoo.html was canceled

    Hello, your work is very attractive to me, but when I reproduced your excellent research results, I found that the weight file on http://kdexd.xyz/virtex/virtex/usage/model_zoo.html was canceled. I hope you can provide effective links to the weighted documents and reproduce your excellent work.

    opened by hubin111 2
  • torch.hub.load(

    torch.hub.load("kdexd/virtex", "resnet50", pretrained=True) not working

    I tried running this in Colab environment.

    Got the below error:

    KeyError                                  Traceback (most recent call last)
    
    <ipython-input-5-e8ec27705300> in <module>()
          1 import torch
          2 # model = torch.hub.load('pytorch/vision:v0.9.0', 'alexnet', pretrained=True)
    ----> 3 model = torch.hub.load("kdexd/virtex", "resnet50", pretrained=True)
          4 model.eval()
    
    2 frames
    
    /root/.cache/torch/hub/kdexd_virtex_master/hubconf.py in resnet50(pretrained, **kwargs)
         31                 "https://umich.box.com/shared/static/gsjqm4i4fm1wpzi947h27wweljd8gcpy.pth",
         32                 progress=False,
    ---> 33             )["model"]
         34         )
         35     return model
    
    KeyError: 'model'
    

    Can you let me know the fix ?

    opened by Sumegh-git 2
  • Cog version

    Cog version

    "😵 Uh oh! This model can't be run on Replicate because it was built with a version of Cog that is no longer supported." https://replicate.com/kdexd/virtex-image-captioning

    opened by Jakeukalane 0
  • Training with new Random Seed does not shuffle data

    Training with new Random Seed does not shuffle data

    I've been adapting the example scripts to my own training task, and I've noticed that the scripts do not handle different random seeds as expected. I've found this problem in two places, but there might be more:

    https://github.com/kdexd/virtex/blob/2baba8a4f3a4d80d617b3bc59e4be25b1052db57/scripts/clf_linear.py#L104-L109 https://github.com/kdexd/virtex/blob/2baba8a4f3a4d80d617b3bc59e4be25b1052db57/scripts/pretrain_virtex.py#L68

    The problem is that the DistributedSampler (from PyTorch 1.9.0) requires kwarg "seed" to shuffle differently, when shuffle=True. I believe that the correct use of DistributedSampler for training with different random seeds would be to add the kwarg seed=_DOWNC.RANDOM_SEED when DistributedSampler is initialized in these two places. As for reshuffling on additional epochs, DistributedSampler will add the seed to the epoch number, so nothing needs to be changed during epoch-setting for the sampler.

    https://github.com/pytorch/pytorch/blob/d69c22dd61a2f006dcfe1e3ea8468a3ecaf931aa/torch/utils/data/distributed.py#L100

    Please let me know your thoughts, or if I may have missed something.

    opened by keeganq 0
  • Decoder Attention Weight Visualization

    Decoder Attention Weight Visualization

    Hi, thanks for the awesome code base!

    I'm looking to produce visualizations of decoder attention weights similar to those shown in the paper, but I don't think that you have implemented this feature in the published code (although I may have overlooked it!)

    As best I can tell, the way this would be done is by using a new TransformerDecoderLayer which returns the multihead attention's attn_output_weights in its forward method. The visualized attention weights when predicting a single token would then be the average of these weights across all heads. The problem that I am finding is that the visualized weights seem to mostly appear in the center of the image during captioning on the coco dataset, but the results in the paper show reasonable variation in these weights as tokens are predicted.

    Is this the method that you used to create the visualization? Any insight into how this was previously done would be appreciated!

    opened by keeganq 0
  • Add Docker environment & web demo

    Add Docker environment & web demo

    Hey @kdexd! 👋

    This pull request makes it possible to run your model inside a Docker environment, which makes it easier for other people to run it. We're using an open source tool called Cog to make this process easier.

    This also means we can make a web page where other people can try out your model! We've implemented image captioning, but it should be pretty easy to add other tasks too if you'd like. View it here: https://replicate.ai/kdexd/virtex-image-captioning

    That page also has instructions on how to use the Docker image, which is on our registry at r8.im/kdexd/virtex-image-captioning.

    In case you're wondering who the heck I am, I'm from Replicate, where we're trying to make machine learning reproducible. So many cool models are being made, but I got frustrated that I couldn't run them, hence we're trying to fix that. :)

    opened by bfirsh 0
  • Fine tuning Virtex for image captioning

    Fine tuning Virtex for image captioning

    Hi there, I am aware that Virtex used image captioning as a pretraining task and not as the "final goal", but I was wondering whether one could go on fine-tuning the pretrained model (e.g. bicaptioning_R_50_L1_H2048) with additional COCOcaptions-like data in order to get an improved captioning model. Has anyone tried that or does anyone have any suggestion how to do it? Can any of the scripts in the repository be used/adapted for fine-tuning existing models? Thanks a lot! :)

    opened by freeIsa 1
Releases(v1.4)
  • v1.4(Jan 9, 2022)

    Major changes

    • Python 3.6 support is dropped, the minimum requirement is Python 3.8. All major library versions are bumped to the latest releases (PyTorch, OpenCV, Albumentations, etc.).
    • Model zoo URLs are changed to Dropbox. All pre-trained checkpoint weights are unchanged.
    • There was a spike in training loss when resuming training with pretrain_virtex.py, it is fixed now.
    • Documentation theme is changed from alabaster to read the docs, looks fancier!
    Source code(tar.gz)
    Source code(zip)
  • v1.2(Jul 15, 2021)

    Bug Fix: Beam Search

    The beam search implementation adapted from AllenNLP was more suited for LSTM/GRU (recurrent models), less for transformers (autoregressive models). This version removes the "backpointer" trick from AllenNLP implementation and improves captioning results for all VirTex models. See below, "Old" metrics are v1.1 (ArXiv v2) and "New" metrics are v1.2 (ArXiv v3).

    image

    This bug does not affect pre-training or other downstream task results. Thanks to Nicolas Carion (@alcinos) and Aishwarya Kamath (@ashkamath) for spotting this issue and helping me to fix it!

    Feature: Nucleus Sampling

    This codebase now supports decoding through Nucleus Sampling, as introduced in The Curious Case of Neural Text Degeneration. Try running captioning evaluation script with --config-override MODEL.DECODER.NAME nucleus_sampling MODEL.DECODER.NUCLEUS_SIZE 0.9! To have consistent behavior with prior versions, the default decoding method is Beam Search with 5 beams.

    Note: Nucleus sampling would give worse results specifically on COCO Captions, but will produce more interesting sounding language with larger transformers trained on much more data than COCO Captions.

    New config arguments to support this:

    MODEL:
      DECODER:
        # What algorithm to use for decoding. Supported values: {"beam_search",
        # "nucleus_sampling"}.
        NAME: "beam_search"
    
        # Number of beams to decode (1 = greedy decoding). Ignored when decoding
        # through nucleus sampling.
        BEAM_SIZE: 5
    
        # Size of nucleus for sampling predictions. Ignored when decoding through
        # beam search.
        NUCLEUS_SIZE: 0.9
    
        # Maximum length of decoded caption. Decoding may end earlier when [EOS]
        # token is sampled.
        MAX_DECODING_STEPS: 50  # Same as DATA.MAX_CAPTION_LENGTH
    
    Source code(tar.gz)
    Source code(zip)
  • v1.1(Apr 4, 2021)

    This version is a small increment over v1.0 with only cosmetic changes and obsolete code removals. The final results of models rained from this codebase would remain unchanged.

    Removed feature extraction support:

    • Removed virtex.downstream.FeatureExtractor and its usage in scripts/clf_voc07.py. By default, the script will only evaluate on global average pooled features (2048-d), as with the CVPR 2021 paper version.

    • Removed virtex.modules.visual_backbones.BlindVisualBackbone. I introduced it a long time ago for debugging, it is not much useful anymore.

    Two config-related changes:

    1. Renamed config parameters: (OPTIM.USE_LOOKAHEAD —> OPTIM.LOOKAHEAD.USE), (OPTIM.LOOKAHEAD_ALPHA —> OPTIM.LOOKAHEAD_ALPHA) and (OPTIM.LOOKAHEAD_STEPS —> OPTIM.LOOKAHEAD.STEPS).

    2. Renamed TransformerTextualHead to TransformerDecoderTextualHead for clarity. Model names in config also change accordingly: "transformer_postnorm" —> "transdec_postnorm" (same for prenorm).

    These changes may be breaking if you wrote your own config and explicitly added these arguments.

    Source code(tar.gz)
    Source code(zip)
  • v1.0(Mar 7, 2021)

    CVPR 2021 release of VirTex. Code and pre-trained models can reproduce results according to the paper: https://arxiv.org/abs/2006.06666v2

    Source code(tar.gz)
    Source code(zip)
PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud, CVPR 2019.

PointRCNN PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud Code release for the paper PointRCNN:3D Object Proposal Generation a

Shaoshuai Shi 1.5k Dec 27, 2022
Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data

Ayush Daksh 12 Dec 01, 2022
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

Pytorch Lightning 21.1k Jan 08, 2023
An open source machine learning library for performing regression tasks using RVM technique.

Introduction neonrvm is an open source machine learning library for performing regression tasks using RVM technique. It is written in C programming la

Siavash Eliasi 33 May 31, 2022
ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representation from common sense knowledge graphs.

ZSL-KG is a general-purpose zero-shot learning framework with a novel transformer graph convolutional network (TrGCN) to learn class representa

Bats Research 94 Nov 21, 2022
Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

bottom-up-attention This code implements a bottom-up attention model, based on multi-gpu training of Faster R-CNN with ResNet-101, using object and at

Peter Anderson 1.3k Jan 09, 2023
NeurIPS 2021, "Fine Samples for Learning with Noisy Labels"

[Official] FINE Samples for Learning with Noisy Labels This repository is the official implementation of "FINE Samples for Learning with Noisy Labels"

mythbuster 27 Dec 23, 2022
Google Brain - Ventilator Pressure Prediction

Google Brain - Ventilator Pressure Prediction https://www.kaggle.com/c/ventilator-pressure-prediction The ventilator data used in this competition was

Samuele Cucchi 1 Feb 11, 2022
VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

    VarCLR: Variable Representation Pre-training via Contrastive Learning New: Paper accepted by ICSE 2022. Preprint at arXiv! This repository contain

squaresLab 32 Oct 24, 2022
Randomized Correspondence Algorithm for Structural Image Editing

===================================== README: Inpainting based PatchMatch ===================================== @Author: Younesse ANDAM @Conta

Younesse 116 Dec 24, 2022
Release of the ConditionalQA dataset

ConditionalQA Datasets accompanying the paper ConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers. Disclaimer This dataset

14 Oct 17, 2022
The official implementation of A Unified Game-Theoretic Interpretation of Adversarial Robustness.

This repository is the official implementation of A Unified Game-Theoretic Interpretation of Adversarial Robustness. Requirements pip install -r requi

Jie Ren 17 Dec 12, 2022
计算机视觉中用到的注意力模块和其他即插即用模块PyTorch Implementation Collection of Attention Module and Plug&Play Module

PyTorch实现多种计算机视觉中网络设计中用到的Attention机制,还收集了一些即插即用模块。由于能力有限精力有限,可能很多模块并没有包括进来,有任何的建议或者改进,可以提交issue或者进行PR。

PJDong 599 Dec 23, 2022
🙄 Difficult algorithm, Simple code.

🎉TensorFlow2.0-Examples🎉! "Talk is cheap, show me the code." ----- Linus Torvalds Created by YunYang1994 This tutorial was designed for easily divin

1.7k Dec 25, 2022
Unofficial PyTorch Implementation of Multi-Singer

Multi-Singer Unofficial PyTorch Implementation of Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus. Requirements See re

SunMail-hub 123 Dec 28, 2022
Privacy-Preserving Portrait Matting [ACM MM-21]

Privacy-Preserving Portrait Matting [ACM MM-21] This is the official repository of the paper Privacy-Preserving Portrait Matting. Jizhizi Li∗, Sihan M

Jizhizi_Li 212 Dec 27, 2022
This code provides a PyTorch implementation for OTTER (Optimal Transport distillation for Efficient zero-shot Recognition), as described in the paper.

Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation This repository contains PyTorch evaluation code, trainin

Meta Research 45 Dec 20, 2022
Learning Synthetic Environments and Reward Networks for Reinforcement Learning

Learning Synthetic Environments and Reward Networks for Reinforcement Learning We explore meta-learning agent-agnostic neural Synthetic Environments (

AutoML-Freiburg-Hannover 16 Sep 02, 2022
Learning from Synthetic Humans, CVPR 2017

Learning from Synthetic Humans (SURREAL) Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev and Cordelia Schmid,

Gul Varol 538 Dec 18, 2022
Official Pytorch implementation of RePOSE (ICCV2021)

RePOSE: Iterative Rendering and Refinement for 6D Object Detection (ICCV2021) [Link] Abstract We present RePOSE, a fast iterative refinement method fo

Shun Iwase 68 Nov 15, 2022