Deep ViT Features as Dense Visual Descriptors

Last update: Dec 24, 2022

Overview

dino-vit-features

Official implementation of the paper "Deep ViT Features as Dense Visual Descriptors".

We demonstrate the effectiveness of deep features extracted from a self-supervised, pre-trained ViT model (DINO-ViT) as dense patch descriptors via real-world vision tasks: (a-b) co-segmentation & part co-segmentation: given a set of input images (e.g., 4 input images), we automatically co-segment semantically common foreground objects (e.g., animals), and then further partition them into common parts; (c-d) point correspondence: given a pair of input images, we automatically extract a sparse set of corresponding points. We tackle these tasks by applying only lightweight, simple methodologies such as clustering or binning, to deep ViT features.

Setup

Our code is developed in pytorch on and requires the following modules: tqdm, faiss, timm, matplotlib, pydensecrf, opencv, scikit-learn. We use python=3.9 but our code should be runnable on any version above 3.6. We recomment running our code with any CUDA supported GPU for faster performance. We recommend setting the running environment via Anaconda by running the following commands:

$ conda env create -f env/dino-vit-feats-env.yml
$ conda activate dino-vit-feats-env

Otherwise, run the following commands in your conda environment:

$ conda install pytorch torchvision torchaudio cudatoolkit=11 -c pytorch
$ conda install tqdm
$ conda install -c conda-forge faiss
$ conda install -c conda-forge timm 
$ conda install matplotlib
$ pip install opencv-python
$ pip install git+https://github.com/lucasb-eyer/pydensecrf.git
$ conda install -c anaconda scikit-learn

ViT Extractor

We provide a wrapper class for a ViT model to extract dense visual descriptors in extractor.py. You can extract descriptors to .pt files using the following command:

python extractor.py --image_path 
   
     --output_path

You can specify the pretrained model using the --model flag with the following options:

dino_vits8, dino_vits16, dino_vitb8, dino_vitb16 from the DINO repo.
vit_small_patch8_224, vit_small_patch16_224, vit_base_patch8_224, vit_base_patch16_224 from the timm repo.

You can specify the stride of patch extracting layer to increase resolution using the --stride flag.

Part Co-segmentation

We provide a notebook for running on a single example in part_cosegmentation.ipynb.

To run on several image sets, arrange each set in a directory, inside a data root directory:


   
    
|
|_ 
    
     
|  |
|  |_ img1.png
|  |_ img2.png
|   
|_ 
     
      
   |
   |_ img1.png
   |_ img2.png
   |_ img3.png
...

The following command will produce results in the specified :

python part_cosegmentation.py --root_dir 
   
     --save_dir

Note: The default configuration in part_cosegmentation.ipynb is suited for running on small sets (e.g. < 10). Increase amount of num_crop_augmentations for more stable results (and increased runtime). The default configuration in part_cosegmentation.py is suited for larger sets (e.g. >> 10).

Co-segmentation

We provide a notebook for running on a single example in cosegmentation.ipynb.

To run on several image sets, arrange each set in a directory, inside a data root directory:


   
    
|
|_ 
    
     
|  |
|  |_ img1.png
|  |_ img2.png
|   
|_ 
     
      
   |
   |_ img1.png
   |_ img2.png
   |_ img3.png
...

The following command will produce results in the specified :

python cosegmentation.py --root_dir 
   
     --save_dir

Point Correspondences

We provide a notebook for running on a single example in correpondences.ipynb.

To run on several image pairs, arrange each image pair in a directory, inside a data root directory:


   
    
|
|_ 
    
     
|  |
|  |_ img1.png
|  |_ img2.png
|   
|_ 
     
      
   |
   |_ img1.png
   |_ img2.png
...

The following command will produce results in the specified :

python correspondences.py --root_dir 
   
     --save_dir

Citation

If you found this repository useful please consider starring ⭐ and citing :

@article{amir2021deep,
    author    = {Shir Amir and Yossi Gandelsman and Shai Bagon and Tali Dekel},
    title     = {Deep ViT Features as Dense Visual Descriptors},
    journal   = {arXiv preprint arXiv:2112.05814},
    year      = {2021}
}

Deep ViT Features as Dense Visual Descriptors

Related tags

Overview

dino-vit-features

Setup

ViT Extractor

Part Co-segmentation

Co-segmentation

Point Correspondences

Citation

Owner

Shir Amir

Gradient-free global optimization algorithm for multidimensional functions based on the low rank tensor train format

Various operations like path tracking, counting, etc by using yolov5

Code for the ICME 2021 paper "Exploring Driving-Aware Salient Object Detection via Knowledge Transfer"

PyTorch EO aims to make Deep Learning for Earth Observation data easy and accessible to real-world cases and research alike.

Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks

Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch

Out-of-Town Recommendation with Travel Intention Modeling (AAAI2021)

Learning RAW-to-sRGB Mappings with Inaccurately Aligned Supervision (ICCV 2021)

Este conversor criará a medida exata para sua receita de capuccino gelado da grandiosa Rafaella Ballerini!

SciPy fixes and extensions

This is the repository for the AAAI 21 paper [Contrastive and Generative Graph Convolutional Networks for Graph-based Semi-Supervised Learning].

Code for our CVPR 2021 paper "MetaCam+DSCE"

Pytorch implementation of paper: "NeurMiPs: Neural Mixture of Planar Experts for View Synthesis"

A denoising diffusion probabilistic model synthesises galaxies that are qualitatively and physically indistinguishable from the real thing.

Simple ONNX operation generator. Simple Operation Generator for ONNX.

CLDF dataset derived from Robbeets et al.'s "Triangulation Supports Agricultural Spread" from 2021

Exploring Visual Engagement Signals for Representation Learning

Filtering variational quantum algorithms for combinatorial optimization

Implementation of SegNet: A Deep Convolutional Encoder-Decoder Architecture for Semantic Pixel-Wise Labelling

MediaPipeのPythonパッケージのサンプルです。2020/12/11時点でPython実装のある4機能(Hands、Pose、Face Mesh、Holistic)について用意しています。