"Exploring Vision Transformers for Fine-grained Classification" at CVPRW FGVC8

Last update: Dec 06, 2022

Overview

FGVC8

Exploring Vision Transformers for Fine-grained Classification paper presented at the CVPR 2021, The Eight Workshop on Fine-Grained Visual Categorization on June 25th.

Abstract

Existing computer vision research in categorization struggles with fine-grained attributes recognition due to the inherently high intra-class variances and low inter-class variances. SOTA methods tackle this challenge by locating the most informative image regions and rely on them to classify the complete image. The most recent work, Vision Transformer (ViT), shows its strong performance in both traditional and fine-grained classification tasks.

In this work, we propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes using the inherent multi-head self-attention mechanism. We also introduce attention-guided augmentations for improving the model's capabilities.

We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology. We also prove our model's interpretability via qualitative results.

Instructions

Upcoming

Citation

If you find interesting our results, or you use or code/ideas please consider to cite our work:

@misc{conde2021exploring,
      title={Exploring Vision Transformers for Fine-grained Classification}, 
      author={Marcos V. Conde and Kerem Turgutlu},
      year={2021},
      eprint={2106.10587},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

"Exploring Vision Transformers for Fine-grained Classification" at CVPRW FGVC8

Related tags

Overview

FGVC8

Abstract

Instructions

Citation

References

Owner

Marcos V. Conde

A programming language written with python

Data for "Driving the Herd: Search Engines as Content Influencers" paper

This project contains an implemented version of Face Detection using OpenCV and Mediapipe. This is a code snippet and can be used in projects.

StyleGAN2-ada for practice

traiNNer is an open source image and video restoration (super-resolution, denoising, deblurring and others) and image to image translation toolbox based on PyTorch.

Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

This repo uses a combination of logits and feature distillation method to teach the PSPNet model of ResNet18 backbone with the PSPNet model of ResNet50 backbone. All the models are trained and tested on the PASCAL-VOC2012 dataset.

The code of paper "Block Modeling-Guided Graph Convolutional Neural Networks".

Statistical-Rethinking-with-Python-and-PyMC3 - Python/PyMC3 port of the examples in " Statistical Rethinking A Bayesian Course with Examples in R and Stan" by Richard McElreath

DeepStruc is a Conditional Variational Autoencoder which can predict the mono-metallic nanoparticle from a Pair Distribution Function.

Code for LIGA-Stereo Detector, ICCV'21

A simple algorithm for extracting tree height in sparse scene from point cloud data.

Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)

IPATool-py: download ipa easily

How to train a CNN to 99% accuracy on MNIST in less than a second on a laptop

Python script that allows you to automatically setup your Growtopia server.

NFT-Price-Prediction-CNN - Using visual feature extraction, prices of NFTs are predicted via CNN (Alexnet and Resnet) architectures.

Adversarial Attacks on Probabilistic Autoregressive Forecasting Models.

automatic color-grading

Stochastic Scene-Aware Motion Prediction