Vision transformers (ViTs) have found only limited practical use in processing images

Last update: Sep 10, 2022

Related tags

Overview

CXV

Convolutional Xformers for Vision

Vision transformers (ViTs) have found only limited practical use in processing images, in spite of their state-of-the-art accuracy on certain benchmarks. The reason for their limited use include their need for larger training datasets and more computational resources compared to convolutional neural networks (CNNs), owing to the quadratic complexity of their self-attention mechanism. We propose a linear attention-convolution hybrid architecture -- Convolutional X-formers for Vision (CXV) -- to overcome these limitations. We replace the quadratic attention with linear attention mechanisms, such as Performer, Nyströmformer, and Linear Transformer, to reduce its GPU usage. Inductive prior for image data is provided by convolutional sub-layers, thereby eliminating the need for class token and positional embeddings used by the ViTs. CXV outperforms other architectures, token mixers (eg ConvMixer, FNet and MLP Mixer), transformer models (eg ViT, CCT, CvT and hybrid Xformers), and ResNets for image classification in scenarios with limited data and GPU resources.

Models:

CNV - Convolutional Nyströmformer for Vision
CPV - Convolutional Performer for Vision
CLTV - Convolutional Linear Transformer for Vision

Vision transformers (ViTs) have found only limited practical use in processing images

Related tags

Overview

CXV

Convolutional Xformers for Vision

Owner

Cloudwalker

Diverse graph algorithms implemented using JGraphT library.

The official PyTorch implementation for NCSNv2 (NeurIPS 2020)

In Search of Probeable Generalization Measures

An implementation of MobileFormer

The best solution of the Weather Prediction track in the Yandex Shifts challenge

It is an open dataset for object detection in remote sensing images.

Fine-Tune EleutherAI GPT-Neo to Generate Netflix Movie Descriptions in Only 47 Lines of Code Using Hugginface And DeepSpeed

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

A curated list of automated deep learning (including neural architecture search and hyper-parameter optimization) resources.

Code for CVPR 2021 paper TransNAS-Bench-101: Improving Transferrability and Generalizability of Cross-Task Neural Architecture Search.

A multi-functional library for full-stack Deep Learning. Simplifies Model Building, API development, and Model Deployment.

[ICCV 2021] FaPN: Feature-aligned Pyramid Network for Dense Image Prediction

TensorFlow-based implementation of "ICNet for Real-Time Semantic Segmentation on High-Resolution Images".

Mixed Neural Likelihood Estimation for models of decision-making

Official PyTorch implementation of N-ImageNet: Towards Robust, Fine-Grained Object Recognition with Event Cameras (ICCV 2021)

Efficient Training of Audio Transformers with Patchout

BERTMap: A BERT-Based Ontology Alignment System

Supervised Contrastive Learning for Product Matching

covid question answering datasets and fine tuned models

A unified 3D Transformer Pipeline for visual synthesis