The official implementation for "FQ-ViT: Fully Quantized Vision Transformer without Retraining".

Overview

FQ-ViT [arXiv]

This repo contains the official implementation of "FQ-ViT: Fully Quantized Vision Transformer without Retraining".

Table of Contents

Introduction

Transformer-based architectures have achieved competitive performance in various CV tasks. Compared to the CNNs, Transformers usually have more parameters and higher computational costs, presenting a challenge when deployed to resource-constrained hardware devices.

Most existing quantization approaches are designed and tested on CNNs and lack proper handling of Transformer-specific modules. Previous work found there would be significant accuracy degradation when quantizing LayerNorm and Softmax of Transformer-based architectures. As a result, they left LayerNorm and Softmax unquantized with floating-point numbers. We revisit these two exclusive modules of the Vision Transformers and discover the reasons for degradation. In this work, we propose the FQ-ViT, the first fully quantized Vision Transformer, which contains two specific modules: Powers-of-Two Scale (PTS) and Log-Int-Softmax (LIS).

Layernorm quantized with Powers-of-Two Scale (PTS)

These two figures below show that there exists serious inter-channel variation in Vision Transformers than CNNs, which leads to unacceptable quantization errors with layer-wise quantization.

Taking the advantages of both layer-wise and channel-wise quantization, we propose PTS for LayerNorm's quantization. The core idea of PTS is to equip different channels with different Powers-of-Two Scale factors, rather than different quantization scales.

Softmax quantized with Log-Int-Softmax (LIS)

The storage and computation of attention map is known as a bottleneck for transformer structures, so we want to quantize it to extreme lower bit-width (e.g. 4-bit). However, if directly implementing 4-bit uniform quantization, there will be severe accuracy degeneration. We observe a distribution centering at a fairly small value of the output of Softmax, while only few outliers have larger values close to 1. Based on the following visualization, Log2 preserves more quantization bins than uniform for the small value interval with dense distribution.

Combining Log2 quantization with i-exp, which is a polynomial approximation of exponential function presented by I-BERT, we propose LIS, an integer-only, faster, low consuming Softmax.

The whole process is visualized as follow.

Getting Started

Install

  • Clone this repo.
git clone https://github.com/linyang-zhh/FQ-ViT.git
cd FQ-ViT
  • Create a conda virtual environment and activate it.
conda create -n fq-vit python=3.7 -y
conda activate fq-vit
  • Install PyTorch and torchvision. e.g.,
conda install pytorch=1.7.1 torchvision cudatoolkit=10.1 -c pytorch

Data preparation

You should download the standard ImageNet Dataset.

├── imagenet
│   ├── train
|
│   ├── val

Run

Example: Evaluate quantized DeiT-S with MinMax quantizer and our proposed PTS and LIS

python test_quant.py deit_small <YOUR_DATA_DIR> --quant --pts --lis --quant-method minmax
  • deit_small: model architecture, which can be replaced by deit_tiny, deit_base, vit_base, vit_large, swin_tiny, swin_small and swin_base.

  • --quant: whether to quantize the model.

  • --pts: whether to use Power-of-Two Scale Integer Layernorm.

  • --lis: whether to use Log-Integer-Softmax.

  • --quant-method: quantization methods of activations, which can be chosen from minmax, ema, percentile and omse.

Results on ImageNet

This paper employs several current post-training quantization strategies together with our methods, including MinMax, EMA , Percentile and OMSE.

  • MinMax uses the minimum and maximum values of the total data as the clipping values;

  • EMA is based on MinMax and uses an average moving mechanism to smooth the minimum and maximum values of different mini-batch;

  • Percentile assumes that the distribution of values conforms to a normal distribution and uses the percentile to clip. In this paper, we use the 1e-5 percentile because the 1e-4 commonly used in CNNs has poor performance in Vision Transformers.

  • OMSE determines the clipping values by minimizing the quantization error.

The following results are evaluated on ImageNet.

Method W/A/Attn Bits ViT-B ViT-L DeiT-T DeiT-S DeiT-B Swin-T Swin-S Swin-B
Full Precision 32/32/32 84.53 85.81 72.21 79.85 81.85 81.35 83.20 83.60
MinMax 8/8/8 23.64 3.37 70.94 75.05 78.02 64.38 74.37 25.58
MinMax w/ PTS 8/8/8 83.31 85.03 71.61 79.17 81.20 80.51 82.71 82.97
MinMax w/ PTS, LIS 8/8/4 82.68 84.89 71.07 78.40 80.85 80.04 82.47 82.38
EMA 8/8/8 30.30 3.53 71.17 75.71 78.82 70.81 75.05 28.00
EMA w/ PTS 8/8/8 83.49 85.10 71.66 79.09 81.43 80.52 82.81 83.01
EMA w/ PTS, LIS 8/8/4 82.57 85.08 70.91 78.53 80.90 80.02 82.56 82.43
Percentile 8/8/8 46.69 5.85 71.47 76.57 78.37 78.78 78.12 40.93
Percentile w/ PTS 8/8/8 80.86 85.24 71.74 78.99 80.30 80.80 82.85 83.10
Percentile w/ PTS, LIS 8/8/4 80.22 85.17 71.23 78.30 80.02 80.46 82.67 82.79
OMSE 8/8/8 73.39 11.32 71.30 75.03 79.57 79.30 78.96 48.55
OMSE w/ PTS 8/8/8 82.73 85.27 71.64 78.96 81.25 80.64 82.87 83.07
OMSE w/ PTS, LIS 8/8/4 82.37 85.16 70.87 78.42 80.90 80.41 82.57 82.45

Citation

If you find this repo useful in your research, please consider citing the following paper:

@misc{
    lin2021fqvit,
    title={FQ-ViT: Fully Quantized Vision Transformer without Retraining}, 
    author={Yang Lin and Tianyu Zhang and Peiqin Sun and Zheng Li and Shuchang Zhou},
    year={2021},
    eprint={2111.13824},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}
The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

Dinghan Shen 49 Dec 22, 2022
High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

TL;DR Ignite is a high-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently. Click on the image to

4.2k Jan 01, 2023
Pocsploit is a lightweight, flexible and novel open source poc verification framework

Pocsploit is a lightweight, flexible and novel open source poc verification framework

cckuailong 208 Dec 24, 2022
smc.covid is an R package related to the paper A sequential Monte Carlo approach to estimate a time varying reproduction number in infectious disease models: the COVID-19 case by Storvik et al

smc.covid smc.covid is an R package related to the paper A sequential Monte Carlo approach to estimate a time varying reproduction number in infectiou

0 Oct 15, 2021
Software for Multimodalty 2D+3D Facial Expression Recognition (FER) UI

EmotionUI Software for Multimodalty 2D+3D Facial Expression Recognition (FER) UI. demo screenshot (with RealSense) required packages Python = 3.6 num

Yang Jiao 2 Dec 23, 2021
Data Augmentation with Variational Autoencoders

Documentation Pyraug This library provides a way to perform Data Augmentation using Variational Autoencoders in a reliable way even in challenging con

112 Nov 30, 2022
SeMask: Semantically Masked Transformers for Semantic Segmentation.

SeMask: Semantically Masked Transformers Jitesh Jain, Anukriti Singh, Nikita Orlov, Zilong Huang, Jiachen Li, Steven Walton, Humphrey Shi This repo co

Picsart AI Research (PAIR) 186 Dec 30, 2022
Delving into Localization Errors for Monocular 3D Object Detection, CVPR'2021

Delving into Localization Errors for Monocular 3D Detection By Xinzhu Ma, Yinmin Zhang, Dan Xu, Dongzhan Zhou, Shuai Yi, Haojie Li, Wanli Ouyang. Intr

XINZHU.MA 124 Jan 04, 2023
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization

BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization Authors: Wojciech Kryściński, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong,

Salesforce 125 Dec 31, 2022
A deep learning model for style-specific music generation.

DeepJ: A model for style-specific music generation https://arxiv.org/abs/1801.00887 Abstract Recent advances in deep neural networks have enabled algo

Henry Mao 704 Nov 23, 2022
CAR-API: Cityscapes Attributes Recognition API

CAR-API: Cityscapes Attributes Recognition API This is the official api to download and fetch attributes annotations for Cityscapes Dataset. Content I

Kareem Metwaly 5 Dec 22, 2022
Codes for Causal Semantic Generative model (CSG), the model proposed in "Learning Causal Semantic Representation for Out-of-Distribution Prediction" (NeurIPS-21)

Learning Causal Semantic Representation for Out-of-Distribution Prediction This repository is the official implementation of "Learning Causal Semantic

Chang Liu 54 Dec 01, 2022
An NLP library with Awesome pre-trained Transformer models and easy-to-use interface, supporting wide-range of NLP tasks from research to industrial applications.

简体中文 | English News [2021-10-12] PaddleNLP 2.1版本已发布!新增开箱即用的NLP任务能力、Prompt Tuning应用示例与生成任务的高性能推理! 🎉 更多详细升级信息请查看Release Note。 [2021-08-22]《千言:面向事实一致性的生

6.9k Jan 01, 2023
A PyTorch implementation of the architecture of Mask RCNN

EDIT (AS OF 4th NOVEMBER 2019): This implementation has multiple errors and as of the date 4th, November 2019 is insufficient to be utilized as a reso

Sai Himal Allu 975 Dec 30, 2022
Acoustic mosquito detection code with Bayesian Neural Networks

HumBugDB Acoustic mosquito detection with Bayesian Neural Networks. Extract audio or features from our large-scale dataset on Zenodo. This repository

31 Nov 28, 2022
An Api for Emotion recognition.

PLAYEMO Playemo was built from the ground-up with Flask, a python tool that makes it easy for developers to build APIs. Use Cases Is Python your langu

greek geek 2 Jul 16, 2022
Out-of-distribution detection using the pNML regret. NeurIPS2021

OOD Detection Load conda environment conda env create -f environment.yml or install requirements: while read requirement; do conda install --yes $requ

Koby Bibas 23 Dec 02, 2022
Repository for publicly available deep learning models developed in Rosetta community

trRosetta2 This package contains deep learning models and related scripts used by Baker group in CASP14. Installation Linux/Mac clone the package git

81 Dec 29, 2022
COLMAP - Structure-from-Motion and Multi-View Stereo

COLMAP About COLMAP is a general-purpose Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline with a graphical and command-line interface.

4.7k Jan 07, 2023