Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Overview

Transformer Quantization

This repository contains the implementation and experiments for the paper presented in

Yelysei Bondarenko1, Markus Nagel1, Tijmen Blankevoort1, "Understanding and Overcoming the Challenges of Efficient Transformer Quantization", EMNLP 2021. [ACL Anthology] [ArXiv]

1 Qualcomm AI Research (Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.)

Reference

If you find our work useful, please cite

@inproceedings{bondarenko-etal-2021-understanding,
    title = "Understanding and Overcoming the Challenges of Efficient Transformer Quantization",
    author = "Bondarenko, Yelysei  and
      Nagel, Markus  and
      Blankevoort, Tijmen",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.627",
    pages = "7947--7969",
    abstract = "Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges {--} namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token. To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we introduce a novel quantization scheme {--} per-embedding-group quantization. We demonstrate the effectiveness of our methods on the GLUE benchmark using BERT, establishing state-of-the-art results for post-training quantization. Finally, we show that transformer weights and embeddings can be quantized to ultra-low bit-widths, leading to significant memory savings with a minimum accuracy loss. Our source code is available at \url{https://github.com/qualcomm-ai-research/transformer-quantization}.",
}

How to install

First, ensure locale variables are set as follows:

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

Second, make sure to have Python ≥3.6 (tested with Python 3.6.8) and ensure the latest version of pip (tested with 21.2.4):

pip install --upgrade --no-deps pip

Next, install PyTorch 1.4.0 with the appropriate CUDA version (tested with CUDA 10.0, CuDNN 7.6.3):

pip install torch==1.4.0 torchvision==0.5.0 -f https://download.pytorch.org/whl/torch_stable.html

Finally, install the remaining dependencies using pip:

pip install -r requirements.txt

To run the code, the project root directory needs to be added to your pythonpath:

export PYTHONPATH="${PYTHONPATH}:/path/to/this/dir"

Running experiments

The main run file to reproduce all experiments is main.py. It contains 4 commands to train and validate FP32 and quantized model:

Usage: main.py [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  train-baseline
  train-quantized
  validate-baseline
  validate-quantized

You can see the full list of options for each command using python main.py [COMMAND] --help.

A. FP32 fine-tuning

To start with, you need to get the fune-tuned model(s) for the GLUE task of interest. Example run command for fine-tuning:

python main.py train-baseline --cuda --save-model --model-name bert_base_uncased --task rte \
    --learning-rate 3e-05 --batch-size 8 --eval-batch-size 8 --num-epochs 3 --max-seq-length 128 \
    --seed 1000 --output-dir /path/to/output/dir/

You can also do it directly using HuggingFace library [examples]. In all experiments we used seeds 1000 - 1004 and reported the median score. The sample output directory looks as follows:

/path/to/output/dir
├── config.out
├── eval_results_rte.txt
├── final_score.txt
├── out
│   ├── config.json  # Huggingface model config
│   ├── pytorch_model.bin  # PyTorch model checkpoint
│   ├── special_tokens_map.json
│   ├── tokenizer_config.json  # Huggingface tokenizer config
│   ├── training_args.bin
│   └── vocab.txt  # Vocabulary
└── tb_logs  # TensorBoard logs
    ├── 1632747625.1250594
    │   └── events.out.tfevents.*
    └── events.out.tfevents.*

For validation (both full-precision and quantized), it is assumed that these output directories with the fine-tuned checkpoints are aranged as follows (you can also use a subset of GLUE tasks):

/path/to/saved_models/
├── rte/rte_model_dir
│   ├── out
│   │   ├── config.json  # Huggingface model config
│   │   ├── pytorch_model.bin  # PyTorch model checkpoint
│   │   ├── tokenizer_config.json  # Huggingface tokenizer config
│   │   ├── vocab.txt  # Vocabulary
│   │   ├── (...)
├── cola/cola_model_dir
│   ├── out
│   │   ├── (...)
├── mnli/mnli_model_dir
│   ├── out
│   │   ├── (...)
├── mrpc/mrpc_model_dir
│   ├── out
│   │   ├── (...)
├── qnli/qnli_model_dir
│   ├── out
│   │   ├── (...)
├── qqp/qqp_model_dir
│   ├── out
│   │   ├── (...)
├── sst2/sst2_model_dir
│   ├── out
│   │   ├── (...)
└── stsb/stsb_model_dir
    ├── out
    │   ├── (...)

Note, that you have to create this file structure manually.

The model can then be validated as follows:

python main.py validate-baseline --eval-batch-size 32 --seed 1000 --model-name bert_base_uncased \
    --model-path /path/to/saved_models/ --task rte

You can also validate multiple or all checkpoints by specifying --task --task [...] or --task all, respectively.

B. Post-training quantization (PTQ)

1) Standard (naïve) W8A8 per-tensor PTQ / base run command for all PTQ experiments

python main.py validate-quantized --act-quant --weight-quant --no-pad-to-max-length \
	--est-ranges-no-pad --eval-batch-size 16 --seed 1000 --model-path /path/to/saved_models/ \
	--task rte --n-bits 8 --n-bits-act 8 --qmethod symmetric_uniform \
	--qmethod-act asymmetric_uniform --weight-quant-method MSE --weight-opt-method golden_section \
	--act-quant-method current_minmax --est-ranges-batch-size 1 --num-est-batches 1 \
	--quant-setup all

Note that the range estimation settings are slightly different for each task.

2) Mixed precision W8A{8,16} PTQ

Specify --quant-dict "{'y': 16, 'h': 16, 'x': 16}":

  • 'x': 16 will set FFN's input to 16-bit
  • 'h': 16 will set FFN's output to 16-bit
  • 'y': 16 will set FFN's residual sum to 16-bit

For STS-B regression task, you will need to specify --quant-dict "{'y': 16, 'h': 16, 'x': 16, 'P': 16, 'C': 16}" and --quant-setup MSE_logits, which will also quantize pooler and the final classifier to 16-bit and use MSE estimator for the output.

3) Per-embedding and per-embedding-group (PEG) activation quantization

  • --per-embd -- Per-embedding quantization for all activations
  • --per-groups [N_GROUPS] -- PEG quantization for all activations, no permutation
  • --per-groups [N_GROUPS] --per-groups-permute -- PEG quantization for all activations, apply range-based permutation (separate for each quantizer)
  • --quant-dict "{'y': 'ng6', 'h': 'ng6', 'x': 'ng6'}" -- PEG quantization using 6 groups for FFN's input, output and residual sum, no permutation
  • --quant-dict "{'y': 'ngp6', 'h': 'ngp6', 'x': 'ngp6'}" --per-groups-permute-shared-h -- PEG quantization using 6 groups for FFN's input, output and residual sum, apply range-based permutation (shared between tensors in the same layer)

4) W4A32 PTQ with AdaRound

python main.py validate-quantized --weight-quant --no-act-quant --no-pad-to-max-length \
	--est-ranges-no-pad --eval-batch-size 16 --seed 1000 --model-path /path/to/saved_models/ \
	--task rte --qmethod symmetric_uniform --qmethod-act asymmetric_uniform --n-bits 4 \
	--weight-quant-method MSE --weight-opt-method grid --num-candidates 100 --quant-setup all \
	--adaround all --adaround-num-samples 1024 --adaround-init range_estimator \
	--adaround-mode learned_hard_sigmoid --adaround-asym --adaround-iters 10000 \
	--adaround-act-quant no_act_quant

C. Quantization-aware training (QAT)

Base run command for QAT experiments (using W4A8 for example):

python main.py train-quantized --cuda --do-eval --logging-first-step --weight-quant --act-quant \
	--pad-to-max-length --learn-ranges --tqdm --batch-size 8 --seed 1000 \
	--model-name bert_base_uncased --learning-rate 5e-05 --num-epochs 6 --warmup-steps 186 \
	--weight-decay 0.0 --attn-dropout 0.0 --hidden-dropout 0.0 --max-seq-length 128 --n-bits 4 \
	--n-bits-act 8 --qmethod symmetric_uniform --qmethod-act asymmetric_uniform \
	--weight-quant-method MSE --weight-opt-method golden_section --act-quant-method current_minmax \
	--est-ranges-batch-size 16 --num-est-batches 1 --quant-setup all \
	--model-path /path/to/saved_models/rte/out --task rte --output-dir /path/to/qat_output/dir

Note that the settings are slightly different for each task (see Appendix).

To run mixed-precision QAT with 2-bit embeddings and 4-bit weights, add --quant-dict "{'Et': 2}".

Owner
An initiative of Qualcomm Technologies, Inc.
Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

OFA Sys 1.4k Jan 08, 2023
Implementation of Convolutional LSTM in PyTorch.

ConvLSTM_pytorch This file contains the implementation of Convolutional LSTM in PyTorch made by me and DavideA. We started from this implementation an

Andrea Palazzi 1.3k Dec 29, 2022
Code for Towards Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games

Unifying Behavioral and Response Diversity for Open-ended Learning in Zero-sum Games How to run our algorithm? Create the new environment using: conda

MARL @ SJTU 8 Dec 27, 2022
Second-Order Neural ODE Optimizer, NeurIPS 2021 spotlight

Second-order Neural ODE Optimizer (NeurIPS 2021 Spotlight) [arXiv] ✔️ faster convergence in wall-clock time | ✔️ O(1) memory cost | ✔️ better test-tim

Guan-Horng Liu 39 Oct 22, 2022
Solutions and questions for AoC2021. Merry christmas!

Advent of Code 2021 Merry christmas! 🎄 🎅 To get solutions and approximate execution times for implementations, please execute the run.py script in t

Wilhelm Ågren 5 Dec 29, 2022
An implementation of Deep Graph Infomax (DGI) in PyTorch

DGI Deep Graph Infomax (Veličković et al., ICLR 2019): https://arxiv.org/abs/1809.10341 Overview Here we provide an implementation of Deep Graph Infom

Petar Veličković 491 Jan 03, 2023
darija <-> english dictionary

darija-dictionary Having advanced IT solutions that are well adapted to the Moroccan context passes inevitably through understanding Moroccan dialect.

DODa 102 Jan 01, 2023
Experiments on Flood Segmentation on Sentinel-1 SAR Imagery with Cyclical Pseudo Labeling and Noisy Student Training

Flood Detection Challenge This repository contains code for our submission to the ETCI 2021 Competition on Flood Detection (Winning Solution #2). Acco

Siddha Ganju 108 Dec 28, 2022
Mouse Brain in the Model Zoo

Deep Neural Mouse Brain Modeling This is the repository for the ongoing deep neural mouse modeling project, an attempt to characterize the representat

Colin Conwell 15 Aug 22, 2022
RetinaFace: Deep Face Detection Library in TensorFlow for Python

RetinaFace is a deep learning based cutting-edge facial detector for Python coming with facial landmarks.

Sefik Ilkin Serengil 512 Dec 29, 2022
Export CenterPoint PonintPillars ONNX Model For TensorRT

CenterPoint-PonintPillars Pytroch model convert to ONNX and TensorRT Welcome to CenterPoint! This project is fork from tianweiy/CenterPoint. I impleme

CarkusL 149 Dec 13, 2022
Benchmark for Answering Existential First Order Queries with Single Free Variable

EFO-1-QA Benchmark for First Order Query Estimation on Knowledge Graphs This repository contains an entire pipeline for the EFO-1-QA benchmark. EFO-1

HKUST-KnowComp 14 Oct 24, 2022
Automatically creates genre collections for your Plex media

Plex Auto Genres Plex Auto Genres is a simple script that will add genre collection tags to your media making it much easier to search for genre speci

Shane Israel 63 Dec 31, 2022
Overview of architecture and implementation of TEDS-Net, as described in MICCAI 2021: "TEDS-Net: Enforcing Diffeomorphisms in Spatial Transformers to Guarantee TopologyPreservation in Segmentations"

TEDS-Net Overview of architecture and implementation of TEDS-Net, as described in MICCAI 2021: "TEDS-Net: Enforcing Diffeomorphisms in Spatial Transfo

Madeleine K Wyburd 14 Jan 04, 2023
Do you like Quick, Draw? Well what if you could train/predict doodles drawn inside Streamlit? Also draws lines, circles and boxes over background images for annotation.

Streamlit - Drawable Canvas Streamlit component which provides a sketching canvas using Fabric.js. Features Draw freely, lines, circles, boxes and pol

Fanilo Andrianasolo 325 Dec 28, 2022
PyTorch trainer and model for Sequence Classification

PyTorch-trainer-and-model-for-Sequence-Classification After cloning the repository, modify your training data so that the training data is a .csv file

NhanTieu 2 Dec 09, 2022
A Python-based development platform for automated trading systems - from backtesting to optimisation to livetrading.

AutoTrader AutoTrader is Python-based platform intended to help in the development, optimisation and deployment of automated trading systems. From sim

Kieran Mackle 485 Jan 09, 2023
Learning Energy-Based Models by Diffusion Recovery Likelihood

Learning Energy-Based Models by Diffusion Recovery Likelihood Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, Diederik P. Kingma Paper: https://arxiv.o

Ruiqi Gao 41 Nov 22, 2022
A distributed deep learning framework that supports flexible parallelization strategies.

FlexFlow FlexFlow is a deep learning framework that accelerates distributed DNN training by automatically searching for efficient parallelization stra

528 Dec 25, 2022
A Data Annotation Tool for Semantic Segmentation, Object Detection and Lane Line Detection.(In Development Stage)

Data-Annotation-Tool How to Run this Tool? To run this software, follow the steps: git clone https://github.com/Autonomous-Car-Project/Data-Annotation

TiVRA AI 13 Aug 18, 2022