XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

Overview

XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks

ACL 2020 Microsoft Research [Paper] [Video]

Releasing [XtremeDistilTransformers] with Tensorflow 2.3 and HuggingFace Transformers with an unified API with the following features:

  • Distil any supported pre-trained language models as teachers (e.g, Bert, Electra, Roberta)
  • Initialize student model with any pre-trained model (e.g, MiniLM, DistilBert, TinyBert), or initialize from scratch
  • Multilingual text classification and sequence tagging
  • Distil multiple hidden states from teacher
  • Distil deep attention networks from teacher
  • Pairwise and instance-level classification tasks (e.g, MNLI, MRPC, SST)
  • Progressive knowledge transfer with gradual unfreezing
  • Fast mixed precision training for distillation (e.g, mixed_float16, mixed_bfloat16)
  • ONNX runtime inference

Install requirements pip install -r requirements.txt

Initialize XtremeDistilTransformer with (6/384 pre-trained checkpoint)[https://huggingface.co/microsoft/xtremedistil-l6-h384-uncased] or [TinyBERT] (4/312 pre-trained checkpoint)

Sample usages for distilling different pre-trained language models (tested with Python 3.6.9 and CUDA 10.2)

Training

Sequence Labeling for Wiki NER

PYTHONHASHSEED=42 python run_xtreme_distil.py 
--task $$PT_DATA_DIR/datasets/NER 
--model_dir $$PT_OUTPUT_DIR 
--seq_len 32  
--transfer_file $$PT_DATA_DIR/datasets/NER/unlabeled.txt 
--do_NER 
--pt_teacher TFBertModel 
--pt_teacher_checkpoint bert-base-multilingual-cased 
--student_distil_batch_size 256 
--student_ft_batch_size 32
--teacher_batch_size 128  
--pt_student_checkpoint microsoft/xtremedistil-l6-h384-uncased 
--distil_chunk_size 10000 
--teacher_model_dir $$PT_OUTPUT_DIR 
--distil_multi_hidden_states 
--distil_attention 
--compress_word_embedding 
--freeze_word_embedding
--opt_policy mixed_float16

Text Classification for MNLI

PYTHONHASHSEED=42 python run_xtreme_distil.py 
--task $$PT_DATA_DIR/glue_data/MNLI 
--model_dir $$PT_OUTPUT_DIR 
--seq_len 128  
--transfer_file $$PT_DATA_DIR/glue_data/MNLI/train.tsv 
--do_pairwise 
--pt_teacher TFElectraModel 
--pt_teacher_checkpoint google/electra-base-discriminator 
--student_distil_batch_size 128  
--student_ft_batch_size 32
--pt_student_checkpoint microsoft/xtremedistil-l6-h384-uncased 
--teacher_model_dir $$PT_OUTPUT_DIR 
--teacher_batch_size 32
--distil_chunk_size 300000
--opt_policy mixed_float16

Alternatively, use TinyBert pre-trained student model checkpoint as --pt_student_checkpoint nreimers/TinyBERT_L-4_H-312_v2

Arguments


- task folder contains
	-- train/dev/test '.tsv' files with text and classification labels / token-wise tags (space-separated)
	--- Example 1: feel good about themselves <tab> 1
	--- Example 2: '' Atelocentra '' Meyrick , 1884 <tab> O B-LOC O O O O
	-- label files containing class labels for sequence labeling
	-- transfer file containing unlabeled data
	
- model_dir to store/restore model checkpoints

- task arguments
-- do_pairwise for pairwise classification tasks like MNLI and MRPC
-- do_NER for sequence labeling

- teacher arguments
-- pt_teacher for teacher model to distil (e.g., TFBertModel, TFRobertaModel, TFElectraModel)
-- pt_teacher_checkpoint for pre-trained teacher model checkpoints (e.g., bert-base-multilingual-cased, roberta-large, google/electra-base-discriminator)

- student arguments
-- pt_student_checkpoint to initialize from pre-trained small student models (e.g., MiniLM, DistilBert, TinyBert)
-- instead of pre-trained checkpoint, initialize a raw student from scratch with
--- hidden_size
--- num_hidden_layers
--- num_attention_heads

- distillation features
-- distil_multi_hidden_states to distil multiple hidden states from the teacher
-- distil_attention to distil deep attention network of the teacher
-- compress_word_embedding to initialize student word embedding with SVD-compressed teacher word embedding (useful for multilingual distillation)
-- freeze_word_embedding to keep student word embeddings frozen during distillation (useful for multilingual distillation)
-- opt_policy (e.g., mixed_float16 for GPU and mixed_bfloat16 for TPU)
-- distil_chunk_size for using transfer data in chunks during distillation (reduce for OOM issues, checkpoints are saved after every distil_chunk_size steps)

Model Outputs

The above training code generates intermediate model checkpoints to continue the training in case of abrupt termination instead of starting from scratch -- all saved in $$PT_OUTPUT_DIR. The final output of the model consists of (i) xtremedistil.h5 with distilled model weights, (ii) xtremedistil-config.json with the training configuration, and (iii) word_embedding.npy for the input word embeddings from the student model.

Prediction

PYTHONHASHSEED=42 python run_xtreme_distil_predict.py 
--do_eval 
--model_dir $$PT_OUTPUT_DIR 
--do_predict 
--pred_file ../../datasets/NER/unlabeled.txt
--opt_policy mixed_float16

*ONNX Runtime Inference

You can also use ONXX Runtime for inference speedup with the following script:

PYTHONHASHSEED=42 python run_xtreme_distil_predict_onnx.py 
--do_eval 
--model_dir $$PT_OUTPUT_DIR 
--do_predict 
--pred_file ../../datasets/NER/unlabeled.txt

For details on ONNX Runtime Inference, environment and arguments refer to this Notebook The script is for online inference with batch_size=1.

*Continued Fine-tuning

You can continue fine-tuning the distilled/compressed student model on more labeled data with the following script:

PYTHONHASHSEED=42 python run_xtreme_distil_ft.py --model_dir $$PT_OUTPUT_DIR 

If you use this code, please cite:

@inproceedings{mukherjee-hassan-awadallah-2020-xtremedistil,
    title = "{X}treme{D}istil: Multi-stage Distillation for Massive Multilingual Models",
    author = "Mukherjee, Subhabrata  and
      Hassan Awadallah, Ahmed",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.202",
    pages = "2221--2234",
    abstract = "Deep and large pre-trained language models are the state-of-the-art for various natural language processing tasks. However, the huge size of these models could be a deterrent to using them in practice. Some recent works use knowledge distillation to compress these huge models into shallow ones. In this work we study knowledge distillation with a focus on multilingual Named Entity Recognition (NER). In particular, we study several distillation strategies and propose a stage-wise optimization scheme leveraging teacher internal representations, that is agnostic of teacher architecture, and show that it outperforms strategies employed in prior works. Additionally, we investigate the role of several factors like the amount of unlabeled data, annotation resources, model architecture and inference latency to name a few. We show that our approach leads to massive compression of teacher models like mBERT by upto 35x in terms of parameters and 51x in terms of latency for batch inference while retaining 95{\%} of its F1-score for NER over 41 languages.",
}

Code is released under MIT license.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Angora is a mutation-based fuzzer. The main goal of Angora is to increase branch coverage by solving path constraints without symbolic execution.

Angora Angora is a mutation-based coverage guided fuzzer. The main goal of Angora is to increase branch coverage by solving path constraints without s

833 Jan 07, 2023
This porject is intented to build the most accurate model for predicting the porbability of loan default

Estimating-Loan-Default-Probability IBA ML2 Mid-project / Kaggle Competition This porject is intented to build the most accurate model for predicting

Adil Gahramanov 1 Jan 24, 2022
Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation

Unseen Object Clustering: Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation Introduction In this work, we propose a new method

NVIDIA Research Projects 132 Dec 13, 2022
Synthetic Scene Text from 3D Engines

Introduction UnrealText is a project that synthesizes scene text images using 3D graphics engine. This repository accompanies our paper: UnrealText: S

Shangbang Long 215 Dec 29, 2022
This is just a funny project that we want to see AutoEncoder (AE) can actually work to enhance the features we want

Funny_muscle_enhancer :) 1.Discription: This is just a funny project that we want to see AutoEncoder (AE) can actually work on the some features. We w

Jing-Yao Chen (Jacob) 8 Oct 01, 2022
constructing maps of intellectual influence from publication data

Influencemap Project @ ANU Influence in the academic communities has been an area of interest for researchers. This can be seen in the popularity of a

CS Metrics 13 Jun 18, 2022
A FAIR dataset of TCV experimental results for validating edge/divertor turbulence models.

TCV-X21 validation for divertor turbulence simulations Quick links Intro Welcome to TCV-X21. We're glad you've found us! This repository is designed t

0 Dec 18, 2021
天勤量化开发包, 期货量化, 实时行情/历史数据/实盘交易

TqSdk 天勤量化交易策略程序开发包 TqSdk 是一个由信易科技发起并贡献主要代码的开源 python 库. 依托快期多年积累成熟的交易及行情服务器体系, TqSdk 支持用户使用极少的代码量构建各种类型的量化交易策略程序, 并提供包含期货、期权、股票的 历史数据-实时数据-开发调试-策略回测-

信易科技 2.8k Dec 30, 2022
Tensorflow Repo for "DeepGCNs: Can GCNs Go as Deep as CNNs?"

DeepGCNs: Can GCNs Go as Deep as CNNs? In this work, we present new ways to successfully train very deep GCNs. We borrow concepts from CNNs, mainly re

Guohao Li 612 Nov 15, 2022
Distance Encoding for GNN Design

Distance-encoding for GNN design This repository is the official PyTorch implementation of the DEGNN and DEAGNN framework reported in the paper: Dista

172 Nov 08, 2022
Source code for the BMVC-2021 paper "SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation".

SimReg: A Simple Regression Based Framework for Self-supervised Knowledge Distillation Source code for the paper "SimReg: Regression as a Simple Yet E

9 Oct 15, 2022
Reproducible research and reusable acyclic workflows in Python. Execute code on HPC systems as if you executed them on your personal computer!

Reproducible research and reusable acyclic workflows in Python. Execute code on HPC systems as if you executed them on your machine! Motivation Would

Joeri Hermans 15 Sep 11, 2022
Fastshap: A fast, approximate shap kernel

fastshap: A fast, approximate shap kernel fastshap was designed to be: Fast Calculating shap values can take an extremely long time. fastshap utilizes

Samuel Wilson 22 Sep 24, 2022
Locally cache assets that are normally streamed in POPULATION: ONE

Population One Localizer This is no longer needed as of the build shipped on 03/03/22, thank you bigbox :) Locally cache assets that are normally stre

Ahman Woods 2 Mar 04, 2022
A tight inclusion function for continuous collision detection

Tight-Inclusion Continuous Collision Detection A conservative Continuous Collision Detection (CCD) method with support for minimum separation. You can

Continuous Collision Detection 89 Jan 01, 2023
PyTorch implementation for Convolutional Networks with Adaptive Inference Graphs

Convolutional Networks with Adaptive Inference Graphs (ConvNet-AIG) This repository contains a PyTorch implementation of the paper Convolutional Netwo

Andreas Veit 176 Dec 07, 2022
Hierarchical Memory Matching Network for Video Object Segmentation (ICCV 2021)

Hierarchical Memory Matching Network for Video Object Segmentation Hongje Seong, Seoung Wug Oh, Joon-Young Lee, Seongwon Lee, Suhyeon Lee, Euntai Kim

Hongje Seong 72 Dec 14, 2022
Official implementation of the paper ``Unifying Nonlocal Blocks for Neural Networks'' (ICCV'21)

Spectral Nonlocal Block Overview Official implementation of the paper: Unifying Nonlocal Blocks for Neural Networks (ICCV'21) Spectral View of Nonloca

91 Dec 14, 2022
Datasets and pretrained Models for StyleGAN3 ...

Datasets and pretrained Models for StyleGAN3 ... Dear arfiticial friend, this is a collection of artistic datasets and models that we have put togethe

lucid layers 34 Oct 06, 2022
Tensorflow implementation of the paper "HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences", CVPR 2021.

HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences Tensorflow implementation of the paper "HumanGPS: Geodesic PreServing Feature fo

Google Interns 50 Dec 21, 2022