Explore extreme compression for pre-trained language models

Last update: Nov 14, 2022

Related tags

Deep Learning Xcompression

Overview

Explore extreme compression for pre-trained language models

Code for paper "Exploring extreme parameter compression for pre-trained language models ICLR2022"

Before Training

install some libraries

 pip install tensorly==0.5.0

Torch is needed, torch 1.0-1.4 is preferred

Install horovod for distributed learning

Configuration Install horovod on GPU

pip install horovod[pytorch]

loading pre-trained models

wget https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin -P  models/bert-base-uncased
wget https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt -P  models/bert-base-uncased
cp models/bert-base-uncased/pytorch_model.bin models/bert-td-72-384/pytorch_model.bin 
cp models/bert-base-uncased/vocab.txt models/bert-td-72-384/vocab.txt

generate training data for given corpora (e.g., saved in the path "corpora" )

python pregenerate_training_data.py --train_corpus ${CORPUS_RAW} \ 
                  --bert_model ${BERT_BASE_DIR}$ \
                  --reduce_memory --do_lower_case \
                  --epochs_to_generate 3 \
                  --output_dir ${CORPUS_JSON_DIR}$

task data augmentation

python data_augmentation.py --pretrained_bert_model ${BERT_BASE_DIR}$ \
                            --glove_embs ${GLOVE_EMB}$ \
                            --glue_dir ${GLUE_DIR}$ \  
                            --task_name ${TASK_NAME}$

Decomposing BERT

decomposition and general distillation

Run with horovod

mpirun -np 8 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python3 general_distill.py --teacher_model models/bert-base-uncased --student_model models/bert-gd-72-384 --pregenerated_data data/pregenerated_data --num_train_epochs 2.0 --train_batch_size 32 --output_dir output/bert-gd-72-384 -use_swap --do_lower_case

To restrict sharing among SAN or FFN, add "ops" and set "ops" to be "san" or "ffn" in bert-gd-72-384/config.json

ops = "san"

Evaluation

Task distillation with data augmentation in fine-tuning phase

Rename a pretrained model as "", for instance, change step_0_pytorch_model.bin to pytorch_model.bin, and change load_compressed_model from false to true in output/config.json

Task distillation for distributed training

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 task_distill.py --teacher_model models/bert-base-uncasedi/STS-B --student_model models/bert-gd-72-384 --task_name STS-B --aug_train --data_dir data/glue_data/SST-2 --max_seq_length 128 --train_batch_size 32 --aug_train --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir ./output/36-256-STS-B

Task distillation for single gpu

python3  task_distill.py  --teacher_model models/bert-base-uncased   --student_model  models/bert-td-72-384  --output output_demo  --data_dir  data/glue_data/SST-2   --task_name  SST-2  --do_lower_case --aug_train

For augmentation, you should add --aug_train

Get test result for model

python run_glue.py --model_name_or_path  models/bert-td-72-384/SST-2 --task_name SST-2 --do_eval --do_predict --data_dir data/glue_data/STS-B --max_seq_length 128 --save_steps 500 --save_total_limit 2 --output_dir ./output/SST-2

Explore extreme compression for pre-trained language models

Related tags

Overview

Explore extreme compression for pre-trained language models

Before Training

install some libraries

loading pre-trained models

generate training data for given corpora (e.g., saved in the path "corpora" )

task data augmentation

Decomposing BERT

decomposition and general distillation

Evaluation

Task distillation with data augmentation in fine-tuning phase

Owner

twinkle

Machine Learning Model deployment for Container (TensorFlow Serving)

This is the repository for Learning to Generate Piano Music With Sustain Pedals

PAWS 🐾 Predicting View-Assignments with Support Samples

Source code for "OmniPhotos: Casual 360° VR Photography"

Author's PyTorch implementation of Randomized Ensembled Double Q-Learning (REDQ) algorithm.

[ICLR 2021] Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments.

Meandering In Networks of Entities to Reach Verisimilar Answers

AI-generated-characters for Learning and Wellbeing

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

StyleGAN2-ADA - Official PyTorch implementation

Real-time Neural Representation Fusion for Robust Volumetric Mapping

EPSANet：An Efficient Pyramid Split Attention Block on Convolutional Neural Network

CLIPImageClassifier wraps clip image model from transformers

DPC: Unsupervised Deep Point Correspondence via Cross and Self Construction (3DV 2021)

PyTorch implementation of MuseMorphose, a Transformer-based model for music style transfer.

Where-Got-Time - An NUS timetable generator which uses a genetic algorithm to optimise timetables to suit the needs of NUS students

Anomaly Detection Based on Hierarchical Clustering of Mobile Robot Data

Pytorch Implementation for Dilated Continuous Random Field

DeepLab is a state-of-art deep learning system for semantic image segmentation built on top of Caffe.

code for "Self-supervised edge features for improved Graph Neural Network training",