Explore extreme compression for pre-trained language models

Last update: Nov 14, 2022

Related tags

Deep Learning Xcompression

Overview

Explore extreme compression for pre-trained language models

Code for paper "Exploring extreme parameter compression for pre-trained language models ICLR2022"

Before Training

install some libraries

 pip install tensorly==0.5.0

Torch is needed, torch 1.0-1.4 is preferred

Install horovod for distributed learning

Configuration Install horovod on GPU

pip install horovod[pytorch]

loading pre-trained models

wget https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin -P  models/bert-base-uncased
wget https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt -P  models/bert-base-uncased
cp models/bert-base-uncased/pytorch_model.bin models/bert-td-72-384/pytorch_model.bin 
cp models/bert-base-uncased/vocab.txt models/bert-td-72-384/vocab.txt

generate training data for given corpora (e.g., saved in the path "corpora" )

python pregenerate_training_data.py --train_corpus ${CORPUS_RAW} \ 
                  --bert_model ${BERT_BASE_DIR}$ \
                  --reduce_memory --do_lower_case \
                  --epochs_to_generate 3 \
                  --output_dir ${CORPUS_JSON_DIR}$

task data augmentation

python data_augmentation.py --pretrained_bert_model ${BERT_BASE_DIR}$ \
                            --glove_embs ${GLOVE_EMB}$ \
                            --glue_dir ${GLUE_DIR}$ \  
                            --task_name ${TASK_NAME}$

Decomposing BERT

decomposition and general distillation

Run with horovod

mpirun -np 8 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python3 general_distill.py --teacher_model models/bert-base-uncased --student_model models/bert-gd-72-384 --pregenerated_data data/pregenerated_data --num_train_epochs 2.0 --train_batch_size 32 --output_dir output/bert-gd-72-384 -use_swap --do_lower_case

To restrict sharing among SAN or FFN, add "ops" and set "ops" to be "san" or "ffn" in bert-gd-72-384/config.json

ops = "san"

Evaluation

Task distillation with data augmentation in fine-tuning phase

Rename a pretrained model as "", for instance, change step_0_pytorch_model.bin to pytorch_model.bin, and change load_compressed_model from false to true in output/config.json

Task distillation for distributed training

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 task_distill.py --teacher_model models/bert-base-uncasedi/STS-B --student_model models/bert-gd-72-384 --task_name STS-B --aug_train --data_dir data/glue_data/SST-2 --max_seq_length 128 --train_batch_size 32 --aug_train --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir ./output/36-256-STS-B

Task distillation for single gpu

python3  task_distill.py  --teacher_model models/bert-base-uncased   --student_model  models/bert-td-72-384  --output output_demo  --data_dir  data/glue_data/SST-2   --task_name  SST-2  --do_lower_case --aug_train

For augmentation, you should add --aug_train

Get test result for model

python run_glue.py --model_name_or_path  models/bert-td-72-384/SST-2 --task_name SST-2 --do_eval --do_predict --data_dir data/glue_data/STS-B --max_seq_length 128 --save_steps 500 --save_total_limit 2 --output_dir ./output/SST-2

Explore extreme compression for pre-trained language models

Related tags

Overview

Explore extreme compression for pre-trained language models

Before Training

install some libraries

loading pre-trained models

generate training data for given corpora (e.g., saved in the path "corpora" )

task data augmentation

Decomposing BERT

decomposition and general distillation

Evaluation

Task distillation with data augmentation in fine-tuning phase

Owner

twinkle

This is an official implementation for "PlaneRecNet".

Code repository for the paper: Hierarchical Kinematic Probability Distributions for 3D Human Shape and Pose Estimation from Images in the Wild (ICCV 2021)

Spatio-Temporal Entropy Model (STEM) for end-to-end leaned video compression.

Compute execution plan: A DAG representation of work that you want to get done. Individual nodes of the DAG could be simple python or shell tasks or complex deeply nested parallel branches or embedded DAGs themselves.

torchlm is aims to build a high level pipeline for face landmarks detection, it supports training, evaluating, exporting, inference(Python/C++) and 100+ data augmentations

My usage of Real-ESRGAN to upscale anime, some test and results in the test_img folder

[CVPR 2022] TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing

Repository containing the PhD Thesis "Formal Verification of Deep Reinforcement Learning Agents"

PyTorch implementation of InstaGAN: Instance-aware Image-to-Image Translation

Tutorials, assignments, and competitions for MIT Deep Learning related courses.

scalingscattering

Using a Seq2Seq RNN architecture via TensorFlow to predict future Bitcoin prices

Combining Latent Space and Structured Kernels for Bayesian Optimization over Combinatorial Spaces

This is the formal code implementation of the CVPR 2022 paper 'Federated Class Incremental Learning'.

A very impractical 3D rendering engine that runs in the python terminal.

Submodular Subset Selection for Active Domain Adaptation (ICCV 2021)

Контрольная работа по математическим методам машинного обучения

Code to reproduce results from the paper "AmbientGAN: Generative models from lossy measurements"

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Official code for MPG2: Multi-attribute Pizza Generator: Cross-domain Attribute Control with Conditional StyleGAN