Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Last update: Dec 05, 2022

Overview

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

This is the Pytorch implementation for sparse progressive distillation (SPD). For more details about the motivation, techniques and experimental results, refer to our paper here.

Running

Environment Preparation (using python3)
```
pip install -r requirements.txt
```
Dataset Preparation

The original GLUE dataset could be downloaded here.

BERT_base fine-tuning on GLUE

We use finetuned BERT_base as the teacher. For each task of GLUE benchmark, we obtain the finetuned model using the original huggingface transformers code with the following script.

python run_glue.py \
          --model_name_or_path $INT_DIR \
          --task_name $TASK_NAME \
          --do_train \
          --do_eval \
          --data_dir $GLUE_DIR/$TASK_NAME/ \
          --max_seq_length 128 \
          --per_gpu_train_batch_size 32 \
          --per_gpu_eval_batch_size 32 \
          --learning_rate 3e-5 \
          --num_train_epochs 4.0 \
          --output_dir $OUT_DIR \
          --evaluate_during_training \
          --overwrite_output_dir \
          --logging_steps 400 \
          --logging_dir $OUT_DIR \
          --save_steps 10000

Sparse Progressive Distillation

We use run_glue.py to run the sparse progressive distillation. --num_prune_epochs is the epochs for pruning. --num_train_epochs is the total number of epochs (pruning, progressive distillation, finetuning).

python run_glue.py \
  --model_name_or_path PATH_TO_FINETUNED_MODEL \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --do_lower_case \
  --data_dir $GLUE_DIR/$TASK_NAME/ \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 32 \
  --per_gpu_eval_batch_size 32 \
  --learning_rate 6.4e-4 \
  --save_steps 50 \
  --num_prune_epochs 30 \
  --num_train_epochs 60 \
  --sparsity 0.9 \
  --output_dir $OUT_DIR \
  --evaluate_during_training \
  --replacing_rate 0.8 \
  --overwrite_output_dir \
  --steps_for_replacing 0 \
  --scheduler_type linear

To Dos

Provide our teacher model for each task.
Provide best performed model checkpoint for each task.

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Related tags

Overview

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

Running

BERT_base fine-tuning on GLUE

Sparse Progressive Distillation

To Dos

Owner

a curated list of docker-compose files prepared for testing data engineering tools, databases and open source libraries.

An efficient and effective learning to rank algorithm by mining information across ranking candidates. This repository contains the tensorflow implementation of SERank model. The code is developed based on TF-Ranking.

Saliency - Framework-agnostic implementation for state-of-the-art saliency methods (XRAI, BlurIG, SmoothGrad, and more).

Spatial Temporal Graph Convolutional Networks (ST-GCN) for Skeleton-Based Action Recognition in PyTorch

This is a Tensorflow implementation of Learning to See in the Dark in CVPR 2018

🤖 A Python library for learning and evaluating knowledge graph embeddings

PyTorch implementation of the end-to-end coreference resolution model with different higher-order inference methods.

Tree-based Search Graph for Approximate Nearest Neighbor Search

The Noise Contrastive Estimation for softmax output written in Pytorch

Planning from Pixels in Environments with Combinatorially Hard Search Spaces -- NeurIPS 2021

AquaTimer - Programmable Timer for Aquariums based on ATtiny414/814/1614

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

Code for the CVPR2021 paper "Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition"

ML From Scratch

PyTorch implementation of PSPNet segmentation network

Watch faces morph into each other with StyleGAN 2, StyleGAN, and DCGAN!

Half Instance Normalization Network for Image Restoration

An open-access benchmark and toolbox for electricity price forecasting

Yet Another Reinforcement Learning Tutorial

Official Implementation for the "An Empirical Investigation of 3D Anomaly Detection and Segmentation" paper.