TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

Overview

FunMatch-Distillation

TF2 implementation of knowledge distillation using the "function matching" hypothesis from the paper Knowledge distillation: A good teacher is patient and consistent by Beyer et al.

The techniques have been demonstrated using three datasets:

This repository provides Kaggle Kernel notebooks so that we can leverage the free TPu v3-8 to run the long training schedules. Please refer to this section.

Importance

The importance of knowledge distillation lies in its practical usefulness. With the recipes from "function matching", we can now perform knowledge distillation using a principled approach yielding student models that can actually match the performance of their teacher models. This essentially allows us to compress bigger models into (much) smaller ones thereby reducing storage costs and improving inference speed.

Key ingredients

  • No use of ground-truth labels during distillation.
  • Teacher and student should see same images during distillation as opposed to differently augmented views of same images.
  • Aggressive form of MixUp as the key augmentation recipe. MixUp is paired with "Inception-style" cropping (implemented in this script).
  • A LONG training schedule for distillation. At least 1000 epochs to get good results without overfitting. The importance of a long training schedule is paramount as studied in the paper.

Results

The table below summarizes the results of my experiments. In all cases, teacher is a BiT-ResNet101x3 model and student is a BiT-ResNet50x1. For fun, you can also try to distill into other model families. BiT stands for "Big Transfer" and it was proposed in this paper.

Dataset Teacher/Student Top-1 Acc on Test Location
Flowers102 Teacher 98.18% Link
Flowers102 Student (1000 epochs) 81.02% Link
Pet37 Teacher 90.92% Link
Pet37 Student (300 epochs) 81.3% Link
Pet37 Student (1000 epochs) 86% Link
Food101 Teacher 85.52% Link
Food101 Student (100 epochs) 76.06% Link

(Location denotes the trained model location.)

These results are consistent with Table 4 of the original paper.

It should be noted that none of the above student training regimes showed signs of overfitting. Further improvements can be done by training for longer. The authors also showed that Shampoo can get to similar performance much quicker than Adam during distillation. So, it may very well be possible to get this performance with fewer epochs with Shampoo.

A few differences from the original implementation:

  • The authors use BiT-ResNet152x2 as a teacher.
  • The mixup() variant I used will produce a pair of duplicate images if the number of images is even. Now, for 8 workers it will become 8 pairs. This may have led to the reduced performance. We can overcome this by using tf.roll(images, 1, axis=0) instead of tf.reverse in the mixup() function. Thanks to Lucas Beyer for pointing this out.

About the notebooks

All the notebooks are fully runnable on Kaggle Kernel. The only requirement is that you'd need a billing enabled GCP account to use GCS Buckets to store data.

Notebook Description Kaggle Kernel
train_bit.ipynb Shows how to train the teacher model. Link
train_bit_keras_tuner.ipynb Shows how to run hyperparameter tuning using
Keras Tuner for the teacher model.
Link
funmatch_distillation.ipynb Shows an implementation of the recipes
from "function matching".
Link

These are only demonstrated on the Pet37 dataset but will work out-of-the-box for the other datasets too.

TFRecords

For convenience, TFRecords of different datasets are provided:

Dataset TFRecords
Flowers102 Link
Pet37 Link
Food101 Link

Paper citation

@misc{beyer2021knowledge,
      title={Knowledge distillation: A good teacher is patient and consistent}, 
      author={Lucas Beyer and Xiaohua Zhai and Amélie Royer and Larisa Markeeva and Rohan Anil and Alexander Kolesnikov},
      year={2021},
      eprint={2106.05237},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgements

Huge thanks to Lucas Beyer (first author of the paper) for providing suggestions on the initial version of the implementation.

Thanks to the ML-GDE program for providing GCP credits.

Thanks to TRC for providing Cloud TPU access.

You might also like...
Implementation of momentum^2 teacher

Momentum^2 Teacher: Momentum Teacher with Momentum Statistics for Self-Supervised Learning Requirements All experiments are done with python3.6, torch

Code implementation of Data Efficient Stagewise Knowledge Distillation paper.
Code implementation of Data Efficient Stagewise Knowledge Distillation paper.

Data Efficient Stagewise Knowledge Distillation Table of Contents Data Efficient Stagewise Knowledge Distillation Table of Contents Requirements Image

The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation This repository is the official implementation of CVPR 2021 paper:

PyTorch implementation of paper A Fast Knowledge Distillation Framework for Visual Recognition.
PyTorch implementation of paper A Fast Knowledge Distillation Framework for Visual Recognition.

FKD: A Fast Knowledge Distillation Framework for Visual Recognition Official PyTorch implementation of paper A Fast Knowledge Distillation Framework f

Official implementation of the paper
Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Lightweight-Deep-CNN-for-Natural-Image-Matting-via-Similarity-Preserving-Knowledge-Distillation Introduction Accepted at IEEE Signal Processing Letter

Pcos-prediction - Predicts the likelihood of Polycystic Ovary Syndrome based on patient attributes and symptoms
Pcos-prediction - Predicts the likelihood of Polycystic Ovary Syndrome based on patient attributes and symptoms

PCOS Prediction 🥼 Predicts the likelihood of Polycystic Ovary Syndrome based on

[ICLR 2021 Spotlight Oral] "Undistillable: Making A Nasty Teacher That CANNOT teach students", Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang

Undistillable: Making A Nasty Teacher That CANNOT teach students "Undistillable: Making A Nasty Teacher That CANNOT teach students" Haoyu Ma, Tianlong

Unet network with mean teacher for altrasound image segmentation

Unet network with mean teacher for altrasound image segmentation

Details about the wide minima density hypothesis and metrics to compute width of a minima

wide-minima-density-hypothesis Details about the wide minima density hypothesis and metrics to compute width of a minima This repo presents the wide m

Releases(v4.0.0)
Owner
Sayak Paul
Trying to learn how machines learn.
Sayak Paul
PyKaldi GOP-DNN on Epa-DB

PyKaldi GOP-DNN on Epa-DB This repository has the tools to run a PyKaldi GOP-DNN algorithm on Epa-DB, a database of non-native English speech by Spani

18 Dec 14, 2022
Multi-task Self-supervised Object Detection via Recycling of Bounding Box Annotations (CVPR, 2019)

Multi-task Self-supervised Object Detection via Recycling of Bounding Box Annotations (CVPR 2019) To make better use of given limited labels, we propo

126 Sep 13, 2022
Unofficial implementation of Google "CutPaste: Self-Supervised Learning for Anomaly Detection and Localization" in PyTorch

CutPaste CutPaste: image from paper Unofficial implementation of Google's "CutPaste: Self-Supervised Learning for Anomaly Detection and Localization"

Lilit Yolyan 59 Nov 27, 2022
PyTorch implementation for "Sharpness-aware Quantization for Deep Neural Networks".

Sharpness-aware Quantization for Deep Neural Networks This is the official repository for our paper: Sharpness-aware Quantization for Deep Neural Netw

Zhuang AI Group 30 Dec 19, 2022
An implementation of a sequence to sequence neural network using an encoder-decoder

Keras implementation of a sequence to sequence model for time series prediction using an encoder-decoder architecture. I created this post to share a

Luke Tonin 195 Dec 17, 2022
Framework for estimating the structures and parameters of Bayesian networks (DAGs) at per-sample resolution

Sample-specific Bayesian Networks A framework for estimating the structures and parameters of Bayesian networks (DAGs) at per-sample or per-patient re

Caleb Ellington 1 Sep 23, 2022
Age and Gender prediction using Keras

cnn_age_gender Age and Gender prediction using Keras Dataset example : Description : UTKFace dataset is a large-scale face dataset with long age span

XN3UR0N 58 May 03, 2022
Author: Wenhao Yu ([email protected]). ACL 2022. Commonsense Reasoning on Knowledge Graph for Text Generation

Diversifying Commonsense Reasoning Generation on Knowledge Graph Introduction -- This is the pytorch implementation of our ACL 2022 paper "Diversifyin

DM2 Lab @ ND 61 Dec 30, 2022
Docker containers of baseline agents for the Crafter environment

Crafter Baselines This repository contains Docker containers for running various baselines on the Crafter environment. Reward Agents DreamerV2 based o

Danijar Hafner 17 Sep 25, 2022
GRaNDPapA: Generator of Rad Names from Decent Paper Acronyms

GRaNDPapA: Generator of Rad Names from Decent Paper Acronyms Trying to publish a new machine learning model and can't write a decent title for your pa

264 Nov 08, 2022
INSPIRED: A Transparent Dialogue Dataset for Interactive Semantic Parsing

INSPIRED: A Transparent Dialogue Dataset for Interactive Semantic Parsing Existing studies on semantic parsing focus primarily on mapping a natural-la

7 Aug 22, 2022
An end-to-end machine learning web app to predict rugby scores (Pandas, SQLite, Keras, Flask, Docker)

Rugby score prediction An end-to-end machine learning web app to predict rugby scores Overview An demo project to provide a high-level overview of the

34 May 24, 2022
GazeScroller - Using Facial Movements to perform Hands-free Gesture on the system

GazeScroller Using Facial Movements to perform Hands-free Gesture on the system

2 Jan 05, 2022
Out-of-Town Recommendation with Travel Intention Modeling (AAAI2021)

TrainOR_AAAI21 This is the official implementation of our AAAI'21 paper: Haoran Xin, Xinjiang Lu, Tong Xu, Hao Liu, Jingjing Gu, Dejing Dou, Hui Xiong

Jack Xin 13 Oct 19, 2022
Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch

Implementation of gMLP, an all-MLP replacement for Transformers, in Pytorch

Phil Wang 383 Jan 02, 2023
Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

RMNet This repository contains the source code for the paper Efficient Regional Memory Network for Video Object Segmentation. Cite this work @inprocee

Haozhe Xie 76 Dec 14, 2022
AirLoop: Lifelong Loop Closure Detection

AirLoop This repo contains the source code for paper: Dasong Gao, Chen Wang, Sebastian Scherer. "AirLoop: Lifelong Loop Closure Detection." arXiv prep

Chen Wang 53 Jan 03, 2023
PyTorch implementation of PP-LCNet: A Lightweight CPU Convolutional Neural Network

PyTorch implementation of PP-LCNet Reproduction of PP-LCNet architecture as described in PP-LCNet: A Lightweight CPU Convolutional Neural Network by C

Quan Nguyen (Fly) 47 Nov 02, 2022
DeOldify - A Deep Learning based project for colorizing and restoring old images (and video!)

DeOldify - A Deep Learning based project for colorizing and restoring old images (and video!)

Jason Antic 15.8k Jan 04, 2023
Trash Sorter Extraordinaire is a software which efficiently detects the different types of waste in a pile of random trash through feeding it pictures or videos.

Trash-Sorter-Extraordinaire Trash Sorter Extraordinaire is a software which efficiently detects the different types of waste in a pile of random trash

Rameen Mahmood 1 Nov 07, 2021