Implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

Overview

Selection via Proxy: Efficient Data Selection for Deep Learning

This repository contains a refactored implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

If you use this code in your research, please use the following BibTeX entry.

@inproceedings{
    coleman2020selection,
    title={Selection via Proxy: Efficient Data Selection for Deep Learning},
    author={Cody Coleman and Christopher Yeh and Stephen Mussmann and Baharan Mirzasoleiman and Peter Bailis and Percy Liang and Jure Leskovec and Matei Zaharia},
    booktitle={International Conference on Learning Representations},
    year={2020},
    url={https://openreview.net/forum?id=HJg2b0VYDr}
}

The original code is also available as a zip file, but lacks documentation, uses outdated packages, and won't be maintained. Please use this repository instead and report issues here.

Setup

Prerequisites

Installation

git clone https://github.com/stanford-futuredata/selection-via-proxy.git
cd selection-via-proxy
pip install -e .

or simply

pip install git+https://github.com/stanford-futuredata/selection-via-proxy.git

Quickstart

Perform active learning on CIFAR10 from the command line:

python -m svp.cifar active

Or from the python interpreter:

from svp.cifar.active import active
active()

"Selection via proxy" happens when --proxy-arch doesn't match --arch:

# ResNet20 selecting data for a ResNet164
python -m svp.cifar active --proxy-arch preact20 --arch preact164

For help, see python -m svp.cifar active --help or active()'s docstrinng.

Example Usage

Below are more examples of the command line interface that cover different datasets (e.g., CIFAR100, ImageNet, Amazon Review Polarity) and commands (e.g., train, coreset).

Basic Training

CIFAR10 and CIFAR100

Preliminaries

None. The CIFAR10 and CIFAR100 datasets will download if they don't exist in ./data/cifar10 and ./data/cifar100 respectively.

Examples
# Train ResNet164 with pre-activation (https://arxiv.org/abs/1603.05027) on CIFAR10.
python -m svp.cifar train --dataset cifar10 --arch preact164

Replace --dataset CIFAR10 with --dataset CIFAR100 to run on CIFAR100 rather than CIFAR10.

# Train ResNet164 with pre-activation (https://arxiv.org/abs/1603.05027) on CIFAR100.
python -m svp.cifar train --dataset cifar100 --arch preact164

The same is true for all the python -m svp.cifar commands below

ImageNet

Preliminaries
  • Download the ImageNet dataset into a directory called imagenet.
  • Extract the images.
# Extract train data.
mkdir train && mv ILSVRC2012_img_train.tar train/ && cd train
tar -xvf ILSVRC2012_img_train.tar && rm -f ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
# Extract validation data.
cd ../ && mkdir val && mv ILSVRC2012_img_val.tar val/ && cd val && tar -xvf ILSVRC2012_img_val.tar
wget -qO- https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh | bash
  • Replace /path/to/data in all the python -m svp.imagenet commands below with the path to the imagenet directory you created. Note, do not include imagenet in the path; the script will automatically do that.
Examples
# Train ResNet50 (https://arxiv.org/abs/1512.03385).
python -m svp.imagenet train --dataset-dir '/path/to/data' --arch resnet50 --num-workers 20

For convenience, you can use larger batch sizes and scale learning rates according to "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour" with --scale-learning-rates:

# Train ResNet50 with a batch size of 1048 and scaled learning rates accordingly.
python -m svp.imagenet train --dataset-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --batch-size 1048 --scale-learning-rates

Mixed precision training is also supported using apex. Apex isn't installed during the pip install instructions above, so please follow the installation instructions in the apex repository before running the command below.

# Use mixed precision training to train ResNet50 with a batch size of 1048 and scale learning rates accordingly.
python -m svp.imagenet train --dataset-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --batch-size 1048 --scale-learning-rates --fp16

Amazon Review Polarity and Full

Preliminaries
tar -xvzf amazon_review_full_csv.tar.gz
tar -xvzf amazon_review_polarity_csv.tar.gz
  • Replace /path/to/data in all the python -m svp.amazon commands below with the path to the root directory you created. Note, do not include amazon_review_full_csv or amazon_review_polarity_csv in the path; the script will automatically do that.
Examples
# Train VDCNN29 (https://arxiv.org/abs/1606.01781) on Amazon Review Polarity.
python -m svp.amazon train --datasets-dir '/path/to/data' --dataset amazon_review_polarity --arch vdcnn29-conv \
    --num-workers 4 --eval-num-workers 8

Replace --dataset amazon_review_polarity with --dataset amazon_review_full to run on Amazon Review Full rather than Amazon Review Polarity.

# Train VDCNN29 (https://arxiv.org/abs/1606.01781) on Amazon Review Full.
python -m svp.amazon train --datasets-dir '/path/to/data' --dataset amazon_review_full --arch vdcnn29-maxpool \
    --num-workers 4 --eval-num-workers 8

The same is true for all the python -m svp.amazon commands below

Active learning

Active learning selects points to label from a large pool of unlabeled data by repeatedly training a model on a small pool of labeled data and selecting additional examples to label based on the model’s uncertainty (e.g., the entropy of predicted class probabilities) or other heuristics. The commands below demonstrate how to perform active learning on CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity and Amazon Review Full with a variety of models and selection methods.

CIFAR10 and CIFAR100

Baseline Approach
# Perform active learning with ResNet164 for both selection and the final predictions.
python -m svp.cifar active --dataset cifar10 --arch preact164 --num-workers 4 \
	--selection-method least_confidence \
	--initial-subset 1000 \
	--round 4000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--round 5000
Selection via Proxy

If the model architectures (arch vs proxy_arch) or the learning rate schedules don't match, "selection via proxy" (SVP) is performed and two separate models are trained. The proxy is used for selecting which examples to label, while the target is only used for evaluating the quality of the selection. By default, the target model (arch) is trained and evaluated after each selection round. To change this behavior set eval_target_at to evaluate at a specific labeling budget(s) or set train_target to False to skip evaluating the target model.

# Perform active learning with ResNet20 for selection and ResNet164 for the final predictions.
python -m svp.cifar active --dataset cifar10 --arch preact164 --num-workers 4 \
	--selection-method least_confidence --proxy-arch preact20 \
	--initial-subset 1000 \
	--round 4000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--eval-target-at 25000

To train the proxy for fewer epochs, use the --proxy-* options as shown below:

# Perform active learning with ResNet20 after only 50 epochs for selection.
python -m svp.cifar active --dataset cifar10 --arch preact164 --num-workers 4 \
	--selection-method least_confidence --proxy-arch preact20 \
	--proxy-learning-rate 0.01 --proxy-epochs 1 \
	--proxy-learning-rate 0.1 --proxy-epochs 45 \
	--proxy-learning-rate 0.01 --proxy-epochs 4 \
	--initial-subset 1000 \
	--round 4000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--round 5000 \
	--eval-target-at 25000

ImageNet

Baseline Approach
# Perform active learning with ResNet50 for both selection and the final predictions.
python -m svp.imagenet active --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20
Selection via Proxy

If the model architectures (arch vs proxy_arch) or the learning rate schedules don't match, "selection via proxy" (SVP) is performed and two separate models are trained. The proxy is used for selecting which examples to label, while the target is only used for evaluating the quality of the selection. By default, the target model (arch) is trained and evaluated after each selection round. To change this behavior set eval_target_at to evaluate at a specific labeling budget(s) or set train_target to False to skip evaluating the target model.

# Perform active learning with ResNet18 for selection and ResNet50 for the final predictions.
python -m svp.imagenet active --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --proxy-arch resnet18 --proxy-batch-size 1028 --proxy-scale-learning-rates \
    --eval-target-at 512467

To train the proxy for fewer epochs, use the --proxy-* options as shown below:

# Perform active learning with ResNet18 after only 45 epochs for selection.
python -m svp.imagenet active --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --proxy-arch resnet18 --proxy-batch-size 1028 --proxy-scale-learning-rates \
    --eval-target-at 512467 \
    --proxy-learning-rate 0.0167 --proxy-epochs 1 \
    --proxy-learning-rate 0.0333 --proxy-epochs 1 \
    --proxy-learning-rate 0.05 --proxy-epochs 1 \
    --proxy-learning-rate 0.0667 --proxy-epochs 1 \
    --proxy-learning-rate 0.0833 --proxy-epochs 1 \
    --proxy-learning-rate 0.1 --proxy-epochs 25 \
    --proxy-learning-rate 0.01 --proxy-epochs 15

Amazon Review Polarity and Full

Baseline Approach
# Perform active learning with VDCNN29 for both selection and the final predictions.
python -m svp.amazon active --datasets-dir '/path/to/data' --dataset amazon_review_polarity  --num-workers 8 \
    --arch vdcnn29-conv --selection-method least_confidence
Selection via Proxy

If the model architectures (arch vs proxy_arch) or the learning rate schedules don't match, "selection via proxy" (SVP) is performed and two separate models are trained. The proxy is used for selecting which examples to label, while the target is only used for evaluating the quality of the selection. By default, the target model (arch) is trained and evaluated after each selection round. To change this behavior set eval_target_at to evaluate at a specific labeling budget(s) or set train_target to False to skip evaluating the target model. You can evaluate a series of selections later using the precomputed_selection option.

# Perform active learning with VDCNN9 for selection and VDCNN29 for the final predictions.
python -m svp.amazon active --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --selection-method least_confidence \
    --proxy-arch vdcnn9-maxpool --eval-target-at 1440000

To use fastText as a proxy, Install fastText 0.1.0 and replace /path/to/fastText/fasttext in the python -m svp.amazon fasttext commands below with the path to the fastText binary you created.

# For convenience, save fastText results in a separate directory
mkdir fasttext
# Perform active learning with fastText.
python -m svp.amazon fasttext '/path/to/fastText/fasttext' --run-dir fasttext \
    --datasets-dir '/path/to/data' --dataset amazon_review_polarity --selection-method least_confidence \
    --size 72000 --size 360000 --size 720000 --size 1080000 --size 1440000
# Get the most recent timestamp from the fasttext directory.
fasttext_path="fasttext/$(ls fasttext | sort -nr | head -n 1)"
# Use selected labeled data from fastText to train VDCNN29
python -m svp.amazon active --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --selection-method least_confidence \
    --precomputed-selection $fasttext_path --eval-target-at 1440000

Core-set Selection

Core-set selection techniques start with a large labeled or unlabeled dataset and aim to find a small subset that accurately approximates the full dataset by selecting representative examples. The commands below demonstrate how to perform core-set selection on CIFAR10, CIFAR100, ImageNet, Amazon Review Polarity and Amazon Review Full with a variety of models and selection methods.

CIFAR10 and CIFAR100

Baseline Approach
# Perform core-set selection with an oracle that uses ResNet164 for both selection and the final predictions.
python -m svp.cifar coreset --dataset cifar10 --arch preact164 --num-workers 4 \
    --subset 25000 --selection-method forgetting_events
Selection via Proxy
# Perform core-set selection with ResNet20 selecting for ResNet164.
python -m svp.cifar coreset --dataset cifar10 --arch preact164 --num-workers 4 \
    --subset 25000 --selection-method forgetting_events \
    --proxy-arch preact20

To train the proxy for fewer epochs, use the --proxy-* options as shown below:

# Perform core-set selection with ResNet20 after only 50 epochs.
python -m svp.cifar coreset --dataset cifar10 --arch preact164 --num-workers 4 \
    --subset 25000 --selection-method forgetting_events \
    --proxy-arch preact20 \
	--proxy-learning-rate 0.01 --proxy-epochs 1 \
	--proxy-learning-rate 0.1 --proxy-epochs 45 \
	--proxy-learning-rate 0.01 --proxy-epochs 4

ImageNet

Baseline Approach
# Perform core-set selection with an oracle that uses ResNet50 for both selection and the final predictions.
python -m svp.imagenet coreset --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --subset 768700 --selection-method forgetting_events
Selection via Proxy
# Perform core-set selection with ResNet18 selecting for ResNet50.
python -m svp.imagenet coreset --datasets-dir '/path/to/data' --arch resnet50 --num-workers 20 \
    --subset 768700 --selection-method forgetting_events \
    --proxy-arch resnet18 --proxy-batch-size 1028 --proxy-scale-learning-rates

Amazon Review Polarity and Full

Baseline Approach
# Perform core-set selection with an oracle that uses VDCNN29 for both selection and the final predictions.
python -m svp.amazon coreset --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --subset 2160000  --selection-method entropy
Selection via Proxy
# Perform core-set selection with VDCNN9 selecting for VDCNN29.
python -m svp.amazon coreset --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --subset 2160000 --selection-method entropy \
    --proxy-arch vdcnn9-maxpool

To use fastText as a proxy, Install fastText 0.1.0 and replace /path/to/fastText/fasttext in the python -m svp.amazon fasttext commands below with the path to the fastText binary you created.

# For convenience, save fastText results in a separate directory
mkdir fasttext
# Perform core-set selection with fastText.
python -m svp.amazon fasttext '/path/to/fastText/fasttext' --run-dir fasttext \
    --datasets-dir '/path/to/data' --dataset amazon_review_polarity \
    --selection-method entropy --size 3600000 --size 2160000
# Get the most recent timestamp from the fasttext directory.
fasttext_path="fasttext/$(ls fasttext | sort -nr | head -n 1)"
# Use selected labeled data from fastText to train VDCNN29
python -m svp.amazon coreset --datasets-dir '/path/to/data' --dataset amazon_review_polarity --num-workers 8 \
    --arch vdcnn29-conv --precomputed-selection $fasttext_path
Owner
Stanford Future Data Systems
We are a CS research group at Stanford building data-intensive systems
Stanford Future Data Systems
This repository contains all source code, pre-trained models related to the paper "An Empirical Study on GANs with Margin Cosine Loss and Relativistic Discriminator"

An Empirical Study on GANs with Margin Cosine Loss and Relativistic Discriminator This is a Pytorch implementation for the paper "An Empirical Study o

Cuong Nguyen 3 Nov 15, 2021
Implementation of the Point Transformer layer, in Pytorch

Point Transformer - Pytorch Implementation of the Point Transformer self-attention layer, in Pytorch. The simple circuit above seemed to have allowed

Phil Wang 501 Jan 03, 2023
Management Dashboard for Torchserve

Torchserve Dashboard Torchserve Dashboard using Streamlit Related blog post Usage Additional Requirement: torchserve (recommended:v0.5.2) Simply run:

Ceyda Cinarel 103 Dec 10, 2022
Mouse Brain in the Model Zoo

Deep Neural Mouse Brain Modeling This is the repository for the ongoing deep neural mouse modeling project, an attempt to characterize the representat

Colin Conwell 15 Aug 22, 2022
Audio2Face - Audio To Face With Python

Audio2Face Discription We create a project that transforms audio to blendshape w

FACEGOOD 724 Dec 26, 2022
Source code of the paper "Deep Learning of Latent Variable Models for Industrial Process Monitoring".

Source code of the paper "Deep Learning of Latent Variable Models for Industrial Process Monitoring".

Xiangyin Kong 7 Nov 08, 2022
Solution of Kaggle competition: Sartorius - Cell Instance Segmentation

Sartorius - Cell Instance Segmentation https://www.kaggle.com/c/sartorius-cell-instance-segmentation Environment setup Build docker image bash .dev_sc

68 Dec 09, 2022
[ICCV 2021] Official PyTorch implementation for Deep Relational Metric Learning.

Ranking Models in Unlabeled New Environments Prerequisites This code uses the following libraries Python 3.7 NumPy PyTorch 1.7.0 + torchivision 0.8.1

Borui Zhang 39 Dec 10, 2022
Official Implementation for HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing

HyperStyle: StyleGAN Inversion with HyperNetworks for Real Image Editing Yuval Alaluf*, Omer Tov*, Ron Mokady, Rinon Gal, Amit H. Bermano *Denotes equ

885 Jan 06, 2023
Based on Yolo's low-power, ultra-lightweight universal target detection algorithm, the parameter is only 250k, and the speed of the smart phone mobile terminal can reach ~300fps+

Based on Yolo's low-power, ultra-lightweight universal target detection algorithm, the parameter is only 250k, and the speed of the smart phone mobile terminal can reach ~300fps+

567 Dec 26, 2022
Very deep VAEs in JAX/Flax

Very Deep VAEs in JAX/Flax Implementation of the experiments in the paper Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on I

Jamie Townsend 42 Dec 12, 2022
Predictive Maintenance LSTM

Predictive-Maintenance-LSTM - Predictive maintenance study for Complex case study, we've obtained failure causes by operational error and more deeply by design mistakes.

Amir M. Sadafi 1 Dec 31, 2021
Read and write layered TIFF ImageSourceData and ImageResources tags

Read and write layered TIFF ImageSourceData and ImageResources tags Psdtags is a Python library to read and write the Adobe Photoshop(r) specific Imag

Christoph Gohlke 4 Feb 05, 2022
A library for low-memory inferencing in PyTorch.

Pylomin Pylomin (PYtorch LOw-Memory INference) is a library for low-memory inferencing in PyTorch. Installation ... Usage For example, the following c

3 Oct 26, 2022
GazeScroller - Using Facial Movements to perform Hands-free Gesture on the system

GazeScroller Using Facial Movements to perform Hands-free Gesture on the system

2 Jan 05, 2022
Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

Phil Wang 189 Nov 22, 2022
PyTorch implementation of PNASNet-5 on ImageNet

PNASNet.pytorch PyTorch implementation of PNASNet-5. Specifically, PyTorch code from this repository is adapted to completely match both my implemetat

Chenxi Liu 314 Nov 25, 2022
Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

flownet2-pytorch Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. Multiple GPU training is supported, a

NVIDIA Corporation 2.8k Dec 27, 2022
A tensorflow implementation of an HMM layer

tensorflow_hmm Tensorflow and numpy implementations of the HMM viterbi and forward/backward algorithms. See Keras example for an example of how to use

Zach Dwiel 283 Oct 19, 2022
PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models Code accompanying CVPR'20 paper of the same title. Paper lin

Alex Damian 7k Dec 30, 2022