A practical ML pipeline for data labeling with experiment tracking using DVC.

Last update: Mar 08, 2022

Related tags

Deep Learning auto-label-pipeline

Overview

Auto Label Pipeline

A practical ML pipeline for data labeling with experiment tracking using DVC

Goals:

Demonstrate reproducible ML
Use DVC to build a pipeline and track experiments
Automatically clean noisy data labels using Cleanlab cross validation
Determine which FastText subword embedding performs better for semi-supervised cluster classification
Determine optimal hyperparameters through experiment tracking
Prepare casually labeled data for human evaluation

Demo: View Experiments recorded in git branches:

The Data

For our working demo, we will purify some of the slightly noisy/dirty labels found in Wikidata people entries for attributes for Employers and Occupations. Our initial data labels have been harvested from a json dump of Wikidata, the Kensho Wikidata dataset, and this notebook script for extracting the data.

Data Input Format

Tab separated CSV files, with the fields:

text_data - the item that is to be labeled (single word or short group of words)
class_type - the class label
context - any text that surrounds the text_data field in situ, or defines the text_data item in other words.
count - the number of occurrences of this label; how common it appears in the existing data.

Data Output format

(same parameters as the data input plus)
date_updated - when the label was updated
previous_class_type - the previous class_type label
mislabeled_rank - records how low the confidence was prior to a re-label

The Pipeline

Fetch
Prepare
Train
Relabel

For details, see the README in the src folder. The pipeline is orchestrated via the dvc.yaml file, and parameterized via params.yaml.

Using/Extending the pipeline

Drop your own CSV files into the data/raw directory
Run the pipeline
Tune settings, embeddings, etc, until no longer amused
Verify your results manually and by submitting data/final/data.csv for human evaluation, using random sampling and drawing heavily from the mislabeled_rank entries.

Project Structure

├── LICENSE
├── README.md
├── data                    # <-- Directory with all types of data
│ ├── final                 # <-- Directory with final data
│ │ ├── class.metrics.csv   # <-- Directory with raw and intermediate data
│ │ └── data.csv            # <-- Pipeline output (not stored in git)
│ ├── interim               # <-- Directory with temporary data
│ │ ├── datafile.0.csv
│ │ └── datafile.1.csv
│ ├── prepared              # <-- Directory with prepared data
│ │ └── data.all.csv
│ └── raw                   # <-- Directory with raw data; populated by pipeline's fetch stage
│     ├── README.md
│     ├── cc.en.300.bin               # <-- Fasttext binary model file, creative commons 
│     ├── crawl-300d-2M-subword.bin   # <-- Fasttext binary model file, common crawl
│     ├── crawl-300d-2M-subword.vec
│     ├── employers.wikidata.csv      # <-- Our initial data, 1 set of class labels 
│     ├── lid.176.ftz
│     └── occupations.wikidata.csv    # <-- Our initial data, 1 set of class labels
├── dvc.lock                # <-- DVC internal state tracking file
├── dvc.yaml                # <-- DVC project configuration file
├── dvc_plots               # <-- Temp directory for DVC plots; not tracked by git
│ └── README.md
├── model
│ ├── class.metrics.csv
│ ├── svm.model.pkl
│ └── train.metrics.json    # <-- Metrics from the pipeline's train stage  
├── mypy.ini
├── params.yaml             # <-- Parameter configuration file for the pipeline
├── reports                 # <-- Directory with metrics output
│ ├── prepare.metrics.json  
│ └── relabel.metrics.json
├── requirements-dev.txt
├── requirements.txt
├── runUnitTests.sh
└── src                     # <-- Directory containing the pipeline's code
    ├── README.md
    ├── fetch.py
    ├── prepare.py
    ├── relabel.py
    ├── train.py
    └── utils.py

Setup

Create environment

conda create --name auto-label-pipeline python=3.9

conda activate auto-label-pipeline

Install requirements

pip install -r requirements.txt

If you're going to modify the source, also install the requirements-dev.txt file

Reproduce the pipeline results locally

dvc repro

View Metrics

dvc metrics show

Working with Experiments

To see your local experiments:

dvc exp show

Experiments that have been turned into a branches can be referenced directly in commands:

dvc exp diff svc_linear_ex svc_rbf_ex

e.g. to compare experiments:

dvc exp diff [experiment branch name] [experiment branch 2 name]

e.g.:

dvc exp diff svc_linear_ex svc_rbf_ex

dvc exp diff svc_poly_ex svc_rbf_ex

To create an experiment by changing a parameter:

dvc exp run --set-param train.split=0.9 --name my_split_ex

(When promoting an experiment to a branch, DVC does not switch into the branch.)

To save and share your experiment in a branch:

dvc exp branch my_split_ex my_split_ex_branch

View plots

Initial Confusion matrix:

dvc plots show model/class.metrics.csv -x actual -y predicted --template confusion

Confusion matrix after relabeling:

dvc plots show data/final/class.metrics.csv -x actual -y predicted --template confusion

Conclusions

For relabeling and cleaning, it's important to have more than two labels, and to specifying an UNK label for: unknown; labels spanning multiple groups; or low confidence support.
Standardizing the input data formats allow users to flexibly use many different data sources.
Language detection is an important part of data cleaning, however problematic because:
- Modern languages sometimes "borrow" words from other languages (but not just any words!)
- Language detection models perform inference poorly with limited data, especially just a single word.
- Normalization utilities, such as unidecode aren't helpful; (the wrong word in more readable letters is still the wrong word).
Experimentation parameters often have co-dependencies that would make a simple combinatorial grid search inefficient.

A practical ML pipeline for data labeling with experiment tracking using DVC.

Related tags

Overview

Auto Label Pipeline

Demo: View Experiments recorded in git branches:

The Data

Data Input Format

Data Output format

The Pipeline

Using/Extending the pipeline

Project Structure

Setup

Create environment

Install requirements

Reproduce the pipeline results locally

View Metrics

Working with Experiments

View plots

Conclusions

Recommended readings:

Owner

Todd Cook

Codes for the ICCV'21 paper "FREE: Feature Refinement for Generalized Zero-Shot Learning"

code for "Self-supervised edge features for improved Graph Neural Network training",

S-attack library. Official implementation of two papers "Are socially-aware trajectory prediction models really socially-aware?" and "Vehicle trajectory prediction works, but not everywhere".

Code for approximate graph reduction techniques for cardinality-based DSFM, from paper

Implement the Pareto Optimizer and pcgrad to make a self-adaptive loss for multi-task

FAVD: Featherweight Assisted Vulnerability Discovery

BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond

PiRank: Learning to Rank via Differentiable Sorting

Noise Conditional Score Networks (NeurIPS 2019, Oral)

Official repository for "PAIR: Planning and Iterative Refinement in Pre-trained Transformers for Long Text Generation"

(ICCV 2021 Oral) Re-distributing Biased Pseudo Labels for Semi-supervised Semantic Segmentation: A Baseline Investigation.

The code for our CVPR paper PISE: Person Image Synthesis and Editing with Decoupled GAN, Project Page, supp.

PyTorch framework, for reproducing experiments from the paper Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks

Accommodating supervised learning algorithms for the historical prices of the world's favorite cryptocurrency and boosting it through LightGBM.

A simple but complete full-attention transformer with a set of promising experimental features from various papers

TensorFlow implementation of "Attention is all you need (Transformer)"

Scheduling BilinearRewards

Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising

Code for the Paper "Diffusion Models for Handwriting Generation"

An open-source, low-cost, image-based weed detection device for fallow scenarios.