A practical ML pipeline for data labeling with experiment tracking using DVC.

Overview

Auto Label Pipeline

A practical ML pipeline for data labeling with experiment tracking using DVC

Goals:

  • Demonstrate reproducible ML
  • Use DVC to build a pipeline and track experiments
  • Automatically clean noisy data labels using Cleanlab cross validation
  • Determine which FastText subword embedding performs better for semi-supervised cluster classification
  • Determine optimal hyperparameters through experiment tracking
  • Prepare casually labeled data for human evaluation

Demo: View Experiments recorded in git branches:

asciicast

The Data

For our working demo, we will purify some of the slightly noisy/dirty labels found in Wikidata people entries for attributes for Employers and Occupations. Our initial data labels have been harvested from a json dump of Wikidata, the Kensho Wikidata dataset, and this notebook script for extracting the data.

Data Input Format

Tab separated CSV files, with the fields:

  • text_data - the item that is to be labeled (single word or short group of words)
  • class_type - the class label
  • context - any text that surrounds the text_data field in situ, or defines the text_data item in other words.
  • count - the number of occurrences of this label; how common it appears in the existing data.

Data Output format

  • (same parameters as the data input plus)
  • date_updated - when the label was updated
  • previous_class_type - the previous class_type label
  • mislabeled_rank - records how low the confidence was prior to a re-label

The Pipeline

  • Fetch
  • Prepare
  • Train
  • Relabel

For details, see the README in the src folder. The pipeline is orchestrated via the dvc.yaml file, and parameterized via params.yaml.

Using/Extending the pipeline

  1. Drop your own CSV files into the data/raw directory
  2. Run the pipeline
  3. Tune settings, embeddings, etc, until no longer amused
  4. Verify your results manually and by submitting data/final/data.csv for human evaluation, using random sampling and drawing heavily from the mislabeled_rank entries.

Project Structure

├── LICENSE
├── README.md
├── data                    # <-- Directory with all types of data
│ ├── final                 # <-- Directory with final data
│ │ ├── class.metrics.csv   # <-- Directory with raw and intermediate data
│ │ └── data.csv            # <-- Pipeline output (not stored in git)
│ ├── interim               # <-- Directory with temporary data
│ │ ├── datafile.0.csv
│ │ └── datafile.1.csv
│ ├── prepared              # <-- Directory with prepared data
│ │ └── data.all.csv
│ └── raw                   # <-- Directory with raw data; populated by pipeline's fetch stage
│     ├── README.md
│     ├── cc.en.300.bin               # <-- Fasttext binary model file, creative commons 
│     ├── crawl-300d-2M-subword.bin   # <-- Fasttext binary model file, common crawl
│     ├── crawl-300d-2M-subword.vec
│     ├── employers.wikidata.csv      # <-- Our initial data, 1 set of class labels 
│     ├── lid.176.ftz
│     └── occupations.wikidata.csv    # <-- Our initial data, 1 set of class labels
├── dvc.lock                # <-- DVC internal state tracking file
├── dvc.yaml                # <-- DVC project configuration file
├── dvc_plots               # <-- Temp directory for DVC plots; not tracked by git
│ └── README.md
├── model
│ ├── class.metrics.csv
│ ├── svm.model.pkl
│ └── train.metrics.json    # <-- Metrics from the pipeline's train stage  
├── mypy.ini
├── params.yaml             # <-- Parameter configuration file for the pipeline
├── reports                 # <-- Directory with metrics output
│ ├── prepare.metrics.json  
│ └── relabel.metrics.json
├── requirements-dev.txt
├── requirements.txt
├── runUnitTests.sh
└── src                     # <-- Directory containing the pipeline's code
    ├── README.md
    ├── fetch.py
    ├── prepare.py
    ├── relabel.py
    ├── train.py
    └── utils.py

Setup

Create environment

conda create --name auto-label-pipeline python=3.9

conda activate auto-label-pipeline

Install requirements

pip install -r requirements.txt

If you're going to modify the source, also install the requirements-dev.txt file


Reproduce the pipeline results locally

dvc repro

View Metrics

dvc metrics show

See also: DVC metrics

Working with Experiments

To see your local experiments:

dvc exp show

Experiments that have been turned into a branches can be referenced directly in commands:

dvc exp diff svc_linear_ex svc_rbf_ex

e.g. to compare experiments:

dvc exp diff [experiment branch name] [experiment branch 2 name]

e.g.:

dvc exp diff svc_linear_ex svc_rbf_ex

dvc exp diff svc_poly_ex svc_rbf_ex

To create an experiment by changing a parameter:

dvc exp run --set-param train.split=0.9 --name my_split_ex

(When promoting an experiment to a branch, DVC does not switch into the branch.)

To save and share your experiment in a branch:

dvc exp branch my_split_ex my_split_ex_branch

See also: DVC Experiments

View plots

Initial Confusion matrix:

dvc plots show model/class.metrics.csv -x actual -y predicted --template confusion

Confusion matrix after relabeling:

dvc plots show data/final/class.metrics.csv -x actual -y predicted --template confusion

See also: DVC plots


Conclusions

  • For relabeling and cleaning, it's important to have more than two labels, and to specifying an UNK label for: unknown; labels spanning multiple groups; or low confidence support.
  • Standardizing the input data formats allow users to flexibly use many different data sources.
  • Language detection is an important part of data cleaning, however problematic because:
    • Modern languages sometimes "borrow" words from other languages (but not just any words!)
    • Language detection models perform inference poorly with limited data, especially just a single word.
    • Normalization utilities, such as unidecode aren't helpful; (the wrong word in more readable letters is still the wrong word).
  • Experimentation parameters often have co-dependencies that would make a simple combinatorial grid search inefficient.

Recommended readings:

  • Confident Learning: Estimating Uncertainty in Dataset Labels by Curtis G. Northcutt, Lu Jiang, Isaac L. Chuang, 31 Oct 2019, arxiv
  • A Simple but tough-to-beat baseline for sentence embeddings by Sanjeev Arora, Yingyu Liang, Tengyu Ma, ICLR 2017, paper
  • Support Vector Clustering by Asa Ben-Hur, David Horn, Hava T. Siegelmann, Vladimir Vapnik, November 2001 Journal of Machine Learning Research 2 (12):125-137, DOI:10.1162/15324430260185565, paper
  • SVM clustering by Winters-Hilt, S., Merat, S. BMC Bioinformatics 8, S18 (2007). link, paper

Note: this repo layout borrows heavily from the Cookie Cutter Data Science Layout If you're not familiar with it, please check it out.

Owner
Todd Cook
Software craftsman
Todd Cook
Rlmm blender toolkit - A set of tools to streamline level generation in UDK straight from Blender

rlmm_blender_toolkit A set of tools to streamline level generation in UDK straig

Rocket League Mapmaking 0 Jan 15, 2022
A simple program for training and testing vit

Vit This is a simple program for training and testing vit. Key requirements: torch, torchvision and timm. Dataset I put 5 categories of the cub classi

xiezhenyu 2 Oct 11, 2022
Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

Minimal code and simple experiments to play with Denoising Diffusion Probabilist

Rithesh Kumar 16 Oct 06, 2022
Video Frame Interpolation with Transformer (CVPR2022)

VFIformer Official PyTorch implementation of our CVPR2022 paper Video Frame Interpolation with Transformer Dependencies python = 3.8 pytorch = 1.8.0

DV Lab 63 Dec 16, 2022
Point Cloud Registration using Representative Overlapping Points.

Point Cloud Registration using Representative Overlapping Points (ROPNet) Abstract 3D point cloud registration is a fundamental task in robotics and c

ZhuLifa 36 Dec 16, 2022
Single Image Deraining Using Bilateral Recurrent Network (TIP 2020)

Single Image Deraining Using Bilateral Recurrent Network Introduction Single image deraining has received considerable progress based on deep convolut

23 Aug 10, 2022
code for EMNLP 2019 paper Text Summarization with Pretrained Encoders

PreSumm This code is for EMNLP 2019 paper Text Summarization with Pretrained Encoders Updates Jan 22 2020: Now you can Summarize Raw Text Input!. Swit

Yang Liu 1.2k Dec 28, 2022
ML models implementation practice

Let's implement various ML algorithms with numpy/tf Vanilla Neural Network https://towardsdatascience.com/lets-code-a-neural-network-in-plain-numpy-ae

Jinsoo Heo 4 Jul 04, 2021
Pytorch code for "State-only Imitation with Transition Dynamics Mismatch" (ICLR 2020)

This repo contains code for our paper State-only Imitation with Transition Dynamics Mismatch published at ICLR 2020. The code heavily uses the RL mach

20 Sep 08, 2022
Header-only library for using Keras models in C++.

frugally-deep Use Keras models in C++ with ease Table of contents Introduction Usage Performance Requirements and Installation FAQ Introduction Would

Tobias Hermann 927 Jan 05, 2023
Pytorch Implementation for NeurIPS (oral) paper: Pixel Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

Pixel-Level Cycle Association This is the Pytorch implementation of our NeurIPS 2020 Oral paper Pixel-Level Cycle Association: A New Perspective for D

87 Oct 19, 2022
This repo is a C++ version of yolov5_deepsort_tensorrt. Packing all C++ programs into .so files, using Python script to call C++ programs further.

yolov5_deepsort_tensorrt_cpp Introduction This repo is a C++ version of yolov5_deepsort_tensorrt. And packing all C++ programs into .so files, using P

41 Dec 27, 2022
This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?".

Patches Are All You Need? 🤷 This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?". Code ov

ICLR 2022 Author 934 Dec 30, 2022
TensorFlow Implementation of Unsupervised Cross-Domain Image Generation

Domain Transfer Network (DTN) TensorFlow implementation of Unsupervised Cross-Domain Image Generation. Requirements Python 2.7 TensorFlow 0.12 Pickle

Yunjey Choi 864 Dec 30, 2022
A GridMixup augmentation, inspired by GridMask and CutMix

GridMixup A GridMixup augmentation, inspired by GridMask and CutMix Easy install pip install git+https://github.com/IlyaDobrynin/GridMixup.git Overvie

IlyaDo 42 Dec 28, 2022
an implementation of 3D Ken Burns Effect from a Single Image using PyTorch

3d-ken-burns This is a reference implementation of 3D Ken Burns Effect from a Single Image [1] using PyTorch. Given a single input image, it animates

Simon Niklaus 1.4k Dec 28, 2022
Instant neural graphics primitives: lightning fast NeRF and more

Instant Neural Graphics Primitives Ever wanted to train a NeRF model of a fox in under 5 seconds? Or fly around a scene captured from photos of a fact

NVIDIA Research Projects 10.6k Jan 01, 2023
A scientific and useful toolbox, which contains practical and effective long-tail related tricks with extensive experimental results

Bag of tricks for long-tailed visual recognition with deep convolutional neural networks This repository is the official PyTorch implementation of AAA

Yong-Shun Zhang 181 Dec 28, 2022
ObsPy: A Python Toolbox for seismology/seismological observatories.

ObsPy is an open-source project dedicated to provide a Python framework for processing seismological data. It provides parsers for common file formats

ObsPy 979 Jan 07, 2023
A PyTorch Toolbox for Face Recognition

FaceX-Zoo FaceX-Zoo is a PyTorch toolbox for face recognition. It provides a training module with various supervisory heads and backbones towards stat

JDAI-CV 1.6k Jan 06, 2023