A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

Overview

About

This repository provides data and code for the paper:

Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (submitted to NeurIPS 2021 Track on Datasets and Benchmarks Round2)

Authors: Mingkuan Liu, Chi Zhang, Hua Xing, Chao Feng, Monchu Chen, Judith Bishop, Grace Ngapo

Keywords: speech processing, speech dataset, human in the loop, annotation pipeline, quality assurance, speech annotation

Abstract

This paper introduces a human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets. The pipeline combines human and machine advantages to more quickly, accurately, and cost-effectively annotate datasets with machine pre-labeling and fully manual auditing. Quality control mechanisms such as blind testing, behavior monitoring, and data validation have been adopted in the annotation pipeline to mitigate potential bias introduced by machine-generated labels. Our A/B testing and pilot results demonstrated the HITL pipeline can improve annotation speed and capacity by at least 80% and quality is comparable to or higher than manual double pass annotation. We are leveraging this scalable pipeline to create and continuously grow ultra-high volume off-the-shelf (UHV-OTS) speech corpora for multiple languages, with the capability to expand to 10,000+ hours per language annually. Customized datasets can be produced from the UHV-OTS corpora using dynamic packaging. UHV-OTS is a long-term Appen project to support commercial and academic research data needs in speech processing. Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. We are also releasing the code of the data pre-processing and pre-tagging pipeline under the Apache 2.0 license to allow reproduction of the results reported in the paper. Code and data are available in https://github.com/Appen/UHV-OTS-Speech

HITL speech corpora development system pipeline for UHV-OTS corpora

Reproduce the automated machine pre-labeling results reported in the paper

0. Experiment envirionments setup

We use docker to run all the experiments and data processing for the corpora construction. To illustrate the algorithms used in the automatic modules in our pipeline, we build this docker enveronment containing all the testing scripts or demo scripts of each module. After you git cloned this repo, please run the docker build command like in below.

cd UHV-OTS-Speech
docker build -t uhv-ots-speech-demo:cpu ./

After the images has been built, please docker run the image in a container.

docker run -it uhv-ots-speech-demo:cpu /bin/bash

Inside the container, in /opt/scripts, there are several sub folder, each of which is the testing/demo scripts of a module.

1. Data pre-filtering: synthetic speech detection

We utlized the algorithm propposed in Towards End-to-End Synthetic Speech Detection and adopted the library and pre-trained models in authors's github repo. The original work achieved synthetic speech detection EER as low as 2.16% on in-domain testing data and 1.95% on cross-domain data. We developped a simple demo script to run a part of the ASVspoof2019 and give out the detection results and likelihood.

If the full testing is needed please run the codes in original authors' repo. Please download the ASVspoof 2019 and 2015 data by running following command Inside the container:

cd /opt/scripts/synthetic_detection
./download.sh

But if only want to see how the module is working, inside the container, please run the following command Inside the container to see how it works.

cd /opt/scripts/synthetic_detection
./run_demo.sh 

2. Data pre-processing: music/vocal source separation

We utilized well performed spleeter library for source separation. The spleeter is source separation library of Deezer and was introduced in "Spleeter: a fast and efficient music source separation tool with pre-trained models". We post the script to run this tool on web scraped audio files. To run the tool with sample file, please run following command Inside the container.

cd /opt/scripts/source_separation
./run_demo.sh

The script will try to separate each audio in ./sample_aduio folders into two files, one *_bgm.wav one *_speech.wav, both in mono 16kHz 16bit liner PCM wav format. The rest of automatic processing will be performed on the *_speech.wav file, which is considered to be the speech channel of original audio.

3. Data pre-filtering: language/accent identification

We apply language identification to pre-filter the raw audio data and ensure that the data is correctly routed to the corresponding language data processing pipeline. We trained a language ID systme based on the x-vector, which was introduced in "X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION". The x-vector model was trained with the VoxLingua107 dataset, and the language ID algorithm achieved 93% accuracy on the VoxLingua107 dev set.

The language id module was developped based on the Kaldi recipe. The model and x-vectors have been prepared and stored in this folder, to run the test and get EER, please run the command in below, Inside the container:

cd /opt/scripts/language_id
./run_test.sh

Accent identification is more challenging than language identification. We’ve adopted the x-vector plus LDA/PLDA framework to detect twenty-two different English accents using proprietary data. Our current accent detection accuracy is 75%. The x-vector model and x-vectors of training and testing data were prepared and stored in this folder, same as LDA/PLDA classifier model. To check the performance, please run the command as in below Inside the container:

cd /opt/scripts/accent_id
./run_test.sh

4. Data pre-tagging: speech detection

This is the folder containing the demo scripts of speech segmentation. The speech segmentation in this folder is adopted from the InaSpeechSegmenter which was introduced in AN OPEN-SOURCE SPEAKER GENDER DETECTION FRAMEWORK FOR MONITORING GENDER EQUALITY. We only used the speech detection module of it and it's pretrained model, which can be found in the original authors' repo.

The inaSpeechSegmenter system won the first place in the Music and/or Speech Detection in Music Information Retrieval Evaluation eXchange 2018 (MIREX 2018). This module also achieved 97.5% detection accuracy with an average boundary mismatch of 97ms at Appen's proprietary testset. To run demo of this module, please run the following command Inside the container:

cd /opt/scripts/speech_detection
./run_demo.sh

You can check the output csv file in folder ./output

5. Data pre-tagging: speaker diarization

This is the speaker diarization system developed based on BUT's diarization system introduced in Analysis of the BUT Diarization System for VoxConverse Challenge.

The speaker diarization framework generally involves an embedding stage followed by a clustering stage.

We tested the pipeline with VoxConverse corpus, which is an audio-visual diarization dataset consisting of over 50 hours of multi-speaker clips of human speech, extracted from videos collected on the internet. The DER achieved on VoxConverse using the BUT system is 4.41%, which is consistent with the result in BUT's report.

To download the dataset, please run the command Inside the container as in following:

cd /opt/scripts/speaker_diarization
./download.sh

After the data downloading, please run the test on VoxConverse data by running the commands in below Inside the container:

cd /opt/scripts/speaker_diarization
./run_test.sh

6. Data pre-tagging: speaker clustering & identification

We utlized an ECAPA-TDNN embedding algorithm introduced in Ecapa-tdnn: Emphasized channel412attention, propagation and aggregation in tdnn based speaker verification to generate speaker embeddings, which is used for speaker identification. A pre-trained embedding model by SpeechBrain toolkit is adopted in our pipeline, which produces EER of 0.7% on VoxCeleb 1 dataset.

Please download the VoxCeleb1 data and then run the test to check the system's performance inside the container

cd /opt/scripts/SpeakerSec/
./download.sh
./run_test.sh

7. Data pre-tagging: gender detection

An x-vector embedding model plus Multi-layer Perceptron (MLP) classifier framework is implemented gender_detection folder. We used the x-vector model introduced in "X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION". The pretrained x-vector model was used to extract the x-vectors of training and test data for MLP. Our gender detection model achieved 99.85% accuracy on VoxCeleb1 testing set in VoxCeleb: a large-scale speaker identification dataset. To run the test of gender detection and check results, please run the command Inside the container:

cd /opt/scripts/gender_detection
./run_test.sh

8. Data pre-tagging: speech recognition/transcription

To run the experiments on Librispeech test-clean and test-other data with our own Chain model, please run the following command to download Librispeech data inside the container.

cd /opt/scripts/asr_kaldichain
./download_prepare_extract.sh

The test-clean and test-other data will be downloaded inside the container.

In this module, we trained our own ASR model using Kaldi toolkit introduced in "The kaldi speech recognition toolkit", specifically using the chain model recipe introduced in "Purely sequence-trained neural networks for ASR based on lattice-free MMI", which can be found originally in Kaldi's repo. But we trained our model using 11 corpora at hand, including free public corpora, purchased corpora, and self owned corpora.

To run the test on Librispeech test-other and test-clean data with our trained model, please run the following command, inside the container.

cd /opt/scripts/asr_kaldichain
./run_test.sh

9. Data pre-tagging: domain/topic detection

So far we adopted a pipeline of topic detection of Multi-label Text Classification using BERT introduced in webpage. It was developped by original author based on the BERT. It applied BERT to the problem of multi-label text classification. We assembled the original scripts from the repo to replicate the Kaggle’s Toxic Comment Classification Challenge to benchmark BERT’s performance for the multi-label text classification.

To run the benchmark test, please run the following commands inside the container

cd /opt/scripts/topic_detection
./run_test.sh

UHV-OTS dataset format

Detailed exaplanation of UHV-OTS dataset format is attached here.

Sample codes to parse UHV-OTS dataset to Kaldi style format

A script generate_kaldi_file.py was provided to generate the Kaldi format documents to run a Kaldi experiments. After you acquired a batch of UHV-OTS-Speehc data, you can run this script as in follow:

./generate_kaldi_file.py path-to-batch-data

In this repo, we prepared a sample of batch data in ./sample_dataset, you can try the converting script on that folder to check the generated Kaldi documents.

Speech Annotation Instruction

Detailed annotation guideline is attached here.

License

Software license

The code and pre-trained models of our speech data pre-processing and pre-tagging pipeline are under the Apache 2.0 license to allow reproduction of the results reported in the paper.

Dataset license

The UHV-OTS speech corpora development is an ongoing, long-term Appen project to support commercial and academic research data needs for tasks related to speech processing.

Dataset consumers can visit https://appen.com/off-the-shelf-datasets/ to order existing datasets or contact us to discuss their specific dataset needs. Appen will consolidate those needs and adjust our UHV-OTS delivery pipeline accordingly, to deliver datasets of highest demand.

Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. These free datasets will be downloadable from Appen's https://appen.com/open-source-datasets/ website. The first batch of free available dataset will be released in late of 2021.

References

Owner
Appen Repos
Appen Repos
This repo is official PyTorch implementation of MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices(CVPRW 2021).

Github Code of "MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices" Introduction This repo is official PyTorch implementatio

Choi Sang Bum 203 Jan 05, 2023
Code repo for EMNLP21 paper "Zero-Shot Information Extraction as a Unified Text-to-Triple Translation"

Zero-Shot Information Extraction as a Unified Text-to-Triple Translation Source code repo for paper Zero-Shot Information Extraction as a Unified Text

cgraywang 88 Dec 31, 2022
Demo for Real-time RGBD-based Extended Body Pose Estimation paper

Real-time RGBD-based Extended Body Pose Estimation This repository is a real-time demo for our paper that was published at WACV 2021 conference The ou

Renat Bashirov 118 Dec 26, 2022
source code for https://arxiv.org/abs/2005.11248 "Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics"

Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics This work will be published in Nature Biomedical

International Business Machines 71 Nov 15, 2022
[ICCV2021] Learning to Track Objects from Unlabeled Videos

Unsupervised Single Object Tracking (USOT) 🌿 Learning to Track Objects from Unlabeled Videos Jilai Zheng, Chao Ma, Houwen Peng and Xiaokang Yang 2021

53 Dec 28, 2022
Code for CVPR 2021 paper: Anchor-Free Person Search

Introduction This is the implementationn for Anchor-Free Person Search in CVPR2021 License This project is released under the Apache 2.0 license. Inst

158 Jan 04, 2023
Digan - Official PyTorch implementation of Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks

DIGAN (ICLR 2022) Official PyTorch implementation of "Generating Videos with Dyn

Sihyun Yu 147 Dec 31, 2022
SatelliteSfM - A library for solving the satellite structure from motion problem

Satellite Structure from Motion Maintained by Kai Zhang. Overview This is a libr

Kai Zhang 190 Dec 08, 2022
use machine learning to recognize gesture on raspberrypi

Raspberrypi_Gesture-Recognition use machine learning to recognize gesture on raspberrypi 說明 利用 tensorflow lite 訓練手部辨識模型 分辨 "剪刀"、"石頭"、"布" 之手勢 再將訓練模型匯入

1 Dec 10, 2021
Deep Multimodal Neural Architecture Search

MMNas: Deep Multimodal Neural Architecture Search This repository corresponds to the PyTorch implementation of the MMnas for visual question answering

Vision and Language Group@ MIL 23 Dec 21, 2022
Wandb-predictions - WANDB Predictions With Python

WANDB API CI/CD Below we capture the CI/CD scenarios that we would expect with o

Anish Shah 6 Oct 07, 2022
PassAPI is a password generator in hash format and fully developed in Python, with the aim of teaching how to handle and build

simple, elegant and safe Introduction PassAPI is a password generator in hash format and fully developed in Python, with the aim of teaching how to ha

Johnsz 2 Mar 02, 2022
Stochastic gradient descent with model building

Stochastic Model Building (SMB) This repository includes a new fast and robust stochastic optimization algorithm for training deep learning models. Th

S. Ilker Birbil 22 Jan 19, 2022
A Closer Look at Structured Pruning for Neural Network Compression

A Closer Look at Structured Pruning for Neural Network Compression Code used to reproduce experiments in https://arxiv.org/abs/1810.04622. To prune, w

Bayesian and Neural Systems Group 140 Dec 05, 2022
Iran Open Source Hackathon

Iran Open Source Hackathon is an open-source hackathon (duh) with the aim of encouraging participation in open-source contribution amongst Iranian dev

OSS Hackathon 121 Dec 25, 2022
VGGVox models for Speaker Identification and Verification trained on the VoxCeleb (1 & 2) datasets

VGGVox models for speaker identification and verification This directory contains code to import and evaluate the speaker identification and verificat

338 Dec 27, 2022
Official Implementation for "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery" (ICCV 2021 Oral)

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery (ICCV 2021 Oral) Run this model on Replicate Optimization: Global directions: Mapper: Check ou

3.3k Jan 05, 2023
Script that attempts to force M1 macs into RGB mode when used with monitors that are defaulting to YPbPr.

fix_m1_rgb Script that attempts to force M1 macs into RGB mode when used with monitors that are defaulting to YPbPr. No warranty provided for using th

Kevin Gao 116 Jan 01, 2023
Yet another video caption

Yet another video caption

Fan Zhimin 5 May 26, 2022
End-to-end beat and downbeat tracking in the time domain.

WaveBeat End-to-end beat and downbeat tracking in the time domain. | Paper | Code | Video | Slides | Setup First clone the repo. git clone https://git

Christian J. Steinmetz 60 Dec 24, 2022