A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

Overview

About

This repository provides data and code for the paper:

Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (submitted to NeurIPS 2021 Track on Datasets and Benchmarks Round2)

Authors: Mingkuan Liu, Chi Zhang, Hua Xing, Chao Feng, Monchu Chen, Judith Bishop, Grace Ngapo

Keywords: speech processing, speech dataset, human in the loop, annotation pipeline, quality assurance, speech annotation

Abstract

This paper introduces a human-in-the-loop (HITL) data annotation pipeline to generate high-quality, large-scale speech datasets. The pipeline combines human and machine advantages to more quickly, accurately, and cost-effectively annotate datasets with machine pre-labeling and fully manual auditing. Quality control mechanisms such as blind testing, behavior monitoring, and data validation have been adopted in the annotation pipeline to mitigate potential bias introduced by machine-generated labels. Our A/B testing and pilot results demonstrated the HITL pipeline can improve annotation speed and capacity by at least 80% and quality is comparable to or higher than manual double pass annotation. We are leveraging this scalable pipeline to create and continuously grow ultra-high volume off-the-shelf (UHV-OTS) speech corpora for multiple languages, with the capability to expand to 10,000+ hours per language annually. Customized datasets can be produced from the UHV-OTS corpora using dynamic packaging. UHV-OTS is a long-term Appen project to support commercial and academic research data needs in speech processing. Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. We are also releasing the code of the data pre-processing and pre-tagging pipeline under the Apache 2.0 license to allow reproduction of the results reported in the paper. Code and data are available in https://github.com/Appen/UHV-OTS-Speech

HITL speech corpora development system pipeline for UHV-OTS corpora

Reproduce the automated machine pre-labeling results reported in the paper

0. Experiment envirionments setup

We use docker to run all the experiments and data processing for the corpora construction. To illustrate the algorithms used in the automatic modules in our pipeline, we build this docker enveronment containing all the testing scripts or demo scripts of each module. After you git cloned this repo, please run the docker build command like in below.

cd UHV-OTS-Speech
docker build -t uhv-ots-speech-demo:cpu ./

After the images has been built, please docker run the image in a container.

docker run -it uhv-ots-speech-demo:cpu /bin/bash

Inside the container, in /opt/scripts, there are several sub folder, each of which is the testing/demo scripts of a module.

1. Data pre-filtering: synthetic speech detection

We utlized the algorithm propposed in Towards End-to-End Synthetic Speech Detection and adopted the library and pre-trained models in authors's github repo. The original work achieved synthetic speech detection EER as low as 2.16% on in-domain testing data and 1.95% on cross-domain data. We developped a simple demo script to run a part of the ASVspoof2019 and give out the detection results and likelihood.

If the full testing is needed please run the codes in original authors' repo. Please download the ASVspoof 2019 and 2015 data by running following command Inside the container:

cd /opt/scripts/synthetic_detection
./download.sh

But if only want to see how the module is working, inside the container, please run the following command Inside the container to see how it works.

cd /opt/scripts/synthetic_detection
./run_demo.sh 

2. Data pre-processing: music/vocal source separation

We utilized well performed spleeter library for source separation. The spleeter is source separation library of Deezer and was introduced in "Spleeter: a fast and efficient music source separation tool with pre-trained models". We post the script to run this tool on web scraped audio files. To run the tool with sample file, please run following command Inside the container.

cd /opt/scripts/source_separation
./run_demo.sh

The script will try to separate each audio in ./sample_aduio folders into two files, one *_bgm.wav one *_speech.wav, both in mono 16kHz 16bit liner PCM wav format. The rest of automatic processing will be performed on the *_speech.wav file, which is considered to be the speech channel of original audio.

3. Data pre-filtering: language/accent identification

We apply language identification to pre-filter the raw audio data and ensure that the data is correctly routed to the corresponding language data processing pipeline. We trained a language ID systme based on the x-vector, which was introduced in "X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION". The x-vector model was trained with the VoxLingua107 dataset, and the language ID algorithm achieved 93% accuracy on the VoxLingua107 dev set.

The language id module was developped based on the Kaldi recipe. The model and x-vectors have been prepared and stored in this folder, to run the test and get EER, please run the command in below, Inside the container:

cd /opt/scripts/language_id
./run_test.sh

Accent identification is more challenging than language identification. We’ve adopted the x-vector plus LDA/PLDA framework to detect twenty-two different English accents using proprietary data. Our current accent detection accuracy is 75%. The x-vector model and x-vectors of training and testing data were prepared and stored in this folder, same as LDA/PLDA classifier model. To check the performance, please run the command as in below Inside the container:

cd /opt/scripts/accent_id
./run_test.sh

4. Data pre-tagging: speech detection

This is the folder containing the demo scripts of speech segmentation. The speech segmentation in this folder is adopted from the InaSpeechSegmenter which was introduced in AN OPEN-SOURCE SPEAKER GENDER DETECTION FRAMEWORK FOR MONITORING GENDER EQUALITY. We only used the speech detection module of it and it's pretrained model, which can be found in the original authors' repo.

The inaSpeechSegmenter system won the first place in the Music and/or Speech Detection in Music Information Retrieval Evaluation eXchange 2018 (MIREX 2018). This module also achieved 97.5% detection accuracy with an average boundary mismatch of 97ms at Appen's proprietary testset. To run demo of this module, please run the following command Inside the container:

cd /opt/scripts/speech_detection
./run_demo.sh

You can check the output csv file in folder ./output

5. Data pre-tagging: speaker diarization

This is the speaker diarization system developed based on BUT's diarization system introduced in Analysis of the BUT Diarization System for VoxConverse Challenge.

The speaker diarization framework generally involves an embedding stage followed by a clustering stage.

We tested the pipeline with VoxConverse corpus, which is an audio-visual diarization dataset consisting of over 50 hours of multi-speaker clips of human speech, extracted from videos collected on the internet. The DER achieved on VoxConverse using the BUT system is 4.41%, which is consistent with the result in BUT's report.

To download the dataset, please run the command Inside the container as in following:

cd /opt/scripts/speaker_diarization
./download.sh

After the data downloading, please run the test on VoxConverse data by running the commands in below Inside the container:

cd /opt/scripts/speaker_diarization
./run_test.sh

6. Data pre-tagging: speaker clustering & identification

We utlized an ECAPA-TDNN embedding algorithm introduced in Ecapa-tdnn: Emphasized channel412attention, propagation and aggregation in tdnn based speaker verification to generate speaker embeddings, which is used for speaker identification. A pre-trained embedding model by SpeechBrain toolkit is adopted in our pipeline, which produces EER of 0.7% on VoxCeleb 1 dataset.

Please download the VoxCeleb1 data and then run the test to check the system's performance inside the container

cd /opt/scripts/SpeakerSec/
./download.sh
./run_test.sh

7. Data pre-tagging: gender detection

An x-vector embedding model plus Multi-layer Perceptron (MLP) classifier framework is implemented gender_detection folder. We used the x-vector model introduced in "X-VECTORS: ROBUST DNN EMBEDDINGS FOR SPEAKER RECOGNITION". The pretrained x-vector model was used to extract the x-vectors of training and test data for MLP. Our gender detection model achieved 99.85% accuracy on VoxCeleb1 testing set in VoxCeleb: a large-scale speaker identification dataset. To run the test of gender detection and check results, please run the command Inside the container:

cd /opt/scripts/gender_detection
./run_test.sh

8. Data pre-tagging: speech recognition/transcription

To run the experiments on Librispeech test-clean and test-other data with our own Chain model, please run the following command to download Librispeech data inside the container.

cd /opt/scripts/asr_kaldichain
./download_prepare_extract.sh

The test-clean and test-other data will be downloaded inside the container.

In this module, we trained our own ASR model using Kaldi toolkit introduced in "The kaldi speech recognition toolkit", specifically using the chain model recipe introduced in "Purely sequence-trained neural networks for ASR based on lattice-free MMI", which can be found originally in Kaldi's repo. But we trained our model using 11 corpora at hand, including free public corpora, purchased corpora, and self owned corpora.

To run the test on Librispeech test-other and test-clean data with our trained model, please run the following command, inside the container.

cd /opt/scripts/asr_kaldichain
./run_test.sh

9. Data pre-tagging: domain/topic detection

So far we adopted a pipeline of topic detection of Multi-label Text Classification using BERT introduced in webpage. It was developped by original author based on the BERT. It applied BERT to the problem of multi-label text classification. We assembled the original scripts from the repo to replicate the Kaggle’s Toxic Comment Classification Challenge to benchmark BERT’s performance for the multi-label text classification.

To run the benchmark test, please run the following commands inside the container

cd /opt/scripts/topic_detection
./run_test.sh

UHV-OTS dataset format

Detailed exaplanation of UHV-OTS dataset format is attached here.

Sample codes to parse UHV-OTS dataset to Kaldi style format

A script generate_kaldi_file.py was provided to generate the Kaldi format documents to run a Kaldi experiments. After you acquired a batch of UHV-OTS-Speehc data, you can run this script as in follow:

./generate_kaldi_file.py path-to-batch-data

In this repo, we prepared a sample of batch data in ./sample_dataset, you can try the converting script on that folder to check the generated Kaldi documents.

Speech Annotation Instruction

Detailed annotation guideline is attached here.

License

Software license

The code and pre-trained models of our speech data pre-processing and pre-tagging pipeline are under the Apache 2.0 license to allow reproduction of the results reported in the paper.

Dataset license

The UHV-OTS speech corpora development is an ongoing, long-term Appen project to support commercial and academic research data needs for tasks related to speech processing.

Dataset consumers can visit https://appen.com/off-the-shelf-datasets/ to order existing datasets or contact us to discuss their specific dataset needs. Appen will consolidate those needs and adjust our UHV-OTS delivery pipeline accordingly, to deliver datasets of highest demand.

Appen will donate a number of free speech datasets from the UHV-OTS each year to support academic and open source community research under the CC-BY-SA license. These free datasets will be downloadable from Appen's https://appen.com/open-source-datasets/ website. The first batch of free available dataset will be released in late of 2021.

References

Owner
Appen Repos
Appen Repos
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)

Taming Visually Guided Sound Generation • [Project Page] • [ArXiv] • [Poster] • • Listen for the samples on our project page. Overview We propose to t

Vladimir Iashin 226 Jan 03, 2023
RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020)

RCDNet: A Model-driven Deep Neural Network for Single Image Rain Removal (CVPR2020) Hong Wang, Qi Xie, Qian Zhao, and Deyu Meng [PDF] [Supplementary M

Hong Wang 6 Sep 27, 2022
Robot Hacking Manual (RHM). From robotics to cybersecurity. Papers, notes and writeups from a journey into robot cybersecurity.

RHM: Robot Hacking Manual Download in PDF RHM v0.4 ┃ Read online The Robot Hacking Manual (RHM) is an introductory series about cybersecurity for robo

Víctor Mayoral Vilches 233 Dec 30, 2022
The official implementation of CircleNet: Anchor-free Detection with Circle Representation, MICCAI 2030

CircleNet: Anchor-free Detection with Circle Representation The official implementation of CircleNet, MICCAI 2020 [PyTorch] [project page] [MICCAI pap

The Biomedical Data Representation and Learning Lab 45 Nov 18, 2022
HyperSeg: Patch-wise Hypernetwork for Real-time Semantic Segmentation Official PyTorch Implementation

: We present a novel, real-time, semantic segmentation network in which the encoder both encodes and generates the parameters (weights) of the decoder. Furthermore, to allow maximal adaptivity, the w

Yuval Nirkin 182 Dec 14, 2022
An original implementation of "Noisy Channel Language Model Prompting for Few-Shot Text Classification"

Channel LM Prompting (and beyond) This includes an original implementation of Sewon Min, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer. "Noisy Cha

Sewon Min 92 Jan 07, 2023
Pytorch implementation of Straight Sampling Network For Point Cloud Learning (ICIP2021).

Pytorch code for SS-Net This is a pytorch implementation of Straight Sampling Network For Point Cloud Learning (ICIP2021). Environment Code is tested

Sun Ran 1 May 18, 2022
Stacked Generative Adversarial Networks

Stacked Generative Adversarial Networks This repository contains code for the paper "Stacked Generative Adversarial Networks", CVPR 2017. Part of the

Xun Huang 241 May 07, 2022
Code for our CVPR 2021 paper "MetaCam+DSCE"

Joint Noise-Tolerant Learning and Meta Camera Shift Adaptation for Unsupervised Person Re-Identification (CVPR'21) Introduction Code for our CVPR 2021

FlyingRoastDuck 59 Oct 31, 2022
Reaction SMILES-AA mapping via language modelling

rxn-aa-mapper Reactions SMILES-AA sequence mapping setup conda env create -f conda.yml conda activate rxn_aa_mapper In the following we consider on ex

16 Dec 13, 2022
[NeurIPS 2021] Garment4D: Garment Reconstruction from Point Cloud Sequences

Garment4D [PDF] | [OpenReview] | [Project Page] Overview This is the codebase for our NeurIPS 2021 paper Garment4D: Garment Reconstruction from Point

Fangzhou Hong 112 Dec 23, 2022
This porject is intented to build the most accurate model for predicting the porbability of loan default

Estimating-Loan-Default-Probability IBA ML2 Mid-project / Kaggle Competition This porject is intented to build the most accurate model for predicting

Adil Gahramanov 1 Jan 24, 2022
TakeInfoatNistforICS - Take Information in NIST NVD for ICS

Take Information in NIST NVD for ICS This project developed with Python. When yo

5 Sep 05, 2022
Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks"

HKD Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks" cifia-100 result The implementation of compared methods are ba

Wang Yucheng 30 Dec 18, 2022
Pytorch implementation of XRD spectral identification from COD database

XRDidentifier Pytorch implementation of XRD spectral identification from COD database. Details will be explained in the paper to be submitted to NeurI

Masaki Adachi 4 Jan 07, 2023
Towards Interpretable Deep Metric Learning with Structural Matching

DIML Created by Wenliang Zhao*, Yongming Rao*, Ziyi Wang, Jiwen Lu, Jie Zhou This repository contains PyTorch implementation for paper Towards Interpr

Wenliang Zhao 75 Nov 11, 2022
Back to Event Basics: SSL of Image Reconstruction for Event Cameras

Back to Event Basics: SSL of Image Reconstruction for Event Cameras Minimal code for Back to Event Basics: Self-Supervised Learning of Image Reconstru

TU Delft 42 Dec 26, 2022
This project aims at providing a concise, easy-to-use, modifiable reference implementation for semantic segmentation models using PyTorch.

Semantic Segmentation on PyTorch (include FCN, PSPNet, Deeplabv3, Deeplabv3+, DANet, DenseASPP, BiSeNet, EncNet, DUNet, ICNet, ENet, OCNet, CCNet, PSANet, CGNet, ESPNet, LEDNet, DFANet)

2.4k Jan 08, 2023
K-Nearest Neighbor in Pytorch

Pytorch KNN CUDA 2019/11/02 This repository will no longer be maintained as pytorch supports sort() and kthvalue on tensors. git clone https://github.

Chris Choy 65 Dec 01, 2022
Unofficial implementation of Point-Unet: A Context-Aware Point-Based Neural Network for Volumetric Segmentation

Point-Unet This is an unofficial implementation of the MICCAI 2021 paper Point-Unet: A Context-Aware Point-Based Neural Network for Volumetric Segment

Namt0d 9 Dec 07, 2022