A 10000+ hours dataset for Chinese speech recognition

Last update: Jan 03, 2023

Related tags

Deep Learning WenetSpeech

Overview

WenetSpeech

Official website | Paper

A 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition

Download

Please visit the official website, read the license, and follow the instruction to download the data.

Benchmark

Toolkit	Dev	Test_Net	Test_Meeting	AIShell-1
Kaldi	9.07	12.83	24.72	5.41
ESPNet	9.70	8.90	15.90	3.90
WeNet	8.88	9.70	15.59	4.61

Description

Creation

All the data are collected from YouTube and Podcast. Optical character recognition (OCR) and automatic speech recognition (ASR) techniques are adopted to label each YouTube and Podcast recording, respectively. To improve the quality of the corpus, we use a novel end-to-end label error detection method to further validate and filter the data.

Set	Hours	Confidence	Usage
High Label	10005	>=0.95	Supervised Training
Weak Label	2478	[0.6, 0.95]	Semi-supervised or noise training
Unlabel	9952	/	Unsupervised training or Pre-training
In Total	22435	/	All above

High Label Data

We classify the high label into 10 groups according to its domain, speaking style, and scenarios.

Domain	Youtube	Podcast	Total
audiobook	0	250.9	250.9
commentary	112.6	135.7	248.3
documentary	386.7	90.5	477.2
drama	4338.2	0	4338.2
interview	324.2	614	938.2
news	0	868	868
reading	0	1110.2	1110.2
talk	204	90.7	294.7
variety	603.3	224.5	827.8
others	144	507.5	651.5
Total	6113	3892	10005

As shown in the following table, we provide 3 training subsets, namely S, M and L for building ASR systems on different data scales.

Training Subsets	Confidence	Hours
L	[0.95, 1.0]	10005
M	1.0	1000
S	1.0	100

Evaluation Sets

Evaluation Sets	Hours	Source	Description
DEV	20	Internet	Specially designed for some speech tools which require cross-validation set in training
TEST_NET	23	Internet	Match test
TEST_MEETING	15	Real meeting	Mismatch test which is a far-field, conversational, spontaneous, and meeting dataset

Contributors

ACKNOWLEDGEMENTS

WenetSpeech refers a lot of work of GigaSpeech, and we thank Jiayu Du and Guoguo Chen for their suggestions on this work.
We thank Xi'an Future AI Innovation Center for providing hosting service for WenetSpeech. We also thank MindSpore for the support of this work, which is a new deep learning computing framework.
Our gratitude goes to Lianhui Zhang and Yu Mao for collecting some of the YouTube data.

A 10000+ hours dataset for Chinese speech recognition

Related tags

Overview

WenetSpeech

Download

Benchmark

Description

Creation

Categories

High Label Data

Evaluation Sets

Contributors

ACKNOWLEDGEMENTS

Owner

Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

Implementation of Kaneko et al.'s MaskCycleGAN-VC model for non-parallel voice conversion.

Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.

An automated facial recognition based attendance system (desktop application)

MMFlow is an open source optical flow toolbox based on PyTorch

This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales

Official Pytorch implementation for "End2End Occluded Face Recognition by Masking Corrupted Features, TPAMI 2021"

KAPAO is an efficient multi-person human pose estimation model that detects keypoints and poses as objects and fuses the detections to predict human poses.

Keras-1D-NN-Classifier

Datasets and source code for our paper Webly Supervised Fine-Grained Recognition: Benchmark Datasets and An Approach

Evaluation toolkit of the informative tracking benchmark comprising 9 scenarios, 180 diverse videos, and new challenges.

Temporal-Relational CrossTransformers

Code release for Hu et al. Segmentation from Natural Language Expressions. in ECCV, 2016

Official Pytorch implementation of the paper: "Locally Shifted Attention With Early Global Integration"

This repository contains the code for TABS, a 3D CNN-Transformer hybrid automated brain tissue segmentation algorithm using T1w structural MRI scans

Occlusion robust 3D face reconstruction model in CFR-GAN (WACV 2022)

This is an official implementation of our CVPR 2021 paper "Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression" (https://arxiv.org/abs/2104.02300)

Unofficial PyTorch implementation of Guided Dropout

TinyML Cookbook, published by Packt

Unofficial & improved implementation of NeRF--: Neural Radiance Fields Without Known Camera Parameters