A 10000+ hours dataset for Chinese speech recognition

Overview

WenetSpeech

A 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition

WenetSpeech

Download

Please visit the official website, read the license, and follow the instruction to download the data.

Benchmark

Toolkit Model test_net test_meeting
Kaldi Chain Model
ESPnet Joint CTC/Conformer
WeNet Joint CTC/Conformer

Description

Creation

First, we collect all the data from YouTube and Podcast; Then, OCR is used to label YouTube data, auto trancrition is used to label Podcast data; Finally, a novel end-to-end label error detection method is used to further validate and filter the data.

Categories

In summary, WenetSpeech groups all data into 3 categories, as the following table shows:

Set Hours Confidence Usage
High Label 10005 >=0.95 Supervised Training
Weak Label 2478 [0.6, 0.95] Semi-supervised or noise training
Unlabel 9952 / Unsupervised training or Pre-training
In Total 22435 / All above

High Label Data

All of the data is from Youtube and Podcast, and we tag all the data with its source and domain. We classify the data into 10 groups according to its domain,speaking style, or scenarios.

Domain Youtube Podcast Total
audiobook 0 250.9 250.9
commentary 112.6 135.7 248.3
documentary 386.7 90.5 477.2
drama 4338.2 0 4338.2
interview 324.2 614 938.2
news 0 868 868
reading 0 1110.2 1110.2
talk 204 90.7 294.7
variety 603.3 224.5 827.8
others 144 507.5 651.5
Total 6113 3892 10005

We provide 3 training subsets, namely S, M and L. Subsets S, M are sampled from all the high label data which has the oracle confidence 1.0

Training Subsets Confidence Hours
L [0.95, 1.0] 10005
M 1.0 1000
S 1.0 100

Evaluation Sets

Evaluation Sets Hours Source Description
DEV 20 Internet Specially designed for some speech tools which require cross-validation set in training
TEST_NET 23 Internet Match test
TEST_MEETING 15 Real meeting Mismatch test which is far-field, conversational, and spontaneous meeting speech

Contributors

ACKNOWLEDGEMENTS

  1. WenetSpeech referred a lot of work of GigaSpeech, including metadata design, license design, data encryption, downloading pipeline, and so on. The authors would like to thank Jiayu Du and Guoguo Chen for their suggestions on this work.
  2. The authors would like to thank my college Lianhui Zhang, Yu Mao for collecting some of the YouTube data.
Owner
Production First and Production Ready End-to-End Speech Toolkit
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 | 한국어 State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained models

Hugging Face 77.1k Dec 31, 2022
构建一个多源(公众号、RSS)、干净、个性化的阅读环境

2C 构建一个多源(公众号、RSS)、干净、个性化的阅读环境 作为一名微信公众号的重度用户,公众号一直被我设为汲取知识的地方。随着使用程度的增加,相信大家或多或少会有一个比较头疼的问题——广告问题。 假设你关注的公众号有十来个,若一个公众号两周接一次广告,理论上你会面临二十多次广告,实际上会更多,运

howie.hu 678 Dec 28, 2022
Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Semantic search through Wikipedia with the Weaviate vector search engine Weaviate is an open source vector search engine with build-in vectorization a

SeMI Technologies 191 Dec 26, 2022
Sapiens is a human antibody language model based on BERT.

Sapiens: Human antibody language model ____ _ / ___| __ _ _ __ (_) ___ _ __ ___ \___ \ / _` | '_ \| |/ _ \ '

Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc. 13 Nov 20, 2022
Code-autocomplete, a code completion plugin for Python

Code AutoComplete code-autocomplete, a code completion plugin for Python.

xuming 13 Jan 07, 2023
Python library for parsing resumes using natural language processing and machine learning

CVParser Python library for parsing resumes using natural language processing and machine learning. Setup Installation on Linux and Mac OS Follow the

nafiu 0 Jul 29, 2021
Use fastai-v2 with HuggingFace's pretrained transformers

FastHugs Use fastai v2 with HuggingFace's pretrained transformers, see the notebooks below depending on your task: Text classification: fasthugs_seq_c

Morgan McGuire 111 Nov 16, 2022
Convolutional 2D Knowledge Graph Embeddings resources

ConvE Convolutional 2D Knowledge Graph Embeddings resources. Paper: Convolutional 2D Knowledge Graph Embeddings Used in the paper, but do not use thes

Tim Dettmers 586 Dec 24, 2022
Multilingual finetuning of Machine Translation model on low-resource languages. Project for Deep Natural Language Processing course.

Low-resource-Machine-Translation This repository contains the code for the project relative to the course Deep Natural Language Processing. The goal o

Andrea Cavallo 3 Jun 22, 2022
Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

207 Nov 22, 2022
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
A Paper List for Speech Translation

Keyword: Speech Translation, Spoken Language Processing, Natural Language Processing

138 Dec 24, 2022
Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Open Tech 61 Dec 13, 2022
HuggingTweets - Train a model to generate tweets

HuggingTweets - Train a model to generate tweets Create in 5 minutes a tweet generator based on your favorite Tweeter Make my own model with the demo

Boris Dayma 318 Jan 04, 2023
Beyond the Imitation Game collaborative benchmark for enormous language models

BIG-bench 🪑 The Beyond the Imitation Game Benchmark (BIG-bench) will be a collaborative benchmark intended to probe large language models, and extrap

Google 1.3k Jan 01, 2023
The ibet-Prime security token management system for ibet network.

ibet-Prime The ibet-Prime security token management system for ibet network. Features ibet-Prime is an API service that enables the issuance and manag

BOOSTRY 8 Dec 22, 2022
Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

Easy-Translate is a script for translating large text files in your machine using the M2M100 models from Facebook/Meta AI. We also privide a script fo

Iker García-Ferrero 41 Dec 15, 2022
Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

Covid-19-BOT Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation. This bot uses torc

Neeraj Majhi 2 Nov 05, 2021
Associated Repository for "Translation between Molecules and Natural Language"

MolT5: Translation between Molecules and Natural Language Associated repository for "Translation between Molecules and Natural Language". Table of Con

67 Dec 15, 2022
Translates basic English sentences into the Huna language (hoo-NAH)

huna-translator The Huna Language Translates basic English sentences into the Huna language (hoo-NAH). The Huna constructed language was developed in

Miles Smith 0 Jan 20, 2022