UniSpeech - Large Scale Self-Supervised Learning for Speech

Last update: Dec 15, 2022

Overview

UniSpeech

The family of UniSpeech:

WavLM (arXiv): WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing

UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech-SAT (ICASSP 2022 Submission): Universal Speech Representation Learning with Speaker Aware Pre-Training

Update

[HuggingFace Integration] Octorber 26, 2021: UniSpeech-SAT models are on HuggingFace .
[Model Release] Octorber 13, 2021: UniSpeech-SAT models are releaseed.
[HuggingFace Integration] Octorber 11, 2021: UniSpeech models are on HuggingFace .
[Model Release] June, 2021: UniSpeech v1 models are released.

Pre-trained models

We strongly suggest using our UniSpeech-SAT model for speaker related tasks, since it shows very powerful performance on various speaker related benchmarks.

Model	Pretraining Dataset	Finetuning Dataset	Model
UniSpeech Large EN	Labeled: 1350 hrs en	-	download
UniSpeech Large Multilingual	Labeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it	-	download
Unispeech Large+	Labeled: 1350 hrs en, Unlabeled: 353 hrs fr	-	download
UniSpeech Large+	Labeld: 1350 hrs en, Unlabeled: 168 hrs es	-	download
UniSpeech Large+	Labeled: 1350 hrs en, Unlabeld: 90 hrs it	-	download
UniSpeech Large Multilingual	Labeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it, Unlabeled: 17 hrs ky	-	download
UniSpeech Large+	Labeled: 1350 hrs en, Unlabeled: 353 hrs fr	1 hr fr	download
UniSpeech Large+	Labeld: 1350 hrs en, Unlabeled: 168 hrs es	1 hr es	download
UniSpeech Large+	Labeled: 1350 hrs en, Unlabeld: 90 hrs it	1 hr it	download
UniSpeech Large Multilingual	Labeled: 1350 hrs en + 353 hrs fr + 168 hrs es + 90 hrs it, Unlabeled: 17 hrs ky	1 hr ky	download
UniSpeech-SAT Base	960 hrs LibriSpeech	-	download
UniSpeech-SAT Base+	60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli	-	download
UniSpeech-SAT Large	60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli	-	download
WavLM Base	960 hrs LibriSpeech	-	Azure Storage Google Drive
WavLM Base+	60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli	-	Azure Storage Google Drive
WavLM Large	60k hrs Libri-Light + 10k hrs GigaSpeech + 24k hrs VoxPopuli	-	Azure Storage Google Drive

Universal Representation Evaluation on SUPERB

Downstream Task Performance

We also evaluate our models on typical speaker related benchmarks.

Speaker Verification

Model	Fix pre-train	Vox1-O	Vox1-E	Vox1-H
ECAPA-TDNN	-	0.87	1.12	2.12
HuBERT large	Yes	0.888	0.912	1.853
Wav2Vec2.0 (XLSR)	Yes	0.915	0.945	1.895
UniSpeech-SAT large	Yes	0.771	0.781	1.669
WavLM large	Yes	0.638	0.687	1.457
HuBERT large	No	0.585	0.654	1.342
Wav2Vec2.0 (XLSR)	No	0.564	0.605	1.23
UniSpeech-SAT large	No	0.564	0.561	1.23
WavLM large	No	0.431	0.538	1.154

Our paper for verification

Speech Separation

Evaluation on LibriCSS

Model	0S	0L	OV10	OV20	OV30	OV40
Conformer (SOTA)	4.5	4.4	6.2	8.5	11	12.6
UniSpeech-SAT base	4.4	4.4	5.4	7.2	9.2	10.5
UniSpeech-SAT large	4.3	4.2	5.0	6.3	8.2	8.8
WavLM base+	4.5	4.4	5.6	7.5	9.4	10.9
WavLM large	4.2	4.1	4.8	5.8	7.4	8.5

Speaker Diarization

Evaluation on CALLHOME

Model	spk_2	spk_3	spk_4	spk_5	spk_6	spk_all
EEND-vector clustering	7.96	11.93	16.38	21.21	23.1	12.49
EEND-EDA clustering (SOTA)	7.11	11.88	14.37	25.95	21.95	11.84
UniSpeech-SAT large	5.93	10.66	12.9	16.48	23.25	10.92
WavLM Base	6.99	11.12	15.20	16.48	21.61	11.75
WavLm large	6.46	10.69	11.84	12.89	20.70	10.35

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ project.

Microsoft Open Source Code of Conduct

Reference

If you find our work is useful in your research, please cite the following paper:

@inproceedings{Wang2021UniSpeech,
  author    = {Chengyi Wang and Yu Wu and Yao Qian and Kenichi Kumatani and Shujie Liu and Furu Wei and Michael Zeng and Xuedong Huang},
  editor    = {Marina Meila and Tong Zhang},
  title     = {UniSpeech: Unified Speech Representation Learning with Labeled and
               Unlabeled Data},
  booktitle = {Proceedings of the 38th International Conference on Machine Learning,
               {ICML} 2021, 18-24 July 2021, Virtual Event},
  series    = {Proceedings of Machine Learning Research},
  volume    = {139},
  pages     = {10937--10947},
  publisher = {{PMLR}},
  year      = {2021},
  url       = {http://proceedings.mlr.press/v139/wang21y.html},
  timestamp = {Thu, 21 Oct 2021 16:06:12 +0200},
  biburl    = {https://dblp.org/rec/conf/icml/0002WQK0WZ021.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{Chen2021WavLM,
  title   = {WavLM: Large-Scale Self-Supervised  Pre-training   for Full Stack Speech Processing},
  author  = {Sanyuan Chen and Chengyi Wang and Zhengyang Chen and Yu Wu and Shujie Liu and Zhuo Chen and Jinyu Li and Naoyuki Kanda and Takuya Yoshioka and Xiong Xiao and Jian Wu and Long Zhou and Shuo Ren and Yanmin Qian and Yao Qian and Jian Wu and Michael Zeng and Furu Wei},
  eprint={2110.13900},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2021}
}

@article{Chen2021UniSpeechSAT,
  title   = {UniSpeech-SAT: Universal Speech Representation Learning with  Speaker Aware Pre-Training},
  author  = {Sanyuan Chen and Yu Wu and Chengyi Wang and Zhengyang Chen and Zhuo Chen and Shujie Liu and   Jian Wu and Yao Qian and Furu Wei and Jinyu Li and  Xiangzhan Yu},
  eprint={2110.05752},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2021}
}

Contact Information

For help or issues using UniSpeech models, please submit a GitHub issue.

For other communications related to UniSpeech, please contact Yu Wu ([email protected]).

UniSpeech - Large Scale Self-Supervised Learning for Speech

Related tags

Overview

UniSpeech

Update

Pre-trained models

Universal Representation Evaluation on SUPERB

Downstream Task Performance

Speaker Verification

Speech Separation

Speaker Diarization

License

Reference

Contact Information

Owner

Microsoft

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

A combination of autoregressors and autoencoders using XLNet for sentiment analysis

Voilà turns Jupyter notebooks into standalone web applications

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

Utilizing RBERT model for KLUE Relation Extraction task

An implementation of WaveNet with fast generation

Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

2021搜狐校园文本匹配算法大赛baseline

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

Pytorch NLP library based on FastAI

Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

A framework for cleaning Chinese dialog data

Opal-lang - A WIP programming language based on Python

Code for the paper TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

Reading Wikipedia to Answer Open-Domain Questions

spaCy plugin for Transformers , Udify, ELmo, etc.

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"