Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Last update: Oct 14, 2022

Overview

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

This is an implementation of the paper, along with the pipeline and pretrained model using an open dataset. Audio samples of the paper is available here.

Recipe

This open pipeline uses the Databaker dataset. Please refer to our previous pipeline for dataset preprocessing, while only the Databaker dataset is used. Besides, you need to run lexicon/build_databaker.py to build the vocabulary, download the lexicon from zdic.net, and encode them with XLM-R. Feel free to change the target directory to save the data, which is specified in build_databaker.py and lexicon_utils.py.

Below are the commands to train and evaluate. Default target directories specified in the preprocessing scripts are used, so please substitute them with your own. The evaluation script can be run simultaneously with the training script. You may also use the evaluation script to synthesize samples from pretrained models. Please refer to the help of the arguments for their meanings.

python -m torch.distributed.launch --nproc_per_node=NGPU --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=D:\free_corpus\packed\ --training_languages=zh-cn --eval_languages=zh-cn --training_speakers=databaker --eval_steps=100000:150000 --hparams="input_method=char,multi_speaker=True,use_knowledge_attention=True,remove_space=True,data_format=nlti" --external_embed=D:\free_corpus\packed\embed.zip --vocab=D:\free_corpus\packed\db_vocab.json

python eval.py --model-dir=MODEL_DIR --log-dir=LOG_DIR --data-dir=D:\free_corpus\packed\ --eval_languages=zh-cn --eval_meta=D:\free_corpus\packed\metadata.eval.txt --hparams="input_method=char,multi_speaker=True,use_knowledge_attention=True,remove_space=True,data_format=nlti" --start_step=100000 --vocab=D:\free_corpus\packed\db_vocab.json --external_embed=D:\free_corpus\packed\embed.zip --eval_speakers=databaker

Besides, to report CER, you need to create azure_key.json with your own Azure STT subscription, with content of {"subscription": "YOUR_KEY", "region": "YOUR_REGION"}, see utils/transcribe.py. Due to significant differences of the datasets used, the implementation is for demonstration only and could not fully reproduce the results in the paper.

Pretrained Model

The pretrained models on Databaker are available at OneDrive Link, which reaches a CER of 4.19%. Relevant files necessary for generation of speeches including lexicon texts, lexicon embeddings, the vocabulary file, and evaluation scripts are also included to aid fast reproduction.

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Related tags

Overview

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Recipe

Pretrained Model

Owner

Mutian He

A dual benchmarking study of visual forgery and visual forensics techniques

Adaptive Graph Convolution for Point Cloud Analysis

Multiple Object Extraction from Aerial Imagery with Convolutional Neural Networks

CNNs for Sentence Classification in PyTorch

ELSED: Enhanced Line SEgment Drawing

Deep generative modeling for time-stamped heterogeneous data, enabling high-fidelity models for a large variety of spatio-temporal domains.

Unoffical reMarkable AddOn for Firefox.

A simple but complete full-attention transformer with a set of promising experimental features from various papers

This repository contains the source code for the paper Tutorial on amortized optimization for learning to optimize over continuous domains by Brandon Amos

Datasets, Transforms and Models specific to Computer Vision

Code for NeurIPS 2021 paper "Curriculum Offline Imitation Learning"

Some tentative models that incorporate label propagation to graph neural networks for graph representation learning in nodes, links or graphs.

CTRMs: Learning to Construct Cooperative Timed Roadmaps for Multi-agent Path Planning in Continuous Spaces

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Non-Vacuous Generalisation Bounds for Shallow Neural Networks

Face2webtoon - Despite its importance, there are few previous works applying I2I translation to webtoon.

Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs

Composable transformations of Python+NumPy programsComposable transformations of Python+NumPy programs

Implementation of paper "DCS-Net: Deep Complex Subtractive Neural Network for Monaural Speech Enhancement"

A scientific and useful toolbox, which contains practical and effective long-tail related tricks with extensive experimental results