A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Last update: Dec 19, 2022

Overview

简体中文 | English

并行语音合成

[TOC]

新进展

2021/04/20 合并 wavegan 分支到 main 主分支，删除 wavegan 分支！
2021/04/13 创建 encoder 分支用于开发语音风格迁移模块！
2021/04/13 softdtw 分支支持使用 SoftDTW 损失训练模型！
2021/04/09 ~~wavegan 分支（已删除）~~ 提供 PWG / MelGAN / Multi-band MelGAN 声码器！
2021/04/05 支持 ParallelText2Mel + MelGAN 声码器！
[ 关键信息 ] 速度指标，合成样例，网页演示，一些问题，欢迎交流 ……

目录结构

.
|--- config/      # 配置文件
     |--- default.yaml
     |--- ...
|--- datasets/    # 数据处理
|--- encoder/     # 声纹编码器
     |--- voice_encoder.py
     |--- ...
|--- helpers/     # 一些辅助类
     |--- trainer.py
     |--- synthesizer.py
     |--- ...
|--- logdir/      # 训练过程保存目录
|--- losses/      # 一些损失函数
|--- models/      # 合成模型
     |--- layers.py
     |--- duration.py
     |--- parallel.py
|--- pretrained/  # 预训练模型（LJSpeech 数据集）
|--- samples/     # 合成样例
|--- utils/       # 一些通用方法
|--- vocoder/     # 声码器
     |--- melgan.py
     |--- ...
|--- wandb/       # Wandb 保存目录
|--- extract-duration.py
|--- extract-embedding.py
|--- LICENSE
|--- prepare-dataset.py  # 准备脚本
|--- README.md
|--- README_en.md
|--- requirements.txt    # 依赖文件
|--- synthesize.py       # 合成脚本
|--- train-duration.py   # 训练脚本
|--- train-parallel.py

合成样例

部分合成样例见这里。

预训练

部分预训练模型见这里。

快速开始

步骤（1）：克隆仓库

$ git clone https://github.com/atomicoo/ParallelTTS.git

步骤（2）：安装依赖

$ conda create -n ParallelTTS python=3.7.9
$ conda activate ParallelTTS
$ pip install -r requirements.txt

步骤（3）：合成语音

$ python synthesize.py \
  --checkpoint ./pretrained/ljspeech-parallel-epoch0100.pth \
  --melgan_checkpoint ./pretrained/ljspeech-melgan-epoch3200.pth \
  --input_texts ./samples/english/synthesize.txt \
  --outputs_dir ./outputs/

如果要合成其他语种的语音，需要通过 --config 指定相应的配置文件。

如何训练

步骤（1）：准备数据

$ python prepare-dataset.py

通过 --config 可以指定配置文件，默认的 default.yaml 针对 LJSpeech 数据集。

步骤（2）：训练对齐模型

$ python train-duration.py

步骤（3）：提取持续时间

$ python extract-duration.py

通过 --ground_truth 可以指定是否利用对齐模型生成 Ground-Truth 声谱图。

步骤（4）：训练合成模型

$ python train-parallel.py

通过 --ground_truth 可以指定是否使用 Ground-Truth 声谱图进行模型训练。

训练日志

如果使用 TensorBoardX，则运行如下命令：

$ tensorboard --logdir logdir/[DIR]/

强烈推荐使用 Wandb（Weights & Biases），只需在上述训练命令中增加 --enable_wandb 选项。

数据集

LJSpeech：英语，女性，22050 Hz，约 24 小时
LibriSpeech：英语，多说话人（仅使用 train-clean-100 部分），16000 Hz，总计约 1000 小时
JSUT：日语，女性，48000 Hz，约 10 小时
BiaoBei：普通话，女性，48000 Hz，约 12 小时
KSS：韩语，女性，44100 Hz，约 12 小时
RuLS：俄语，多说话人（仅使用单一说话人音频），16000 Hz，总计约 98 小时
TWLSpeech（非公开，质量较差）：藏语，女性（多说话人，音色相近），16000 Hz，约 23 小时

质量评估

TODO：待补充

速度指标

训练速度：对于 LJSpeech 数据集，设置批次尺寸为 64，可以在单张 8GB 显存的 GTX 1080 显卡上进行训练，训练 ~8h（~300 epochs）后即可合成质量较高的语音。

合成速度：以下测试在 CPU @ Intel Core i7-8550U / GPU @ NVIDIA GeForce MX150 下进行，每段合成音频在 8 秒左右（约 20 词）

批次尺寸	Spec (GPU)	Audio (GPU)	Spec (CPU)	Audio (CPU)
1	0.042	0.218	0.100	2.004
2	0.046	0.453	0.209	3.922
4	0.053	0.863	0.407	7.897
8	0.062	2.386	0.878	14.599

注意，没有进行多次测试取平均值，结果仅供参考。

一些问题

在 wavegan 分支中，vocoder 代码取自 ParallelWaveGAN，由于声学特征提取方式不兼容，需要进行转化，具体转化代码见这里。
普通话模型的文本输入选择拼音序列，因为 BiaoBei 的原始拼音序列不包含标点、以及对齐模型训练不完全，所以合成语音的节奏会有点问题。
韩语模型没有专门训练对应的声码器，而是直接使用 LJSpeech（同为 22050 Hz）的声码器，可能稍微影响合成语音的质量。

A fast Text-to-Speech (TTS) model. Work well for English, Mandarin/Chinese, Japanese, Korean, Russian and Tibetan (so far). 快速语音合成模型，适用于英语、普通话/中文、日语、韩语、俄语和藏语（当前已测试）。

Related tags

Overview

并行语音合成

新进展

目录结构

合成样例

预训练

快速开始

如何训练

训练日志

数据集

质量评估

速度指标

一些问题

参考资料

TODO

欢迎交流

Owner

Atomicoo

Repository for Project Insight: NLP as a Service

NLP Text Classification

Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Codename generator using WordNet parts of speech database

BERT score for text generation

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Control the classic General Instrument SP0256-AL2 speech chip and AY-3-8910 sound generator with a Raspberry Pi and this Python library.

Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

Opal-lang - A WIP programming language based on Python

One Stop Anomaly Shop: Anomaly detection using two-phase approach: (a) pre-labeling using statistics, Natural Language Processing and static rules; (b) anomaly scoring using supervised and unsupervised machine learning.

OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

Kerberoast with ACL abuse capabilities

It analyze the sentiment of the user, whether it is postive or negative.

Code for ACL 2021 main conference paper "Conversations are not Flat: Modeling the Intrinsic Information Flow between Dialogue Utterances".

Задания КЕГЭ по информатике 2021 на Python

Semi-automated vocabulary generation from semantic vector models

💫 Industrial-strength Natural Language Processing (NLP) in Python

Various Algorithms for Short Text Mining

Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)