🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Last update: Jan 09, 2023

Overview

🐤 Nix-TTS

An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

This is a repository for our paper, 🐤 Nix-TTS (Submitted to INTERSPEECH 2022). We released the pretrained models, an interactive demo, and audio samples below.

[ 📄 Paper Link] [ 🤗 Interactive Demo] [ 📢 Audio Samples]

Abstract We propose Nix-TTS, a lightweight neural TTS (Text-to-Speech) model achieved by applying knowledge distillation to a powerful yet large-sized generative TTS teacher model. Distilling a TTS model might sound unintuitive due to the generative and disjointed nature of TTS architectures, but pre-trained TTS models can be simplified into encoder and decoder structures, where the former encodes text into some latent representation and the latter decodes the latent into speech data. We devise a framework to distill each component in a non end-to-end fashion. Nix-TTS is end-to-end (vocoder-free) with only 5.23M parameters or up to 82% reduction of the teacher model, it achieves over 3.26x and 8.36x inference speedup on Intel-i7 CPU and Raspberry Pi respectively, and still retains a fair voice naturalness and intelligibility compared to the teacher model.

Getting Started with Nix-TTS

Clone the nix-tts repository and move to its directory

git clone https://github.com/rendchevi/nix-tts.git
cd nix-tts

Install the dependencies

Install Python dependencies. We recommend python >= 3.8

pip install -r requirements.txt

Install espeak in your device (for text tokenization).

sudo apt-get install espeak

Or follow the official instruction in case it didn't work.

Download your chosen pre-trained model here.

Model	Num. of Params	Faster than real-time^* (CPU Intel-i7)	Faster than real-time^* (RasPi Model 3B)
Nix-TTS (ONNX)	5.23 M	11.9x	0.50x
Nix-TTS w/ Stochastic Duration (ONNX)	6.03 M	10.8x	0.50x

^* Here we compute how much the model run faster than real-time as the inverse of Real Time Factor (RTF). The complete table of all models speedup is detailed on the paper.

And running Nix-TTS is as easy as:

from nix.models.TTS import NixTTSInference
from IPython.display import Audio

# Initiate Nix-TTS
nix = NixTTSInference(model_dir = "<path_to_the_downloaded_model>")
# Tokenize input text
c, c_length, phoneme = nix.tokenize("Born to multiply, born to gaze into night skies.")
# Convert text to raw speech
xw = nix.vocalize(c, c_length)

# Listen to the generated speech
Audio(xw[0,0], rate = 22050)

Acknowledgement

This research is fully and exclusively funded by Kata.ai, where the authors work as part of the Kata.ai Research Team.
Some of the complex parts of our model, as mentioned in the paper, are adapted from the original implementation of VITS and Comprehensive-Transformer-TTS.

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Related tags

Overview

🐤 Nix-TTS

An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Getting Started with Nix-TTS

Acknowledgement

Owner

Rendi Chevi

PyTorch Code for NeurIPS 2021 paper Anti-Backdoor Learning: Training Clean Models on Poisoned Data.

Optimized code based on M2 for faster image captioning training

Gapmm2: gapped alignment using minimap2 (align transcripts to genome)

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Deep Hedging Demo - An Example of Using Machine Learning for Derivative Pricing.

Official code for the paper "Why Do Self-Supervised Models Transfer? Investigating the Impact of Invariance on Downstream Tasks".

Scalable Attentive Sentence-Pair Modeling via Distilled Sentence Embedding (AAAI 2020) - PyTorch Implementation

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Monk is a low code Deep Learning tool and a unified wrapper for Computer Vision.

Object detection (YOLO) with pytorch, OpenCV and python

PowerGridworld: A Framework for Multi-Agent Reinforcement Learning in Power Systems

An Easy-to-use, Modular and Prolongable package of deep-learning based Named Entity Recognition Models.

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Real-time Object Detection for Streaming Perception, CVPR 2022

CVPR 2021 - Official code repository for the paper: On Self-Contact and Human Pose.

This is an open source python repository for various python tests

Exe-to-xlsm - Simple script to create VBscript of exe and inject to xlsm

D2LV: A Data-Driven and Local-Verification Approach for Image Copy Detection

This code is for our paper "VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers"

Cryptocurrency Prediction with Artificial Intelligence (Deep Learning via LSTM Neural Networks)