This program uses trial auth token of Azure Cognitive Services to do speech synthesis for you.

Overview

🗣️ aspeak

GitHub stars GitHub issues GitHub forks GitHub license PyPI version

A simple text-to-speech client using azure TTS API(trial). 😆

TL;DR: This program uses trial auth token of Azure Cognitive Services to do speech synthesis for you.

You can try the Azure TTS API online: https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech

Installation

$ pip install --upgrade aspeak

Limitations

Since we are using Azure Cognitive Services, there are some limitations:

Quota Free (F0)3
Max number of transactions per certain time period per Speech service resource
Real-time API. Prebuilt neural voices and custom neural voices. 20 transactions per 60 seconds
Adjustable No4
HTTP-specific quotas
Max audio length produced per request 10 min
Max total number of distinct <voice> and <audio> tags in SSML 50
Websocket specific quotas
Max audio length produced per turn 10 min
Max total number of distinct <voice> and <audio> tags in SSML 50
Max SSML message size per turn 64 KB

This table is copied from Azure Cognitive Services documentation

And the limitations may be subject to change. The table above might become outdated in the future. Please refer to the latest Azure Cognitive Services documentation for the latest information.

Attention: If the result audio is longer than 10 minutes, the audio will be truncated to 10 minutes and the program will not report an error.

Using aspeak as a Python library

See DEVELOP.md for more details. You can find examples in src/examples.

Usage

usage: aspeak [-h] [-V | -L | -Q | [-t [TEXT] [-p PITCH] [-r RATE] [-S STYLE] [-R ROLE] [-d STYLE_DEGREE] | -s [SSML]]]
              [-f FILE] [-e ENCODING] [-o OUTPUT_PATH] [-l LOCALE] [-v VOICE]
              [--mp3 [-q QUALITY] | --ogg [-q QUALITY] | --webm [-q QUALITY] | --wav [-q QUALITY] | -F FORMAT] 

This program uses trial auth token of Azure Cognitive Services to do speech synthesis for you

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -L, --list-voices     list available voices, you can combine this argument with -v and -l
  -Q, --list-qualities-and-formats
                        list available qualities and formats
  -t [TEXT], --text [TEXT]
                        Text to speak. Left blank when reading from file/stdin
  -s [SSML], --ssml [SSML]
                        SSML to speak. Left blank when reading from file/stdin
  -f FILE, --file FILE  Text/SSML file to speak, default to `-`(stdin)
  -e ENCODING, --encoding ENCODING
                        Text/SSML file encoding, default to "utf-8"(Not for stdin!)
  -o OUTPUT_PATH, --output OUTPUT_PATH
                        Output file path, wav format by default
  --mp3                 Use mp3 format for output. (Only works when outputting to a file)
  --ogg                 Use ogg format for output. (Only works when outputting to a file)
  --webm                Use webm format for output. (Only works when outputting to a file)
  --wav                 Use wav format for output
  -F FORMAT, --format FORMAT
                        Set output audio format (experts only)
  -l LOCALE, --locale LOCALE
                        Locale to use, default to en-US
  -v VOICE, --voice VOICE
                        Voice to use
  -q QUALITY, --quality QUALITY
                        Output quality, default to 0

Options for --text:
  -p PITCH, --pitch PITCH
                        Set pitch, default to 0. Valid values include floats(will be converted to percentages), percentages such as 20% and -10%, absolute values like 300Hz, and
                        relative values like -20Hz, +2st and string values like x-low. See the documentation for more details.
  -r RATE, --rate RATE  Set speech rate, default to 0. Valid values include floats(will be converted to percentages), percentages like -20%, floats with postfix "f" (e.g. 2f means
                        doubling the default speech rate), and string values like x-slow. See the documentation for more details.
  -S STYLE, --style STYLE
                        Set speech style, default to "general"
  -R {Girl,Boy,YoungAdultFemale,YoungAdultMale,OlderAdultFemale,OlderAdultMale,SeniorFemale,SeniorMale}, --role {Girl,Boy,YoungAdultFemale,YoungAdultMale,OlderAdultFemale,OlderAdultMale,SeniorFemale,SeniorMale}
                        Specifies the speaking role-play. This only works for some Chinese voices!
  -d {values in range 0.01-2 (inclusive)}, --style-degree {values in range 0.01-2 (inclusive)}
                        Specifies the intensity of the speaking style.This only works for some Chinese voices!

Attention: If the result audio is longer than 10 minutes, the audio will be truncated to 10 minutes and the program will not report an error. Unreasonable high/low values for pitch
and rate will be clipped to reasonable values by Azure Cognitive Services.Please refer to the documentation for other limitations at
https://github.com/kxxt/aspeak/blob/main/README.md#limitations
  • If you don't specify -o, we will use your default speaker.
  • If you don't specify -t or -s, we will assume -t is provided.
  • You must specify voice if you want to use special options for --text.

Special Note for Pitch and Rate

  • rate: The speaking rate of the voice.
    • If you use a float value (say 0.5), the value will be multiplied by 100% and become 50.00%.
    • You can use the following values as well: x-slow, slow, medium, fast, x-fast, default.
    • You can also use percentage values directly: +10%.
    • You can also use a relative float value (with f postfix), 1.2f:
      • According to the Azure documentation,
      • A relative value, expressed as a number that acts as a multiplier of the default.
      • For example, a value of 1f results in no change in the rate. A value of 0.5f results in a halving of the rate. A value of 3f results in a tripling of the rate.
  • pitch: The pitch of the voice.
    • If you use a float value (say -0.5), the value will be multiplied by 100% and become -50.00%.
    • You can also use the following values as well: x-low, low, medium, high, x-high, default.
    • You can also use percentage values directly: +10%.
    • You can also use a relative value, (e.g. -2st or +80Hz):
      • According to the Azure documentation,
      • A relative value, expressed as a number preceded by "+" or "-" and followed by "Hz" or "st" that specifies an amount to change the pitch.
      • The "st" indicates the change unit is semitone, which is half of a tone (a half step) on the standard diatonic scale.
    • You can also use an absolute value: e.g. 600Hz

Note: Unreasonable high/low values will be clipped to reasonable values by Azure Cognitive Services.

About Custom Style Degree and Role

According to the Azure documentation , style degree specifies the intensity of the speaking style. It is a floating point number between 0.01 and 2, inclusive.

At the time of writing, style degree adjustments are supported for Chinese (Mandarin, Simplified) neural voices.

According to the Azure documentation , role specifies the speaking role-play. The voice acts as a different age and gender, but the voice name isn't changed.

At the time of writing, role adjustments are supported for these Chinese (Mandarin, Simplified) neural voices: zh-CN-XiaomoNeural, zh-CN-XiaoxuanNeural, zh-CN-YunxiNeural, and zh-CN-YunyeNeural.

Examples

Speak "Hello, world!" to default speaker.

$ aspeak -t "Hello, world"

List all available voices.

$ aspeak -L

List all available voices for Chinese.

$ aspeak -L -l zh-CN

Get information about a voice.

$ aspeak -L -v en-US-SaraNeural
Output
Microsoft Server Speech Text to Speech Voice (en-US, SaraNeural)
Display Name: Sara
Local Name: Sara @ en-US
Locale: English (United States)
Gender: Female
ID: en-US-SaraNeural
Styles: ['cheerful', 'angry', 'sad']
Voice Type: Neural
Status: GA

Save synthesized speech to a file.

$ aspeak -t "Hello, world" -o output.wav

If you prefer mp3/ogg/webm, you can use --mp3/--ogg/--webm option.

$ aspeak -t "Hello, world" -o output.mp3 --mp3
$ aspeak -t "Hello, world" -o output.ogg --ogg
$ aspeak -t "Hello, world" -o output.webm --webm

List available quality levels and formats

$ aspeak -Q
Output
Available qualities:
Qualities for wav:
-2: Riff8Khz16BitMonoPcm
-1: Riff16Khz16BitMonoPcm
 0: Riff24Khz16BitMonoPcm
 1: Riff24Khz16BitMonoPcm
Qualities for mp3:
-3: Audio16Khz32KBitRateMonoMp3
-2: Audio16Khz64KBitRateMonoMp3
-1: Audio16Khz128KBitRateMonoMp3
 0: Audio24Khz48KBitRateMonoMp3
 1: Audio24Khz96KBitRateMonoMp3
 2: Audio24Khz160KBitRateMonoMp3
 3: Audio48Khz96KBitRateMonoMp3
 4: Audio48Khz192KBitRateMonoMp3
Qualities for ogg:
-1: Ogg16Khz16BitMonoOpus
 0: Ogg24Khz16BitMonoOpus
 1: Ogg48Khz16BitMonoOpus
Qualities for webm:
-1: Webm16Khz16BitMonoOpus
 0: Webm24Khz16BitMonoOpus
 1: Webm24Khz16Bit24KbpsMonoOpus

Available formats:
- Riff8Khz16BitMonoPcm
- Riff16Khz16BitMonoPcm
- Audio16Khz128KBitRateMonoMp3
- Raw24Khz16BitMonoPcm
- Raw48Khz16BitMonoPcm
- Raw16Khz16BitMonoPcm
- Audio24Khz160KBitRateMonoMp3
- Ogg24Khz16BitMonoOpus
- Audio16Khz64KBitRateMonoMp3
- Raw8Khz8BitMonoALaw
- Audio24Khz16Bit48KbpsMonoOpus
- Ogg16Khz16BitMonoOpus
- Riff8Khz8BitMonoALaw
- Riff8Khz8BitMonoMULaw
- Audio48Khz192KBitRateMonoMp3
- Raw8Khz16BitMonoPcm
- Audio24Khz48KBitRateMonoMp3
- Raw24Khz16BitMonoTrueSilk
- Audio24Khz16Bit24KbpsMonoOpus
- Audio24Khz96KBitRateMonoMp3
- Webm24Khz16BitMonoOpus
- Ogg48Khz16BitMonoOpus
- Riff48Khz16BitMonoPcm
- Webm24Khz16Bit24KbpsMonoOpus
- Raw8Khz8BitMonoMULaw
- Audio16Khz16Bit32KbpsMonoOpus
- Audio16Khz32KBitRateMonoMp3
- Riff24Khz16BitMonoPcm
- Raw16Khz16BitMonoTrueSilk
- Audio48Khz96KBitRateMonoMp3
- Webm16Khz16BitMonoOpus

Increase/Decrease audio qualities

# Less than default quality.
$ aspeak -t "Hello, world" -o output.mp3 --mp3 -q=-1
# Best quality for mp3
$ aspeak -t "Hello, world" -o output.mp3 --mp3 -q=3

Read text from file and speak it.

$ cat input.txt | aspeak

or

$ aspeak -f input.txt

with custom encoding:

$ aspeak -f input.txt -e gbk

Read from stdin and speak it.

$ aspeak

or (more verbose)

$ aspeak -f -

maybe you prefer:

$ aspeak -l zh-CN << EOF
我能吞下玻璃而不伤身体。
EOF

Speak Chinese.

$ aspeak -t "你好,世界!" -l zh-CN

Use a custom voice.

$ aspeak -t "你好,世界!" -v zh-CN-YunjianNeural

Custom pitch, rate and style

$ aspeak -t "你好,世界!" -v zh-CN-XiaoxiaoNeural -p 1.5 -r 0.5 -S sad
$ aspeak -t "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=-10% -r=+5% -S cheerful
$ aspeak -t "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=+40Hz -r=1.2f -S fearful
$ aspeak -t "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=high -r=x-slow -S calm
$ aspeak -t "你好,世界!" -v zh-CN-XiaoxiaoNeural -p=+1st -r=-7% -S lyrical

Advanced Usage

Use a custom audio format for output

Note: When outputing to default speaker, using a non-wav format may lead to white noises.

$ aspeak -t "Hello World" -F Riff48Khz16BitMonoPcm -o high-quality.wav

About This Application

  • I found Azure TTS can synthesize nearly authentic human voice, which is very interesting 😆 .
  • I wrote this program to learn Azure Cognitive Services.
  • And I use this program daily, because espeak and festival outputs terrible 😨 audio.
    • But I respect 🙌 their maintainers' work, both are good open source software and they can be used off-line.
  • I hope you like it ❤️ .

Alternative Applications

Comments
  • Could not extract token from webpage

    Could not extract token from webpage

    Raise this error today

    def _get_auth_token() -> str:
        """
        Get a trial auth token from the trial webpage.
        """
        response = requests.get(TRAIL_URL)
        if response.status_code != 200:
            raise errors.TokenRetrievalError(status_code=response.status_code)
        text = response.text
    
        # We don't need bs4, because a little of regex is enough.
    
        match = re.search(r'\s+var\s+localizedResources\s+=\s+\{((.|\n)*?)\};', text, re.M)
        retrieval_error = errors.TokenRetrievalError(message='Could not extract token from webpage.',
                                                     status_code=response.status_code)
        if match is None:
            raise retrieval_error
        token = re.search(r'\s+token:\s*"([^"]+)"', match.group(1), re.M)
        if token is None:
            raise retrieval_error
        return token.group(1)
    

    https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/#overview works normal in browser in same machine.

    opened by ArthurDavidson 16
  • Running in python script

    Running in python script

    How can I run this in my python script, just that without text being typed in, there would be used translated_text instead: def speak_paste(): try: spoken_text = driver1.find_element_by_xpath("/html/body/div/div[2]/div[3]/span").text test_str = (spoken_text) res = " ".join(lookp_dict.get(ele, ele) for ele in test_str.split()) pyperclip.copy(res) translator = deepl.Translator('') result = translator.translate_text((res), target_lang="ru", formality="less", preserve_formatting="1") translated_text = result.text This is basically using subtitle text to translate it with deepl, but I want just to pass this translated text to azure tts to synthesize it.

    opened by Funktionar 11
  • Faster audio output/processing

    Faster audio output/processing

    Possible to use this in real-time communications? Compared with just azure it's slower and I have the deepl API to talk with foreigners. I'd like to get the audio within 200 ms and output it to a sound device, if it's feasible.

    opened by Funktionar 9
  • need help

    need help

    need help. C:\Users\公司>aspeak Traceback (most recent call last): File "c:\users\公司\appdata\local\programs\python\python37\lib\runpy.py", line 193, in run_module_as_main "main", mod_spec) File "c:\users\公司\appdata\local\programs\python\python37\lib\runpy.py", line 85, in run_code exec(code, run_globals) File "C:\Users\公司\AppData\Local\Programs\Python\Python37\Scripts\aspeak.exe_main.py", line 5, in File "c:\users\公司\appdata\local\programs\python\python37\lib\site-packages\aspeak_main.py", line 1, in from .cli import main File "c:\users\公司\appdata\local\programs\python\python37\lib\site-packages\aspeak\cli_init_.py", line 1, in from .main import main File "c:\users\公司\appdata\local\programs\python\python37\lib\site-packages\aspeak\cli\main.py", line 11, in from .parser import parser File "c:\users\公司\appdata\local\programs\python\python37\lib\site-packages\aspeak\cli\parser.py", line 5, in from .value_parsers import pitch, rate, format File "c:\users\公司\appdata\local\programs\python\python37\lib\site-packages\aspeak\cli\value_parsers.py", line 26 if (result := try_parse_float(arg)) and result[0]: ^ SyntaxError: invalid syntax

    opened by jkcox2016 3
  • Speech synthesis canceled: CancellationReason.Error WebSocket operation failed. Internal error: 3. Error details: WS_ERROR_UNDERLYING_IO_ERROR USP state: 4. Received audio size: 1998720 bytes.

    Speech synthesis canceled: CancellationReason.Error WebSocket operation failed. Internal error: 3. Error details: WS_ERROR_UNDERLYING_IO_ERROR USP state: 4. Received audio size: 1998720 bytes.

    Speech synthesis canceled: CancellationReason.Error WebSocket operation failed. Internal error: 3. Error details: WS_ERROR_UNDERLYING_IO_ERROR USP state: 4. Received audio size: 1998720 bytes.

    opened by RouderSky 1
  • RuntimeError in ssml_to_speech_async

    RuntimeError in ssml_to_speech_async

    An RuntimeError occurred while calling ssml_to_speech_async of instance of SpeechToFileService.

    CRITICAL: Traceback (most recent call last):
      ......
      File "E:\***\tts.py", line 17, in tts
        return provider.ssml_to_speech_async(ssml,path=path)  # type: ignore
      File "F:\ProgramData\Miniconda3\lib\site-packages\aspeak\api\api.py", line 110, in wrapper
        self._setup_synthesizer(kwargs['path'])
      File "F:\ProgramData\Miniconda3\lib\site-packages\aspeak\api\api.py", line 139, in _setup_synthesizer
        self._synthesizer = speechsdk.SpeechSynthesizer(self._config, self._output)
      File "F:\ProgramData\Miniconda3\lib\site-packages\azure\cognitiveservices\speech\speech.py", line 1598, in __init__
        self._impl = self._get_impl(impl.SpeechSynthesizer, speech_config, audio_config,
      File "F:\ProgramData\Miniconda3\lib\site-packages\azure\cognitiveservices\speech\speech.py", line 1703, in _get_impl
        _impl = synth_type._from_config(speech_config._impl, None if audio_config is None else audio_config._impl)
    RuntimeError: Exception with an error code: 0x8 (SPXERR_FILE_OPEN_FAILED)
    [CALL STACK BEGIN]
    
        > pal_string_to_wstring
    
        - pal_string_to_wstring
    
        - synthesizer_create_speech_synthesizer_from_config
    
        - synthesizer_create_speech_synthesizer_from_config
    
        - 00007FFE37F772C4 (SymFromAddr() error: 试图访问无效的地址。)
    
        - 00007FFE37FC76A8 (SymFromAddr() error: 试图访问无效的地址。)
    
        - 00007FFE37FC87A8 (SymFromAddr() error: 试图访问无效的地址。)
    
        - PyArg_CheckPositional
    
        - Py_NewReference
    
        - PyEval_EvalFrameDefault
    
        - Py_NewReference
    
        - PyEval_EvalFrameDefault
    
        - PyFunction_Vectorcall
    
        - PyFunction_Vectorcall
    
        - PyMem_RawStrdup
    
        - Py_NewReference
    
    
    
    [CALL STACK END]
    

    tts.py

    from aspeak import SpeechToFileService,AudioFormat,FileFormat
    
    provider=None
    fmt=AudioFormat(FileFormat.MP3,-1)
    
    def init():
        global provider
        provider=SpeechToFileService(locale="zh-CN",audio_format=fmt)
    
    def tts(ssml:str,path:str):
        if provider is None:
            init()
        return provider.ssml_to_speech_async(ssml,path=path)  # type: ignore
    

    The thing is, this error seemed to occurred randomly, and only when I created(Finished) over 20 ssml_to_speech_async instance does it occurs. This error seems can't be catch through try

    bug wontfix upstream 
    opened by flt6 1
  • Use correct function at ssml_to_speech_async

    Use correct function at ssml_to_speech_async

    I updated to the latest version, and found both example for ssml_to_speech_async and my own code couldn't run correctly. I found in api/api.py file function ssml_to_speech_async uses speak_ssml not speak_ssml_async, and I think this is the reason.

    opened by flt6 1
  • Can you add an option for output file name same to text?

    Can you add an option for output file name same to text?

    for example, if the text is a simple text, the output name would be same.

    aspeak -t "Hello" -ot --mp3 then the output would be "Hello.mp3"

    enhancement wontfix 
    opened by bk111 1
  • Error: Speech synthesis canceled: CancellationReason.Error

    Error: Speech synthesis canceled: CancellationReason.Error

    when i run: aspeak -l zh-CN -f 从美丽的室友开始.txt -o 从美丽的室友开始.wav the txt file has 90k chars and 279KB. and then it shows: Error: Speech synthesis canceled: CancellationReason.Error Connection was closed by the remote host. Error code: 1007. Error details: Websocket message size cannot exceed 65536 bytes USP state: 3. Received audio size: 0 bytes. then I only used the first paragraph(4k letters and 12.8KB),and it works well.

    imo,it may cause by too large file.

    opened by BonexP 1
  • Error: 'gbk' codec can't decode byte 0xac in position 2: illegal multibyte sequence

    Error: 'gbk' codec can't decode byte 0xac in position 2: illegal multibyte sequence

    aspeak -f 296.txt -l zh-cn Error: 'gbk' codec can't decode byte 0xac in position 2: illegal multibyte sequence While 296.txt is UTF-8 text file.

    UTF8 is more popular than GBK nowadays.

    opened by danangua 1
  • Failed to synthesize speech!

    Failed to synthesize speech!

    Failed to synthesize speech!

    result = speech.text_to_speech(
                str, voice='en-US-JennyNeural', style='excited')
            print(result)
    

    the result gets None

    opened by 2010hexi 0
  • ERROR: Cannot install aspeak because these package versions have conflicting dependencies.

    ERROR: Cannot install aspeak because these package versions have conflicting dependencies.

    To fix this you could try to:

    1. loosen the range of package versions you've specified
    2. remove package versions to allow pip attempt to solve the dependency conflict

    ERROR: Cannot install aspeak==0.1, aspeak==0.1.1, aspeak==0.2.0, aspeak==0.2.1, aspeak==0.3.0, aspeak==0.3.1, aspeak==0.3.2, aspeak==1.0.0, aspeak==1.1.0, aspeak==1.1.1, aspeak==1.1.2, aspeak==1.1.3, aspeak==1.1.4, aspeak==1.2.0, aspeak==1.3.0, aspeak==1.3.1, aspeak==1.4.0, aspeak==1.4.1, aspeak==1.4.2, aspeak==2.0.0, aspeak==2.0.1, aspeak==2.1.0, aspeak==3.0.0, aspeak==3.0.1 and aspeak==3.0.2 because these package versions have conflicting dependencies.

    opened by qirenzhidao 1
  • ImportError from urllib3: cannot import name 'Mapping' from 'collections'

    ImportError from urllib3: cannot import name 'Mapping' from 'collections'

    /Users/meeia ~ aspeak -t "Hello, world"
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/3.10/bin/aspeak", line 5, in <module>
        from aspeak.__main__ import main
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/aspeak/__main__.py", line 1, in <module>
        from .cli import main
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/aspeak/cli/__init__.py", line 1, in <module>
        from .main import main
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/aspeak/cli/main.py", line 8, in <module>
        from .voices import list_voices
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/aspeak/cli/voices.py", line 1, in <module>
        import requests
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/requests/__init__.py", line 43, in <module>
        import urllib3
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/__init__.py", line 8, in <module>
        from .connectionpool import (
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connectionpool.py", line 29, in <module>
        from .connection import (
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/connection.py", line 39, in <module>
        from .util.ssl_ import (
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/util/__init__.py", line 3, in <module>
        from .connection import is_connection_dropped
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/util/connection.py", line 3, in <module>
        from .wait import wait_for_read
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/util/wait.py", line 1, in <module>
        from .selectors import (
      File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/urllib3/util/selectors.py", line 14, in <module>
        from collections import namedtuple, Mapping
    ImportError: cannot import name 'Mapping' from 'collections' (/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/collections/__init__.py)
    /Users/meeia ~
    
    
    wontfix upstream 
    opened by cnmeeia 3
Releases(v3.0.2)
Owner
Levi Zim
Developer / Student / AI / Data Science :octocat: Telegram channel: t.me/kxxtchannel :octocat: PGP: 0x57670CCFA42CCF0A :octocat: Reading: t.me/kxxt_read
Levi Zim
ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

Ajinkya Kulkarni 42 Feb 10, 2022
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17.2k Oct 23, 2022
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17k Feb 11, 2021
Code for the paper Language as a Cognitive Tool to Imagine Goals in Curiosity Driven Exploration

IMAGINE: Language as a Cognitive Tool to Imagine Goals in Curiosity Driven Exploration This repo contains the code base of the paper Language as a Cog

Flowers Team 25 Aug 1, 2022
RCD: Relation Map Driven Cognitive Diagnosis for Intelligent Education Systems

RCD: Relation Map Driven Cognitive Diagnosis for Intelligent Education Systems This is our implementation for the paper: Weibo Gao, Qi Liu*, Zhenya Hu

BigData Lab @USTC  中科大大数据实验室 9 Jul 13, 2022
TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

TalkNet 2 [WIP] TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Predictio

Rishikesh (ऋषिकेश) 67 Oct 20, 2022
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis Jungil Kong, Jaehyeon Kim, Jaekyoung Bae In our paper, we p

Rishikesh (ऋषिकेश) 28 Oct 14, 2022
PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

WaveGrad2 - PyTorch Implementation PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis. Status (202

Keon Lee 56 Sep 15, 2022
PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

FastPitchFormant - PyTorch Implementation PyTorch Implementation of FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis. Qu

Keon Lee 58 Sep 1, 2022
A Flow-based Generative Network for Speech Synthesis

WaveGlow: a Flow-based Generative Network for Speech Synthesis Ryan Prenger, Rafael Valle, and Bryan Catanzaro In our recent paper, we propose WaveGlo

NVIDIA Corporation 2k Oct 23, 2022
PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

VAENAR-TTS - PyTorch Implementation PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Keon Lee 65 Aug 28, 2022
PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Daft-Exprt - PyTorch Implementation PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis The

Keon Lee 44 Oct 14, 2022
Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

Multi-speaker DGP This repository provides official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch. O

sarulab-speech 24 Sep 7, 2022
PyTorch implementation of Tacotron speech synthesis model.

tacotron_pytorch PyTorch implementation of Tacotron speech synthesis model. Inspired from keithito/tacotron. Currently not as much good speech quality

Ryuichi Yamamoto 277 Sep 20, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.7k Oct 17, 2022
A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

WaveGlow A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis Quick Start: Install requirements: pip install

Yuchao Zhang 204 Jul 14, 2022
Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis"

StrengthNet Implementation of "StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis" https://arxiv.org/abs/2110

RuiLiu 59 Sep 22, 2022
PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Lip to Speech Synthesis with Visual Context Attentional GAN This repository contains the PyTorch implementation of the following paper: Lip to Speech

null 5 Jul 25, 2022
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Bilateral Denoising Diffusion Models (BDDMs) This is the official PyTorch implementation of the following paper: BDDM: BILATERAL DENOISING DIFFUSION M

null 153 Oct 14, 2022