Simple, hackable offline speech to text - using the VOSK-API.

Overview

Nerd Dictation

Offline Speech to Text for Desktop Linux. - See demo video.

This is a utility that provides simple access speech to text for using in Linux without being tied to a desktop environment.

Simple
This is a single file Python script with minimal dependencies.
Hackable
User configuration lets you manipulate text using Python string operations.
Zero Overhead
As this relies on manual activation there are no background processes.

Dictation is accessed manually with begin/end commands.

This uses the excellent vosk-api.

Usage

It is suggested to bind begin/end/cancel to shortcut keys.

nerd-dictation begin
nerd-dictation end

For details on how this can be used, see: nerd-dictation --help and nerd-dictation begin --help.

Features

Specific features include:

Numbers as Digits

Optional conversion from numbers to digits.

So Three million five hundred and sixty second becomes 3,000,562nd.

A series of numbers (such as reciting a phone number) is also supported.

So Two four six eight becomes 2,468.

Time Out
Optionally end speech to text early when no speech is detected for a given number of seconds. (without an explicit call to end which is otherwise required).
Output Type
Output can simulate keystroke events (default) or simply print to the standard output.
User Configuration Script
User configuration is just a Python script which can be used to manipulate text using Python's full feature set.

See nerd-dictation begin --help for details on how to access these options.

Dependencies

  • Python 3.
  • The VOSK-API.
  • parec command (for recording from pulse-audio).
  • xdotool command to simulate keyboard input.

Install

pip3 install vosk
git clone https://github.com/ideasman42/nerd-dictation.git
cd nerd-dictation
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip
mv vosk-model-small-en-us-0.15 model

To test dictation:

./nerd-dictation begin --vosk-model-dir=./model &
# Start speaking.
./nerd-dictation end
  • Reminder that it's up to you to bind begin/end/cancel to actions you can easily access (typically key shortcuts).

  • To avoid having to pass the --vosk-model-dir argument, copy the model to the default path:

    mkdir -p ~/.config/nerd-dictation
    mv ./model ~/.config/nerd-dictation

Hint

Once this is working properly you may wish to download one of the larger language models for more accurate dictation. They are available here.

Configuration

This is an example of a trivial configuration file which simply makes the input text uppercase.

# ~/.config/nerd-dictation/nerd-dictation.py
def nerd_dictation_process(text):
    return text.upper()

A more comprehensive configuration is included in the examples/ directory.

Hints

  • The processing function can be used to implement your own actions using keywords of your choice. Simply return a blank string if you have implemented your own text handling.
  • Context sensitive actions can be implemented using command line utilities to access the active window.

Paths

Local Configuration
~/.config/nerd-dictation/nerd-dictation.py
Language Model

~/.config/nerd-dictation/model

Note that --vosk-model-dir=PATH can be used to override the default.

Command Line Arguments

Output of nerd-dictation --help

usage:

nerd-dictation [-h]  ...

This is a utility that activates text to speech in Linux. While it could use any system currently it uses the VOSK-API.

positional arguments:

begin: Begin dictation.
end: End dictation.
cancel: Cancel dictation.
optional arguments:
-h, --help show this help message and exit

Subcommand: begin

usage:

nerd-dictation begin [-h] [--cookie FILE_PATH] [--vosk-model-dir DIR]
                     [--pulse-device-name IDENTIFIER]
                     [--sample-rate HZ] [--defer-output] [--continuous]
                     [--timeout SECONDS] [--idle-time SECONDS]
                     [--delay-exit SECONDS]
                     [--punctuate-from-previous-timeout SECONDS]
                     [--full-sentence] [--numbers-as-digits]
                     [--numbers-use-separator] [--output OUTPUT_METHOD]
                     [- ...]

This creates the directory used to store internal data, so other commands such as sync can be performed.

optional arguments:
-h, --help show this help message and exit
--cookie FILE_PATH
  Location for writing a temporary cookie (this file is monitored to begin/end dictation).
--vosk-model-dir DIR
  Path to the VOSK model, see: https://alphacephei.com/vosk/models
--pulse-device-name IDENTIFIER
  The name of the pulse-audio device to use for recording. See the output of "pactl list sources" to find device names (using the identifier following "Name:").
--sample-rate HZ
  The sample rate to use for recording (in Hz). Defaults to 44100.
--defer-output

When enabled, output is deferred until exiting.

This prevents text being typed during speech (implied with --output=STDOUT)

--continuous Enable this option, when you intend to keep the dictation process enabled for extended periods of time. without this enabled, the entirety of this dictation session will be processed on every update. Only used when --defer-output is disabled.
--timeout SECONDS
  Time out recording when no speech is processed for the time in seconds. This can be used to avoid having to explicitly exit (zero disables).
--idle-time SECONDS
  Time to idle between processing audio from the recording. Setting to zero is the most responsive at the cost of high CPU usage. The default value is 0.1 (processing 10 times a second), which is quite responsive in practice (the maximum value is clamped to 0.5)
--delay-exit SECONDS
  The time to continue running after an exit request. this can be useful so "push to talk" setups can be released while you finish speaking (zero disables).
--punctuate-from-previous-timeout SECONDS
  The time-out in seconds for detecting the state of dictation from the previous recording, this can be useful so punctuation it is added before entering the dictation(zero disables).
--full-sentence
  Capitalize the first character. This is also used to add either a comma or a full stop when dictation is performed under the --punctuate-from-previous-timeout value.
--numbers-as-digits
  Convert numbers into digits instead of using whole words.
--numbers-use-separator
  Use a comma separators for numbers.
--output OUTPUT_METHOD
 

Method used to at put the result of speech to text.

  • SIMULATE_INPUT simulate keystrokes (default).
  • STDOUT print the result to the standard output. Be sure only to handle text from the standard output as the standard error may be used for reporting any problems that occur.
- ... End argument parsing.
This can be used for user defined arguments which configuration scripts may read from the sys.argv.

Subcommand: end

usage:

nerd-dictation end [-h] [--cookie FILE_PATH]

This ends dictation, causing the text to be typed in.

optional arguments:
-h, --help show this help message and exit
--cookie FILE_PATH
  Location for writing a temporary cookie (this file is monitored to begin/end dictation).

Subcommand: cancel

usage:

nerd-dictation cancel [-h] [--cookie FILE_PATH]

This cancels dictation.

optional arguments:
-h, --help show this help message and exit
--cookie FILE_PATH
  Location for writing a temporary cookie (this file is monitored to begin/end dictation).

Details

  • Typing in results will never press enter/return.
  • Pulse audio is used for recording.
  • Recording and speech to text a performed in parallel.

Examples

Store the result of speech to text as a variable in the shell:

SPEECH="$(nerd-dictation begin --timeout=1.0 --output=STDOUT)"

Example Configurations

These are example configurations you may use as a reference.

Other Software

  • Elograf - nerd-dictation GUI front-end that runs as a tray icon.

Limitations

  • Text from VOSK is all lower-case, while the user configuration can be used to set the case of common words like I this isn't very convenient (see the example configuration for details).

  • For some users the delay in start up may be noticeable on systems with slower hard disks especially when running for the 1st time (a cold start).

    This is a limitation with the choice not to use a service that runs in the background. Recording begins before any the speech-to-text components are loaded to mitigate this problem.

Further Work

  • And a general solution to capitalize words (proper nouns for example).
  • Wayland support (this should be quite simple to support and mainly relies on a replacement for xdotool).
  • Add a setup.py for easy installation on uses systems.
  • Possibly other speech to text engines (only if they provide some significant benefits).
  • Possibly support Windows & macOS.
Owner
Campbell Barton
Campbell Barton
Creating an LSTM model to generate music

Music-Generation Creating an LSTM model to generate music music-generator Used to create basic sin wave sounds music-ai Contains the functions to conv

Jerin Joseph 2 Dec 02, 2021
OpenAI CLIP text encoders for multiple languages!

Multilingual-CLIP OpenAI CLIP text encoders for any language Colab Notebook · Pre-trained Models · Report Bug Overview OpenAI recently released the pa

Fredrik Carlsson 481 Dec 30, 2022
HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

Vaibhav Srivastav 110 Jan 01, 2023
The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

Speech Separation The simple project to separate mixed voice (2 clean voices) to 2 separate voices. Result Example (Clisk to hear the voices): mix ||

vuthede 31 Oct 30, 2022
Using Bert as the backbone model for lime, designed for NLP task explanation (sentence pair text classification task)

Lime Comparing deep contextualized model for sentences highlighting task. In addition, take the classic explanation model "LIME" with bert-base model

JHJu 2 Jan 18, 2022
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN

artificial intelligence cosmic love and attention fire in the sky a pyramid made of ice a lonely house in the woods marriage in the mountains lantern

Phil Wang 2.3k Jan 01, 2023
Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense.

PythonTextObfuscator Takes a string and puts it through different languages in Google Translate a requested amount of times, returning nonsense. Requi

2 Aug 29, 2022
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-popu

TextFlint 587 Dec 20, 2022
This repository contains the code for running the character-level Sandwich Transformers from our ACL 2020 paper on Improving Transformer Models by Reordering their Sublayers.

Improving Transformer Models by Reordering their Sublayers This repository contains the code for running the character-level Sandwich Transformers fro

Ofir Press 53 Sep 26, 2022
Simple, hackable offline speech to text - using the VOSK-API.

Simple, hackable offline speech to text - using the VOSK-API.

Campbell Barton 844 Jan 07, 2023
Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

Wake Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec Abstract استخراج خودکار کلمات کلیدی متون کوتاه فارسی با استفاده از word2vec ب

Omid Hajipoor 1 Dec 17, 2021
Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Applied Natural Language Processing in the Enterprise This is the companion repo for Applied Natural Language Processing in the Enterprise, an O'Reill

Applied Natural Language Processing in the Enterprise 95 Jan 05, 2023
Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Dense Passage Retrieval Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the

Meta Research 1.1k Jan 07, 2023
Python library to make development of portfolio analysis faster and easier

Trafalgar Python library to make development of portfolio analysis faster and easier Installation 🔥 For the moment, Trafalgar is still in beta develo

Santosh Passoubady 641 Jan 01, 2023
precise iris segmentation

PI-DECODER Introduction PI-DECODER, a decoder structure designed for Precise Iris Segmentation and Location. The decoder structure is shown below: Ple

8 Aug 08, 2022
AI and Machine Learning workflows on Anthos Bare Metal.

Hybrid and Sovereign AI on Anthos Bare Metal Table of Contents Overview Terraform as IaC Substrate ABM Cluster on GCE using Terraform TensorFlow ResNe

Google Cloud Platform 8 Nov 26, 2022
spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines

spaCy-wrap: For Wrapping fine-tuned transformers in spaCy pipelines spaCy-wrap is minimal library intended for wrapping fine-tuned transformers from t

Kenneth Enevoldsen 32 Dec 29, 2022
Pretrain CPM - 大规模预训练语言模型的预训练代码

CPM-Pretrain 版本更新记录 为了促进中文自然语言处理研究的发展,本项目提供了大规模预训练语言模型的预训练代码。项目主要基于DeepSpeed、Megatron实现,可以支持数据并行、模型加速、流水并行的代码。 安装 1、首先安装pytorch等基础依赖,再安装APEX以支持fp16。 p

Tsinghua AI 37 Dec 06, 2022
This repository contains the codes for LipGAN. LipGAN was published as a part of the paper titled "Towards Automatic Face-to-Face Translation".

LipGAN Generate realistic talking faces for any human speech and face identity. [Paper] | [Project Page] | [Demonstration Video] Important Update: A n

Rudrabha Mukhopadhyay 438 Dec 31, 2022
Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP

Wikipedia-Utils: Preprocessing Wikipedia Texts for NLP This repository maintains some utility scripts for retrieving and preprocessing Wikipedia text

Masatoshi Suzuki 44 Oct 19, 2022