Speech recognition tool to convert audio to text transcripts, for Linux and Raspberry Pi.

Overview

Spchcat

Speech recognition tool to convert audio to text transcripts, for Linux and Raspberry Pi.

Description

spchcat is a command-line tool that reads in audio from .WAV files, a microphone, or system audio inputs and converts any speech found into text. It runs locally on your machine, with no web API calls or network activity, and is open source. It is built on top of Coqui's speech to text library, TensorFlow, KenLM, and data from Mozilla's Common Voice project.

It supports multiple languages thanks to Coqui's library of models. The accuracy of the recognized text will vary widely depending on the language, since some have only small amounts of training data. You can help improve future models by contributing your voice.

Installation

x86

On Debian-based x86 Linux systems like Ubuntu you should be able to install the latest .deb package by downloading and double-clicking it. Other distributions are currently unsupported. The tool requires PulseAudio, which is already present on most desktop systems, but can be installed manually.

There's a notebook you can run in Colab at notebooks/install.ipynb that shows all installation steps.

Raspberry Pi

To install on a Raspberry Pi, download the latest .deb installer package and either double-click on it from the desktop, or run dpkg -i ~/Downloads/spchcat_0.0-2_armhf.deb from the terminal. It will take several minutes to unpack all the language files. This version has only been tested on the latest release of Raspbian, released October 30th 2021, and on a Raspberry Pi 4. It's expected to fail on Raspberry Pi 1's and 0's, due to their CPU architecture.

Usage

After installation, you should be able to run it with no arguments to start capturing audio from the default microphone source, with the results output to the terminal:

spchcat

After you've run the command, start speaking, and you should see the words you're saying appear. The speech recognition is still a work in progress, and the accuracy will depend a lot on the noise levels, your accent, and the complexity of the words, but hopefully you should see something close enough to be useful for simple note taking or other purposes.

System Audio

If you don't have a microphone attached, or want to transcribe audio coming from another program, you can set the --source argument to 'system'. This will attempt to listen to the audio that your machine is playing, including any videos or songs, and transcribe any speech found.

spchcat --source=system

WAV Files

One of the most common audio file formats is WAV. If you don't have any to test with, you can download Coqui's test set to try this option out. If you need to convert files from another format like '.mp3', I recommend using FFMPeg. As with the other source options, spchcat will attempt to find any speech in the files and convert it into a transcript. You don't have to explicitly set the --source argument, as long as file names are present on the command line that will be the default.

spchcat audio/8455-210777-0068.wav 

If you're using the audio file from the test set, you should see output like the following:

TensorFlow: v2.3.0-14-g4bdd3955115
 Coqui STT: v1.1.0-0-gf3605e23
your power is sufficient i said 

You can also specify a folder instead of a single filename, and all .wav files within that directory will be transcribed.

Language Support

So far this documentation has assumed you're using American English, but the tool will default to looking for the language your system has been configured to use. It first looks for the one specified in the LANG environment variable. If no model for that language is found, it will default back to 'en_US'. You can override this by setting the --language argument on the command line, for example:

spchcat --language=de_DE

This works independently of --source and other options, so you can transcribe microphone, system audio, or files in any of the supported languages. It should be noted that some languages have very small amounts of data and so their quality may suffer. If you don't care about country-specific variants, you can also just specify the language part of the code, for example --language=en. This will pick any model that supports the language, regardless of country. The same thing happens if a particular language and country pair isn't found, it will log a warning and fall back to any country that supports the language. For example, if 'en_GB' is specified but only 'en_US' is present, 'en_US' will be used.

Language Name Code
am_ET Amharic
bn_IN Bengali
br_FR Breton
ca_ES Catalan
cnh_MM Hakha-Chin
cs_CZ Czech
cv_RU Chuvash
cy_GB Welsh
de_DE German
dv_MV Dhivehi
el_GR Greek
en_US English
et_EE Estonian
eu_ES Basque
fi_FI Finnish
fr_FR French
fy_NL Frisian
ga_IE Irish
hu_HU Hungarian
id_ID Indonesian
it_IT Italian
ka_GE Georgian
ky_KG Kyrgyz
lg_UG Luganda
lt_LT Lithuanian
lv_LV Latvian
mn_MN Mongolian
mt_MT Maltese
nl_NL Dutch
or_IN Odia
pt_PT Portuguese
rm_CH Romansh-Sursilvan
ro_RO Romanian
ru_RU Russian
rw_RW Kinyarwanda
sah_RU Sakha
sb_DE Upper-Sorbian
sl_SI Slovenian
sw_KE Swahili-Congo
ta_IN Tamil
th_TH Thai
tr_TR Turkish
tt_RU Tatar
uk_UK Ukrainian
wo_SN Wolof
yo_NG Yoruba

All of these models have been collected by Coqui, and contributed by organizations like Inclusive Technology for Marginalized Languages or individuals. All are using the conventions for Coqui's STT library, so custom models could potentially be used, but training and deployment of those is outside the scope of this document. The models themselves are provided under a variety of open source licenses, which can be inspected in their source folders (typically inside /etc/spchcat/models/).

Saving Output

By default spchcat writes any recognized text to the terminal, but it's designed to behave like a normal Unix command-line tool, so it can also be written to a file using indirection like this:

spchcat audio/8455-210777-0068.wav > /tmp/transcript.txt

If you then run cat /tmp/transcript.txt (or open it in an editor) you should see `your power is sufficient i said'. You can also pipe the output to another command. Unfortunately you can't pipe audio into the tool from another executable, since pipes aren't designed for non-text data.

There is one subtle difference between writing to a file and to the terminal. The transcription itself can take some time to settle into a final form, especially when waiting for long words to finish, so when it's being run live in a terminal you'll often see the last couple of words change. This isn't useful when writing to a file, so instead the output is finalized before it's written. This can introduce a small delay when writing live microphone or system audio input.

Build from Source

Tool

It's possible to build all dependencies from source, but I recommending downloading binary versions of Coqui's STT, TensorFlow Lite, and KenLM libraries from github.com/coqui-ai/STT/releases/download/v1.1.0/native_client.tflite.Linux.tar.xz. Extract this to a folder, and then from inside a folder containing this repo run to build the spchcat tool itself:

make spchcat LINK_PATH_STT=-L../STT_download

You should replace ../STT_download with the path to the Coqui library folder. After this you should see a spchcat executable binary in the repo folder. Because it relies on shared libraries, you'll need to specify a path to these too using LD_LIBRARY_PATH unless you have copies in system folders.

LD_LIBRARY_PATH=../STT_download ./spchcat

Models

The previous step only built the executable binary itself, but for the complete tool you also need data files for each language. If you have the gh GitHub command line tool you can run the download_models.py script to fetch Coqui's releases into the build/models folder in your local repo. You can then run your locally-built tool against these models using the --languages_dir option:

LD_LIBRARY_PATH=../STT_download ./spchcat --languages_dir=build/models/

Installer

After you have the tool built and the model data downloaded, create_deb_package.sh will attempt to package them into a Debian installer archive. It will take several minutes to run, and the result ends up in spchcat_0.0-2_amd64.deb.

Release Process

There's a notebook at notebooks/build.pynb that runs through all the build steps needed to downloaded dependencies, data, build the executable, and create the final package. These steps are run inside an Ubuntu 18.04 Docker image to create the binaries that are released.

sudo docker run -it -v`pwd`:/spchcat ubuntu:bionic bash

Contributors

Tool code written by Pete Warden, [email protected], heavily based on Coqui's STT example. It's a pretty thin wrapper on top of Coqui's speech to text library, so the Coqui team should get credit for their amazing work. Also relies on TensorFlow, KenLM, data from Mozilla's Common Voice project, and all the contributors to Coqui's model zoo.

License

Tool code is licensed under the Mozilla Public License Version 2.0, see LICENSE in this folder.

All other libraries and model data are released under their own licenses, see the relevant folders for more details.

Comments
  • How can I use downloaded models?

    How can I use downloaded models?

    <From a user email, added here for posterity>

    I really need to use spchcat with spanish model (es_ES). I see the model in Coqui GitHub, but is not compiled in your .deb package. How can i recompile it to include spanish? Or, maybe are you compiling a new version including it?

    opened by petewarden 1
  • Running on Standby Mode for File Input

    Running on Standby Mode for File Input

    I am looking to use speechcat as in on demand .wav file transcription. However, I require the model to be preloaded and waiting for intermittent transcription of input .wav files. May I ask if there are plans to make such a feature?

    Environment

    uname -a
    Linux raspberrypi 5.15.32-v7+ #1538 SMP Thu Mar 31 19:38:48 BST 2022 armv7l GNU/Linux
    
    opened by kwokyto 0
  • Use feature test to expose `setenv`

    Use feature test to expose `setenv`

    As per the man page, setenv requires _POSIX_C_SOURCE >= 200112L to be defined before including the appropriate header file (stdlib.h). As the other included header files include some standard headers transitively, this needs to go above all includes.

    opened by msbit 0
  • Use float literals for `TEST_FLTEQ`

    Use float literals for `TEST_FLTEQ`

    When using the TEST_FLTEQ macro, pass float literals for the comparision argument, to avoid errors of the form:

    error: absolute value function 'fabsf' given an argument of
    type 'double' but has parameter of type 'float' which may cause
    truncation of value [-Werror,-Wabsolute-value]
    
    opened by msbit 0
  • Avoid possible infinite loop due to chunk ordering

    Avoid possible infinite loop due to chunk ordering

    Properly re-read the chunk ID when iterating through subsequent chunks. This avoids an infinite loop in the case where the data chunk doesn't immediately follow the fmt chunk.

    opened by msbit 0
  • Not working on a Raspberry Pi

    Not working on a Raspberry Pi

    I am trying to get spchcat working on my raspberry pi. When running the command it is printing in the console this:

    TensorFlow: v2.3.0-14-g4bdd3955115
     Coqui STT: v1.1.0-0-gf3605e23
    

    and shortly after the console clears and displays eddie with no audio input? When I speak nothing appears in the console.

    opened by MiniMinnoww 2
Releases(v0.0.2-rpi-alpha)
Owner
Pete Warden
Pete Warden
GradAttack is a Python library for easy evaluation of privacy risks in public gradients in Federated Learning

GradAttack is a Python library for easy evaluation of privacy risks in public gradients in Federated Learning, as well as corresponding mitigation strategies.

129 Dec 30, 2022
Implementation of the paper "Shapley Explanation Networks"

Shapley Explanation Networks Implementation of the paper "Shapley Explanation Networks" at ICLR 2021. Note that this repo heavily uses the experimenta

68 Dec 27, 2022
This repository contains the code for Direct Molecular Conformation Generation (DMCG).

Direct Molecular Conformation Generation This repository contains the code for Direct Molecular Conformation Generation (DMCG). Dataset Download rdkit

25 Dec 20, 2022
PyTorch Implement of Context Encoders: Feature Learning by Inpainting

Context Encoders: Feature Learning by Inpainting This is the Pytorch implement of CVPR 2016 paper on Context Encoders 1) Semantic Inpainting Demo Inst

321 Dec 25, 2022
Learning to Map Large-scale Sparse Graphs on Memristive Crossbar

Release of AutoGMap:Learning to Map Large-scale Sparse Graphs on Memristive Crossbar For reproduction of our searched model, the Ubuntu OS is recommen

2 Aug 23, 2022
Direct application of DALLE-2 to video synthesis, using factored space-time Unet and Transformers

DALLE2 Video (wip) ** only to be built after DALLE2 image is done and replicated, and the importance of the prior network is validated ** Direct appli

Phil Wang 105 May 15, 2022
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 03, 2023
In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Contrastive Learning of Object Representations Supervisor: Prof. Dr. Gemma Roig Institutions: Goethe University CVAI - Computational Vision & Artifici

Dirk Neuhäuser 6 Dec 08, 2022
VQGAN+CLIP Colab Notebook with user-friendly interface.

VQGAN+CLIP and other image generation system VQGAN+CLIP Colab Notebook with user-friendly interface. Latest Notebook: Mse regulized zquantize Notebook

Justin John 227 Jan 05, 2023
It's A ML based Web Site build with python and Django to find the breed of the dog

ML-Based-Dog-Breed-Identifier This is a Django Based Web Site To Identify the Breed of which your DOG belogs All You Need To Do is to Follow These Ste

Sanskar Dwivedi 2 Oct 12, 2022
PyTorch implementation of the NIPS-17 paper "Poincaré Embeddings for Learning Hierarchical Representations"

Poincaré Embeddings for Learning Hierarchical Representations PyTorch implementation of Poincaré Embeddings for Learning Hierarchical Representations

Facebook Research 1.6k Dec 25, 2022
An implementation of the paper "A Neural Algorithm of Artistic Style"

A Neural Algorithm of Artistic Style implementation - Neural Style Transfer This is an implementation of the research paper "A Neural Algorithm of Art

Srijarko Roy 27 Sep 20, 2022
Guided Internet-delivered Cognitive Behavioral Therapy Adherence Forecasting

Guided Internet-delivered Cognitive Behavioral Therapy Adherence Forecasting #Dataset The folder "Dataset" contains the dataset use in this work and m

0 Jan 08, 2022
ReSSL: Relational Self-Supervised Learning with Weak Augmentation

ReSSL: Relational Self-Supervised Learning with Weak Augmentation This repository contains PyTorch evaluation code, training code and pretrained model

mingkai 45 Oct 25, 2022
Few-NERD: Not Only a Few-shot NER Dataset

Few-NERD: Not Only a Few-shot NER Dataset This is the source code of the ACL-IJCNLP 2021 paper: Few-NERD: A Few-shot Named Entity Recognition Dataset.

THUNLP 319 Dec 30, 2022
Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

zhaohu xing 112 Dec 16, 2022
Learning Tracking Representations via Dual-Branch Fully Transformer Networks

Learning Tracking Representations via Dual-Branch Fully Transformer Networks DualTFR ⭐ We achieves the runner-ups for both VOT2021ST (short-term) and

phiphi 19 May 04, 2022
This is an official implementation for "SimMIM: A Simple Framework for Masked Image Modeling".

SimMIM By Zhenda Xie*, Zheng Zhang*, Yue Cao*, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai and Han Hu*. This repo is the official implementation of

Microsoft 674 Dec 26, 2022
Neural Contours: Learning to Draw Lines from 3D Shapes (CVPR2020)

Neural Contours: Learning to Draw Lines from 3D Shapes This repository contains the PyTorch implementation for CVPR 2020 Paper "Neural Contours: Learn

93 Dec 16, 2022
This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

1 Dec 24, 2021