PyTorch implementations of neural network models for keyword spotting

Related tags

Deep Learninghonk
Overview

Honk: CNNs for Keyword Spotting

Honk is a PyTorch reimplementation of Google's TensorFlow convolutional neural networks for keyword spotting, which accompanies the recent release of their Speech Commands Dataset. For more details, please consult our writeup:

Honk is useful for building on-device speech recognition capabilities for interactive intelligent agents. Our code can be used to identify simple commands (e.g., "stop" and "go") and be adapted to detect custom "command triggers" (e.g., "Hey Siri!").

Check out this video for a demo of Honk in action!

Demo Application

Use the instructions below to run the demo application (shown in the above video) yourself!

Currently, PyTorch has official support for only Linux and OS X. Thus, Windows users will not be able to run this demo easily.

To deploy the demo, run the following commands:

  • If you do not have PyTorch, please see the website.
  • Install Python dependencies: pip install -r requirements.txt
  • Install GLUT (OpenGL Utility Toolkit) through your package manager (e.g. apt-get install freeglut3-dev)
  • Fetch the data and models: ./fetch_data.sh
  • Start the PyTorch server: python .
  • Run the demo: python utils/speech_demo.py

If you need to adjust options, like turning off CUDA, please edit config.json.

Additional notes for Mac OS X:

  • GLUT is already installed on Mac OS X, so that step isn't needed.
  • If you have issues installing pyaudio, this may be the issue.

Server

Setup and deployment

python . deploys the web service for identifying if audio contain the command word. By default, config.json is used for configuration, but that can be changed with --config=<file_name>. If the server is behind a firewall, one workflow is to create an SSH tunnel and use port forwarding with the port specified in config (default 16888).

In our honk-models repository, there are several pre-trained models for Caffe2 (ONNX) and PyTorch. The fetch_data.sh script fetches these models and extracts them to the model directory. You may specify which model and backend to use in the config file's model_path and backend, respectively. Specifically, backend can be either caffe2 or pytorch, depending on what format model_path is in. Note that, in order to run our ONNX models, the packages onnx and onnx_caffe2 must be present on your system; these are absent in requirements.txt.

Raspberry Pi (RPi) Infrastructure Setup

Unfortunately, getting the libraries to work on the RPi, especially librosa, isn't as straightforward as running a few commands. We outline our process, which may or may not work for you.

  1. Obtain an RPi, preferably an RPi 3 Model B running Raspbian. Specifically, we used this version of Raspbian Stretch.
  2. Install dependencies: sudo apt-get install -y protobuf-compiler libprotoc-dev python-numpy python-pyaudio python-scipy python-sklearn
  3. Install Protobuf: pip install protobuf
  4. Install ONNX without dependencies: pip install --no-deps onnx
  5. Follow the official instructions for installing Caffe2 on Raspbian. This process takes about two hours. You may need to add the caffe2 module path to the PYTHONPATH environment variable. For us, this was accomplished by export PYTHONPATH=$PYTHONPATH:/home/pi/caffe2/build
  6. Install the ONNX extension for Caffe2: pip install onnx-caffe2
  7. Install further requirements: pip install -r requirements_rpi.txt
  8. Install librosa: pip install --no-deps resampy librosa
  9. Try importing librosa: python -c "import librosa". It should throw an error regarding numba, since we haven't installed it.
  10. We haven't found a way to easily install numba on the RPi, so we need to remove it from resampy. For our setup, we needed to remove numba and @numba.jit from /home/pi/.local/lib/python2.7/site-packages/resampy/interpn.py
  11. All dependencies should now be installed. We should try deploying an ONNX model.
  12. Fetch the models and data: ./fetch_data.sh
  13. In config.json, change backend to caffe2 and model_path to model/google-speech-dataset-full.onnx.
  14. Deploy the server: python . If there are no errors, you have successfully deployed the model, accessible via port 16888 by default.
  15. Run the speech commands demo: python utils/speech_demo.py. You'll need a working microphone and speakers. If you're interacting with your RPi remotely, you can run the speech demo locally and specify the remote endpoint --server-endpoint=http://[RPi IP address]:16888.

Utilities

QA client

Unfortunately, the QA client has no support for the general public yet, since it requires a custom QA service. However, it can still be used to retarget the command keyword.

python client.py runs the QA client. You may retarget a keyword by doing python client.py --mode=retarget. Please note that text-to-speech may not work well on Linux distros; in this case, please supply IBM Watson credentials via --watson-username and --watson--password. You can view all the options by doing python client.py -h.

Training and evaluating the model

CNN models. python -m utils.train --type [train|eval] trains or evaluates the model. It expects all training examples to follow the same format as that of Speech Commands Dataset. The recommended workflow is to download the dataset and add custom keywords, since the dataset already contains many useful audio samples and background noise.

Residual models. We recommend the following hyperparameters for training any of our res{8,15,26}[-narrow] models on the Speech Commands Dataset:

python -m utils.train --wanted_words yes no up down left right on off stop go --dev_every 1 --n_labels 12 --n_epochs 26 --weight_decay 0.00001 --lr 0.1 0.01 0.001 --schedule 3000 6000 --model res{8,15,26}[-narrow]

For more information about our deep residual models, please see our paper:

There are command options available:

option input format default description
--audio_preprocess_type {MFCCs, PCEN} MFCCs type of audio preprocess to use
--batch_size [1, n) 100 the mini-batch size to use
--cache_size [0, inf) 32768 number of items in audio cache, consumes around 32 KB * n
--conv1_pool [1, inf) [1, inf) 2 2 the width and height of the pool filter
--conv1_size [1, inf) [1, inf) 10 4 the width and height of the conv filter
--conv1_stride [1, inf) [1, inf) 1 1 the width and length of the stride
--conv2_pool [1, inf) [1, inf) 1 1 the width and height of the pool filter
--conv2_size [1, inf) [1, inf) 10 4 the width and height of the conv filter
--conv2_stride [1, inf) [1, inf) 1 1 the width and length of the stride
--data_folder string /data/speech_dataset path to data
--dev_every [1, inf) 10 dev interval in terms of epochs
--dev_pct [0, 100] 10 percentage of total set to use for dev
--dropout_prob [0.0, 1.0) 0.5 the dropout rate to use
--gpu_no [-1, n] 1 the gpu to use
--group_speakers_by_id {true, false} true whether to group speakers across train/dev/test
--input_file string the path to the model to load
--input_length [1, inf) 16000 the length of the audio
--lr (0.0, inf) {0.1, 0.001} the learning rate to use
--type {train, eval} train the mode to use
--model string cnn-trad-pool2 one of cnn-trad-pool2, cnn-tstride-{2,4,8}, cnn-tpool{2,3}, cnn-one-fpool3, cnn-one-fstride{4,8}, res{8,15,26}[-narrow], cnn-trad-fpool3, cnn-one-stride1
--momentum [0.0, 1.0) 0.9 the momentum to use for SGD
--n_dct_filters [1, inf) 40 the number of DCT bases to use
--n_epochs [0, inf) 500 number of epochs
--n_feature_maps [1, inf) {19, 45} the number of feature maps to use for the residual architecture
--n_feature_maps1 [1, inf) 64 the number of feature maps for conv net 1
--n_feature_maps2 [1, inf) 64 the number of feature maps for conv net 2
--n_labels [1, n) 4 the number of labels to use
--n_layers [1, inf) {6, 13, 24} the number of convolution layers for the residual architecture
--n_mels [1, inf) 40 the number of Mel filters to use
--no_cuda switch false whether to use CUDA
--noise_prob [0.0, 1.0] 0.8 the probability of mixing with noise
--output_file string model/google-speech-dataset.pt the file to save the model to
--seed (inf, inf) 0 the seed to use
--silence_prob [0.0, 1.0] 0.1 the probability of picking silence
--test_pct [0, 100] 10 percentage of total set to use for testing
--timeshift_ms [0, inf) 100 time in milliseconds to shift the audio randomly
--train_pct [0, 100] 80 percentage of total set to use for training
--unknown_prob [0.0, 1.0] 0.1 the probability of picking an unknown word
--wanted_words string1 string2 ... stringn command random the desired target words

JavaScript-based Keyword Spotting

Honkling is a JavaScript implementation of Honk. With Honkling, it is possible to implement various web applications with in-browser keyword spotting functionality.

Keyword Spotting Data Generator

In order to improve the flexibility of Honk and Honkling, we provide a program that constructs a dataset from youtube videos. Details can be found in keyword_spotting_data_generator folder

Recording audio

You may do the following to record sequential audio and save to the same format as that of speech command dataset:

python -m utils.record

Input return to record, up arrow to undo, and "q" to finish. After one second of silence, recording automatically halts.

Several options are available:

--output-begin-index: Starting sequence number
--output-prefix: Prefix of the output audio sequence
--post-process: How the audio samples should be post-processed. One or more of "trim" and "discard_true".

Post-processing consists of trimming or discarding "useless" audio. Trimming is self-explanatory: the audio recordings are trimmed to the loudest window of x milliseconds, specified by --cutoff-ms. Discarding "useless" audio (discard_true) uses a pre-trained model to determine which samples are confusing, discarding correctly labeled ones. The pre-trained model and correct label are defined by --config and --correct-label, respectively.

For example, consider python -m utils.record --post-process trim discard_true --correct-label no --config config.json. In this case, the utility records a sequence of speech snippets, trims them to one second, and finally discards those not labeled "no" by the model in config.json.

Listening to sound level

python manage_audio.py listen

This assists in setting sane values for --min-sound-lvl for recording.

Generating contrastive examples

python manage_audio.py generate-contrastive --directory [directory] generates contrastive examples from all .wav files in [directory] using phonetic segmentation.

Trimming audio

Speech command dataset contains one-second-long snippets of audio.

python manage_audio.py trim --directory [directory] trims to the loudest one-second for all .wav files in [directory]. The careful user should manually check all audio samples using an audio editor like Audacity.

Owner
Castorini
Deep learning for natural language processing and information retrieval at the University of Waterloo
Castorini
Semantic Segmentation with Pytorch-Lightning

This is a simple demo for performing semantic segmentation on the Kitti dataset using Pytorch-Lightning and optimizing the neural network by monitoring and comparing runs with Weights & Biases.

Boris Dayma 58 Nov 18, 2022
PyTorch code for the "Deep Neural Networks with Box Convolutions" paper

Box Convolution Layer for ConvNets Single-box-conv network (from `examples/mnist.py`) learns patterns on MNIST What This Is This is a PyTorch implemen

Egor Burkov 515 Dec 18, 2022
Code and results accompanying our paper titled Mixture Proportion Estimation and PU Learning: A Modern Approach at Neurips 2021 (Spotlight)

Mixture Proportion Estimation and PU Learning: A Modern Approach This repository is the official implementation of Mixture Proportion Estimation and P

Approximately Correct Machine Intelligence (ACMI) Lab 23 Dec 28, 2022
Code for our paper Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation

CorDA Code for our paper Domain Adaptive Semantic Segmentation with Self-Supervised Depth Estimation Prerequisite Please create and activate the follo

Qin Wang 60 Nov 30, 2022
Invertible conditional GANs for image editing

Invertible Conditional GANs This is the implementation of the IcGAN model proposed in our paper: Invertible Conditional GANs for image editing. Novemb

Guim 278 Dec 12, 2022
Implementation of "GNNAutoScale: Scalable and Expressive Graph Neural Networks via Historical Embeddings" in PyTorch

PyGAS: Auto-Scaling GNNs in PyG PyGAS is the practical realization of our G NN A uto S cale (GAS) framework, which scales arbitrary message-passing GN

Matthias Fey 139 Dec 25, 2022
Improving XGBoost survival analysis with embeddings and debiased estimators

xgbse: XGBoost Survival Embeddings "There are two cultures in the use of statistical modeling to reach conclusions from data

Loft 242 Dec 30, 2022
[SDM 2022] Towards Similarity-Aware Time-Series Classification

SimTSC This is the PyTorch implementation of SDM2022 paper Towards Similarity-Aware Time-Series Classification. We propose Similarity-Aware Time-Serie

Daochen Zha 49 Dec 27, 2022
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

The Apache Software Foundation 20.2k Jan 08, 2023
This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging To appear on KDD'21...[pdf] This project provides an unsupervised framework for mining and

Xiaotao Gu 146 Dec 22, 2022
Code for pre-training CharacterBERT models (as well as BERT models).

Pre-training CharacterBERT (and BERT) This is a repository for pre-training BERT and CharacterBERT. DISCLAIMER: The code was largely adapted from an o

Hicham EL BOUKKOURI 31 Dec 05, 2022
AugLiChem - The augmentation library for chemical systems.

AugLiChem Welcome to AugLiChem! The augmentation library for chemical systems. This package supports augmentation for both crystaline and molecular sy

BaratiLab 17 Jan 08, 2023
Tensorflow Implementation of ECCV'18 paper: Multimodal Human Motion Synthesis

MT-VAE for Multimodal Human Motion Synthesis This is the code for ECCV 2018 paper MT-VAE: Learning Motion Transformations to Generate Multimodal Human

Xinchen Yan 36 Oct 02, 2022
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers Authors: Jaemin Cho, Abhay Zala, and Mohit Bansal (

Jaemin Cho 98 Dec 15, 2022
This is a file about Unet implemented in Pytorch

Unet this is an implemetion of Unet in Pytorch and it's architecture is as follows which is the same with paper of Unet component of Unet Convolution

Dragon 1 Dec 03, 2021
Tutorial repo for an end-to-end Data Science project

End-to-end Data Science project This is the repo with the notebooks, code, and additional material used in the ITI's workshop. The goal of the session

Deena Gergis 127 Dec 30, 2022
[SIGGRAPH'22] StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets

[Project] [PDF] This repository contains code for our SIGGRAPH'22 paper "StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets" by Axel Sauer, Katja

742 Jan 04, 2023
Implementation of Hire-MLP: Vision MLP via Hierarchical Rearrangement and An Image Patch is a Wave: Phase-Aware Vision MLP.

Hire-Wave-MLP.pytorch Implementation of Hire-MLP: Vision MLP via Hierarchical Rearrangement and An Image Patch is a Wave: Phase-Aware Vision MLP Resul

Nevermore 29 Oct 28, 2022
Code, Models and Datasets for OpenViDial Dataset

OpenViDial This repo contains downloading instructions for the OpenViDial dataset in 《OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Vis

119 Dec 08, 2022
A tool to estimate time varying instantaneous reproduction number during epidemics

EpiEstim A tool to estimate time varying instantaneous reproduction number during epidemics. It is described in the following paper: @article{Cori2013

MRC Centre for Global Infectious Disease Analysis 78 Dec 19, 2022