The codebase for Data-driven general-purpose voice activity detection.

Overview

Data driven GPVAD

Repository for the work in TASLP 2021 Voice activity detection in the wild: A data-driven approach using teacher-student training.

Framework

Sample predictions against other methods

Samples_1

Samples_2

Samples_3

Samples_4

Noise robustness

Speech

Background

Speech

Results

Our best model trained on the SRE (V3) dataset obtains the following results:

Precision Recall F1 AUC FER Event-F1
aurora_clean 96.844 95.102 95.93 98.66 3.06 74.8
aurora_noisy 90.435 92.871 91.544 97.63 6.68 54.45
dcase18 89.202 88.362 88.717 95.2 10.82 57.85

Usage

We provide most of our pretrained models in this repository, including:

  1. Both teachers (T_1, T_2)
  2. Unbalanced audioset pretrained model
  3. Voxceleb 2 pretrained model
  4. Our best submission (SRE V3 trained)

To download and run evaluation just do:

git clone https://github.com/RicherMans/Datadriven-VAD
cd Datadriven-VAD
pip3 install -r requirements.txt
python3 forward.py -w example/example.wav

Running this will print:

|   index | event_label   |   onset |   offset | filename            |
|--------:|:--------------|--------:|---------:|:--------------------|
|       0 | Speech        |    0.28 |     0.94 | example/example.wav |
|       1 | Speech        |    1.04 |     2.22 | example/example.wav |

Predicting voice activity

We support single file and filelist-batching in our script. Obtaining VAD predictions is easy:

python3 forward.py -w example/example.wav

Or if one prefers to do that batch_wise, first prepare a filelist: find . -type f -name *.wav > wavlist.txt' And then just run:

python3 forward.py -l wavlist

Extra parameters

  • -model adjusts the pretrained model. Can be one of t1,t2,v2,a2,a2_v2,sre. Refer to the paper for each respective model. By default we use sre.
  • -soft instead of predicting human-readable timestamps, the model is now outputting the raw probabilities.
  • -hard instead of predicting human-readable timestamps, the model is now outputting the post-processed 0-1 flags indicating speech. Please note this is different from the paper, which thresholded the soft probabilities without post-processing.
  • -th adjusts the threshold. If a single threshold is passed (e.g., -th 0.5), we utilize simple binearization. Otherwise use the default double threshold with -th 0.5 0.1.
  • -o outputs the results into a new folder.

Training from scratch

If you intend to rerun our work, prepare some data and extract log-Mel spectrogram features. Say, you have downloaded the balanced subset of AudioSet and stored all files in a folder data/balanced/. Then:

cd data;
mkdir hdf5 csv_labels;
find balanced -type f > wavs.txt;
python3 extract_features.py wavs.txt -o hdf5/balanced.h5
h5ls -r hdf5/balanced.h5 | awk -F[/' '] 'BEGIN{print "filename","hdf5path"}NR>1{print $2,"hdf5/balanced.h5"}'> csv_labels/balanced.csv

The input for our label prediction script is a csv file with exactly two columns, filename and hdf5path.

An example csv_labels/balanced.csv would be:

filename hdf5path
--PJHxphWEs_30.000.wav hdf5/balanced.h5                                                                                          
--ZhevVpy1s_50.000.wav hdf5/balanced.h5                                                                                          
--aE2O5G5WE_0.000.wav hdf5/balanced.h5                                                                                           
--aO5cdqSAg_30.000.wav hdf5/balanced.h5                                                                                          

After feature extraction, proceed to predict labels:

mkdir -p softlabels/{hdf5,csv};
python3 prepare_labels.py --pre ../pretrained_models/teacher1/model.pth csv_labels/balanced.csv softlabels/hdf5/balanced.h5 softlabels/csv/balanced.csv

Lastly, just train:

cd ../; #Go to project root
# Change config accoringly with input data
python3 run.py train configs/example.yaml

Citation

If youre using this work, please cite it in your publications.

@article{Dinkel2021,
author = {Dinkel, Heinrich and Wang, Shuai and Xu, Xuenan and Wu, Mengyue and Yu, Kai},
doi = {10.1109/TASLP.2021.3073596},
issn = {2329-9290},
journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
pages = {1542--1555},
title = {{Voice Activity Detection in the Wild: A Data-Driven Approach Using Teacher-Student Training}},
url = {https://ieeexplore.ieee.org/document/9405474/},
volume = {29},
year = {2021}
}

and

@inproceedings{Dinkel2020,
  author={Heinrich Dinkel and Yefei Chen and Mengyue Wu and Kai Yu},
  title={{Voice Activity Detection in the Wild via Weakly Supervised Sound Event Detection}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={3665--3669},
  doi={10.21437/Interspeech.2020-0995},
  url={http://dx.doi.org/10.21437/Interspeech.2020-0995}
}
Owner
Heinrich Dinkel
日新月异
Heinrich Dinkel
An unofficial personal implementation of UM-Adapt, specifically to tackle joint estimation of panoptic segmentation and depth prediction for autonomous driving datasets.

Semisupervised Multitask Learning This repository is an unofficial and slightly modified implementation of UM-Adapt[1] using PyTorch. This code primar

Abhinav Atrishi 11 Nov 25, 2022
This repo is the official implementation for Multi-Scale Adaptive Graph Neural Network for Multivariate Time Series Forecasting

1 MAGNN This repo is the official implementation for Multi-Scale Adaptive Graph Neural Network for Multivariate Time Series Forecasting. 1.1 The frame

SZJ 12 Nov 08, 2022
Official PyTorch implementation of "AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks"

AASIST This repository provides the overall framework for training and evaluating audio anti-spoofing systems proposed in 'AASIST: Audio Anti-Spoofing

Clova AI Research 56 Jan 02, 2023
labelpix is a graphical image labeling interface for drawing bounding boxes

Welcome to labelpix 👋 labelpix is a graphical image labeling interface for drawing bounding boxes. 🏠 Homepage Install pip install -r requirements.tx

schissmantics 26 May 24, 2022
MlTr: Multi-label Classification with Transformer

MlTr: Multi-label Classification with Transformer This is official implement of "MlTr: Multi-label Classification with Transformer". Abstract The task

程星 38 Nov 08, 2022
PyTorch implementation of Value Iteration Networks (VIN): Clean, Simple and Modular. Visualization in Visdom.

VIN: Value Iteration Networks This is an implementation of Value Iteration Networks (VIN) in PyTorch to reproduce the results.(TensorFlow version) Key

Xingdong Zuo 215 Dec 07, 2022
💡 Type hints for Numpy

Type hints with dynamic checks for Numpy! (❒) Installation pip install nptyping (❒) Usage (❒) NDArray nptyping.NDArray lets you define the shape and

Ramon Hagenaars 377 Dec 28, 2022
Manage the availability of workspaces within Frappe/ ERPNext (sidebar) based on user-roles

Workspace Permissions Manage the availability of workspaces within Frappe/ ERPNext (sidebar) based on user-roles. Features Configure foreach workspace

Patrick.St. 18 Sep 26, 2022
TensorFlow Metal Backend on Apple Silicon Experiments (just for fun)

tf-metal-experiments TensorFlow Metal Backend on Apple Silicon Experiments (just for fun) Setup This is tested on M1 series Apple Silicon SOC only. Te

Timothy Liu 161 Jan 03, 2023
(AAAI2020)Grapy-ML: Graph Pyramid Mutual Learning for Cross-dataset Human Parsing

Grapy-ML: Graph Pyramid Mutual Learning for Cross-dataset Human Parsing This repository contains pytorch source code for AAAI2020 oral paper: Grapy-ML

54 Aug 04, 2022
StocksMA is a package to facilitate access to financial and economic data of Moroccan stocks.

Creating easier access to the Moroccan stock market data What is StocksMA ? StocksMA is a package to facilitate access to financial and economic data

Salah Eddine LABIAD 28 Jan 04, 2023
Image Completion with Deep Learning in TensorFlow

Image Completion with Deep Learning in TensorFlow See my blog post for more details and usage instructions. This repository implements Raymond Yeh and

Brandon Amos 1.3k Dec 23, 2022
Official code repository for A Simple Long-Tailed Rocognition Baseline via Vision-Language Model.

This is the official code repository for A Simple Long-Tailed Rocognition Baseline via Vision-Language Model.

peng gao 42 Nov 26, 2022
Spatial Single-Cell Analysis Toolkit

Single-Cell Image Analysis Package Scimap is a scalable toolkit for analyzing spatial molecular data. The underlying framework is generalizable to spa

Laboratory of Systems Pharmacology @ Harvard 30 Nov 08, 2022
2021-AIAC-QQ-Browser-Hyperparameter-Optimization-Rank6

2021-AIAC-QQ-Browser-Hyperparameter-Optimization-Rank6

Aigege 8 Mar 31, 2022
Latent Network Models to Account for Noisy, Multiply-Reported Social Network Data

VIMuRe Latent Network Models to Account for Noisy, Multiply-Reported Social Network Data. If you use this code please cite this article (preprint). De

6 Dec 15, 2022
World Models with TensorFlow 2

World Models This repo reproduces the original implementation of World Models. This implementation uses TensorFlow 2.2. Docker The easiest way to hand

Zac Wellmer 234 Nov 30, 2022
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech

STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee, Ky

Keon Lee 114 Dec 12, 2022
CLIPImageClassifier wraps clip image model from transformers

CLIPImageClassifier CLIPImageClassifier wraps clip image model from transformers. CLIPImageClassifier is initialized with the argument classes, these

Jina AI 6 Sep 12, 2022