Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis

Last update: Oct 17, 2022

Overview

TDY-CNN for Text-Independent Speaker Verification

Official implementation of

Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis
by Seong-Hu Kim, Hyeonuk Nam, Yong-Hwa Park @ Human Lab, Mechanical Engineering Department, KAIST

Accepted paper in ICASSP 2022.

This code was written mainly with reference to VoxCeleb_trainer of paper 'In defence of metric learning for speaker recognition'.

Temporal Dynamic Convolutional Neural Network (TDY-CNN)

TDY-CNN efficiently applies adaptive convolution depending on time bins by changing the computation order as follows:

$y(f, t) = \sigma (\sum_{k=1}^{K} \pi_{k}(t)y_k(f,t))$

where x and y are input and output of TDY-CNN module which depends on frequency feature f and time feature t in time-frequency domain data. k-th basis kernel is convoluted with input and k-th bias is added. The results are aggregated using the attention weights which depends on time bins. K is the number of basis kernels, and σ is an activation function ReLU. The attention weight has a value between 0 and 1, and the sum of all basis kernels on a single time bin is 1 as the weights are processed by softmax.

Requirements and versions used

Python version of 3.7.10 is used with following libraries

pytorch == 1.8.1
pytorchaudio == 0.8.1
numpy == 1.19.2
scipy == 1.5.3
scikit-learn == 0.23.2

Dataset

We used VoxCeleb1 & 2 dataset in this paper. You can download the dataset by reffering to VoxCeleb1 and VoxCeleb1.

Training

You can train and save model in exps folder by running:

python trainSpeakerNet.py --model TDy_ResNet34_half --log_input True --encoder_type AVG --trainfunc softmaxproto --save_path exps/TDY_CNN_ResNet34 --nPerSpeaker 2 --batch_size 400

This implementation also provides accelerating training with distributed training and mixed precision training.

Use --distributed flag to enable distributed training and --mixedprec flag to enable mixed precision training.
- GPU indices should be set before training : os.environ['CUDA_VISIBLE_DEVICES'] ='0,1,2,3' in trainSpeakernet.py.

Results:

Network	#Parm	EER (%)	C_det (%)
TDY-VGG-M	71.2M	3.04	0.237
TDY-ResNet-34(×0.25)	13.3M	1.58	0.116
TDY-ResNet-34(×0.5)	51.9M	1.48	0.118

This result is low-dimensional t-SNE projection of frame-level speaker embed-dings of MHRM0 and FDAS1 using (a) baseline model ResNet-34(×0.25) and (b) TDY-ResNet-34(×0.25). Left column represents embeddings for different speakers, and right column represents em-beddings for different phoneme classes.
Embeddings by TDY-ResNet-34(×0.25) are closely gathered regardless of phoneme groups. It shows that the temporal dynamic model extracts consistent speaker information regardless of phonemes.

Pretrained models

There are pretrained models in folder pretrained_model.

For example, you can check 1.4786 of EER by running following script using TDY-ResNet-34(×0.5).

python trainSpeakerNet.py --eval --model TDy_ResNet34_half --log_input True --encoder_type AVG --trainfunc softmaxproto --save_path exps/test --eval_frames 400 --initial_model pretrained_model/pretrained_TDy_ResNet34_half.model

Citation

@article{kim2021tdycnn,
  title={Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis},
  author={Kim, Seong-Hu and Nam, Hyeonuk and Park, Yong-Hwa},
  journal={arXiv preprint arXiv:2110.03213},
  year={2021}
}

Please contact Seong-Hu Kim at [email protected] for any query.

Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis

Related tags

Overview

TDY-CNN for Text-Independent Speaker Verification

Temporal Dynamic Convolutional Neural Network (TDY-CNN)

Requirements and versions used

Dataset

Training

Results:

Pretrained models

Citation

Owner

Seong-Hu Kim

Sample and Computation Redistribution for Efficient Face Detection

EdMIPS: Rethinking Differentiable Search for Mixed-Precision Neural Networks

A memory-efficient implementation of DenseNets

A modular PyTorch library for optical flow estimation using neural networks

Trained on Simulated Data, Tested in the Real World

This project generates news headlines using a Long Short-Term Memory (LSTM) neural network.

Pytorch Implementation of "Diagonal Attention and Style-based GAN for Content-Style disentanglement in image generation and translation" (ICCV 2021)

Automated Hyperparameter Optimization Competition

This repository contains the code for the paper ``Identifiable VAEs via Sparse Decoding''.

PyTorch implementation of DeepDream algorithm

A Temporal Extension Library for PyTorch Geometric

Deep Learning for humans

Some toy examples of score matching algorithms written in PyTorch

R-package accompanying the paper "Dynamic Factor Model for Functional Time Series: Identification, Estimation, and Prediction"

Tools for computational pathology

SGoLAM - Simultaneous Goal Localization and Mapping

StyleGAN2 with adaptive discriminator augmentation (ADA) - Official TensorFlow implementation

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

Codes to calculate solar-sensor zenith and azimuth angles directly from hyperspectral images collected by UAV. Works only for UAVs that have high resolution GNSS/IMU unit.

User-friendly bulk RNAseq deconvolution using simulated annealing