Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis

Overview

TDY-CNN for Text-Independent Speaker Verification

Official implementation of

  • Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis
    by Seong-Hu Kim, Hyeonuk Nam, Yong-Hwa Park @ Human Lab, Mechanical Engineering Department, KAIST
    arXiv

Accepted paper in ICASSP 2022.

This code was written mainly with reference to VoxCeleb_trainer of paper 'In defence of metric learning for speaker recognition'.

Temporal Dynamic Convolutional Neural Network (TDY-CNN)

TDY-CNN efficiently applies adaptive convolution depending on time bins by changing the computation order as follows:

where x and y are input and output of TDY-CNN module which depends on frequency feature f and time feature t in time-frequency domain data. k-th basis kernel is convoluted with input and k-th bias is added. The results are aggregated using the attention weights which depends on time bins. K is the number of basis kernels, and σ is an activation function ReLU. The attention weight has a value between 0 and 1, and the sum of all basis kernels on a single time bin is 1 as the weights are processed by softmax.

Requirements and versions used

Python version of 3.7.10 is used with following libraries

  • pytorch == 1.8.1
  • pytorchaudio == 0.8.1
  • numpy == 1.19.2
  • scipy == 1.5.3
  • scikit-learn == 0.23.2

Dataset

We used VoxCeleb1 & 2 dataset in this paper. You can download the dataset by reffering to VoxCeleb1 and VoxCeleb1.

Training

You can train and save model in exps folder by running:

python trainSpeakerNet.py --model TDy_ResNet34_half --log_input True --encoder_type AVG --trainfunc softmaxproto --save_path exps/TDY_CNN_ResNet34 --nPerSpeaker 2 --batch_size 400

This implementation also provides accelerating training with distributed training and mixed precision training.

  • Use --distributed flag to enable distributed training and --mixedprec flag to enable mixed precision training.
    • GPU indices should be set before training : os.environ['CUDA_VISIBLE_DEVICES'] ='0,1,2,3' in trainSpeakernet.py.

Results:

Network #Parm EER (%) C_det (%)
TDY-VGG-M 71.2M 3.04 0.237
TDY-ResNet-34(×0.25) 13.3M 1.58 0.116
TDY-ResNet-34(×0.5) 51.9M 1.48 0.118

  • This result is low-dimensional t-SNE projection of frame-level speaker embed-dings of MHRM0 and FDAS1 using (a) baseline model ResNet-34(×0.25) and (b) TDY-ResNet-34(×0.25). Left column represents embeddings for different speakers, and right column represents em-beddings for different phoneme classes.

  • Embeddings by TDY-ResNet-34(×0.25) are closely gathered regardless of phoneme groups. It shows that the temporal dynamic model extracts consistent speaker information regardless of phonemes.

Pretrained models

There are pretrained models in folder pretrained_model.

For example, you can check 1.4786 of EER by running following script using TDY-ResNet-34(×0.5).

python trainSpeakerNet.py --eval --model TDy_ResNet34_half --log_input True --encoder_type AVG --trainfunc softmaxproto --save_path exps/test --eval_frames 400 --initial_model pretrained_model/pretrained_TDy_ResNet34_half.model

Citation

@article{kim2021tdycnn,
  title={Temporal Dynamic Convolutional Neural Network for Text-Independent Speaker Verification and Phonemetic Analysis},
  author={Kim, Seong-Hu and Nam, Hyeonuk and Park, Yong-Hwa},
  journal={arXiv preprint arXiv:2110.03213},
  year={2021}
}

Please contact Seong-Hu Kim at [email protected] for any query.

Owner
Seong-Hu Kim
Seong-Hu Kim
Hierarchical Uniform Manifold Approximation and Projection

HUMAP Hierarchical Manifold Approximation and Projection (HUMAP) is a technique based on UMAP for hierarchical non-linear dimensionality reduction. HU

Wilson Estécio Marcílio Júnior 160 Jan 06, 2023
Code for our paper A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization,

FSRA This repository contains the dataset link and the code for our paper A Transformer-Based Feature Segmentation and Region Alignment Method For UAV

Dmmm 32 Dec 18, 2022
scikit-learn: machine learning in Python

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. The project was started

scikit-learn 52.5k Jan 08, 2023
Semantic segmentation models, datasets and losses implemented in PyTorch.

Semantic Segmentation in PyTorch Semantic Segmentation in PyTorch Requirements Main Features Models Datasets Losses Learning rate schedulers Data augm

Yassine 1.3k Jan 07, 2023
Demonstration of the Model Training as a CI/CD System in Vertex AI

Model Training as a CI/CD System This project demonstrates the machine model training as a CI/CD system in GCP platform. You will see more detailed wo

Chansung Park 19 Dec 28, 2022
Simulation of Self Driving Car

In this repository, the code to use Udacity's self driving car simulator as a testbed for training an autonomous car are provided.

Shyam Das Shrestha 1 Nov 21, 2021
Stratified Transformer for 3D Point Cloud Segmentation (CVPR 2022)

Stratified Transformer for 3D Point Cloud Segmentation Xin Lai*, Jianhui Liu*, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu Liu, Xiaojuan Qi, Jiaya Jia

DV Lab 195 Jan 01, 2023
HyperPose is a library for building high-performance custom pose estimation applications.

HyperPose is a library for building high-performance custom pose estimation applications.

TensorLayer Community 1.2k Jan 04, 2023
Why Are You Weird? Infusing Interpretability in Isolation Forest for Anomaly Detection

Why, hello there! This is the supporting notebook for the research paper — Why Are You Weird? Infusing Interpretability in Isolation Forest for Anomal

2 Dec 14, 2021
Package for extracting emotions from social media text. Tailored for financial data.

EmTract: Extracting Emotions from Social Media Text Tailored for Financial Contexts EmTract is a tool that extracts emotions from social media text. I

13 Nov 17, 2022
FastyAPI is a Stack boilerplate optimised for heavy loads.

FastyAPI A FastAPI based Stack boilerplate for heavy loads. Explore the docs » View Demo · Report Bug · Request Feature Table of Contents About The Pr

Ali Chaayb 47 Dec 27, 2022
An onlinel learning to rank python codebase.

OLTR Online learning to rank python codebase. The code related to Pairwise Differentiable Gradient Descent (ranker/PDGDLinearRanker.py) is copied from

ielab 5 Jul 18, 2022
Leibniz is a python package which provide facilities to express learnable partial differential equations with PyTorch

Leibniz is a python package which provide facilities to express learnable partial differential equations with PyTorch

Beijing ColorfulClouds Technology Co.,Ltd. 16 Aug 07, 2022
Tensorflow implementation of DeepLabv2

TF-deeplab This is a Tensorflow implementation of DeepLab, compatible with Tensorflow 1.2.1. Currently it supports both training and testing the ResNe

Chenxi Liu 21 Sep 27, 2022
LQM - Improving Object Detection by Estimating Bounding Box Quality Accurately

Improving Object Detection by Estimating Bounding Box Quality Accurately Abstract Object detection aims to locate and classify object instances in ima

IM Lab., POSTECH 0 Sep 28, 2022
Deep Learning for Computer Vision final project

Deep Learning for Computer Vision final project

grassking100 1 Nov 30, 2021
Implementations of the algorithms in the paper Approximative Algorithms for Multi-Marginal Optimal Transport and Free-Support Wasserstein Barycenters

Implementations of the algorithms in the paper Approximative Algorithms for Multi-Marginal Optimal Transport and Free-Support Wasserstein Barycenters

Johannes von Lindheim 3 Oct 29, 2022
Fewshot-face-translation-GAN - Generative adversarial networks integrating modules from FUNIT and SPADE for face-swapping.

Few-shot face translation A GAN based approach for one model to swap them all. The table below shows our priliminary face-swapping results requiring o

768 Dec 24, 2022
Keras Implementation of The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation by (Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, Yoshua Bengio)

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation: Work In Progress, Results can't be replicated yet with the m

Yad Konrad 196 Aug 30, 2022
Out-of-boundary View Synthesis towards Full-frame Video Stabilization

Out-of-boundary View Synthesis towards Full-frame Video Stabilization Introduction | Update | Results Demo | Introduction This repository contains the

25 Oct 10, 2022