PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Overview

Lip to Speech Synthesis with Visual Context Attentional GAN

This repository contains the PyTorch implementation of the following paper:

Lip to Speech Synthesis with Visual Context Attentional GAN
Minsu Kim, Joanna Hong, and Yong Man Ro
[Paper] [Demo Video]

Preparation

Requirements

  • python 3.7
  • pytorch 1.6 ~ 1.8
  • torchvision
  • torchaudio
  • ffmpeg
  • av
  • tensorboard
  • scikit-image
  • pillow
  • librosa
  • pystoi
  • pesq
  • scipy

Datasets

Download

GRID dataset (video normal) can be downloaded from the below link.

For data preprocessing, download the face landmark of GRID from the below link.

Preprocessing

After download the dataset, preprocess the dataset with the following scripts in ./preprocess.
It supposes the data directory is constructed as

Data_dir
├── subject
|   ├── video
|   |   └── xxx.mpg
  1. Extract frames
    Extract_frames.py extract images and audio from the video.
python Extract_frames.py --Grid_dir "Data dir of GRID_corpus" --Out_dir "Output dir of images and audio of GRID_corpus"
  1. Align faces and audio processing
    Preprocess.py aligns faces and generates videos, which enables cropping the video lip-centered during training.
python Preprocess.py \
--Data_dir "Data dir of extracted images and audio of GRID_corpus" \
--Landmark "Downloaded landmark dir of GRID" \
--Output_dir "Output dir of processed data"

Training the Model

The speaker setting (different subject) can be selected by subject argument. Please refer to below examples.
To train the model, run following command:

# Data Parallel training example using 4 GPUs for multi-speaker setting in GRID
python train.py \
--grid 'enter_the_processed_data_path' \
--checkpoint_dir 'enter_the_path_to_save' \
--batch_size 88 \
--epochs 500 \
--subject 'overlap' \
--eval_step 720 \
--dataparallel \
--gpu 0,1,2,3
# 1 GPU training example for GRID for unseen-speaker setting in GRID
python train.py \
--grid 'enter_the_processed_data_path' \
--checkpoint_dir 'enter_the_path_to_save' \
--batch_size 22 \
--epochs 500 \
--subject 'unseen' \
--eval_step 1000 \
--gpu 0

Descriptions of training parameters are as follows:

  • --grid: Dataset location (grid)
  • --checkpoint_dir: directory for saving checkpoints
  • --checkpoint : saved checkpoint where the training is resumed from
  • --batch_size: batch size
  • --epochs: number of epochs
  • --augmentations: whether performing augmentation
  • --dataparallel: Use DataParallel
  • --subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
  • --gpu: gpu number for training
  • --lr: learning rate
  • --eval_step: steps for performing evaluation
  • --window_size: number of frames to be used for training
  • Refer to train.py for the other training parameters

The evaluation during training is performed for a subset of the validation dataset due to the heavy time costs of waveform conversion (griffin-lim).
In order to evaluate the entire performance of the trained model run the test code (refer to "Testing the Model" section).

check the training logs

tensorboard --logdir='./runs/logs to watch' --host='ip address of the server'

The tensorboard shows the training and validation loss, evaluation metrics, generated mel-spectrogram, and audio

Testing the Model

To test the model, run following command:

# Dataparallel test example for multi-speaker setting in GRID
python test.py \
--grid 'enter_the_processed_data_path' \
--checkpoint 'enter_the_checkpoint_path' \
--batch_size 100 \
--subject 'overlap' \
--save_mel \
--save_wav \
--dataparallel \
--gpu 0,1

Descriptions of training parameters are as follows:

  • --grid: Dataset location (grid)
  • --checkpoint : saved checkpoint where the training is resumed from
  • --batch_size: batch size
  • --dataparallel: Use DataParallel
  • --subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
  • --save_mel: whether to save the 'mel_spectrogram' and 'spectrogram' in .npz format
  • --save_wav: whether to save the 'waveform' in .wav format
  • --gpu: gpu number for training
  • Refer to test.py for the other parameters

Test Automatic Speech Recognition (ASR) results of generated results: WER

Transcription (Ground-truth) of GRID dataset can be downloaded from the below link.

move to the ASR_model directory

cd ASR_model/GRID

To evaluate the WER, run following command:

# test example for multi-speaker setting in GRID
python test.py \
--data 'enter_the_generated_data_dir (mel or wav) (ex. ./../../test/spec_mel)' \
--gtpath 'enter_the_downloaded_transcription_path' \
--subject 'overlap' \
--gpu 0

Descriptions of training parameters are as follows:

  • --data: Data for evaluation (wav or mel(.npz))
  • --wav : whether the data is waveform or not
  • --batch_size: batch size
  • --subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
  • --gpu: gpu number for training
  • Refer to ./ASR_model/GRID/test.py for the other parameters

Pre-trained ASR model checkpoint

Below lists are the pre-trained ASR model to evaluate the generated speech.
WER shows the original performances of the model on ground-truth audio.

Setting WER
GRID (constrained-speaker) 0.83 %
GRID (multi-speaker) 1.67 %
GRID (unseen-speaker) 0.37 %
LRW 1.54 %

Put the checkpoints in ./ASR_model/GRID/data for GRID, and in ./ASR_model/LRW/data for LRW.

Citation

If you find this work useful in your research, please cite the paper:

@article{kim2021vcagan,
  title={Lip to Speech Synthesis with Visual Context Attentional GAN},
  author={Kim, Minsu and Hong, Joanna and Ro, Yong Man},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}
TLDR; Train custom adaptive filter optimizers without hand tuning or extra labels.

AutoDSP TLDR; Train custom adaptive filter optimizers without hand tuning or extra labels. About Adaptive filtering algorithms are commonplace in sign

Jonah Casebeer 48 Sep 19, 2022
Fast, Attemptable Route Planner for Navigation in Known and Unknown Environments

FAR Planner uses a dynamically updated visibility graph for fast replanning. The planner models the environment with polygons and builds a global visi

Fan Yang 346 Dec 30, 2022
[CVPR 2020] Interpreting the Latent Space of GANs for Semantic Face Editing

InterFaceGAN - Interpreting the Latent Space of GANs for Semantic Face Editing Figure: High-quality facial attributes editing results with InterFaceGA

GenForce: May Generative Force Be with You 1.3k Dec 29, 2022
Pytorch Implementation of Continual Learning With Filter Atom Swapping (ICLR'22 Spolight) Paper

Continual Learning With Filter Atom Swapping Pytorch Implementation of Continual Learning With Filter Atom Swapping (ICLR'22 Spolight) Paper If find t

11 Aug 29, 2022
PyTorch code for Composing Partial Differential Equations with Physics-Aware Neural Networks

FInite volume Neural Network (FINN) This repository contains the PyTorch code for models, training, and testing, and Python code for data generation t

Cognitive Modeling 20 Dec 18, 2022
optimization routines for hyperparameter tuning

Hyperopt: Distributed Hyperparameter Optimization Hyperopt is a Python library for serial and parallel optimization over awkward search spaces, which

Marc Claesen 398 Nov 09, 2022
CTRL-C: Camera calibration TRansformer with Line-Classification

CTRL-C: Camera calibration TRansformer with Line-Classification This repository contains the official code and pretrained models for CTRL-C (Camera ca

57 Nov 14, 2022
Implementation of the final project of the course DDA6309 Probabilistic Graphical Model

Task-aware Joint CWS and POS (TCwsPos) This is the implementation of the final project of the course DDA6309 Probabilistic Graphical Models, The Chine

Peng 1 Dec 26, 2021
Image process framework based on plugin like imagej, it is esay to glue with scipy.ndimage, scikit-image, opencv, simpleitk, mayavi...and any libraries based on numpy

Introduction ImagePy is an open source image processing framework written in Python. Its UI interface, image data structure and table data structure a

ImagePy 1.2k Dec 29, 2022
TextureGAN in Pytorch

TextureGAN This code is our PyTorch implementation of TextureGAN [Project] [Arxiv] TextureGAN is a generative adversarial network conditioned on sketc

Patsorn 147 Dec 14, 2022
Breast Cancer Detection 🔬 ITI "AI_Pro" Graduation Project

BreastCancerDetection - This program is designed to predict two severity of abnormalities associated with breast cancer cells: benign and malignant. Mammograms from MIAS is preprocessed and features

6 Nov 29, 2022
Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth [Paper]

Global-Local Path Networks for Monocular Depth Estimation with Vertical CutDepth [Paper] Downloads [Downloads] Trained ckpt files for NYU Depth V2 and

98 Jan 01, 2023
CO-PILOT: COllaborative Planning and reInforcement Learning On sub-Task curriculum

CO-PILOT CO-PILOT: COllaborative Planning and reInforcement Learning On sub-Task curriculum, NeurIPS 2021, Shuang Ao, Tianyi Zhou, Guodong Long, Qingh

Shuang Ao 1 Feb 18, 2022
Official repository for the paper "GN-Transformer: Fusing AST and Source Code information in Graph Networks".

GN-Transformer AST This is the official repository for the paper "GN-Transformer: Fusing AST and Source Code information in Graph Networks". Data Prep

Cheng Jun-Yan 10 Nov 26, 2022
offical implement of our Lifelong Person Re-Identification via Adaptive Knowledge Accumulation in CVPR2021

LifelongReID Offical implementation of our Lifelong Person Re-Identification via Adaptive Knowledge Accumulation in CVPR2021 by Nan Pu, Wei Chen, Yu L

PeterPu 76 Dec 08, 2022
Codes for “A Deeply Supervised Attention Metric-Based Network and an Open Aerial Image Dataset for Remote Sensing Change Detection”

DSAMNet The pytorch implementation for "A Deeply-supervised Attention Metric-based Network and an Open Aerial Image Dataset for Remote Sensing Change

Mengxi Liu 41 Dec 14, 2022
68 keypoint annotations for COFW test data

68 keypoint annotations for COFW test data This repository contains manually annotated 68 keypoints for COFW test data (original annotation of CFOW da

31 Dec 06, 2022
Twin-deep neural network for semi-supervised learning of materials properties

Deep Semi-Supervised Teacher-Student Material Synthesizability Prediction Citation: Semi-supervised teacher-student deep neural network for materials

MLEG 3 Dec 14, 2022
Tensorflow2.0 🍎🍊 is delicious, just eat it! 😋😋

How to eat TensorFlow2 in 30 days ? 🔥 🔥 Click here for Chinese Version(中文版) 《10天吃掉那只pyspark》 🚀 github项目地址: https://github.com/lyhue1991/eat_pyspark

lyhue1991 9.7k Jan 01, 2023
Shuffle Attention for MobileNetV3

SA-MobileNetV3 Shuffle Attention for MobileNetV3 Train Run the following command for train model on your own dataset: python train.py --dataset mnist

Sajjad Aemmi 36 Dec 28, 2022