PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Overview

Lip to Speech Synthesis with Visual Context Attentional GAN

This repository contains the PyTorch implementation of the following paper:

Lip to Speech Synthesis with Visual Context Attentional GAN
Minsu Kim, Joanna Hong, and Yong Man Ro
[Paper] [Demo Video]

Preparation

Requirements

  • python 3.7
  • pytorch 1.6 ~ 1.8
  • torchvision
  • torchaudio
  • ffmpeg
  • av
  • tensorboard
  • scikit-image
  • pillow
  • librosa
  • pystoi
  • pesq
  • scipy

Datasets

Download

GRID dataset (video normal) can be downloaded from the below link.

For data preprocessing, download the face landmark of GRID from the below link.

Preprocessing

After download the dataset, preprocess the dataset with the following scripts in ./preprocess.
It supposes the data directory is constructed as

Data_dir
├── subject
|   ├── video
|   |   └── xxx.mpg
  1. Extract frames
    Extract_frames.py extract images and audio from the video.
python Extract_frames.py --Grid_dir "Data dir of GRID_corpus" --Out_dir "Output dir of images and audio of GRID_corpus"
  1. Align faces and audio processing
    Preprocess.py aligns faces and generates videos, which enables cropping the video lip-centered during training.
python Preprocess.py \
--Data_dir "Data dir of extracted images and audio of GRID_corpus" \
--Landmark "Downloaded landmark dir of GRID" \
--Output_dir "Output dir of processed data"

Training the Model

The speaker setting (different subject) can be selected by subject argument. Please refer to below examples.
To train the model, run following command:

# Data Parallel training example using 4 GPUs for multi-speaker setting in GRID
python train.py \
--grid 'enter_the_processed_data_path' \
--checkpoint_dir 'enter_the_path_to_save' \
--batch_size 88 \
--epochs 500 \
--subject 'overlap' \
--eval_step 720 \
--dataparallel \
--gpu 0,1,2,3
# 1 GPU training example for GRID for unseen-speaker setting in GRID
python train.py \
--grid 'enter_the_processed_data_path' \
--checkpoint_dir 'enter_the_path_to_save' \
--batch_size 22 \
--epochs 500 \
--subject 'unseen' \
--eval_step 1000 \
--gpu 0

Descriptions of training parameters are as follows:

  • --grid: Dataset location (grid)
  • --checkpoint_dir: directory for saving checkpoints
  • --checkpoint : saved checkpoint where the training is resumed from
  • --batch_size: batch size
  • --epochs: number of epochs
  • --augmentations: whether performing augmentation
  • --dataparallel: Use DataParallel
  • --subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
  • --gpu: gpu number for training
  • --lr: learning rate
  • --eval_step: steps for performing evaluation
  • --window_size: number of frames to be used for training
  • Refer to train.py for the other training parameters

The evaluation during training is performed for a subset of the validation dataset due to the heavy time costs of waveform conversion (griffin-lim).
In order to evaluate the entire performance of the trained model run the test code (refer to "Testing the Model" section).

check the training logs

tensorboard --logdir='./runs/logs to watch' --host='ip address of the server'

The tensorboard shows the training and validation loss, evaluation metrics, generated mel-spectrogram, and audio

Testing the Model

To test the model, run following command:

# Dataparallel test example for multi-speaker setting in GRID
python test.py \
--grid 'enter_the_processed_data_path' \
--checkpoint 'enter_the_checkpoint_path' \
--batch_size 100 \
--subject 'overlap' \
--save_mel \
--save_wav \
--dataparallel \
--gpu 0,1

Descriptions of training parameters are as follows:

  • --grid: Dataset location (grid)
  • --checkpoint : saved checkpoint where the training is resumed from
  • --batch_size: batch size
  • --dataparallel: Use DataParallel
  • --subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
  • --save_mel: whether to save the 'mel_spectrogram' and 'spectrogram' in .npz format
  • --save_wav: whether to save the 'waveform' in .wav format
  • --gpu: gpu number for training
  • Refer to test.py for the other parameters

Test Automatic Speech Recognition (ASR) results of generated results: WER

Transcription (Ground-truth) of GRID dataset can be downloaded from the below link.

move to the ASR_model directory

cd ASR_model/GRID

To evaluate the WER, run following command:

# test example for multi-speaker setting in GRID
python test.py \
--data 'enter_the_generated_data_dir (mel or wav) (ex. ./../../test/spec_mel)' \
--gtpath 'enter_the_downloaded_transcription_path' \
--subject 'overlap' \
--gpu 0

Descriptions of training parameters are as follows:

  • --data: Data for evaluation (wav or mel(.npz))
  • --wav : whether the data is waveform or not
  • --batch_size: batch size
  • --subject: different speaker settings, s# is speaker specific training, overlap for multi-speaker setting, unseen for unseen-speaker setting, four for four speaker training
  • --gpu: gpu number for training
  • Refer to ./ASR_model/GRID/test.py for the other parameters

Pre-trained ASR model checkpoint

Below lists are the pre-trained ASR model to evaluate the generated speech.
WER shows the original performances of the model on ground-truth audio.

Setting WER
GRID (constrained-speaker) 0.83 %
GRID (multi-speaker) 1.67 %
GRID (unseen-speaker) 0.37 %
LRW 1.54 %

Put the checkpoints in ./ASR_model/GRID/data for GRID, and in ./ASR_model/LRW/data for LRW.

Citation

If you find this work useful in your research, please cite the paper:

@article{kim2021vcagan,
  title={Lip to Speech Synthesis with Visual Context Attentional GAN},
  author={Kim, Minsu and Hong, Joanna and Ro, Yong Man},
  journal={Advances in Neural Information Processing Systems},
  volume={34},
  year={2021}
}
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

Facebook Research 94 Oct 26, 2022
Code for paper Adaptively Aligned Image Captioning via Adaptive Attention Time

Adaptively Aligned Image Captioning via Adaptive Attention Time This repository includes the implementation for Adaptively Aligned Image Captioning vi

Lun Huang 45 Aug 27, 2022
Training deep models using anime, illustration images.

animeface deep models for anime images. Datasets anime-face-dataset Anime faces collected from Getchu.com. Based on Mckinsey666's dataset. 63.6K image

Tomoya Sawada 61 Dec 25, 2022
Pairwise model for commonlit competition

Pairwise model for commonlit competition To run: - install requirements - create input directory with train_folds.csv and other competition data - cd

abhishek thakur 45 Aug 31, 2022
MetaBalance: High-Performance Neural Networks for Class-Imbalanced Data

This repository is the official PyTorch implementation of Meta-Balance. Find the paper on arxiv MetaBalance: High-Performance Neural Networks for Clas

Arpit Bansal 20 Oct 18, 2021
FPGA: Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification

FPGA & FreeNet Fast Patch-Free Global Learning Framework for Fully End-to-End Hyperspectral Image Classification by Zhuo Zheng, Yanfei Zhong, Ailong M

Zhuo Zheng 92 Jan 03, 2023
Estimating Example Difficulty using Variance of Gradients

Estimating Example Difficulty using Variance of Gradients This repository contains source code necessary to reproduce some of the main results in the

Chirag Agarwal 48 Dec 26, 2022
Semi Supervised Learning for Medical Image Segmentation, a collection of literature reviews and code implementations.

Semi-supervised-learning-for-medical-image-segmentation. Recently, semi-supervised image segmentation has become a hot topic in medical image computin

Healthcare Intelligence Laboratory 1.3k Jan 03, 2023
PyTorch implementation of 'Gen-LaneNet: a generalized and scalable approach for 3D lane detection'

(pytorch) Gen-LaneNet: a generalized and scalable approach for 3D lane detection Introduction This is a pytorch implementation of Gen-LaneNet, which p

Yuliang Guo 233 Jan 06, 2023
I tried to apply the CAM algorithm to YOLOv4 and it worked.

YOLOV4:You Only Look Once目标检测模型在pytorch当中的实现 2021年2月7日更新: 加入letterbox_image的选项,关闭letterbox_image后网络的map得到大幅度提升。 目录 性能情况 Performance 实现的内容 Achievement

55 Dec 05, 2022
This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

RGB2NIR_Experimental This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models

5 Jan 04, 2023
PyTorch implementation of DUL (Data Uncertainty Learning in Face Recognition, CVPR2020)

PyTorch implementation of DUL (Data Uncertainty Learning in Face Recognition, CVPR2020)

Mouxiao Huang 20 Nov 15, 2022
Cross-view Transformers for real-time Map-view Semantic Segmentation (CVPR 2022 Oral)

Cross View Transformers This repository contains the source code and data for our paper: Cross-view Transformers for real-time Map-view Semantic Segme

Brady Zhou 363 Dec 25, 2022
PyTorch code of "SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks"

SLAPS-GNN This repo contains the implementation of the model proposed in SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

60 Dec 22, 2022
Vector Neurons: A General Framework for SO(3)-Equivariant Networks

Vector Neurons: A General Framework for SO(3)-Equivariant Networks Created by Congyue Deng, Or Litany, Yueqi Duan, Adrien Poulenard, Andrea Tagliasacc

Congyue Deng 332 Dec 29, 2022
"Projelerle Yapay Zeka Ve Bilgisayarlı Görü" Kitabımın projeleri

"Projelerle Yapay Zeka Ve Bilgisayarlı Görü" Kitabımın projeleri Bu Github Reposundaki tüm projeler; kaleme almış olduğum "Projelerle Yapay Zekâ ve Bi

Ümit Aksoylu 4 Aug 03, 2022
Implementation of Perceiver, General Perception with Iterative Attention in TensorFlow

Perceiver This Python package implements Perceiver: General Perception with Iterative Attention by Andrew Jaegle in TensorFlow. This model builds on t

Rishit Dagli 84 Oct 15, 2022
Development Kit for the SoccerNet Challenge

SoccerNetv2-DevKit Welcome to the SoccerNet-V2 Development Kit for the SoccerNet Benchmark and Challenge. This kit is meant as a help to get started w

Silvio Giancola 117 Dec 30, 2022
GDSC-ML Team Interview Task

GDSC-ML-Team---Interview-Task Task 1 : Clean or Messy room In this task we have to classify the given test images as clean or messy. - Link for datase

Aayush. 1 Jan 19, 2022
A simple code to perform canny edge contrast detection on images.

CECED-Canny-Edge-Contrast-Enhanced-Detection A simple code to perform canny edge contrast detection on images. A simple code to process images using c

Happy N. Monday 3 Feb 15, 2022