Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Last update: Dec 27, 2022

Related tags

Deep Learning ATVGnet

Overview

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

By Lele Chen , Ross K Maddox, Zhiyao Duan, Chenliang Xu.

University of Rochester.

Introduction
Citation
Running
Model
Results
Disclaimer and known issues

Introduction

This repository contains the original models (AT-net, VG-net) described in the paper Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss. The demo video is avaliable at https://youtu.be/eH7h_bDRX2Q. This code can be applied directly in LRW and GRID. The outputs from the model are visualized here: the first one is the synthesized landmark from ATnet, the rest of them are attention, motion map and final results from VGnet.

Citation

If you use any codes, models or the ideas from this repo in your research, please cite:

@inproceedings{chen2019hierarchical,
  title={Hierarchical cross-modal talking face generation with dynamic pixel-wise loss},
  author={Chen, Lele and Maddox, Ross K and Duan, Zhiyao and Xu, Chenliang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={7832--7841},
  year={2019}
}

Running

This code is tested under Python 2.7. The model we provided is trained on LRW. However, it works fine on GRID,VOXCELB and other datasets. You can directly compare this model on other dataset with your own model. We treat this as fair comparison.
Pytorch environment:Pytorch 0.4.1. (conda install pytorch=0.4.1 torchvision cuda90 -c pytorch)
Install requirements.txt (pip install -r requirement.txt)
Download the pretrained ATnet and VGnet weights at google drive. Put the weights under model folder.
Run the demo code: python demo.py
- -device_ids: gpu id
- -cuda: using cuda or not
- -vg_model: pretrained VGnet weight
- -at_model: pretrained ATnet weight
- -lstm: use lstm or not
- -p: input example image
- -i: input audio file
- -lstm: use lstm or not
- -sample_dir: folder to save the outputs
- ...
Download and unzip the training data from LRW
Preprocess the data (Extract landmark and crop the image by dlib).
Train the ATnet model: python atnet.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_dir: folder to save weights
- -lstm: use lstm or not
- -sample_dir: folder to save visualized images during training
- ...
Test the model: python atnet_test.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_name: pretrained weights
- -sample_dir: folder to save the outputs
- -lstm: use lstm or not
- ...
Train the VGnet: python vgnet.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_dir: folder to save weights
- -sample_dir: folder to save visualized images during training
- ...
Test the VGnet: python vgnet_test.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_name: pretrained weights
- -sample_dir: folder to save the outputs
- ...

Model

Overall ATVGnet
Regresssion based discriminator network

Results

Result visualization on different datasets:
Reuslt compared with other SOTA methods:
The studies on image robustness respective with landmark accuracy:
Quantitative results:

Disclaimer and known issues

These codes are implmented in Pytorch.
In this paper, we train LRW and GRID seperately.
The model are sensitive to input images. Please use the correct preprocessing code.
I didn't finish the data processing code yet. I will release it soon. But you can try the model and replace with your own image.
If you want to train these models using this version of pytorch without modifications, please notice that:
- You need at lest 12 GB GPU memory.
- There might be some other untested issues.
There is another intresting and useful research on audio to landmark genration. Please check it out at https://github.com/eeskimez/Talking-Face-Landmarks-from-Speech.

Todos

Release training data

License

MIT

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Related tags

Overview

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Table of Contents

Introduction

Citation

Running

Model

Results

Disclaimer and known issues

Todos

License

Owner

Lele Chen

All course materials for the Zero to Mastery Machine Learning and Data Science course.

Sequence to Sequence (seq2seq) Recurrent Neural Network (RNN) for Time Series Forecasting

Code needed to reproduce the examples found in "The Temporal Robustness of Stochastic Signals"

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

MGFN: Multi-Graph Fusion Networks for Urban Region Embedding was accepted by IJCAI-2022.

The first dataset on shadow generation for the foreground object in real-world scenes.

A Pytorch implementation of MoveNet from Google. Include training code and pre-train model.

This is an implementation for the CVPR2020 paper "Learning Invariant Representation for Unsupervised Image Restoration"

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

Code for "Single-view robot pose and joint angle estimation via render & compare", CVPR 2021 (Oral).

[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)

Keyword spotting on Arm Cortex-M Microcontrollers

ICLR 2021 i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning

A PyTorch implementation of unsupervised SimCSE

Py-FEAT: Python Facial Expression Analysis Toolbox

A Real-ESRGAN equipped Colab notebook for CLIP Guided Diffusion

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

Solve a Rubiks Cube using Python Opencv and Kociemba module

Scikit-event-correlation - Event Correlation and Forecasting over High Dimensional Streaming Sensor Data algorithms

Long Expressive Memory (LEM)