Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

Last update: Dec 28, 2022

Overview

NÜWA - Pytorch (wip)

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch. This repository will be populated in the case that Microsoft does not open source the code by end of December. It may also contain an extension into video and audio, using a dual decoder approach.

DeepReader

Citations

@misc{wu2021nuwa,
    title   = {N\"UWA: Visual Synthesis Pre-training for Neural visUal World creAtion}, 
    author  = {Chenfei Wu and Jian Liang and Lei Ji and Fan Yang and Yuejian Fang and Daxin Jiang and Nan Duan},
    year    = {2021},
    eprint  = {2111.12417},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

Question about generated videos?

There are a lot of negative numbers and very small decimals (like 5e-1). But the loss degrades normally when training. Is that a normal situation? How can I make the result visible?

opened by Fitzwong 0
Why the video does not pass through the encoder?

Hi! lucidrains. Thanks for providing a great repo which is convenient to understand the NUWA paper.
I have a question as follows: In the NUWA paper, we can see that the inputs of the Encoder are caption tokens (caption condition) and the video tokens (3DNA condition). So, in my eye, the video tokens sequence should fully self-attend in the Encoder, right? And then, the outputs condition the Decoder. The Decoder provided by you is as following. . It has causal self-attention and text-condition as we expected. But from the definition in paper, the condition contains the text-condition and 3DNA condition, and these two condition the Decoder. Is my opinion right? I am just curious about the condition in the NUWA paper. The Encoder in your repo is only the Text-Encoder, but the video does not pass through the encoder to condition the Encoder.

Looking forward to your reply! Thanks!

opened by Wang-Xiaodong1899 0
Questions about function forward() in NUWA please.
I'm confused me that, in function forward() of class NUWA, the ground-truth video is fed to transformer and calculate the output video, which is different from function generate().

frame_embeddings = self.video_transformer( frame_embeddings, # calculated from ground-truth video context = text_embeds, context_mask = text_mask )

So when training NUWA, the loss comes from logits. But the logits are not only from text, but ground-truth video (only one transformer layer, different from the auto-regressive model in generate function). Is that some kind of cheating when training? Or should I generate logits in the same way as in generate(), and then calculate loss to train?
opened by Fitzwong 1
Type of dataset for training VQ-GAN

Hi,

First, thanks a lot for the amazing work! I have one question regarding the training of the VQ-GAN, do you recommend training it on a dataset similar to the dataset the nuwa model will be trained? What I mean is, if I want to train nuwa to generate sport videos based on text, do I need to also train the VQ-GAN on a sport dataset?

Thanks a lot

opened by antonibigata 0
Pseudocode for 3DNA?

me no comprendai le complex einops 😢

Can someone give the 3DNA pseudocode to illustrate what's going on 🤗

(Also how did lucidrains bang out thousands of lines of code in a few weeks - is he confirmed to be human? 🤔)

opened by neel04 4

Releases(0.7.7a)

0.7.7a(Aug 14, 2022)

null
Source code(tar.gz)
Source code(zip)
0.7.7(Aug 14, 2022)

null
Source code(tar.gz)
Source code(zip)
0.7.6(Apr 28, 2022)

Source code(tar.gz)
Source code(zip)
0.7.5(Apr 28, 2022)

Source code(tar.gz)
Source code(zip)
0.7.4(Apr 27, 2022)

Source code(tar.gz)
Source code(zip)
0.7.3(Apr 22, 2022)

Source code(tar.gz)
Source code(zip)
0.7.2(Apr 7, 2022)

Source code(tar.gz)
Source code(zip)
0.7.1(Mar 24, 2022)

Source code(tar.gz)
Source code(zip)
0.7.0(Mar 24, 2022)

Source code(tar.gz)
Source code(zip)
0.6.4(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.6.3(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.6.2(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.6.1(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.6.0(Mar 15, 2022)

Source code(tar.gz)
Source code(zip)
0.5.15(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
0.5.14(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
0.5.12(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
0.5.11(Mar 12, 2022)

Source code(tar.gz)
Source code(zip)
0.5.10(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.9(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.8(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.7(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.6(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.5(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.4(Mar 11, 2022)

Source code(tar.gz)
Source code(zip)
0.5.3(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
0.5.2(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
0.5.1(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
0.5.0(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)
0.4.33(Mar 10, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

Implicit Deep Adaptive Design (iDAD)

Implicit Deep Adaptive Design (iDAD) This code supports the NeurIPS paper 'Implicit Deep Adaptive Design: Policy-Based Experimental Design without Lik

12 Aug 14, 2022

Code accompanying paper: Meta-Learning to Improve Pre-Training

Meta-Learning to Improve Pre-Training This folder contains code to run experiments in the paper Meta-Learning to Improve Pre-Training, NeurIPS 2021. P

28 Dec 31, 2022

docTR by Mindee (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

1.5k Jan 01, 2023

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

Related tags

Overview

NÜWA - Pytorch (wip)

Citations

Comments

Question about generated videos?

Why the video does not pass through the encoder?

Questions about function forward() in NUWA please.

Type of dataset for training VQ-GAN

Pseudocode for 3DNA?

Releases(0.7.7a)

0.7.7a(Aug 14, 2022)

0.7.7(Aug 14, 2022)

0.7.6(Apr 28, 2022)

0.7.5(Apr 28, 2022)

0.7.4(Apr 27, 2022)

0.7.3(Apr 22, 2022)

0.7.2(Apr 7, 2022)

0.7.1(Mar 24, 2022)

0.7.0(Mar 24, 2022)

0.6.4(Mar 15, 2022)

0.6.3(Mar 15, 2022)

0.6.2(Mar 15, 2022)

0.6.1(Mar 15, 2022)

0.6.0(Mar 15, 2022)

0.5.15(Mar 12, 2022)

0.5.14(Mar 12, 2022)

0.5.12(Mar 12, 2022)

0.5.11(Mar 12, 2022)

0.5.10(Mar 11, 2022)

0.5.9(Mar 11, 2022)

0.5.8(Mar 11, 2022)

0.5.7(Mar 11, 2022)

0.5.6(Mar 11, 2022)

0.5.5(Mar 11, 2022)

0.5.4(Mar 11, 2022)

0.5.3(Mar 10, 2022)

0.5.2(Mar 10, 2022)

0.5.1(Mar 10, 2022)

0.5.0(Mar 10, 2022)

0.4.33(Mar 10, 2022)

Owner

Phil Wang

Implicit Deep Adaptive Design (iDAD)

Code accompanying paper: Meta-Learning to Improve Pre-Training

docTR by Mindee (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.

Source code for the NeurIPS 2021 paper "On the Second-order Convergence Properties of Random Search Methods"

Video Background Music Generation with Controllable Music Transformer (ACM MM 2021 Oral)

The codes and related files to reproduce the results for Image Similarity Challenge Track 1.

ShinRL: A Library for Evaluating RL Algorithms from Theoretical and Practical Perspectives

Autonomous racing with the Anki Overdrive

Pretrained Pytorch face detection (MTCNN) and recognition (InceptionResnet) models

Original Pytorch Implementation of FLAME: Facial Landmark Heatmap Activated Multimodal Gaze Estimation

Points2Surf: Learning Implicit Surfaces from Point Clouds (ECCV 2020 Spotlight)

CVNets: A library for training computer vision networks

Continuum Learning with GEM: Gradient Episodic Memory

End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)

Deep Learning agent of Starcraft2, similar to AlphaStar of DeepMind except size of network.

Dataset para entrenamiento de yoloV3 para 4 clases

FairEdit: Preserving Fairness in Graph Neural Networks through Greedy Graph Editing

CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing

Matching python environment code for Lux AI 2021 Kaggle competition, and a gym interface for RL models.

Machine Learning Model deployment for Container (TensorFlow Serving)