Making a music video with Wav2CLIP and VQGAN-CLIP

Last update: Dec 26, 2022

Related tags

Deep Learning music2video

Overview

music2video Overview

A repo for making a music video with Wav2CLIP and VQGAN-CLIP.

The base code was derived from VQGAN-CLIP The CLIP embedding for audio was derived from Wav2CLIP

Environment:

Tested on Ubuntu 20.04
GPU: Nvidia RTX 3090
Typical VRAM requirements:
- 24 GB for a 900x900 image
- 10 GB for a 512x512 image
- 8 GB for a 380x380 image

Set up

This example uses Anaconda to manage virtual Python environments.

Create a new virtual Python environment for VQGAN-CLIP:

conda create --name vqgan python=3.9
conda activate vqgan

Install Pytorch in the new enviroment:

Note: This installs the CUDA version of Pytorch, if you want to use an AMD graphics card, read the AMD section below.

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

Install other required Python packages:

pip install ftfy regex tqdm omegaconf pytorch-lightning IPython kornia imageio imageio-ffmpeg einops torch_optimizer wav2clip

Or use the requirements.txt file, which includes version numbers.

Clone required repositories:

git clone 'https://github.com/nerdyrodent/VQGAN-CLIP'
cd VQGAN-CLIP
git clone 'https://github.com/openai/CLIP'
git clone 'https://github.com/CompVis/taming-transformers'

Note: In my development environment both CLIP and taming-transformers are present in the local directory, and so aren't present in the requirements.txt or vqgan.yml files.

As an alternative, you can also pip install taming-transformers and CLIP.

You will also need at least 1 VQGAN pretrained model. E.g.

mkdir checkpoints

curl -L -o checkpoints/vqgan_imagenet_f16_16384.yaml -C - 'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fconfigs%2Fmodel.yaml&dl=1' #ImageNet 16384
curl -L -o checkpoints/vqgan_imagenet_f16_16384.ckpt -C - 'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fckpts%2Flast.ckpt&dl=1' #ImageNet 16384

Note that users of curl on Microsoft Windows should use double quotes.

The download_models.sh script is an optional way to download a number of models. By default, it will download just 1 model.

See https://github.com/CompVis/taming-transformers#overview-of-pretrained-models for more information about VQGAN pre-trained models, including download links.

By default, the model .yaml and .ckpt files are expected in the checkpoints directory. See https://github.com/CompVis/taming-transformers for more information on datasets and models.

Run

To generate video from music, specify your music as shown in the example below:

python generate.py -vid -i 200 -vl 5 -o outputs/output.png -ap "music_sample/meeting_easy.wav" -gid 0

python generate.py -vid -i 200 -vl 5 -o outputs2/output.png -ap "music_sample/merry_go_round.wav" -gid 0

Citations

@misc{unpublished2021clip,
    title  = {CLIP: Connecting Text and Images},
    author = {Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal},
    year   = {2021}
}

@misc{esser2020taming,
      title={Taming Transformers for High-Resolution Image Synthesis}, 
      author={Patrick Esser and Robin Rombach and Björn Ommer},
      year={2020},
      eprint={2012.09841},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@article{wu2021wav2clip,
  title={Wav2CLIP: Learning Robust Audio Representations From CLIP},
  author={Wu, Ho-Hsiang and Seetharaman, Prem and Kumar, Kundan and Bello, Juan Pablo},
  journal={arXiv preprint arXiv:2110.11499},
  year={2021}
}

Making a music video with Wav2CLIP and VQGAN-CLIP

Related tags

Overview

music2video Overview

Set up

Run

Citations

Owner

Joel Jang | 장요엘

Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

A TikTok-like recommender system for GitHub repositories based on Gorse

A denoising diffusion probabilistic model synthesises galaxies that are qualitatively and physically indistinguishable from the real thing.

Cooperative multi-agent reinforcement learning for high-dimensional nonequilibrium control

LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation (NeurIPS2021 Benchmark and Dataset Track)

Code + pre-trained models for the paper Keeping Your Eye on the Ball Trajectory Attention in Video Transformers

Using pytorch to implement unet network for liver image segmentation.

Human4D Dataset tools for processing and visualization

Original code for "Zero-Shot Domain Adaptation with a Physics Prior"

Official tensorflow implementation for CVPR2020 paper “Learning to Cartoonize Using White-box Cartoon Representations”

1st ranked 'driver careless behavior detection' for AI Online Competition 2021, hosted by MSIT Korea.

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Code for ACL'2021 paper WARP 🌀 Word-level Adversarial ReProgramming

Deep generative modeling for time-stamped heterogeneous data, enabling high-fidelity models for a large variety of spatio-temporal domains.

Subgraph Based Learning of Contextual Embedding

Unsupervised Feature Ranking via Attribute Networks.

KE-Dialogue: Injecting knowledge graph into a fully end-to-end dialogue system.

Rewrite ultralytics/yolov5 v6.0 opencv inference code based on numpy, no need to rely on pytorch

Bayesian Generative Adversarial Networks in Tensorflow

Optimized Gillespie algorithm for simulating Stochastic sPAtial models of Cancer Evolution (OG-SPACE)