Making a music video with Wav2CLIP and VQGAN-CLIP

Last update: Dec 26, 2022

Related tags

Deep Learning music2video

Overview

music2video Overview

A repo for making a music video with Wav2CLIP and VQGAN-CLIP.

The base code was derived from VQGAN-CLIP The CLIP embedding for audio was derived from Wav2CLIP

Environment:

Tested on Ubuntu 20.04
GPU: Nvidia RTX 3090
Typical VRAM requirements:
- 24 GB for a 900x900 image
- 10 GB for a 512x512 image
- 8 GB for a 380x380 image

Set up

This example uses Anaconda to manage virtual Python environments.

Create a new virtual Python environment for VQGAN-CLIP:

conda create --name vqgan python=3.9
conda activate vqgan

Install Pytorch in the new enviroment:

Note: This installs the CUDA version of Pytorch, if you want to use an AMD graphics card, read the AMD section below.

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

Install other required Python packages:

pip install ftfy regex tqdm omegaconf pytorch-lightning IPython kornia imageio imageio-ffmpeg einops torch_optimizer wav2clip

Or use the requirements.txt file, which includes version numbers.

Clone required repositories:

git clone 'https://github.com/nerdyrodent/VQGAN-CLIP'
cd VQGAN-CLIP
git clone 'https://github.com/openai/CLIP'
git clone 'https://github.com/CompVis/taming-transformers'

Note: In my development environment both CLIP and taming-transformers are present in the local directory, and so aren't present in the requirements.txt or vqgan.yml files.

As an alternative, you can also pip install taming-transformers and CLIP.

You will also need at least 1 VQGAN pretrained model. E.g.

mkdir checkpoints

curl -L -o checkpoints/vqgan_imagenet_f16_16384.yaml -C - 'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fconfigs%2Fmodel.yaml&dl=1' #ImageNet 16384
curl -L -o checkpoints/vqgan_imagenet_f16_16384.ckpt -C - 'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fckpts%2Flast.ckpt&dl=1' #ImageNet 16384

Note that users of curl on Microsoft Windows should use double quotes.

The download_models.sh script is an optional way to download a number of models. By default, it will download just 1 model.

See https://github.com/CompVis/taming-transformers#overview-of-pretrained-models for more information about VQGAN pre-trained models, including download links.

By default, the model .yaml and .ckpt files are expected in the checkpoints directory. See https://github.com/CompVis/taming-transformers for more information on datasets and models.

Run

To generate video from music, specify your music as shown in the example below:

python generate.py -vid -i 200 -vl 5 -o outputs/output.png -ap "music_sample/meeting_easy.wav" -gid 0

python generate.py -vid -i 200 -vl 5 -o outputs2/output.png -ap "music_sample/merry_go_round.wav" -gid 0

Citations

@misc{unpublished2021clip,
    title  = {CLIP: Connecting Text and Images},
    author = {Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal},
    year   = {2021}
}

@misc{esser2020taming,
      title={Taming Transformers for High-Resolution Image Synthesis}, 
      author={Patrick Esser and Robin Rombach and Björn Ommer},
      year={2020},
      eprint={2012.09841},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@article{wu2021wav2clip,
  title={Wav2CLIP: Learning Robust Audio Representations From CLIP},
  author={Wu, Ho-Hsiang and Seetharaman, Prem and Kumar, Kundan and Bello, Juan Pablo},
  journal={arXiv preprint arXiv:2110.11499},
  year={2021}
}

Making a music video with Wav2CLIP and VQGAN-CLIP

Related tags

Overview

music2video Overview

Set up

Run

Citations

Owner

Joel Jang | 장요엘

EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent

[cvpr22] Perturbed and Strict Mean Teachers for Semi-supervised Semantic Segmentation

SGPT: Multi-billion parameter models for semantic search

Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)

Structure-Preserving Deraining with Residue Channel Prior Guidance (ICCV2021)

Patch SVDD for Image anomaly detection

The official implementation of Equalization Loss v1 & v2 (CVPR 2020, 2021) based on MMDetection.

Combining Diverse Feature Priors

Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystr\"om Method (NeurIPS 2021)

A Robust Unsupervised Ensemble of Feature-Based Explanations using Restricted Boltzmann Machines

Semi-Supervised Signed Clustering Graph Neural Network (and Implementation of Some Spectral Methods)

The code used for the free [email protected] Webinar series on Reinforcement Learning in Finance

Changing the Mind of Transformers for Topically-Controllable Language Generation

MLP-Numpy - A simple modular implementation of Multi Layer Perceptron in pure Numpy.

YoloAll is a collection of yolo all versions. you you use YoloAll to test yolov3/yolov5/yolox/yolo_fastest

My coursework for Machine Learning (2021 Spring) at National Taiwan University (NTU)

Code release for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification (TIP 2020)

Joint Learning of 3D Shape Retrieval and Deformation, CVPR 2021

Adversarial-autoencoders - Tensorflow implementation of Adversarial Autoencoders