Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP

Overview

Wav2CLIP

🚧 WIP 🚧

Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP πŸ“„ πŸ”—

Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, Juan Pablo Bello

We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. Furthermore, Wav2CLIP needs just ~10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model. Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications.

Installation

pip install wav2clip

Usage

Clip-Level Embeddings

import wav2clip

model = wav2clip.get_model()
embeddings = wav2clip.embed_audio(audio, model)

Frame-Level Embeddings

import wav2clip

model = wav2clip.get_model(frame_length=16000, hop_length=16000)
embeddings = wav2clip.embed_audio(audio, model)
Comments
  • request of projection layer weight

    request of projection layer weight

    Hi @hohsiangwu , Thanks for great work! Request pre-trained weights of image_transform (MLP layer) for audio-image-language joint embedding space.

    Currently, only audio encoders seem to exist in the get_model function. Is there any big problem if I use CLIP embedding (text or image) without projection layer?

    opened by SeungHeonDoh 2
  • Initial checkin for accessing pre-trained model via pip install

    Initial checkin for accessing pre-trained model via pip install

    I am considering using the release feature of GitHub to host model weights, once the url is added to MODEL_WEIGHTS_URL, and the repository is made public, we should be able to model = torch.hub.load('descriptinc/lyrebird-wav2clip', 'wav2clip', pretrained=True)

    opened by hohsiangwu 1
  • Adding VQGAN-CLIP with modification to generate audio

    Adding VQGAN-CLIP with modification to generate audio

    • Adding a working snapshot of original generate.py from https://github.com/nerdyrodent/VQGAN-CLIP/
    • Modify to add audio related params and functions
    • Add scripts to generate image and video with options for conditioning and interpolation
    opened by hohsiangwu 0
  • Supervised scenario no transform

    Supervised scenario no transform

    In the supervise scenario in the __init__.py the transform flag is not set to True, so the model doesn't contain the MLP layer after training. I'm wondering how you train the MLP layer when using as pretrained.

    opened by alirezadir 0
  • Integrated into VQGAN+CLIP 3D Zooming notebook

    Integrated into VQGAN+CLIP 3D Zooming notebook

    Dear researchers,

    I integrated Wav2CLIP into a VQGAN+CLIP animation notebook.

    It is available on colab here: https://colab.research.google.com/github/pollinations/hive/blob/main/notebooks/2%20Text-To-Video/1%20CLIP-Guided%20VQGAN%203D%20Turbo%20Zoom.ipynb

    I'm part of a team creating an open-source generative art platform called Pollinations.AI. It's also possible to use through our frontend if you are interested. https://pollinations.ai/p/QmT7yt67DF3GF4wd2vyw6bAgN3QZx7Xpnoyx98YWEsEuV7/create

    Here is an example output: https://user-images.githubusercontent.com/5099901/168467451-f633468d-e596-48f5-8c2c-2dc54648ead3.mp4

    opened by voodoohop 0
  • The details concerning loading raw audio files

    The details concerning loading raw audio files

    Hi !

    I haved imported the wave2clip as a package, however when testing, the inputs for the model to extract features are not original audio files. Thus can you provided the details to load the audio files to processed data for the model?

    opened by jinx2018 0
  • torch version

    torch version

    Hi, thanks for sharing the wonderful work! I encountered some issues during pip installing it, so may I ask what is the torch version you used? I cannot find the requirement of this project. Thanks!

    opened by annahung31 0
  • Error when importing after fresh installation on colab

    Error when importing after fresh installation on colab

    What CUDA and Python versions have you tested the pip package in? After installation on a fresh collab I receive the following error:


    OSError Traceback (most recent call last) in () ----> 1 import wav2clip

    7 frames /usr/local/lib/python3.7/dist-packages/wav2clip/init.py in () 2 import torch 3 ----> 4 from .model.encoder import ResNetExtractor 5 6

    /usr/local/lib/python3.7/dist-packages/wav2clip/model/encoder.py in () 4 from torch import nn 5 ----> 6 from .resnet import BasicBlock 7 from .resnet import ResNet 8

    /usr/local/lib/python3.7/dist-packages/wav2clip/model/resnet.py in () 3 import torch.nn as nn 4 import torch.nn.functional as F ----> 5 import torchaudio 6 7

    /usr/local/lib/python3.7/dist-packages/torchaudio/init.py in () ----> 1 from torchaudio import _extension # noqa: F401 2 from torchaudio import ( 3 compliance, 4 datasets, 5 functional,

    /usr/local/lib/python3.7/dist-packages/torchaudio/_extension.py in () 25 26 ---> 27 _init_extension()

    /usr/local/lib/python3.7/dist-packages/torchaudio/_extension.py in _init_extension() 19 # which depends on libtorchaudio and dynamic loader will handle it for us. 20 if path.exists(): ---> 21 torch.ops.load_library(path) 22 torch.classes.load_library(path) 23 # This import is for initializing the methods registered via PyBind11

    /usr/local/lib/python3.7/dist-packages/torch/_ops.py in load_library(self, path) 108 # static (global) initialization code in order to register custom 109 # operators with the JIT. --> 110 ctypes.CDLL(path) 111 self.loaded_libraries.add(path) 112

    /usr/lib/python3.7/ctypes/init.py in init(self, name, mode, handle, use_errno, use_last_error) 362 363 if handle is None: --> 364 self._handle = _dlopen(self._name, mode) 365 else: 366 self._handle = handle

    OSError: libcudart.so.10.2: cannot open shared object file: No such file or directory

    opened by janzuiderveld 0
Releases(v0.1.0-alpha)
Owner
Descript
Descript
Code accompanying "Adaptive Methods for Aggregated Domain Generalization"

Adaptive Methods for Aggregated Domain Generalization (AdaClust) Official Pytorch Implementation of Adaptive Methods for Aggregated Domain Generalizat

Xavier Thomas 15 Sep 20, 2022
Real-ESRGAN aims at developing Practical Algorithms for General Image Restoration.

Real-ESRGAN Colab Demo for Real-ESRGAN . Portable Windows executable file. You can find more information here. Real-ESRGAN aims at developing Practica

Xintao 17.2k Jan 02, 2023
A 1.3B text-to-image generation model trained on 14 million image-text pairs

minDALL-E on Conceptual Captions minDALL-E, named after minGPT, is a 1.3B text-to-image generation model trained on 14 million image-text pairs for no

Kakao Brain 604 Dec 14, 2022
Official implementation of Meta-StyleSpeech and StyleSpeech

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang This is an official code

min95 168 Dec 28, 2022
The code for paper "Learning Implicit Fields for Generative Shape Modeling".

implicit-decoder The tensorflow code for paper "Learning Implicit Fields for Generative Shape Modeling", Zhiqin Chen, Hao (Richard) Zhang. Project pag

Zhiqin Chen 353 Dec 30, 2022
Code for Blind Image Decomposition (BID) and Blind Image Decomposition network (BIDeN).

arXiv, porject page, paper Blind Image Decomposition (BID) Blind Image Decomposition is a novel task. The task requires separating a superimposed imag

64 Dec 20, 2022
Code for "LoFTR: Detector-Free Local Feature Matching with Transformers", CVPR 2021

LoFTR: Detector-Free Local Feature Matching with Transformers Project Page | Paper LoFTR: Detector-Free Local Feature Matching with Transformers Jiami

ZJU3DV 1.4k Jan 04, 2023
PixelPick This is an official implementation of the paper "All you need are a few pixels: semantic segmentation with PixelPick."

PixelPick This is an official implementation of the paper "All you need are a few pixels: semantic segmentation with PixelPick." [Project page] [Paper

Gyungin Shin 59 Sep 25, 2022
Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR2022)[paper] Authors: Chenhang He, Ruihuang Li, Shuai Li, L

Billy HE 141 Dec 30, 2022
[ICCV21] Official implementation of the "Social NCE: Contrastive Learning of Socially-aware Motion Representations" in PyTorch.

Social-NCE + CrowdNav Website | Paper | Video | Social NCE + Trajectron | Social NCE + STGCNN This is an official implementation for Social NCE: Contr

VITA lab at EPFL 125 Dec 23, 2022
Implementation of trRosetta and trDesign for Pytorch, made into a convenient package

trRosetta - Pytorch (wip) Implementation of trRosetta and trDesign for Pytorch, made into a convenient package

Phil Wang 67 Dec 17, 2022
Addition of pseudotorsion caclulation eta, theta, eta', and theta' to barnaba package

Addition to Original Barnaba Code: This is modified version of Barnaba package to calculate RNA pseudotorsion angles eta, theta, eta', and theta'. Ple

Mandar Kulkarni 1 Jan 11, 2022
Learning the Beauty in Songs: Neural Singing Voice Beautifier; ACL 2022 (Main conference); Official code

Learning the Beauty in Songs: Neural Singing Voice Beautifier Jinglin Liu, Chengxi Li, Yi Ren, Zhiying Zhu, Zhou Zhao Zhejiang University ACL 2022 Mai

Jinglin Liu 257 Dec 30, 2022
πŸ… Top 5% in 제2회 μ—°κ΅¬κ°œλ°œνŠΉκ΅¬ 인곡지λŠ₯ κ²½μ§„λŒ€νšŒ AI SPARK μ±Œλ¦°μ§€

AI_SPARK_CHALLENG_Object_Detection 제2회 μ—°κ΅¬κ°œλ°œνŠΉκ΅¬ 인곡지λŠ₯ κ²½μ§„λŒ€νšŒ AI SPARK μ±Œλ¦°μ§€ πŸ… Top 5% in mAP(0.75) (443λͺ… 쀑 13λ“±, mAP: 0.98116) λŒ€νšŒ μ„€λͺ… Edge ν™˜κ²½μ—μ„œμ˜ κ°€μΆ• Object Dete

3 Sep 19, 2022
Quasi-Dense Similarity Learning for Multiple Object Tracking, CVPR 2021 (Oral)

Quasi-Dense Tracking This is the offical implementation of paper Quasi-Dense Similarity Learning for Multiple Object Tracking. We present a trailer th

ETH VIS Research Group 327 Dec 27, 2022
Code & Data for Enhancing Photorealism Enhancement

Enhancing Photorealism Enhancement Stephan R. Richter, Hassan Abu AlHaija, Vladlen Koltun Paper | Website (with side-by-side comparisons) | Video (Pap

Intelligent Systems Lab Org 1.1k Dec 31, 2022
Model Serving Made Easy

The easiest way to build Machine Learning APIs BentoML makes moving trained ML models to production easy: Package models trained with any ML framework

BentoML 4.4k Jan 08, 2023
[LREC] MMChat: Multi-Modal Chat Dataset on Social Media

MMChat This repo contains the code and data for the LREC2022 paper MMChat: Multi-Modal Chat Dataset on Social Media. Dataset MMChat is a large-scale d

Silver 47 Jan 03, 2023
FairMOT - A simple baseline for one-shot multi-object tracking

FairMOT - A simple baseline for one-shot multi-object tracking

Yifu Zhang 3.6k Jan 08, 2023
source code for https://arxiv.org/abs/2005.11248 "Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics"

Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics This work will be published in Nature Biomedical

International Business Machines 71 Nov 15, 2022