Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP

Overview

Wav2CLIP

🚧 WIP 🚧

Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP 📄 🔗

Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, Juan Pablo Bello

We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and generation, and show that Wav2CLIP can outperform several publicly available pre-trained audio representation algorithms. Wav2CLIP projects audio into a shared embedding space with images and text, which enables multimodal applications such as zero-shot classification, and cross-modal retrieval. Furthermore, Wav2CLIP needs just ~10% of the data to achieve competitive performance on downstream tasks compared with fully supervised models, and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model. Finally, we demonstrate image generation from Wav2CLIP as qualitative assessment of the shared embedding space. Our code and model weights are open sourced and made available for further applications.

Installation

pip install wav2clip

Usage

Clip-Level Embeddings

import wav2clip

model = wav2clip.get_model()
embeddings = wav2clip.embed_audio(audio, model)

Frame-Level Embeddings

import wav2clip

model = wav2clip.get_model(frame_length=16000, hop_length=16000)
embeddings = wav2clip.embed_audio(audio, model)
Comments
  • request of projection layer weight

    request of projection layer weight

    Hi @hohsiangwu , Thanks for great work! Request pre-trained weights of image_transform (MLP layer) for audio-image-language joint embedding space.

    Currently, only audio encoders seem to exist in the get_model function. Is there any big problem if I use CLIP embedding (text or image) without projection layer?

    opened by SeungHeonDoh 2
  • Initial checkin for accessing pre-trained model via pip install

    Initial checkin for accessing pre-trained model via pip install

    I am considering using the release feature of GitHub to host model weights, once the url is added to MODEL_WEIGHTS_URL, and the repository is made public, we should be able to model = torch.hub.load('descriptinc/lyrebird-wav2clip', 'wav2clip', pretrained=True)

    opened by hohsiangwu 1
  • Adding VQGAN-CLIP with modification to generate audio

    Adding VQGAN-CLIP with modification to generate audio

    • Adding a working snapshot of original generate.py from https://github.com/nerdyrodent/VQGAN-CLIP/
    • Modify to add audio related params and functions
    • Add scripts to generate image and video with options for conditioning and interpolation
    opened by hohsiangwu 0
  • Supervised scenario no transform

    Supervised scenario no transform

    In the supervise scenario in the __init__.py the transform flag is not set to True, so the model doesn't contain the MLP layer after training. I'm wondering how you train the MLP layer when using as pretrained.

    opened by alirezadir 0
  • Integrated into VQGAN+CLIP 3D Zooming notebook

    Integrated into VQGAN+CLIP 3D Zooming notebook

    Dear researchers,

    I integrated Wav2CLIP into a VQGAN+CLIP animation notebook.

    It is available on colab here: https://colab.research.google.com/github/pollinations/hive/blob/main/notebooks/2%20Text-To-Video/1%20CLIP-Guided%20VQGAN%203D%20Turbo%20Zoom.ipynb

    I'm part of a team creating an open-source generative art platform called Pollinations.AI. It's also possible to use through our frontend if you are interested. https://pollinations.ai/p/QmT7yt67DF3GF4wd2vyw6bAgN3QZx7Xpnoyx98YWEsEuV7/create

    Here is an example output: https://user-images.githubusercontent.com/5099901/168467451-f633468d-e596-48f5-8c2c-2dc54648ead3.mp4

    opened by voodoohop 0
  • The details concerning loading raw audio files

    The details concerning loading raw audio files

    Hi !

    I haved imported the wave2clip as a package, however when testing, the inputs for the model to extract features are not original audio files. Thus can you provided the details to load the audio files to processed data for the model?

    opened by jinx2018 0
  • torch version

    torch version

    Hi, thanks for sharing the wonderful work! I encountered some issues during pip installing it, so may I ask what is the torch version you used? I cannot find the requirement of this project. Thanks!

    opened by annahung31 0
  • Error when importing after fresh installation on colab

    Error when importing after fresh installation on colab

    What CUDA and Python versions have you tested the pip package in? After installation on a fresh collab I receive the following error:


    OSError Traceback (most recent call last) in () ----> 1 import wav2clip

    7 frames /usr/local/lib/python3.7/dist-packages/wav2clip/init.py in () 2 import torch 3 ----> 4 from .model.encoder import ResNetExtractor 5 6

    /usr/local/lib/python3.7/dist-packages/wav2clip/model/encoder.py in () 4 from torch import nn 5 ----> 6 from .resnet import BasicBlock 7 from .resnet import ResNet 8

    /usr/local/lib/python3.7/dist-packages/wav2clip/model/resnet.py in () 3 import torch.nn as nn 4 import torch.nn.functional as F ----> 5 import torchaudio 6 7

    /usr/local/lib/python3.7/dist-packages/torchaudio/init.py in () ----> 1 from torchaudio import _extension # noqa: F401 2 from torchaudio import ( 3 compliance, 4 datasets, 5 functional,

    /usr/local/lib/python3.7/dist-packages/torchaudio/_extension.py in () 25 26 ---> 27 _init_extension()

    /usr/local/lib/python3.7/dist-packages/torchaudio/_extension.py in _init_extension() 19 # which depends on libtorchaudio and dynamic loader will handle it for us. 20 if path.exists(): ---> 21 torch.ops.load_library(path) 22 torch.classes.load_library(path) 23 # This import is for initializing the methods registered via PyBind11

    /usr/local/lib/python3.7/dist-packages/torch/_ops.py in load_library(self, path) 108 # static (global) initialization code in order to register custom 109 # operators with the JIT. --> 110 ctypes.CDLL(path) 111 self.loaded_libraries.add(path) 112

    /usr/lib/python3.7/ctypes/init.py in init(self, name, mode, handle, use_errno, use_last_error) 362 363 if handle is None: --> 364 self._handle = _dlopen(self._name, mode) 365 else: 366 self._handle = handle

    OSError: libcudart.so.10.2: cannot open shared object file: No such file or directory

    opened by janzuiderveld 0
Releases(v0.1.0-alpha)
Owner
Descript
Descript
MEND: Model Editing Networks using Gradient Decomposition

MEND: Model Editing Networks using Gradient Decomposition Setup Environment This codebase uses Python 3.7.9. Other versions may work as well. Create a

Eric Mitchell 141 Dec 02, 2022
Robocop is your personal mini voice assistant made using Python.

Robocop-VoiceAssistant To use this project, you should have python installed in your system. If you don't have python installed, install it beforehand

Sohil Khanduja 3 Feb 26, 2022
Auto-updating data to assist in investment to NEPSE

Symbol Ratios Summary Sector LTP Undervalued Bonus % MEGA Strong Commercial Banks 368 5 10 JBBL Strong Development Banks 568 5 10 SIFC Strong Finance

Amit Chaudhary 16 Nov 01, 2022
Metrics to evaluate quality and efficacy of synthetic datasets.

An Open Source Project from the Data to AI Lab, at MIT Metrics for Synthetic Data Generation Projects Website: https://sdv.dev Documentation: https://

The Synthetic Data Vault Project 129 Jan 03, 2023
Hardware accelerated, batchable and differentiable optimizers in JAX.

JAXopt Installation | Examples | References Hardware accelerated (GPU/TPU), batchable and differentiable optimizers in JAX. Installation JAXopt can be

Google 621 Jan 08, 2023
PyTorch implementation code for the paper MixCo: Mix-up Contrastive Learning for Visual Representation

How to Reproduce our Results This repository contains PyTorch implementation code for the paper MixCo: Mix-up Contrastive Learning for Visual Represen

opcrisis 46 Dec 15, 2022
Federated_learning codes used for the the paper "Evaluation of Federated Learning Aggregation Algorithms" and "A Federated Learning Aggregation Algorithm for Pervasive Computing: Evaluation and Comparison"

Federated Distance (FedDist) This is the code accompanying the Percom2021 paper "A Federated Learning Aggregation Algorithm for Pervasive Computing: E

GETALP 8 Jan 03, 2023
Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation. Intel iHD GPU (iGPU) support. NVIDIA GPU (dGPU) support.

mtomo Multiple types of NN model optimization environments. It is possible to directly access the host PC GUI and the camera to verify the operation.

Katsuya Hyodo 24 Mar 02, 2022
PyTorch framework for Deep Learning research and development.

Accelerated DL & RL PyTorch framework for Deep Learning research and development. It was developed with a focus on reproducibility, fast experimentati

Catalyst-Team 29 Jul 13, 2022
[arXiv] What-If Motion Prediction for Autonomous Driving ❓🚗💨

WIMP - What If Motion Predictor Reference PyTorch Implementation for What If Motion Prediction [PDF] [Dynamic Visualizations] Setup Requirements The W

William Qi 96 Dec 29, 2022
Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The original code is written in keras.

CasRel-pytorch-reimplement Pytorch reimplement of the paper "A Novel Cascade Binary Tagging Framework for Relational Triple Extraction" ACL2020. The o

longlongman 170 Dec 01, 2022
Simple ONNX operation generator. Simple Operation Generator for ONNX.

sog4onnx Simple ONNX operation generator. Simple Operation Generator for ONNX. https://github.com/PINTO0309/simple-onnx-processing-tools Key concept V

Katsuya Hyodo 6 May 15, 2022
CarND-LaneLines-P1 - Lane Finding Project for Self-Driving Car ND

Finding Lane Lines on the Road Overview When we drive, we use our eyes to decide where to go. The lines on the road that show us where the lanes are a

Udacity 769 Dec 27, 2022
GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It ca

OSU DKI Lab 76 Dec 21, 2022
dataset for ECCV 2020 "Motion Capture from Internet Videos"

Motion Capture from Internet Videos Motion Capture from Internet Videos Junting Dong*, Qing Shuai*, Yuanqing Zhang, Xian Liu, Xiaowei Zhou, Hujun Bao

ZJU3DV 98 Dec 07, 2022
REBEL: Relation Extraction By End-to-end Language generation

REBEL: Relation Extraction By End-to-end Language generation This is the repository for the Findings of EMNLP 2021 paper REBEL: Relation Extraction By

Babelscape 222 Jan 06, 2023
This repository contains code accompanying the paper "An End-to-End Chinese Text Normalization Model based on Rule-Guided Flat-Lattice Transformer"

FlatTN This repository contains code accompanying the paper "An End-to-End Chinese Text Normalization Model based on Rule-Guided Flat-Lattice Transfor

THUHCSI 74 Nov 28, 2022
Code to accompany the paper "Finding Bipartite Components in Hypergraphs", which is published in NeurIPS'21.

Finding Bipartite Components in Hypergraphs This repository contains code to accompany the paper "Finding Bipartite Components in Hypergraphs", publis

Peter Macgregor 5 May 06, 2022
TDmatch is a Python library developed to perform matching tasks in three categories:

TDmatch TDmatch is a Python library developed to perform matching tasks in three categories: Text to Data which matches tuples of a table to text docu

Naser Ahmadi 5 Aug 11, 2022
ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

ERISHA: Multilingual Multispeaker Expressive Text-to-Speech Library ERISHA is a multilingual multispeaker expressive speech synthesis framework. It ca

Ajinkya Kulkarni 43 Nov 27, 2022