Localizing-Visual-Sounds-the-Hard-Way

Code and Dataset for "Localizing Visual Sounds the Hard Way".

The repo contains code and our pre-trained model.

Environment

Python 3.6.8
Pytorch 1.3.0

Flickr-SoundNet

We provide the pretrained model here.

To test the model, testing data and ground truth should be downloaded from learning to localize sound source.

Then run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --gt_path "path to ground truth" --testset "flickr"

VGG-Sound Source

We provide the pretrained model here.

To test the model, run

python test.py --data_path "path to downloaded data with structure below/" --summaries_dir "path to pretrained models" --testset "vggss"

(Note, some gt bounding boxes are updated recently, all results on VGG-SS cause a 2~3% difference on IoU.)

Both test data should be placed in the following structure.

data path
│
└───frames
│   │   image001.jpg
│   │   image002.jpg
│   │
└───audio
    │   audio011.wav
    │   audio012.wav

Citation

@InProceedings{Chen21,
              title        = "Localizing Visual Sounds the Hard Way",
              author       = "Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, Andrew Zisserman",
              booktitle    = "CVPR",
              year         = "2021"}

Localizing Visual Sounds the Hard Way

Related tags

Overview

Localizing-Visual-Sounds-the-Hard-Way

Environment

Flickr-SoundNet

VGG-Sound Source

Citation

Owner

Honglie Chen

A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

mPose3D, a mmWave-based 3D human pose estimation model.

Liquid Warping GAN with Attention: A Unified Framework for Human Image Synthesis

Transfer Reinforcement Learning for Differing Action Spaces via Q-Network Representations

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

The PyTorch implementation of Directed Graph Contrastive Learning (DiGCL), NeurIPS-2021

MPRNet-Cloud-removal: Progressive cloud removal

Differential Privacy for Heterogeneous Federated Learning : Utility & Privacy tradeoffs

This repository contains the code for our fast polygonal building extraction from overhead images pipeline.

[SIGGRAPH Asia 2019] Artistic Glyph Image Synthesis via One-Stage Few-Shot Learning

Hand Gesture Volume Control is AIML based project which uses image processing to control the volume of your Computer.

Dense Prediction Transformers

🐦 Quickly annotate data from the comfort of your Jupyter notebook

Unrolled Variational Bayesian Algorithm for Image Blind Deconvolution

Drone-based Joint Density Map Estimation, Localization and Tracking with Space-Time Multi-Scale Attention Network

PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

1st place solution in CCF BDCI 2021 ULSEG challenge

PHOTONAI is a high level python API for designing and optimizing machine learning pipelines.

Code for KDD'20 "Generative Pre-Training of Graph Neural Networks"