MUSIC-AVQA, CVPR2022 (ORAL)

Last update: Dec 23, 2022

Related tags

Audio MUSIC-AVQA

Overview

Audio-Visual Question Answering (AVQA)

PyTorch code accompanies our CVPR 2022 paper:

Learning to Answer Questions in Dynamic Audio-Visual Scenarios (Oral Presentation)

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen and Di Hu

Resources: [Paper], [Supplementary], [Poster], [Video]

Project Homepage: https://gewu-lab.github.io/MUSIC-AVQA/

What's Audio-Visual Question Answering Task?

We focus on audio-visual question answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes.

MUSIC-AVQA Dataset

The large-scale MUSIC-AVQA dataset of musical performance, which contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All QA pairs types are divided into 3 modal scenarios, which contain 9 question types and 33 question templates. Finally, as an open-ended problem of our AVQA tasks, all 42 kinds of answers constitute a set for selection.

QA examples

Model Overview

To solve the AVQA problem, we propose a spatio-temporal grounding model to achieve scene understanding and reasoning over audio and visual modalities. An overview of the proposed framework is illustrated in below figure.

Requirements

python3.6 +
pytorch1.6.0
tensorboardX
ffmpeg
numpy

Usage

Clone this repo

https://github.com/GeWu-Lab/MUSIC-AVQA_CVPR2022.git

Download data

Annotations (QA pairs, etc.)
- Available for download at here
- The annotation files are stored in JSON format. Each annotation file contains seven different keyword. And more detail see in Project Homepage
Features
- We use VGGish, ResNet18, and ResNet (2+1)D to extract audio, 2D frame-level, and 3D snippet-level features, respectively.
- The audio and visual features of videos in the MUSIC-AVQA dataset can be download from Baidu Drive (password: cvpr):
  - VGGish feature shape: [T, 128] Download (112.7M)
  - ResNet18 feature shape: [T, 512] Download (972.6M)
  - R(2+1)D feature shape: [T, 512] Download (973.9M)
- The features are in the ./data/feats folder.
- 14x14 features, too large to share ... but we can extract from raw video frames.
Download videos frames
- Raw videos: Availabel at Baidu Drive (password: cvpr):.
  - Real videos (36.67GB)
  - Synthetic videos (11.59GB)
  Note: Please move all downloaded videos to a folder, for example, create a new folder named MUSIC-AVQA-Videos, which contains 9,288 real videos and synthetic videos.
- Raw video frames (1fps): Available at Baidu Drive (14.84GB) (password: cvpr).
- Download raw videos in the MUSIC-AVQA dataset. The downloaded videos will be in the /data/video folder.
- Pandas and ffmpeg libraries are required.
Data pre-processing

Extract audio waveforms from videos. The extracted audios will be in the ./data/audio folder. moviepy library is used to read videos and extract audios.
```
python feat_script/extract_audio_cues/extract_audio.py	
```
Extract video frames from videos. The extracted frames will be in the data/frames folder.
```
python feat_script/extract_visual_frames/extract_frames_adaptive_script.py
```

Feature extraction

Audio feature. TensorFlow1.4 and VGGish pretrained on AudioSet is required. Feature file also can be found from here (password: cvpr).

python feat_script/extract_audio_feat/audio_feature_extractor.py

2D visual feature. Pretrained models library is required.

python feat_script/eatract_visual_feat/extract_rgb_feat.py

3D visual feature.

python feat_script/eatract_visual_feat/extract_3d_feat.py

14x14 visual feature.

python feat_script/extract_visual_feat_14x14/extract_14x14_feat.py

Baseline Model

Training

python net_grd_baseline/main_qa_grd_baseline.py --mode train

Testing

python net_grd_baseline/main_qa_grd_baseline.py --mode test

Our Audio-Visual Spatial-Temporal Model

We provide trained models and you can quickly test the results. Test results may vary slightly on different machines.

python net_grd_avst/main_avst.py --mode train \
	--audio_dir = "path to your audio features"
	--video_res14x14_dir = "path to your visual res14x14 features"

Audio-Visual grounding generation

python grounding_gen/main_grd_gen.py

Training

python net_grd_avst/main_avst.py --mode train \
	--audio_dir = "path to your audio features"
	--video_res14x14_dir = "path to your visual res14x14 features"

Testing

python net_grd_avst/main_avst.py --mode test \
	--audio_dir = "path to your audio features"
	--video_res14x14_dir = "path to your visual res14x14 features"

Results

Audio-visual video question answering results of different methods on the test set of MUSIC-AVQA. The top-2 results are highlighted. Please see the citations in the [Paper] for comparison methods.
Visualized spatio-temporal grounding results

We provide several visualized spatial grounding results. The heatmap indicates the location of sounding source. Through the spatial grounding results, the sounding objects are visually captured, which can facilitate the spatial reasoning.

Firstly, ./grounding_gen/models_grd_vis/ should be created.
```
python grounding_gen/main_grd_gen_vis.py
```

Citation

If you find this work useful, please consider citing it.


@ARTICLE{Li2022Learning,
  title	= {Learning to Answer Questions in Dynamic Audio-Visual Scenarios},
  author	= {Guangyao li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu},
  journal	= {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year	= {2022},
}

Acknowledgement

This research was supported by Public Computing Cloud, Renmin University of China.

License

This project is released under the GNU General Public License v3.0.

MUSIC-AVQA, CVPR2022 (ORAL)

Related tags

Overview

Audio-Visual Question Answering (AVQA)

What's Audio-Visual Question Answering Task?

MUSIC-AVQA Dataset

Model Overview

Requirements

Usage

Results

Citation

Acknowledgement

License

Owner

This is a python package that turns any images into MIDI files that views the same as them

Library for Python 3 to communicate with the Google Chromecast.

Analysis of voices based on the Mel-frequency band

MelGAN test on audio decoding

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.

XA Music Player - Telegram Music Bot

This is a short program that takes the input from your microphone and uses OpenGL to draw a live colourful pattern

Terminal-based music player written in Python for the best music in the world 🎵 🎧 💻

Real-Time Spherical Microphone Renderer for binaural reproduction in Python

A Python wrapper around the Soundcloud API

Stream Music 🎵 𝘼 𝙗𝙤𝙩 𝙩𝙝𝙖𝙩 𝙘𝙖𝙣 𝙥𝙡𝙖𝙮 𝙢𝙪𝙨𝙞𝙘 𝙤𝙣 𝙏𝙚𝙡𝙚𝙜𝙧𝙖𝙢 𝙂𝙧𝙤𝙪𝙥 𝙖𝙣𝙙 𝘾𝙝𝙖𝙣𝙣𝙚𝙡 𝙑𝙤𝙞𝙘𝙚 𝘾𝙝𝙖𝙩𝙨 𝘼𝙫𝙖𝙞𝙡?

A python library for working with praat, textgrids, time aligned audio transcripts, and audio files.

Sequencer: Deep LSTM for Image Classification

Python tools for the corpus analysis of popular music.

voice assistant made with python that search for covid19 data(like total cases, deaths and etc) in a specific country

In this project we can see how we can generate automatic music using character RNN.

Converting UGG files from Rode Wireless Go II transmitters (unsompressed recordings) to WAV format

A Python port and library-fication of the midicsv tool by John Walker.

Powerful, simple, audio tag editor for GNU/Linux

praudio provides audio preprocessing framework for Deep Learning audio applications