The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"

Overview

Hierarchical Token Semantic Audio Transformer

Introduction

The Code Repository for "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection", in ICASSP 2022.

In this paper, we devise a model, HTS-AT, by combining a swin transformer with a token-semantic module and adapt it in to audio classification and sound event detection tasks. HTS-AT is an efficient and light-weight audio transformer with a hierarchical structure and has only 30 million parameters. It achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models.

HTS-AT Architecture

Classification Results on AudioSet, ESC-50, and Speech Command V2 (mAP)

HTS-AT ClS Result

Localization/Detection Results on DESED dataset (F1-Score)

HTS-AT Localization Result

Getting Started

Install Requirments

pip install -r requirements.txt

Download and Processing Datasets

  • config.py
change the varible "dataset_path" to your audioset address
change the variable "desed_folder" to your DESED address
change the classes_num to 527
./create_index.sh # 
// remember to change the pathes in the script
// more information about this script is in https://github.com/qiuqiangkong/audioset_tagging_cnn

python main.py save_idc 
// count the number of samples in each class and save the npy files
Open the jupyter notebook at esc-50/prep_esc50.ipynb and process it
Open the jupyter notebook at scv2/prep_scv2.ipynb and process it
python conver_desed.py 
// will produce the npy data files

Set the Configuration File: config.py

The script config.py contains all configurations you need to assign to run your code. Please read the introduction comments in the file and change your settings. For the most important part: If you want to train/test your model on AudioSet, you need to set:

dataset_path = "your processed audioset folder"
dataset_type = "audioset"
balanced_data = True
loss_type = "clip_bce"
sample_rate = 32000
hop_size = 320 
classes_num = 527

If you want to train/test your model on ESC-50, you need to set:

dataset_path = "your processed ESC-50 folder"
dataset_type = "esc-50"
loss_type = "clip_ce"
sample_rate = 32000
hop_size = 320 
classes_num = 50

If you want to train/test your model on Speech Command V2, you need to set:

dataset_path = "your processed SCV2 folder"
dataset_type = "scv2"
loss_type = "clip_bce"
sample_rate = 16000
hop_size = 160
classes_num = 35

If you want to test your model on DESED, you need to set:

resume_checkpoint = "Your checkpoint on AudioSet"
heatmap_dir = "localization results output folder"
test_file = "output heatmap name"
fl_local = True
fl_dataset = "Your DESED npy file"

Train and Evaluation

Notice: Our model is run on DDP mode and requires at least two GPU cards. If you want to use a single GPU for training and evaluation, you need to mannually change sed_model.py and main.py

All scripts is run by main.py:

Train: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py train

Test: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test

Ensemble Test: CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py esm_test 
// See config.py for settings of ensemble testing

Weight Average: python main.py weight_average
// See config.py for settings of weight averaging

Localization on DESED

CUDA_VISIBLE_DEVICES=1,2,3,4 python main.py test
// make sure that fl_local=True in config.py
python fl_evaluate.py
// organize and gather the localization results
fl_evaluate_f1.ipynb
// Follow the notebook to produce the results

Model Checkpoints:

We provide the model checkpoints on three datasets (and additionally DESED dataset) in this link. Feel free to download and test it.

Citing

@inproceedings{htsat-ke2022,
  author = {Ke Chen and Xingjian Du and Bilei Zhu and Zejun Ma and Taylor Berg-Kirkpatrick and Shlomo Dubnov},
  title = {HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection},
  booktitle = {{ICASSP} 2022}
}

Our work is based on Swin Transformer, which is a famous image classification transformer model.

Owner
Knut(Ke) Chen
ORZ: { godfather: sweetdum, ufo: zgg, dragon sister: lzl, morning king: corner café }
Knut(Ke) Chen
ConvMAE: Masked Convolution Meets Masked Autoencoders

ConvMAE ConvMAE: Masked Convolution Meets Masked Autoencoders Peng Gao1, Teli Ma1, Hongsheng Li2, Jifeng Dai3, Yu Qiao1, 1 Shanghai AI Laboratory, 2 M

Alpha VL Team of Shanghai AI Lab 345 Jan 08, 2023
Learning Modified Indicator Functions for Surface Reconstruction

Learning Modified Indicator Functions for Surface Reconstruction In this work, we propose a learning-based approach for implicit surface reconstructio

4 Apr 18, 2022
OpenFed: A Comprehensive and Versatile Open-Source Federated Learning Framework

OpenFed: A Comprehensive and Versatile Open-Source Federated Learning Framework Introduction OpenFed is a foundational library for federated learning

25 Dec 12, 2022
PyTorch Personal Trainer: My framework for deep learning experiments

Alex's PyTorch Personal Trainer (ptpt) (name subject to change) This repository contains my personal lightweight framework for deep learning projects

Alex McKinney 8 Jul 14, 2022
Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)

Network Pruning That Matters: A Case Study on Retraining Variants (ICLR 2021)

Duong H. Le 18 Jun 13, 2022
Lung Pattern Classification for Interstitial Lung Diseases Using a Deep Convolutional Neural Network

ild-cnn This is supplementary material for the manuscript: "Lung Pattern Classification for Interstitial Lung Diseases Using a Deep Convolutional Neur

22 Nov 05, 2022
Alphabetical Letter Recognition

BayeesNetworks-Image-Classification Alphabetical Letter Recognition In these demo we are using "Bayees Networks" Our database is composed by Learning

Mohammed Firass 4 Nov 30, 2021
Style-based Neural Drum Synthesis with GAN inversion

Style-based Drum Synthesis with GAN Inversion Demo TensorFlow implementation of a style-based version of the adversarial drum synth (ADS) from the pap

Sound and Music Analysis (SoMA) Group 29 Nov 19, 2022
Defending against Model Stealing via Verifying Embedded External Features

Defending against Model Stealing Attacks via Verifying Embedded External Features This is the official implementation of our paper Defending against M

20 Dec 30, 2022
A Python Library for Graph Outlier Detection (Anomaly Detection)

PyGOD is a Python library for graph outlier detection (anomaly detection). This exciting yet challenging field has many key applications, e.g., detect

PyGOD Team 757 Jan 04, 2023
Code and experiments for "Deep Neural Networks for Rank Consistent Ordinal Regression based on Conditional Probabilities"

corn-ordinal-neuralnet This repository contains the orginal model code and experiment logs for the paper "Deep Neural Networks for Rank Consistent Ord

Raschka Research Group 14 Dec 27, 2022
Adversarial Framework for (non-) Parametric Image Stylisation Mosaics

Fully Adversarial Mosaics (FAMOS) Pytorch implementation of the paper "Copy the Old or Paint Anew? An Adversarial Framework for (non-) Parametric Imag

Zalando Research 120 Dec 24, 2022
RaceBERT -- A transformer based model to predict race and ethnicty from names

RaceBERT -- A transformer based model to predict race and ethnicty from names Installation pip install racebert Using a virtual environment is highly

Prasanna Parasurama 3 Nov 02, 2022
Neural Articulated Radiance Field

Neural Articulated Radiance Field NARF Neural Articulated Radiance Field Atsuhiro Noguchi, Xiao Sun, Stephen Lin, Tatsuya Harada ICCV 2021 [Paper] [Co

Atsuhiro Noguchi 144 Jan 03, 2023
Scenic: A Jax Library for Computer Vision and Beyond

Scenic Scenic is a codebase with a focus on research around attention-based models for computer vision. Scenic has been successfully used to develop c

Google Research 1.6k Dec 27, 2022
Contrastive Learning with Non-Semantic Negatives

Contrastive Learning with Non-Semantic Negatives This repository is the official implementation of Robust Contrastive Learning Using Negative Samples

39 Jul 31, 2022
Pytorch library for end-to-end transformer models training and serving

Pytorch library for end-to-end transformer models training and serving

Mikhail Grankin 768 Jan 01, 2023
A spherical CNN for weather forecasting

DeepSphere-Weather - Deep Learning on the sphere for weather/climate applications. The code in this repository provides a scalable and flexible framew

DeepSphere 47 Dec 25, 2022
Code for the paper: Hierarchical Reinforcement Learning With Timed Subgoals, published at NeurIPS 2021

Hierarchical reinforcement learning with Timed Subgoals (HiTS) This repository contains code for reproducing experiments from our paper "Hierarchical

Autonomous Learning Group 21 Dec 03, 2022
Official PyTorch Implementation for InfoSwap: Information Bottleneck Disentanglement for Identity Swapping

InfoSwap: Information Bottleneck Disentanglement for Identity Swapping Code usage Please check out the user manual page. Paper Gege Gao, Huaibo Huang,

Grace Hešeri 56 Dec 20, 2022