Wav2Vec for speech recognition, classification, and audio classification

Last update: Dec 15, 2022

Overview

Soxan

در زبان پارسی به نام سخن

This repository consists of models, scripts, and notebooks that help you to use all the benefits of Wav2Vec 2.0 in your research. In the following, I'll show you how to train speech tasks in your dataset and how to use the pretrained models.

How to train

I'm just at the beginning of all the possible speech tasks. To start, we continue the training script with the speech emotion recognition problem.

Training - Notebook

Task	Notebook
Speech Emotion Recognition (Wav2Vec 2.0)
Speech Emotion Recognition (Hubert)
Audio Classification (Wav2Vec 2.0)

Training - CMD

python3 run_wav2vec_clf.py \
    --pooling_mode="mean" \
    --model_name_or_path="lighteternal/wav2vec2-large-xlsr-53-greek" \
    --model_mode="wav2vec2" \ # or you can use hubert
    --output_dir=/path/to/output \
    --cache_dir=/path/to/cache/ \
    --train_file=/path/to/train.csv \
    --validation_file=/path/to/dev.csv \
    --test_file=/path/to/test.csv \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --gradient_accumulation_steps=2 \
    --learning_rate=1e-4 \
    --num_train_epochs=5.0 \
    --evaluation_strategy="steps"\
    --save_steps=100 \
    --eval_steps=100 \
    --logging_steps=100 \
    --save_total_limit=2 \
    --do_eval \
    --do_train \
    --fp16 \
    --freeze_feature_extractor

Prediction

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchaudio
from transformers import AutoConfig, Wav2Vec2FeatureExtractor
from src.models import Wav2Vec2ForSpeechClassification, HubertForSpeechClassification

model_name_or_path = "path/to/your-pretrained-model"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
config = AutoConfig.from_pretrained(model_name_or_path)
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name_or_path)
sampling_rate = feature_extractor.sampling_rate

# for wav2vec
model = Wav2Vec2ForSpeechClassification.from_pretrained(model_name_or_path).to(device)

# for hubert
model = HubertForSpeechClassification.from_pretrained(model_name_or_path).to(device)


def speech_file_to_array_fn(path, sampling_rate):
    speech_array, _sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(_sampling_rate, sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech


def predict(path, sampling_rate):
    speech = speech_file_to_array_fn(path, sampling_rate)
    inputs = feature_extractor(speech, sampling_rate=sampling_rate, return_tensors="pt", padding=True)
    inputs = {key: inputs[key].to(device) for key in inputs}

    with torch.no_grad():
        logits = model(**inputs).logits

    scores = F.softmax(logits, dim=1).detach().cpu().numpy()[0]
    outputs = [{"Emotion": config.id2label[i], "Score": f"{round(score * 100, 3):.1f}%"} for i, score in
               enumerate(scores)]
    return outputs


path = "/path/to/disgust.wav"
outputs = predict(path, sampling_rate)

Output:

[
    {'Emotion': 'anger', 'Score': '0.0%'},
    {'Emotion': 'disgust', 'Score': '99.2%'},
    {'Emotion': 'fear', 'Score': '0.1%'},
    {'Emotion': 'happiness', 'Score': '0.3%'},
    {'Emotion': 'sadness', 'Score': '0.5%'}
]

Demos

Demo	Link
Speech To Text With Emotion Recognition (Persian) - soon	huggingface.co/spaces/m3hrdadfi/speech-text-emotion

Models

Dataset	Model
ShEMO: a large-scale validated database for Persian speech emotion detection	m3hrdadfi/wav2vec2-xlsr-persian-speech-emotion-recognition
ShEMO: a large-scale validated database for Persian speech emotion detection	m3hrdadfi/hubert-base-persian-speech-emotion-recognition
ShEMO: a large-scale validated database for Persian speech emotion detection	m3hrdadfi/hubert-base-persian-speech-gender-recognition
Speech Emotion Recognition (Greek) (AESDD)	m3hrdadfi/hubert-large-greek-speech-emotion-recognition
Speech Emotion Recognition (Greek) (AESDD)	m3hrdadfi/hubert-base-greek-speech-emotion-recognition
Speech Emotion Recognition (Greek) (AESDD)	m3hrdadfi/wav2vec2-xlsr-greek-speech-emotion-recognition
Eating Sound Collection	m3hrdadfi/wav2vec2-base-100k-eating-sound-collection
GTZAN Dataset - Music Genre Classification	m3hrdadfi/wav2vec2-base-100k-gtzan-music-genres

Wav2Vec for speech recognition, classification, and audio classification

Related tags

Overview

Soxan

How to train

Training - Notebook

Training - CMD

Prediction

Demos

Models

Owner

Mehrdad Farahani

FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes

Implementation of Multistream Transformers in Pytorch

最新版本yolov5+deepsort目标检测和追踪，支持5.0版本可训练自己数据集

Incomplete easy-to-use math solver and PDF generator.

Official PyTorch code for CVPR 2020 paper "Deep Active Learning for Biased Datasets via Fisher Kernel Self-Supervision"

Tackling Obstacle Tower Challenge using PPO & A2C combined with ICM.

[NeurIPS-2021] Slow Learning and Fast Inference: Efficient Graph Similarity Computation via Knowledge Distillation

🚗 INGI Dakar 2K21 - Be the first one on the finish line ! 🚗

Tensorflow implementation of our method: "Triangle Graph Interest Network for Click-through Rate Prediction".

ExCon: Explanation-driven Supervised Contrastive Learning

PyTorch implementation of the paper Dynamic Token Normalization Improves Vision Transfromers.

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Uncertainty-aware Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving

Official implementation of VQ-Diffusion

Boosted neural network for tabular data

A library for efficient similarity search and clustering of dense vectors.

PlenOctrees: NeRF-SH Training & Conversion

An Artificial Intelligence trying to drive a car by itself on a user created map

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Differential Privacy for Heterogeneous Federated Learning : Utility & Privacy tradeoffs