CPC-big and k-means clustering for zero-resource speech processing

Last update: Nov 23, 2022

Overview

Contrastive Predictive Coding

The CPC-big model and k-means checkpoints used in Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing.

Contrastive predictive coding (CPC) aims to learn representations of speech by distinguishing future observations from a set of negative examples. Previous work has shown that linear classifiers trained on CPC features can accurately predict speaker and phone labels. However, it is unclear how the features actually capture speaker and phonetic information, and whether it is possible to normalize out the irrelevant details (depending on the downstream task). In this paper, we first show that the per-utterance mean of CPC features captures speaker information to a large extent. Concretely, we find that comparing means performs well on a speaker verification task. Next, probing experiments show that standardizing the features effectively removes speaker information. Based on this observation, we propose a speaker normalization step to improve acoustic unit discovery using K-means clustering of CPC features. Finally, we show that a language model trained on the resulting units achieves some of the best results in the ZeroSpeech2021~Challenge.

Basic Usage

import torch, torchaudio
from sklearn.preprocessing import StandardScaler

# Load model checkpoints
cpc = torch.hub.load("bshall/cpc:main", "cpc").cuda()
kmeans = torch.hub.load("bshall/cpc:main", "kmeans50")

# Load audio
wav, sr = torchaudio.load("path/to/wav")
assert sr == 16000
wav = wav.unsqueeze(0).cuda()

x = cpc.encode(wav).squeeze().cpu().numpy()  # Encode
x = StandardScaler().fit_transform(x)  # Speaker normalize
codes = kmeans.predict(x)  # Discretize

Note that the encode function is stateful (keeps the hidden state of the LSTM from previous calls).

Encode an Audio Dataset

Clone the repo and use the encode.py script:

usage: encode.py [-h] in_dir out_dir

Encode an audio dataset using CPC-big (with speaker normalization and discretization).

positional arguments:
  in_dir      Path to the directory to encode.
  out_dir     Path to the output directory.

optional arguments:
  -h, --help  show this help message and exit

You might also like...

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

2.2k Jan 9, 2023

Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

11 Nov 17, 2022

Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

5 Dec 28, 2021

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

29 Oct 16, 2022

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

8.4k Dec 26, 2022

DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

1.5k Dec 26, 2022

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

7.5k Feb 17, 2021

DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

1.4k Feb 17, 2021

End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

5.9k Jan 3, 2023

Releases(v0.1)

v0.1(Oct 15, 2021)
This release contains:

The checkpoint for the CPC-big model (adapted from https://download.zerospeech.com/)

k-means with 50 clusters (trained on normalized CPC features)

k-means with 100 clusters (trained on normalized CPC features)

Source code(tar.gz)
Source code(zip)
cpc-d7475380.pt(36.08 MB)
kmeans100-c7eda98e.pt(200.79 KB)
kmeans50-89accca9.pt(100.79 KB)

CPC-big and k-means clustering for zero-resource speech processing

Related tags

Overview

Contrastive Predictive Coding

Basic Usage

Encode an Audio Dataset

You might also like...

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

Speech Recognition for Uyghur using Speech transformer

Simple Speech to Text, Text to Speech

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

DELTA is a deep learning based natural language and speech processing platform.

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

DELTA is a deep learning based natural language and speech processing platform.

End-to-End Speech Processing Toolkit

Releases(v0.1)

v0.1(Oct 15, 2021)

Owner

Benjamin van Niekerk

TweebankNLP - Pre-trained Tweet NLP Pipeline (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Models + Tweebank-NER

Fine-tuning scripts for evaluating transformer-based models on KLEJ benchmark.

This python module is an easy-to-use port of the text normalization used in the paper "Not low-resource anymore: Aligner ensembling, batch filtering, and new datasets for Bengali-English machine translation". It is intended to be used for normalizing / cleaning Bengali and English text.

This project aims to conduct a text information retrieval and text mining on medical research publication regarding Covid19 - treatments and vaccinations.

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

Extract rooms type, door, neibour rooms, rooms corners nad bounding boxes, and generate graph from rplan dataset

Code for "Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures Inside Arguments".

Paddlespeech Streaming ASR GUI

Repository for the paper "Optimal Subarchitecture Extraction for BERT"

Plugin repository for Macast

Codename generator using WordNet parts of speech database

ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

NLP applications using deep learning.

CMeEE 数据集医学实体抽取

Text to speech converter with GUI made in Python.