This library provides common speech features for ASR including MFCCs and filterbank energies.

Overview

python_speech_features

This library provides common speech features for ASR including MFCCs and filterbank energies. If you are not sure what MFCCs are, and would like to know more have a look at this MFCC tutorial

Project Documentation

To cite, please use: James Lyons et al. (2020, January 14). jameslyons/python_speech_features: release v0.6.1 (Version 0.6.1). Zenodo. http://doi.org/10.5281/zenodo.3607820

Installation

This project is on pypi

To install from pypi:

pip install python_speech_features

From this repository:

git clone https://github.com/jameslyons/python_speech_features
python setup.py develop

Usage

Supported features:

  • Mel Frequency Cepstral Coefficients
  • Filterbank Energies
  • Log Filterbank Energies
  • Spectral Subband Centroids

Example use

From here you can write the features to a file etc.

MFCC Features

The default parameters should work fairly well for most cases, if you want to change the MFCC parameters, the following parameters are supported:

python
def mfcc(signal,samplerate=16000,winlen=0.025,winstep=0.01,numcep=13,
                 nfilt=26,nfft=512,lowfreq=0,highfreq=None,preemph=0.97,
     ceplifter=22,appendEnergy=True)
Parameter Description
signal the audio signal from which to compute features. Should be an N*1 array
samplerate the samplerate of the signal we are working with.
winlen the length of the analysis window in seconds. Default is 0.025s (25 milliseconds)
winstep the step between successive windows in seconds. Default is 0.01s (10 milliseconds)
numcep the number of cepstrum to return, default 13
nfilt the number of filters in the filterbank, default 26.
nfft the FFT size. Default is 512
lowfreq lowest band edge of mel filters. In Hz, default is 0
highfreq highest band edge of mel filters. In Hz, default is samplerate/2
preemph apply preemphasis filter with preemph as coefficient. 0 is no filter. Default is 0.97
ceplifter apply a lifter to final cepstral coefficients. 0 is no lifter. Default is 22
appendEnergy if this is true, the zeroth cepstral coefficient is replaced with the log of the total frame energy.
returns A numpy array of size (NUMFRAMES by numcep) containing features. Each row holds 1 feature vector.

Filterbank Features

These filters are raw filterbank energies. For most applications you will want the logarithm of these features. The default parameters should work fairly well for most cases. If you want to change the fbank parameters, the following parameters are supported:

python
def fbank(signal,samplerate=16000,winlen=0.025,winstep=0.01,
      nfilt=26,nfft=512,lowfreq=0,highfreq=None,preemph=0.97)
Parameter Description
signal the audio signal from which to compute features. Should be an N*1 array
samplerate the samplerate of the signal we are working with
winlen the length of the analysis window in seconds. Default is 0.025s (25 milliseconds)
winstep the step between successive windows in seconds. Default is 0.01s (10 milliseconds)
nfilt the number of filters in the filterbank, default 26.
nfft the FFT size. Default is 512.
lowfreq lowest band edge of mel filters. In Hz, default is 0
highfreq highest band edge of mel filters. In Hz, default is samplerate/2
preemph apply preemphasis filter with preemph as coefficient. 0 is no filter. Default is 0.97
returns A numpy array of size (NUMFRAMES by nfilt) containing features. Each row holds 1 feature vector. The second return value is the energy in each frame (total energy, unwindowed)

Reference

sample english.wav obtained from:

wget http://voyager.jpl.nasa.gov/spacecraft/audio/english.au
sox english.au -e signed-integer english.wav
Comments
  • Question about hamming window length

    Question about hamming window length

    When using mfcc, the window parameter can use numpy.hamming, but numpy.hamming is a funtion, and it can take an int input as number of points in the output window. See numpy.hamming Doc. However, frame_len is used in sigproc.py Could you please how does the np.hamming work in mfcc? What if I want to input a specific window length? Thank you !

    opened by leeeeeeo 14
  • obtain the noise data

    obtain the noise data

    Hi, my aim is to get the noise data from a audio file, which is from classroom, it is meaning that i want to remove the teachers voice. So, what should I do ?

    opened by YueWenWu 7
  • How to ignore the NFFT warning

    How to ignore the NFFT warning

    TX3NH0WQ451{_PJD 7O BML

    I don't want to increase the NFFT cause I think it is acceptable to distortion. So may there is any way to ignore the annoying warnings?

    I have tried using module warning , but it dosen't work.

    opened by igo312 5
  • Why there's big difference using 16k and 44.1k sample rate

    Why there's big difference using 16k and 44.1k sample rate

    Hi: I recorded some wav file originally in 44.1k sample rate, and then I convert this file to 16k by sox. After that I use this python script to caculate the MFCC feature of the 44.1k file and 16k file, but found that the result was completely different. And one same file no matter convert to 44.1k or 16k, I think the result should be the same. Isn't that ?

    opened by robotnc 5
  • What if the frame length is greater than NFFT?

    What if the frame length is greater than NFFT?

    I'm not an expert in this kind of stuff, so I'm sorry if this will be a waste of time.

    From the numpy.fft.rfft documentation [in our case: n=NFTT, input=frame]: "Number of points along transformation axis in the input to use. If n is smaller than the length of the input, the input is cropped. If it is larger, the input is padded with zeros. If n is not given, the length of the input along the axis specified by axis is used."

    Is not this cropping something we want to avoid? Because, as far as I've seen, there's not any check in the code about how the frame size compares to NFTT.

    opened by janluke 4
  • inconsistent result with HTK

    inconsistent result with HTK

    Hi,

    I tried to compare the MFCC features generated using HTK, and those generated by python_speech_features. Unfortunately, somehow they always mismatch.

    Below is the configuration I used for HTK

    SOURCEFORMAT = NIST
    TARGETKIND = MFCC_0
    TARGETRATE = 100000
    SAVECOMPRESSED = F
    SAVEWITHCRC = F
    WINDOWSIZE = 250000
    USEHAMMING = F
    PREEMCOEF = 0.97
    NUMCHANS = 26
    CEPLIFTER = 22
    NUMCEPS = 12
    ENORMALISE = F
    

    The configuration for python_speech_features is default.

    I also tried adding USEPOWER = F/T, and still features obtained are very different (actually, for file TIMITcorpus/TIMIT/TRAIN/DR8/FBCG1/SX442, I got 358 frames for HTK, but only 354 frames for python_speech_features.

    Any insight? I'm a newbie in speech recognition, and may have committed some silly mistakes..

    opened by zym1010 4
  • Troubles when porting

    Troubles when porting

    Hi, I am trying to port this algorithm to JavaScript and I am running into the following:

    feat = numpy.dot(pspec,fb.T)
    

    (https://github.com/jameslyons/python_speech_features/blob/master/features/base.py#L56)

    The issue I am running into is that pspec and fb here should have the same dimensions, but for some reason they don't. Is there something in the algorithm, some kind of balance between parameters for example, which should cause these two arrays to have the same dimensions?

    opened by mauritslamers 4
  • [Question:] How to capture intensity or perceived loudness of a given audio file at regular intervals

    [Question:] How to capture intensity or perceived loudness of a given audio file at regular intervals

    If you are playing a song on your laptop, As you increase the volume from 0 to 100, the audio becomes louder and louder.

    Say I have an .mp3 or .wav , how do I capture this ^ perceived loudness/intensity at regular intervals (may be 0.1 second) in the audio using python speech features?

    Any advice is appreciated.

    Thanks Vivek

    opened by StanSilas 3
  • can't get same result as compute-mfcc-feats.

    can't get same result as compute-mfcc-feats.

    compute-mfcc-feats --window-type=hamming --dither=0.0 --use-energy=false --sample-frequency=8000 --num-mel-bins=40 --num-ceps=40 --low-freq=40 --raw-energy=false --remove-dc-offset=false --high-freq=3800 scp:wav.scp ark,scp:feats.ark,feats.scp

    mfcc(signal=sig, samplerate=rate, winlen=0.025, winstep=0.01, numcep=40, nfilt=40, lowfreq=40, highfreq=3800, appendEnergy=False, winfunc = lambda x: np.hamming(x) )

    is there some difference ?

    opened by bjtommychen 3
  • inconsistency with librosa

    inconsistency with librosa

    I compared the mfcc of librosa with python_speech_analysis package and got totally different results.

    Which one is correct? librosa list of first frame coefficients:

    [-395.07433842032867, -7.1149347948192963e-14, 3.5772469223901538e-14, -1.7476140989485184e-14, 3.1665300829452658e-14, -4.4214136625668904e-14, 6.7157035631648599e-14, 1.5013974158050108e-14, 2.9512326634271699e-14, 7.2275398398734558e-14, -1.5043753316598812e-13, -2.2358383003147776e-14, 1.6209256159527285e-13]

    python_speech_analysis list of first frame coefficients:

    [-169.91598446684722, 1.3219891974654943, 0.22216979881740945, -0.7368248288464827, 0.26268194306407788, 1.8470757480486224, 3.2670900572694435, 2.3726120692753563, 1.4983949546889608, 0.67862219561000914, -0.44705590991616034, 0.39184067109778226, -0.48048214059101707]

    import librosa
    import python_speech_features
    from scipy.signal.windows import hann
    
    n_mfcc = 13
    n_mels = 40
    n_fft = 512 # in librosa, win_length is assumed to be equal to n_fft implicitly
    hop_length = 160
    fmin = 0
    fmax = None
    y, sr = librosa.load(librosa.util.example_audio_file())
    sr = 16000  # fake sample rate just to make the point
    
    # librosa
    mfcc_librosa = librosa.feature.mfcc(y=y, sr=sr, n_fft=n_fft,
                                        n_mfcc=n_mfcc, n_mels=n_mels,
                                        hop_length=hop_length,
                                        fmin=fmin, fmax=fmax)
    
    # python_speech_features
    # no preemph nor ceplifter in librosa, so setting to zero
    # librosa default stft window is hann
    mfcc_speech = python_speech_features.mfcc(signal=y, samplerate=sr, winlen=n_fft / sr, winstep=hop_length / sr,
                                              numcep=n_mfcc, nfilt=n_mels, nfft=n_fft, lowfreq=fmin, highfreq=fmax,
                                              preemph=0, ceplifter=0, appendEnergy=False, winfunc=hann)
    
    
    print(list(mfcc_librosa[:, 0]))
    print(list(mfcc_speech[0, :]))
    
    opened by chananshgong 3
  • Filterbank=80

    Filterbank=80

    It works fine for filterbank=40.But when I try for 80, the third filterbank out is constant value like this -36.04365,-36.04365,-36.04365,-36.04365,-36.04365,-36.04365

    I have attached the image showing speech,spectrogram, logfilterbank for 80 filters screenshot from 2015-12-03 11 13 31

    opened by madhavsund 3
  • Use another augmented assignment statement

    Use another augmented assignment statement

    :eyes: Some source code analysis tools can help to find opportunities for improving software components. :thought_balloon: I propose to increase the usage of augmented assignment statements accordingly.

    diff --git a/python_speech_features/sigproc.py b/python_speech_features/sigproc.py
    index a786c4f..b8729ea 100644
    --- a/python_speech_features/sigproc.py
    +++ b/python_speech_features/sigproc.py
    @@ -84,7 +84,7 @@ def deframesig(frames, siglen, frame_len, frame_step, winfunc=lambda x: numpy.on
                                                    indices[i, :]] + win + 1e-15  # add a little bit so it is never zero
             rec_signal[indices[i, :]] = rec_signal[indices[i, :]] + frames[i, :]
     
    -    rec_signal = rec_signal / window_correction
    +    rec_signal /= window_correction
         return rec_signal[0:siglen]
     
     
    
    opened by elfring 0
  • High CPU Utilization

    High CPU Utilization

    I observe that exetacting MFCC or MFB features utilizes almost all the CPU with 100% capacity. I am sure that extracting these features doesn't requires so much of computation.

    I am processing only one file at a time and not using any parallalization. Here is the code I am using

    import glob
    import numpy as np
    import scipy.io as sio
    import scipy.io.wavfile
    from python_speech_features import *
    filelist = glob.glob("/home/divraj/scribetech/dataset/voxceleb1/test/wav/*/*/*.wav")
    for file in filelist:
    	sr, audio = sio.wavfile.read(file)
    	features, energies = fbank(audio, samplerate=16000, nfilt=40, winlen=0.025, winfunc=np.hamming)
    

    What is the reason for high CPU utilization?

    opened by divyeshrajpura4114 2
  • viseme generation

    viseme generation

    @jameslyons thank you so much for this repo. Hi everyone I am trying to produce viseme from audio. can you please guide me about how can i generate viseme using this repo. or any other repo you may refer or another relevant work which I can extend to my main goal. Would really appreciate any help

    opened by AhmadManzoor 0
  • [Question:] inverse fbank back to wav

    [Question:] inverse fbank back to wav

    Hi, thanks for the library. I use it to compute the fbank, do some stuff on them, and than i get a new one. Is there a way to convert the new fbank back to the waw? I have the original starting file theoretically is possible.

    opened by matdtr 0
  • Minor issue on round vs. floor

    Minor issue on round vs. floor

    In this line:

    https://github.com/jameslyons/python_speech_features/blob/9a2d76c6336d969d51ad3aa0d129b99297dcf55e/python_speech_features/base.py#L169

    I think you are assuming that np.floor(t+1) = np.round(t), but that is not true. I think your want:

    bin = numpy.round((nfft)*mel2hz(melpoints)/samplerate) bin = numpy.floor((nfft+0.5)*mel2hz(melpoints)/samplerate)

    This is a minor point because they often give the same and it doesn't matter in practice. I just found this point a little confusing in your write-up.

    Thanks for the blogpost and this code!

    opened by keithchugg 0
Releases(0.6.1)
Owner
James Lyons
James Lyons
Musillow is a music recommender app that finds songs similar to your favourites.

MUSILLOW The music recommender app Check it out now!!! View Demo Β· Report Bug Β· Request Feature About The App Musillow is a music recommender app that

3 Feb 03, 2022
Sound-Equalizer- This is a Sound Equalizer GUI App Using Python's PyQt5

Sound-Equalizer- This is a Sound Equalizer GUI App Using Python's PyQt5. It gives you the ability to play, pause, and Equalize any one-channel wav audio file and play 3 different instruments.

Mustafa Megahed 1 Jan 10, 2022
GiantMIDI-Piano is a classical piano MIDI dataset contains 10,854 MIDI files of 2,786 composers

GiantMIDI-Piano is a classical piano MIDI dataset contains 10,854 MIDI files of 2,786 composers

Bytedance Inc. 1.3k Jan 04, 2023
A lightweight yet powerful audio-to-MIDI converter with pitch bend detection

Basic Pitch is a Python library for Automatic Music Transcription (AMT), using lightweight neural network developed by Spotify's Audio Intelligence La

Spotify 1.4k Jan 01, 2023
𝙰 π™Όπšžπšœπš’πšŒ π™±πš˜πš π™²πš›πšŽπšŠπšπšŽπš π™±πš’ πšƒπšŽπšŠπš–π™³πš•πš πŸ’–

TeamDltmusic 𝙰 π™Όπšžπšœπš’πšŒ π™±πš˜πš π™²πš›πšŽπšŠπšπšŽπš π™±πš’ πšƒπšŽπšŠπš–π™³πš•πš πŸ’– Deploy String Session String Click hear you can find string session OR join He

TeamDlt 5 Jan 18, 2022
Music player - endlessly plays your music

Music player First, if you wonder about what is supposed to be a music player or what makes a music player different from a simple media player, read

Albert Zeyer 482 Dec 19, 2022
MelGAN test on audio decoding

Official repository for the paper MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis The original work URL: https://github.com

Jurio 1 Apr 29, 2022
Open Sound Strip, Sequence or Record in Audacity

Audacity Tools For Blender Sound editing in Blender Video Sequence Editor with Audacity integrated. Send/receive the full edited sequence or single st

64 Dec 31, 2022
Spotifyd - An open source Spotify client running as a UNIX daemon.

Spotifyd An open source Spotify client running as a UNIX daemon. Spotifyd streams music just like the official client, but is more lightweight and sup

8.5k Jan 09, 2023
Enhanced Audio Player for Discord

Discodo is an enhanced audio player for discord

Mary 42 Oct 05, 2022
Noinoi music is smoothly playing music on voice chat of telegram.

NOINOI MUSIC BOT ✨ Features Music & Video stream support MultiChat support Playlist & Queue support Skip, Pause, Resume, Stop feature Music & Video do

2 Feb 13, 2022
Welcome to Nexus. Your personal virtual assistant

AI Voice Assistant Welcome to Nexus voice assistant Description Have you ever heard of voice assistants like Cortana, Siri, Google assistant, and Alex

Mustafah Zacs 1 Jan 10, 2022
SU Music Player β€” The first open-source PyTgCalls based Pyrogram bot to play music in voice chats

SU Music Player β€” The first open-source PyTgCalls based Pyrogram bot to play music in voice chats Note Neither this, or PyTgCalls are fully

SU Projects 58 Jan 02, 2023
Implementation of "Slow-Fast Auditory Streams for Audio Recognition, ICASSP, 2021" in PyTorch

Auditory Slow-Fast This repository implements the model proposed in the paper: Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, Dima Damen, Slow-Fa

Evangelos Kazakos 57 Dec 07, 2022
Voicefixer aims at the restoration of human speech regardless how serious its degraded.

Voicefixer aims at the restoration of human speech regardless how serious its degraded.

Leo 324 Dec 26, 2022
Carnatic Notes Predictor for audio files

Carnatic Notes Predictor for audio files Link for live application: https://share.streamlit.io/pradeepak1/carnatic-notes-predictor-for-audio-files/mai

1 Nov 06, 2021
Audio processor to map oracle notes in the VoG raid in Destiny 2 to call outs.

vog_oracles Audio processor to map oracle notes in the VoG raid in Destiny 2 to call outs. Huge thanks to mzucker on GitHub for the note detection cod

19 Sep 29, 2022
Real-Time Spherical Microphone Renderer for binaural reproduction in Python

ReTiSAR Implementation of the Real-Time Spherical Microphone Renderer for binaural reproduction in Python [1][2]. Contents: | Requirements | Setup | Q

Division of Applied Acoustics at Chalmers University of Technology 51 Dec 17, 2022
Guide & Examples to create deeplearning gstreamer plugins and use them in your pipeline

upai-gst-dl-plugins Guide & Examples to create deeplearning gstreamer plugins and use them in your pipeline Introduction Thanks to the work done by @j

UPAI.IO 11 Dec 11, 2022
Praat in Python, the Pythonic way

Parselmouth - Praat in Python, the Pythonic way Parselmouth is a Python library for the Praat software. Though other attempts have been made at portin

Yannick Jadoul 786 Jan 09, 2023