kapre: Keras Audio Preprocessors

Overview

Kapre

Keras Audio Preprocessors - compute STFT, ISTFT, Melspectrogram, and others on GPU real-time.

Tested on Python 3.6 and 3.7

Why Kapre?

vs. Pre-computation

  • You can optimize DSP parameters
  • Your model deployment becomes much simpler and consistent.
  • Your code and model has less dependencies

vs. Your own implementation

  • Quick and easy!
  • Consistent with 1D/2D tensorflow batch shapes
  • Data format agnostic (channels_first and channels_last)
  • Less error prone - Kapre layers are tested against Librosa (stft, decibel, etc) - which is (trust me) trickier than you think.
  • Kapre layers have some extended APIs from the default tf.signals implementation such as..
    • A perfectly invertible STFT and InverseSTFT pair
    • Mel-spectrogram with more options
  • Reproducibility - Kapre is available on pip with versioning

Workflow with Kapre

  1. Preprocess your audio dataset. Resample the audio to the right sampling rate and store the audio signals (waveforms).
  2. In your ML model, add Kapre layer e.g. kapre.time_frequency.STFT() as the first layer of the model.
  3. The data loader simply loads audio signals and feed them into the model
  4. In your hyperparameter search, include DSP parameters like n_fft to boost the performance.
  5. When deploying the final model, all you need to remember is the sampling rate of the signal. No dependency or preprocessing!

Installation

pip install kapre

API Documentation

Please refer to Kapre API Documentation at https://kapre.readthedocs.io

One-shot example

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, BatchNormalization, ReLU, GlobalAveragePooling2D, Dense, Softmax
from kapre import STFT, Magnitude, MagnitudeToDecibel
from kapre.composed import get_melspectrogram_layer, get_log_frequency_spectrogram_layer

# 6 channels (!), maybe 1-sec audio signal, for an example.
input_shape = (44100, 6)
sr = 44100
model = Sequential()
# A STFT layer
model.add(STFT(n_fft=2048, win_length=2018, hop_length=1024,
               window_name=None, pad_end=False,
               input_data_format='channels_last', output_data_format='channels_last',
               input_shape=input_shape))
model.add(Magnitude())
model.add(MagnitudeToDecibel())  # these three layers can be replaced with get_stft_magnitude_layer()
# Alternatively, you may want to use a melspectrogram layer
# melgram_layer = get_melspectrogram_layer()
# or log-frequency layer
# log_stft_layer = get_log_frequency_spectrogram_layer() 

# add more layers as you want
model.add(Conv2D(32, (3, 3), strides=(2, 2)))
model.add(BatchNormalization())
model.add(ReLU())
model.add(GlobalAveragePooling2D())
model.add(Dense(10))
model.add(Softmax())

# Compile the model
model.compile('adam', 'categorical_crossentropy') # if single-label classification

# train it with raw audio sample inputs
# for example, you may have functions that load your data as below.
x = load_x() # e.g., x.shape = (10000, 6, 44100)
y = load_y() # e.g., y.shape = (10000, 10) if it's 10-class classification
# then..
model.fit(x, y)
# Done!

Citation

Please cite this paper if you use Kapre for your work.

@inproceedings{choi2017kapre,
  title={Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras},
  author={Choi, Keunwoo and Joo, Deokjin and Kim, Juho},
  booktitle={Machine Learning for Music Discovery Workshop at 34th International Conference on Machine Learning},
  year={2017},
  organization={ICML}
}
Comments
  • Migrated functions to tf.keras

    Migrated functions to tf.keras

    This PR addresses #52 by removing the dependency on keras and switching to tensorflow.keras

    Proposed version is 0.1.6 because of pull request #56

    In particular, #56 keeps the dependency on keras with from keras.utils.conv_utils import conv_output_length

    opened by douglas125 27
  • Spectrogram?

    Spectrogram?

    I have an older version of Kapre that has time_frequency.Spectrogram, which is trainable.

    However, the new version of Kapre doesn't have Spectrogram anymore. Why?

    opened by turian 8
  • Melspectrogram cant be set 'trainable_fb=False'

    Melspectrogram cant be set 'trainable_fb=False'

    Melspectrogram cant be set 'trainable_fb=False',after I set trainable_fb=False,trainable_kernel=False,but seems like it doesnot work.It is still trainable.

    opened by zhh6 8
  • htk=true for mel frequencies

    htk=true for mel frequencies

    We noticed the current implenetation of the mel_frequencies function (based on Librosa) doesn't include the htk=True option, which is handy when training CNNs because then the frequency scale is fully logarithmic which, in principle, makes more sense for frequency invariant convolutional filters.

    What was the motivation for removing this? Any chance it can be added?

    opened by justinsalamon 8
  • Added parallel STFT implementation

    Added parallel STFT implementation

    Hi!

    As I comented in #98, I added a parallel STFT implementation based on the map_fn function following the indications of @zaccharieramzi here.

    I've added a use_parallel_stft param (disabled by default) that allows to use this function. I've put this param in as many functions as I can. I also added test cases for every function I can, including an specific test that checks that the result of the tf.signal.stft is equals to the result of the parallel_stft function.

    Hope this could serve us well meanwhile tensorflow resolves its issues with the fft implementation.

    opened by JPery 7
  • Amplitude-to-decibel conversion produces different results on different batches

    Amplitude-to-decibel conversion produces different results on different batches

    Related to #16, I found another issue that contributes to different prediction results depending on batch size (and the batches themselves). In particular, it occurs when using converting spectrograms to decibels.

    https://github.com/keunwoochoi/kapre/blob/master/kapre/backend_keras.py#L17

    The maximum is taken over the entire tensor, instead of per example in the batch. This results in different normalization when the examples in a batch are different.

    bug 
    opened by auroracramer 7
  • Inverse Spectrogram and Mel-Spectrogram Layer?

    Inverse Spectrogram and Mel-Spectrogram Layer?

    Namaste!

    kapre has become an integral part of all my audio Deep Learning experiments. Powerful! Thanks for providing such a great software!

    I was thinking... I guess it would make sense to have layers for inverse spectrogram and inverse mel-spectrogram. Thinking about Autoencoders, this would be even more powerful. I know that reconstructing samples from spectrograms is not the best, but it is possible to a certain degree.

    What do you think about that feature request?

    Best, Tristan

    opened by AI-Guru 7
  • Hey! The input is too short!

    Hey! The input is too short!

    Hi,

    I'm encountering an assertion problem when calling your code with a Tensorflow backend.

    input_shape = (44100,1)

    Could this be a be a problem with "channels_first" / "channels_last"?

    Best, Alex

    opened by slychief 6
  • Pip?

    Pip?

    It seems you were on pip, but are no longer. Is there anything I could do to help get kapre back on there? We want to use this library in a commercial application, and for our process pip packages are easier to support than a git repository.

    opened by ff-rfeather 6
  • trainable_stft error

    trainable_stft error

    Following your example but missing layer definition trainable_stft or something, can you provide example with error resolution?

    `# 6 channels (!), maybe 1-sec audio signal
    input_shape = (6, 44100) 
    sr = 44100
    model = Sequential()
    model.add(Melspectrogram(n_dft=512, n_hop=256, input_shape=src_shape,
                             border_mode='same', sr=sr, n_mels=128,
                             fmin=0.0, fmax=sr/2, power=1.0,
                             return_decibel=False, trainable_fb=False,
                             trainable_kernel=False
                             name='trainable_stft'))`
    
      File "<ipython-input-24-cea5588ddf1e>", line 13
        name='trainable_stft'))
           ^
    SyntaxError: invalid syntax
    
    opened by sildeag 6
  • `STFT` layer output shape deviates from `STFTTflite` layer in batch dimension

    `STFT` layer output shape deviates from `STFTTflite` layer in batch dimension

    Use Case

    I want to convert a STFT layer in my model to a STFTTflite to deploy it to my mobile device. In the documentation I found that another dimension is added to account for complex numbers. But I also encountered a behaviour that is not documented.

    Expected Behaviour

    input_shape = (2048, 1)  # mono signal
    
    model = keras.models.Sequential()  # TFLite incompatible model
    model.add(kapre.STFT(n_fft=1024, hop_length=512, input_shape=input_shape))
    
    tflite_model = keras.models.Sequential()  # TFLite compatible model
    tflite_model.add(kapre.STFTTflite(n_fft=1024, hop_length=512, input_shape=input_shape))
    

    model has the output shape (None, 3, 513, 1). Therefore, tflite_model should have the output shape (None, 3, 513, 1, 2).

    Observed Behaviour

    The output shape of tflite_model is (1, 3, 513, 1, 2) instead of (None, 3, 513, 1, 2).

    Problem Solution

    • If this behaviour is unwanted:
      • Change the model output format so that the batch dimension is correctly shaped.
    • Otherwise:
      • Explain in the documentation why the batch dimension is shaped to 1.
      • Explain in the documentation how to include this layer into models which expect the batch dimension to be shaped None.
    opened by PhilippMatthes 5
  • Problem incorporating SpecAugument in the training process

    Problem incorporating SpecAugument in the training process

    Hi,

    I'm trying to add a SpecAug layer in the training process of a CNN using the code below:

    
    CLIP_DURATION = 5 
    SAMPLING_RATE = 41000
    NUM_CHANNELS = 1
    
    INPUT_SHAPE = ((CLIP_DURATION * SAMPLING_RATE), NUM_CHANNELS)
    
    melgram = get_melspectrogram_layer(input_shape = INPUT_SHAPE,
                              n_fft = 2048,
                              hop_length = 512,
                              return_decibel=True,
                              n_mels = 40,
                              mel_f_min = 500,
                              mel_f_max = 15000,
                              input_data_format='channels_last',
                              output_data_format='channels_last')
    
    spec_augment = SpecAugment(freq_mask_param=5,
                              time_mask_param=10,
                              n_freq_masks=2,
                              n_time_masks=3,
                              mask_value=-100)   
    
    model = Sequential()
    model.add(melgram)
    model.add(spec_augment)
    

    The CNN summary looks like this:

    Model: "sequential_2"
    _________________________________________________________________
     Layer (type)                Output Shape              Param #   
    =================================================================
     melspectrogram (Sequential)  (None, 397, 40, 1)       0         
                                                                     
     spec_augment_1 (SpecAugment  (None, 397, 40, 1)       0         
     )                                                               
                                                                     
    =================================================================
    Total params: 0
    Trainable params: 0
    Non-trainable params: 0
    _________________________________________________________________
    

    Compiling and fitting the model

    model.compile(loss = 'sparse_categorical_crossentropy', optimizer='adam', metrics = 'accuracy')
    
    early_stop = EarlyStopping(monitor='loss', patience=5)
    
    reduce_LR = ReduceLROnPlateau(monitor="val_loss",factor=0.1,patience=4)
    
    checkpointer = ModelCheckpoint(filepath = 'saved_models/bird_song_classification.hdf5')
    
    model.fit(X_train, y_train, validation_data = (X_val, y_val), epochs = 50, batch_size = 32, callbacks = [early_stop, checkpointer, reduce_LR])
    

    Then I get the following error:

    Epoch 1/50
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    [<ipython-input-35-e58a056ab523>](https://localhost:8080/#) in <module>
          7 checkpointer = ModelCheckpoint(filepath = 'saved_models/bird_song_classification.hdf5')
          8 
    ----> 9 model.fit(X_train, y_train, validation_data = (X_val, y_val), epochs = 50, batch_size = 32, callbacks = [early_stop, checkpointer, reduce_LR])
    
    6 frames
    [/usr/local/lib/python3.7/dist-packages/kapre/augmentation.py](https://localhost:8080/#) in tf___apply_masks_to_axis(self, x, axis, mask_param, n_masks)
         78                 try:
         79                     do_return = True
    ---> 80                     retval_ = ag__.converted_call(ag__.ld(tf).where, (ag__.ld(mask), ag__.ld(self).mask_value, ag__.ld(x)), None, fscope)
         81                 except:
         82                     do_return = False
    
    TypeError: in user code:
    
        File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1051, in train_function  *
            return step_function(self, iterator)
        File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1040, in step_function  **
            outputs = model.distribute_strategy.run(run_step, args=(data,))
        File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1030, in run_step  **
            outputs = model.train_step(data)
        File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 889, in train_step
            y_pred = self(x, training=True)
        File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
            raise e.with_traceback(filtered_tb) from None
        File "/tmp/__autograph_generated_filepzvfxhgz.py", line 63, in tf__call
            ag__.if_stmt((ag__.ld(training) in (None, False)), if_body_2, else_body_2, get_state_2, set_state_2, ('do_return', 'retval_'), 2)
        File "/tmp/__autograph_generated_filepzvfxhgz.py", line 58, in else_body_2
            retval_ = ag__.converted_call(ag__.ld(tf).map_fn, (), dict(elems=ag__.ld(x), fn=ag__.ld(self)._apply_spec_augment, dtype=ag__.ld(tf).float32, fn_output_signature=ag__.ld(tf).float32), fscope)
        File "/tmp/__autograph_generated_filef27o6c1f.py", line 44, in tf___apply_spec_augment
            ag__.if_stmt((ag__.ld(self).n_time_masks >= 1), if_body_1, else_body_1, get_state_1, set_state_1, ('x',), 1)
        File "/tmp/__autograph_generated_filef27o6c1f.py", line 39, in if_body_1
            x = ag__.converted_call(ag__.ld(self)._apply_masks_to_axis, (ag__.ld(x),), dict(axis=ag__.ld(time_axis), mask_param=ag__.ld(self).time_mask_param, n_masks=ag__.ld(self).n_time_masks), fscope)
        File "/tmp/__autograph_generated_file3vip8w4x.py", line 80, in tf___apply_masks_to_axis
            retval_ = ag__.converted_call(ag__.ld(tf).where, (ag__.ld(mask), ag__.ld(self).mask_value, ag__.ld(x)), None, fscope)
    
        TypeError: Exception encountered when calling layer "spec_augment_1" (type SpecAugment).
        
        in user code:
        
            File "/usr/local/lib/python3.7/dist-packages/kapre/augmentation.py", line 299, in call  *
                elems=x, fn=self._apply_spec_augment, dtype=tf.float32, fn_output_signature=tf.float32
            File "/usr/local/lib/python3.7/dist-packages/kapre/augmentation.py", line 273, in _apply_spec_augment  *
                x = self._apply_masks_to_axis(
            File "/usr/local/lib/python3.7/dist-packages/kapre/augmentation.py", line 254, in _apply_masks_to_axis  *
                return tf.where(mask, self.mask_value, x)
        
            TypeError: Input 'e' of 'SelectV2' Op has type float32 that does not match type int32 of argument 't'.
        
        
        Call arguments received by layer "spec_augment_1" (type SpecAugment):
          • x=tf.Tensor(shape=(None, 397, 40, 1), dtype=float32)
          • training=True
          • kwargs=<class 'inspect._empty'>
    

    The shape of X_train is

    (2182, 205000, 1)
    

    I'm using Tensorflow 2.9.2, and Python 3.7.15

    When I remove the SpecAug layer everything runs fine. I've tested using only the melspec + a mobile net at the end and it runs smooth. The problem is apparently related to SpecAug layer.

    Do you have any idea what could be going wrong here? I appreciate any guidance related to the problem. Best regards.

    opened by nnbuainain 2
  • Full-integer quantization and kapre layers

    Full-integer quantization and kapre layers

    I am training a model which includes the mel-spectrogram block from get_melspectrogram_layer() right after the input layer. Training goes well, and I am able to change the specific mel-spec-layers to their TFLite-counterparts (STFTTflite, MagnitudeTflite) afterwards. I have checked also that the model performs as well as before.

    The model also perfoms as expected when converting the model to .tflite using dynamic range quantization. However, when using full-integer quantization, the model loses its accuracy (see (https://www.tensorflow.org/lite/performance/post_training_quantization#integer_only).

    I suppose the mel-spec starts to significantly differ as in full-integer quantization, the input values are projected to new range (int8). Is there any way to make it work with full-integer quantization?

    I guess I need to separate the mel-spec-layer from the model as a preprocessing step in order to succeed with full-integer quantization, i.e., apply the input quantization to the output values of mel-spec layer. But then I would have to deploy two models to the edge device, where the input goes first to the mel-spec-block and then to the rest of the model (?).

    I am using TensorFlow 2.7.0 and kapre 0.3.7.

    Here is my code for testing the tflite-model:

    preds = []
    # Test and evaluate the TFLite-converted model on unseen test data
    for i, sample in enumerate(X_test_full_scaled):
        X = sample
        
        if input_details['dtype'] == np.int8:
            input_scale, input_zero_point = input_details["quantization"]
            X = sample / input_scale + input_zero_point
        
        X = X.reshape((1, 8000, 1)).astype(input_details["dtype"])
        
        interpreter.set_tensor(input_index, X)
        interpreter.invoke()
        pred = interpreter.get_tensor(output_index)
        
        output_scale, output_zero_point = output_details['quantization']
        if output_details['dtype'] == np.int8:
            pred = pred.astype(np.float32)
            pred = (pred - output_zero_point) * output_scale
        
        pred = np.argmax(pred, axis=1)[0]
        preds.append(pred)
    
    preds = np.array(preds)
    
    opened by eppane 3
  • Calling Magnitude() and Phase() simultaneously

    Calling Magnitude() and Phase() simultaneously

    Hi,

    I am looking to call Magnitude() and Phase() simultaneously for the same STFT input and concatenate the magnitude and phase before feeding into the convolution layers in my CNN sequential Keras model.

    Is this possible?

    Best,

    Yang

    opened by HsuanYang-Wang 1
  • about kapre.utiils

    about kapre.utiils

    Hi, when i used "from kapre.utils import Normalization2D", I met this error which said No module named 'kapre.utils'. I see your package, and found that there is surely no utils.py. I am wondering how to slove it.

    Best wishes, Daisy

    opened by YiningWang2 1
  • Function missing in updated version

    Function missing in updated version

    I noticed there is a functon "kapre.utils.Normalization2D" in the old version, while I cannot find it in the updated version. Why? Is there have any alternative functions?

    opened by v3551G 1
  • trainable DSP parameters

    trainable DSP parameters

    hello contributers and community.

    I love your repo! It's eases so much for me! Although having the precomputation in the model is already great I'd like to know how you can optimize DSP parameters. It looks like that this is a feature from old versions (e.g. 0.2) and by default I dont see any trainable params in this layer.

    Could you please state if this is still available and how to use it?

    happy hacking Paul

    opened by bytosaur 2
Releases(Kapre-0.3.7)
Owner
Keunwoo Choi
MIR, machine learning, music recommendation.
Keunwoo Choi
We built this fully functioning Music player in Python. The music player allows you to play/pause and switch to different songs easily.

We built this fully functioning Music player in Python. The music player allows you to play/pause and switch to different songs easily.

1 Nov 19, 2021
TwitterMusicBot - A Twitter bot with Spotify integration.

A Twitter Music Bot 🤖 🎵 🎶 I created this project to learn more about APIs, so it only works for student purposes. Initially, delving into the Spoti

Gustavo Oliveira 2 Jan 02, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.1k Dec 31, 2022
DCL - An easy to use diacritic library used for diacritic and accent manipulation.

Diacritics Library This library is used for adding, and removing diacritics from strings. Getting started Start by importing the module: import dcl DC

Kreus Amredes 6 Jun 03, 2022
Spotifyd - An open source Spotify client running as a UNIX daemon.

Spotifyd An open source Spotify client running as a UNIX daemon. Spotifyd streams music just like the official client, but is more lightweight and sup

8.5k Jan 09, 2023
Learn chords with your MIDI keyboard !

miditeach miditeach is a music learning tool that can be used to practice your chords skills with a midi keyboard 🎹 ! Features Midi keyboard input se

Alexis LOUIS 3 Oct 20, 2021
Full LAKH MIDI dataset converted to MuseNet MIDI output format (9 instruments + drums)

LAKH MuseNet MIDI Dataset Full LAKH MIDI dataset converted to MuseNet MIDI output format (9 instruments + drums) Bonus: Choir on Channel 10 Please CC

Alex 6 Nov 20, 2022
Terminal-based music player written in Python for the best music in the world 🎵 🎧 💻

audius-terminal-player Terminal-based music player written in Python for the best music in the world 🎵 🎧 💻 Browse and listen to Audius from the com

Audius 21 Jul 23, 2022
Read music meta data and length of MP3, OGG, OPUS, MP4, M4A, FLAC, WMA and Wave files with python 2 or 3

tinytag tinytag is a library for reading music meta data of MP3, OGG, OPUS, MP4, M4A, FLAC, WMA and Wave files with python Install pip install tinytag

Tom Wallroth 577 Dec 26, 2022
Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features

MediumVC MediumVC is an utterance-level method towards any-to-any VC. Before that, we propose SingleVC to perform A2O tasks(Xi → Ŷi) , Xi means utter

谷下雨 47 Dec 25, 2022
Pythonic bindings for FFmpeg's libraries.

PyAV PyAV is a Pythonic binding for the FFmpeg libraries. We aim to provide all of the power and control of the underlying library, but manage the gri

PyAV 1.8k Jan 03, 2023
Python module for handling audio metadata

Mutagen is a Python module to handle audio metadata. It supports ASF, FLAC, MP4, Monkey's Audio, MP3, Musepack, Ogg Opus, Ogg FLAC, Ogg Speex, Ogg The

Quod Libet 1.1k Dec 31, 2022
Sound-Equalizer- This is a Sound Equalizer GUI App Using Python's PyQt5

Sound-Equalizer- This is a Sound Equalizer GUI App Using Python's PyQt5. It gives you the ability to play, pause, and Equalize any one-channel wav audio file and play 3 different instruments.

Mustafa Megahed 1 Jan 10, 2022
A simple voice detection system which can be applied practically for designing a device with capability to detect a baby’s cry and automatically turning on music

Auto-Baby-Cry-Detection-with-Music-Player A simple voice detection system which can be applied practically for designing a device with capability to d

2 Dec 15, 2021
GiantMIDI-Piano is a classical piano MIDI dataset contains 10,854 MIDI files of 2,786 composers

GiantMIDI-Piano is a classical piano MIDI dataset contains 10,854 MIDI files of 2,786 composers

Bytedance Inc. 1.3k Jan 04, 2023
A voice control utility for Spotify

Spotify Voice Control A voice control utility for Spotify · Report Bug · Request

Shoubhit Dash 27 Jan 01, 2023
Audio fingerprinting and recognition in Python

dejavu Audio fingerprinting and recognition algorithm implemented in Python, see the explanation here: How it works Dejavu can memorize audio by liste

Will Drevo 6k Jan 06, 2023
A library for augmenting annotated audio data

muda A library for Musical Data Augmentation. muda package implements annotation-aware musical data augmentation, as described in the muda paper. The

Brian McFee 214 Nov 22, 2022
Voicefixer aims at the restoration of human speech regardless how serious its degraded.

Voicefixer aims at the restoration of human speech regardless how serious its degraded.

Leo 324 Dec 26, 2022
Welcome to Nexus. Your personal virtual assistant

AI Voice Assistant Welcome to Nexus voice assistant Description Have you ever heard of voice assistants like Cortana, Siri, Google assistant, and Alex

Mustafah Zacs 1 Jan 10, 2022