kapre: Keras Audio Preprocessors

Overview

Kapre

Keras Audio Preprocessors - compute STFT, ISTFT, Melspectrogram, and others on GPU real-time.

Tested on Python 3.6 and 3.7

Why Kapre?

vs. Pre-computation

  • You can optimize DSP parameters
  • Your model deployment becomes much simpler and consistent.
  • Your code and model has less dependencies

vs. Your own implementation

  • Quick and easy!
  • Consistent with 1D/2D tensorflow batch shapes
  • Data format agnostic (channels_first and channels_last)
  • Less error prone - Kapre layers are tested against Librosa (stft, decibel, etc) - which is (trust me) trickier than you think.
  • Kapre layers have some extended APIs from the default tf.signals implementation such as..
    • A perfectly invertible STFT and InverseSTFT pair
    • Mel-spectrogram with more options
  • Reproducibility - Kapre is available on pip with versioning

Workflow with Kapre

  1. Preprocess your audio dataset. Resample the audio to the right sampling rate and store the audio signals (waveforms).
  2. In your ML model, add Kapre layer e.g. kapre.time_frequency.STFT() as the first layer of the model.
  3. The data loader simply loads audio signals and feed them into the model
  4. In your hyperparameter search, include DSP parameters like n_fft to boost the performance.
  5. When deploying the final model, all you need to remember is the sampling rate of the signal. No dependency or preprocessing!

Installation

pip install kapre

API Documentation

Please refer to Kapre API Documentation at https://kapre.readthedocs.io

One-shot example

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, BatchNormalization, ReLU, GlobalAveragePooling2D, Dense, Softmax
from kapre import STFT, Magnitude, MagnitudeToDecibel
from kapre.composed import get_melspectrogram_layer, get_log_frequency_spectrogram_layer

# 6 channels (!), maybe 1-sec audio signal, for an example.
input_shape = (44100, 6)
sr = 44100
model = Sequential()
# A STFT layer
model.add(STFT(n_fft=2048, win_length=2018, hop_length=1024,
               window_name=None, pad_end=False,
               input_data_format='channels_last', output_data_format='channels_last',
               input_shape=input_shape))
model.add(Magnitude())
model.add(MagnitudeToDecibel())  # these three layers can be replaced with get_stft_magnitude_layer()
# Alternatively, you may want to use a melspectrogram layer
# melgram_layer = get_melspectrogram_layer()
# or log-frequency layer
# log_stft_layer = get_log_frequency_spectrogram_layer() 

# add more layers as you want
model.add(Conv2D(32, (3, 3), strides=(2, 2)))
model.add(BatchNormalization())
model.add(ReLU())
model.add(GlobalAveragePooling2D())
model.add(Dense(10))
model.add(Softmax())

# Compile the model
model.compile('adam', 'categorical_crossentropy') # if single-label classification

# train it with raw audio sample inputs
# for example, you may have functions that load your data as below.
x = load_x() # e.g., x.shape = (10000, 6, 44100)
y = load_y() # e.g., y.shape = (10000, 10) if it's 10-class classification
# then..
model.fit(x, y)
# Done!

Citation

Please cite this paper if you use Kapre for your work.

@inproceedings{choi2017kapre,
  title={Kapre: On-GPU Audio Preprocessing Layers for a Quick Implementation of Deep Neural Network Models with Keras},
  author={Choi, Keunwoo and Joo, Deokjin and Kim, Juho},
  booktitle={Machine Learning for Music Discovery Workshop at 34th International Conference on Machine Learning},
  year={2017},
  organization={ICML}
}
Comments
  • Migrated functions to tf.keras

    Migrated functions to tf.keras

    This PR addresses #52 by removing the dependency on keras and switching to tensorflow.keras

    Proposed version is 0.1.6 because of pull request #56

    In particular, #56 keeps the dependency on keras with from keras.utils.conv_utils import conv_output_length

    opened by douglas125 27
  • Spectrogram?

    Spectrogram?

    I have an older version of Kapre that has time_frequency.Spectrogram, which is trainable.

    However, the new version of Kapre doesn't have Spectrogram anymore. Why?

    opened by turian 8
  • Melspectrogram cant be set 'trainable_fb=False'

    Melspectrogram cant be set 'trainable_fb=False'

    Melspectrogram cant be set 'trainable_fb=False',after I set trainable_fb=False,trainable_kernel=False,but seems like it doesnot work.It is still trainable.

    opened by zhh6 8
  • htk=true for mel frequencies

    htk=true for mel frequencies

    We noticed the current implenetation of the mel_frequencies function (based on Librosa) doesn't include the htk=True option, which is handy when training CNNs because then the frequency scale is fully logarithmic which, in principle, makes more sense for frequency invariant convolutional filters.

    What was the motivation for removing this? Any chance it can be added?

    opened by justinsalamon 8
  • Added parallel STFT implementation

    Added parallel STFT implementation

    Hi!

    As I comented in #98, I added a parallel STFT implementation based on the map_fn function following the indications of @zaccharieramzi here.

    I've added a use_parallel_stft param (disabled by default) that allows to use this function. I've put this param in as many functions as I can. I also added test cases for every function I can, including an specific test that checks that the result of the tf.signal.stft is equals to the result of the parallel_stft function.

    Hope this could serve us well meanwhile tensorflow resolves its issues with the fft implementation.

    opened by JPery 7
  • Amplitude-to-decibel conversion produces different results on different batches

    Amplitude-to-decibel conversion produces different results on different batches

    Related to #16, I found another issue that contributes to different prediction results depending on batch size (and the batches themselves). In particular, it occurs when using converting spectrograms to decibels.

    https://github.com/keunwoochoi/kapre/blob/master/kapre/backend_keras.py#L17

    The maximum is taken over the entire tensor, instead of per example in the batch. This results in different normalization when the examples in a batch are different.

    bug 
    opened by auroracramer 7
  • Inverse Spectrogram and Mel-Spectrogram Layer?

    Inverse Spectrogram and Mel-Spectrogram Layer?

    Namaste!

    kapre has become an integral part of all my audio Deep Learning experiments. Powerful! Thanks for providing such a great software!

    I was thinking... I guess it would make sense to have layers for inverse spectrogram and inverse mel-spectrogram. Thinking about Autoencoders, this would be even more powerful. I know that reconstructing samples from spectrograms is not the best, but it is possible to a certain degree.

    What do you think about that feature request?

    Best, Tristan

    opened by AI-Guru 7
  • Hey! The input is too short!

    Hey! The input is too short!

    Hi,

    I'm encountering an assertion problem when calling your code with a Tensorflow backend.

    input_shape = (44100,1)

    Could this be a be a problem with "channels_first" / "channels_last"?

    Best, Alex

    opened by slychief 6
  • Pip?

    Pip?

    It seems you were on pip, but are no longer. Is there anything I could do to help get kapre back on there? We want to use this library in a commercial application, and for our process pip packages are easier to support than a git repository.

    opened by ff-rfeather 6
  • trainable_stft error

    trainable_stft error

    Following your example but missing layer definition trainable_stft or something, can you provide example with error resolution?

    `# 6 channels (!), maybe 1-sec audio signal
    input_shape = (6, 44100) 
    sr = 44100
    model = Sequential()
    model.add(Melspectrogram(n_dft=512, n_hop=256, input_shape=src_shape,
                             border_mode='same', sr=sr, n_mels=128,
                             fmin=0.0, fmax=sr/2, power=1.0,
                             return_decibel=False, trainable_fb=False,
                             trainable_kernel=False
                             name='trainable_stft'))`
    
      File "<ipython-input-24-cea5588ddf1e>", line 13
        name='trainable_stft'))
           ^
    SyntaxError: invalid syntax
    
    opened by sildeag 6
  • `STFT` layer output shape deviates from `STFTTflite` layer in batch dimension

    `STFT` layer output shape deviates from `STFTTflite` layer in batch dimension

    Use Case

    I want to convert a STFT layer in my model to a STFTTflite to deploy it to my mobile device. In the documentation I found that another dimension is added to account for complex numbers. But I also encountered a behaviour that is not documented.

    Expected Behaviour

    input_shape = (2048, 1)  # mono signal
    
    model = keras.models.Sequential()  # TFLite incompatible model
    model.add(kapre.STFT(n_fft=1024, hop_length=512, input_shape=input_shape))
    
    tflite_model = keras.models.Sequential()  # TFLite compatible model
    tflite_model.add(kapre.STFTTflite(n_fft=1024, hop_length=512, input_shape=input_shape))
    

    model has the output shape (None, 3, 513, 1). Therefore, tflite_model should have the output shape (None, 3, 513, 1, 2).

    Observed Behaviour

    The output shape of tflite_model is (1, 3, 513, 1, 2) instead of (None, 3, 513, 1, 2).

    Problem Solution

    • If this behaviour is unwanted:
      • Change the model output format so that the batch dimension is correctly shaped.
    • Otherwise:
      • Explain in the documentation why the batch dimension is shaped to 1.
      • Explain in the documentation how to include this layer into models which expect the batch dimension to be shaped None.
    opened by PhilippMatthes 5
  • Problem incorporating SpecAugument in the training process

    Problem incorporating SpecAugument in the training process

    Hi,

    I'm trying to add a SpecAug layer in the training process of a CNN using the code below:

    
    CLIP_DURATION = 5 
    SAMPLING_RATE = 41000
    NUM_CHANNELS = 1
    
    INPUT_SHAPE = ((CLIP_DURATION * SAMPLING_RATE), NUM_CHANNELS)
    
    melgram = get_melspectrogram_layer(input_shape = INPUT_SHAPE,
                              n_fft = 2048,
                              hop_length = 512,
                              return_decibel=True,
                              n_mels = 40,
                              mel_f_min = 500,
                              mel_f_max = 15000,
                              input_data_format='channels_last',
                              output_data_format='channels_last')
    
    spec_augment = SpecAugment(freq_mask_param=5,
                              time_mask_param=10,
                              n_freq_masks=2,
                              n_time_masks=3,
                              mask_value=-100)   
    
    model = Sequential()
    model.add(melgram)
    model.add(spec_augment)
    

    The CNN summary looks like this:

    Model: "sequential_2"
    _________________________________________________________________
     Layer (type)                Output Shape              Param #   
    =================================================================
     melspectrogram (Sequential)  (None, 397, 40, 1)       0         
                                                                     
     spec_augment_1 (SpecAugment  (None, 397, 40, 1)       0         
     )                                                               
                                                                     
    =================================================================
    Total params: 0
    Trainable params: 0
    Non-trainable params: 0
    _________________________________________________________________
    

    Compiling and fitting the model

    model.compile(loss = 'sparse_categorical_crossentropy', optimizer='adam', metrics = 'accuracy')
    
    early_stop = EarlyStopping(monitor='loss', patience=5)
    
    reduce_LR = ReduceLROnPlateau(monitor="val_loss",factor=0.1,patience=4)
    
    checkpointer = ModelCheckpoint(filepath = 'saved_models/bird_song_classification.hdf5')
    
    model.fit(X_train, y_train, validation_data = (X_val, y_val), epochs = 50, batch_size = 32, callbacks = [early_stop, checkpointer, reduce_LR])
    

    Then I get the following error:

    Epoch 1/50
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    [<ipython-input-35-e58a056ab523>](https://localhost:8080/#) in <module>
          7 checkpointer = ModelCheckpoint(filepath = 'saved_models/bird_song_classification.hdf5')
          8 
    ----> 9 model.fit(X_train, y_train, validation_data = (X_val, y_val), epochs = 50, batch_size = 32, callbacks = [early_stop, checkpointer, reduce_LR])
    
    6 frames
    [/usr/local/lib/python3.7/dist-packages/kapre/augmentation.py](https://localhost:8080/#) in tf___apply_masks_to_axis(self, x, axis, mask_param, n_masks)
         78                 try:
         79                     do_return = True
    ---> 80                     retval_ = ag__.converted_call(ag__.ld(tf).where, (ag__.ld(mask), ag__.ld(self).mask_value, ag__.ld(x)), None, fscope)
         81                 except:
         82                     do_return = False
    
    TypeError: in user code:
    
        File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1051, in train_function  *
            return step_function(self, iterator)
        File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1040, in step_function  **
            outputs = model.distribute_strategy.run(run_step, args=(data,))
        File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1030, in run_step  **
            outputs = model.train_step(data)
        File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 889, in train_step
            y_pred = self(x, training=True)
        File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
            raise e.with_traceback(filtered_tb) from None
        File "/tmp/__autograph_generated_filepzvfxhgz.py", line 63, in tf__call
            ag__.if_stmt((ag__.ld(training) in (None, False)), if_body_2, else_body_2, get_state_2, set_state_2, ('do_return', 'retval_'), 2)
        File "/tmp/__autograph_generated_filepzvfxhgz.py", line 58, in else_body_2
            retval_ = ag__.converted_call(ag__.ld(tf).map_fn, (), dict(elems=ag__.ld(x), fn=ag__.ld(self)._apply_spec_augment, dtype=ag__.ld(tf).float32, fn_output_signature=ag__.ld(tf).float32), fscope)
        File "/tmp/__autograph_generated_filef27o6c1f.py", line 44, in tf___apply_spec_augment
            ag__.if_stmt((ag__.ld(self).n_time_masks >= 1), if_body_1, else_body_1, get_state_1, set_state_1, ('x',), 1)
        File "/tmp/__autograph_generated_filef27o6c1f.py", line 39, in if_body_1
            x = ag__.converted_call(ag__.ld(self)._apply_masks_to_axis, (ag__.ld(x),), dict(axis=ag__.ld(time_axis), mask_param=ag__.ld(self).time_mask_param, n_masks=ag__.ld(self).n_time_masks), fscope)
        File "/tmp/__autograph_generated_file3vip8w4x.py", line 80, in tf___apply_masks_to_axis
            retval_ = ag__.converted_call(ag__.ld(tf).where, (ag__.ld(mask), ag__.ld(self).mask_value, ag__.ld(x)), None, fscope)
    
        TypeError: Exception encountered when calling layer "spec_augment_1" (type SpecAugment).
        
        in user code:
        
            File "/usr/local/lib/python3.7/dist-packages/kapre/augmentation.py", line 299, in call  *
                elems=x, fn=self._apply_spec_augment, dtype=tf.float32, fn_output_signature=tf.float32
            File "/usr/local/lib/python3.7/dist-packages/kapre/augmentation.py", line 273, in _apply_spec_augment  *
                x = self._apply_masks_to_axis(
            File "/usr/local/lib/python3.7/dist-packages/kapre/augmentation.py", line 254, in _apply_masks_to_axis  *
                return tf.where(mask, self.mask_value, x)
        
            TypeError: Input 'e' of 'SelectV2' Op has type float32 that does not match type int32 of argument 't'.
        
        
        Call arguments received by layer "spec_augment_1" (type SpecAugment):
          • x=tf.Tensor(shape=(None, 397, 40, 1), dtype=float32)
          • training=True
          • kwargs=<class 'inspect._empty'>
    

    The shape of X_train is

    (2182, 205000, 1)
    

    I'm using Tensorflow 2.9.2, and Python 3.7.15

    When I remove the SpecAug layer everything runs fine. I've tested using only the melspec + a mobile net at the end and it runs smooth. The problem is apparently related to SpecAug layer.

    Do you have any idea what could be going wrong here? I appreciate any guidance related to the problem. Best regards.

    opened by nnbuainain 2
  • Full-integer quantization and kapre layers

    Full-integer quantization and kapre layers

    I am training a model which includes the mel-spectrogram block from get_melspectrogram_layer() right after the input layer. Training goes well, and I am able to change the specific mel-spec-layers to their TFLite-counterparts (STFTTflite, MagnitudeTflite) afterwards. I have checked also that the model performs as well as before.

    The model also perfoms as expected when converting the model to .tflite using dynamic range quantization. However, when using full-integer quantization, the model loses its accuracy (see (https://www.tensorflow.org/lite/performance/post_training_quantization#integer_only).

    I suppose the mel-spec starts to significantly differ as in full-integer quantization, the input values are projected to new range (int8). Is there any way to make it work with full-integer quantization?

    I guess I need to separate the mel-spec-layer from the model as a preprocessing step in order to succeed with full-integer quantization, i.e., apply the input quantization to the output values of mel-spec layer. But then I would have to deploy two models to the edge device, where the input goes first to the mel-spec-block and then to the rest of the model (?).

    I am using TensorFlow 2.7.0 and kapre 0.3.7.

    Here is my code for testing the tflite-model:

    preds = []
    # Test and evaluate the TFLite-converted model on unseen test data
    for i, sample in enumerate(X_test_full_scaled):
        X = sample
        
        if input_details['dtype'] == np.int8:
            input_scale, input_zero_point = input_details["quantization"]
            X = sample / input_scale + input_zero_point
        
        X = X.reshape((1, 8000, 1)).astype(input_details["dtype"])
        
        interpreter.set_tensor(input_index, X)
        interpreter.invoke()
        pred = interpreter.get_tensor(output_index)
        
        output_scale, output_zero_point = output_details['quantization']
        if output_details['dtype'] == np.int8:
            pred = pred.astype(np.float32)
            pred = (pred - output_zero_point) * output_scale
        
        pred = np.argmax(pred, axis=1)[0]
        preds.append(pred)
    
    preds = np.array(preds)
    
    opened by eppane 3
  • Calling Magnitude() and Phase() simultaneously

    Calling Magnitude() and Phase() simultaneously

    Hi,

    I am looking to call Magnitude() and Phase() simultaneously for the same STFT input and concatenate the magnitude and phase before feeding into the convolution layers in my CNN sequential Keras model.

    Is this possible?

    Best,

    Yang

    opened by HsuanYang-Wang 1
  • about kapre.utiils

    about kapre.utiils

    Hi, when i used "from kapre.utils import Normalization2D", I met this error which said No module named 'kapre.utils'. I see your package, and found that there is surely no utils.py. I am wondering how to slove it.

    Best wishes, Daisy

    opened by YiningWang2 1
  • Function missing in updated version

    Function missing in updated version

    I noticed there is a functon "kapre.utils.Normalization2D" in the old version, while I cannot find it in the updated version. Why? Is there have any alternative functions?

    opened by v3551G 1
  • trainable DSP parameters

    trainable DSP parameters

    hello contributers and community.

    I love your repo! It's eases so much for me! Although having the precomputation in the model is already great I'd like to know how you can optimize DSP parameters. It looks like that this is a feature from old versions (e.g. 0.2) and by default I dont see any trainable params in this layer.

    Could you please state if this is still available and how to use it?

    happy hacking Paul

    opened by bytosaur 2
Releases(Kapre-0.3.7)
Owner
Keunwoo Choi
MIR, machine learning, music recommendation.
Keunwoo Choi
Python game programming in Jupyter notebooks.

Jupylet Jupylet is a Python library for programming 2D and 3D games, graphics, music and sound synthesizers, interactively in a Jupyter notebook. It i

Nir Aides 178 Dec 09, 2022
Deep learning transformer model that generates unique music sequences.

music-ai Deep learning transformer model that generates unique music sequences. Abstract In 2017, a new state-of-the-art was published for natural lan

xacer 6 Nov 19, 2022
GiantMIDI-Piano is a classical piano MIDI dataset contains 10,854 MIDI files of 2,786 composers

GiantMIDI-Piano is a classical piano MIDI dataset contains 10,854 MIDI files of 2,786 composers

Bytedance Inc. 1.3k Jan 04, 2023
Cobra is a highly-accurate and lightweight voice activity detection (VAD) engine.

On-device voice activity detection (VAD) powered by deep learning.

Picovoice 88 Dec 16, 2022
SomaFM Plugin for Kodi

SomaFM XBMC Plugin This description is a bit outdated. You can simply install this addon by browsing the official repositories from within Kodi. Insta

7 Jan 21, 2022
Jarvis From Basic to Advance - make a voice assistant similar to JARVIS (in iron man movie)

JARVIS (Basic to Advance) This was my attempt to make a voice assistant similar to JARVIS (in iron man movie) Let's be honest, it's not as intelligent

codesempai 17 Dec 25, 2022
Open Sound Strip, Sequence or Record in Audacity

Audacity Tools For Blender Sound editing in Blender Video Sequence Editor with Audacity integrated. Send/receive the full edited sequence or single st

64 Dec 31, 2022
Inner ear models for Python

cochlea cochlea is a collection of inner ear models. All models are easily accessible as Python functions. They take sound signal as input and return

98 Jan 05, 2023
Reading list for research topics in sound event detection

Sound event detection aims at processing the continuous acoustic signal and converting it into symbolic descriptions of the corresponding sound events present at the auditory scene.

Soham 64 Jan 05, 2023
Terminal-based audio-to-text converter

att Terminal-based audio-to-text converter Project description A terminal-based audio-to-text converter written in python, enabling you to convert .wa

Sven Eschlbeck 4 Dec 15, 2022
AudioDVP:Photorealistic Audio-driven Video Portraits

AudioDVP This is the official implementation of Photorealistic Audio-driven Video Portraits. Major Requirements Ubuntu = 18.04 PyTorch = 1.2 GCC =

232 Jan 03, 2023
Python wrapper around sox.

pysox Python wrapper around sox. Read the Docs here. This library was presented in the following paper: R. M. Bittner, E. J. Humphrey and J. P. Bello,

Rachel Bittner 446 Dec 07, 2022
Python audio and music signal processing library

madmom Madmom is an audio signal processing library written in Python with a strong focus on music information retrieval (MIR) tasks. The library is i

Institute of Computational Perception 1k Dec 26, 2022
In this project we can see how we can generate automatic music using character RNN.

Automatic Music Genaration Table of Contents Project Description Approach towards the problem Limitations Libraries Used Summary Applications Referenc

Pronay Ghosh 2 May 27, 2022
extract unpack asset file (form unreal engine 4 pak) with extenstion *.uexp which contain awb/acb (cri/cpk like) sound or music resource

Uexp2Awb extract unpack asset file (form unreal engine 4 pak) with extenstion .uexp which contain awb/acb (cri/cpk like) sound or music resource. i ju

max 6 Jun 22, 2022
Omniscient Mozart, being able to transcribe everything in the music, including vocal, drum, chord, beat, instruments, and more.

OMNIZART Omnizart is a Python library that aims for democratizing automatic music transcription. Given polyphonic music, it is able to transcribe pitc

MCTLab 1.3k Jan 08, 2023
Python tools for the corpus analysis of popular music.

CATCHY Corpus Analysis Tools for Computational Hook discovery Python tools for the corpus analysis of popular music recordings. The tools can be used

Jan VB 20 Aug 20, 2022
Using python to generate a bat script of repetitive lines of code that differ in some way but can sort out a group of audio files according to their common names

Batch Sorting Using python to generate a bat script of repetitive lines of code that differ in some way but can sort out a group of audio files accord

David Mainoo 1 Oct 29, 2021
Gammatone-based spectrograms, using gammatone filterbanks or Fourier transform weightings.

Gammatone Filterbank Toolkit Utilities for analysing sound using perceptual models of human hearing. Jason Heeris, 2013 Summary This is a port of Malc

Jason Heeris 188 Dec 14, 2022
music library manager and MusicBrainz tagger

beets Beets is the media library management system for obsessive music geeks. The purpose of beets is to get your music collection right once and for

beetbox 11.3k Dec 31, 2022