Audio Classification with CNN-LSTM networks

January 1, 2023

Synopsis

In this project, I aim to classify 1 second long audio clips of the words “one”, “two”, “three”, …, “zero”. The data for this project is taken from the TensorFlow Speech Recognition Challenge. However I have slightly deviated from the competition, in terms of the target classes, where I have truncated the target classes to the ones I mentioned above.

I trained 3 models on the data:

  1. A baseline CNN model
================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
================================================================================
BaseModel                                [128, 10]                 --
├─convblock: 1-1                         [128, 16, 65, 23]         --
│    └─Sequential: 2-1                   [128, 16, 65, 23]         --
│    │    └─Conv2d: 3-1                  [128, 16, 130, 46]        160
│    │    └─ReLU: 3-2                    [128, 16, 130, 46]        --
│    │    └─MaxPool2d: 3-3               [128, 16, 65, 23]         --
├─convblock: 1-2                         [128, 32, 33, 12]         --
│    └─Sequential: 2-2                   [128, 32, 33, 12]         --
│    │    └─Conv2d: 3-4                  [128, 32, 67, 25]         4,640
│    │    └─ReLU: 3-5                    [128, 32, 67, 25]         --
│    │    └─MaxPool2d: 3-6               [128, 32, 33, 12]         --
├─convblock: 1-3                         [128, 64, 17, 7]          --
│    └─Sequential: 2-3                   [128, 64, 17, 7]          --
│    │    └─Conv2d: 3-7                  [128, 64, 35, 14]         18,496
│    │    └─ReLU: 3-8                    [128, 64, 35, 14]         --
│    │    └─MaxPool2d: 3-9               [128, 64, 17, 7]          --
├─convblock: 1-4                         [128, 128, 9, 4]          --
│    └─Sequential: 2-4                   [128, 128, 9, 4]          --
│    │    └─Conv2d: 3-10                 [128, 128, 19, 9]         73,856
│    │    └─ReLU: 3-11                   [128, 128, 19, 9]         --
│    │    └─MaxPool2d: 3-12              [128, 128, 9, 4]          --
├─Flatten: 1-5                           [128, 4608]               --
├─Linear: 1-6                            [128, 10]                 46,090
├─Softmax: 1-7                           [128, 10]                 --
================================================================================
Total params: 143,242
Trainable params: 143,242
Non-trainable params: 0
Total mult-adds (G): 3.90
================================================================================
Input size (MB): 2.88
Forward/backward pass size (MB): 207.40
Params size (MB): 0.57
Estimated Total Size (MB): 210.86
================================================================================
  1. A CRNN Model, with a LSTM following a CNNBLock
================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
================================================================================
CRNN                                     [128, 10]                 --
├─convbloc: 1-1                          [128, 128, 22]            --
│    └─Sequential: 2-1                   [128, 128, 22]            --
│    │    └─Conv1d: 3-1                  [128, 128, 44]            82,048
│    │    └─BatchNorm1d: 3-2             [128, 128, 44]            256
│    │    └─ReLU: 3-3                    [128, 128, 44]            --
│    │    └─MaxPool1d: 3-4               [128, 128, 22]            --
├─convbloc: 1-2                          [128, 128, 11]            --
│    └─Sequential: 2-2                   [128, 128, 11]            --
│    │    └─Conv1d: 3-5                  [128, 128, 22]            82,048
│    │    └─BatchNorm1d: 3-6             [128, 128, 22]            256
│    │    └─ReLU: 3-7                    [128, 128, 22]            --
│    │    └─MaxPool1d: 3-8               [128, 128, 11]            --
├─convbloc: 1-3                          [128, 256, 5]             --
│    └─Sequential: 2-3                   [128, 256, 5]             --
│    │    └─Conv1d: 3-9                  [128, 256, 11]            164,096
│    │    └─BatchNorm1d: 3-10            [128, 256, 11]            512
│    │    └─ReLU: 3-11                   [128, 256, 11]            --
│    │    └─MaxPool1d: 3-12              [128, 256, 5]             --
├─LSTM: 1-4                              [128, 256, 96]            39,552
├─Flatten: 1-5                           [128, 24576]              --
├─Sequential: 1-6                        [128, 64]                 --
│    └─Linear: 2-4                       [128, 64]                 1,572,928
│    └─ReLU: 2-5                         [128, 64]                 --
├─Linear: 1-7                            [128, 10]                 650
├─Softmax: 1-8                           [128, 10]                 --
================================================================================
Total params: 1,942,346
Trainable params: 1,942,346
Non-trainable params: 0
Total mult-adds (G): 2.42
================================================================================
Input size (MB): 2.88
Forward/backward pass size (MB): 48.31
Params size (MB): 7.77
Estimated Total Size (MB): 58.96
================================================================================
  1. A Parallel CNN-LSTM model, where we have the inputs go through 5 CNN blocks and a LSTM block parallely and then they are concatenated
================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
================================================================================
ParallelNet                              [128, 10]                 --
├─CNNBLock: 1-1                          [128, 16, 63, 22]         --
│    └─Sequential: 2-1                   [128, 16, 63, 22]         --
│    │    └─Conv2d: 3-1                  [128, 16, 126, 44]        64
│    │    └─BatchNorm2d: 3-2             [128, 16, 126, 44]        32
│    │    └─ReLU: 3-3                    [128, 16, 126, 44]        --
│    │    └─MaxPool2d: 3-4               [128, 16, 63, 22]         --
├─CNNBLock: 1-2                          [128, 32, 30, 11]         --
│    └─Sequential: 2-2                   [128, 32, 30, 11]         --
│    │    └─Conv2d: 3-5                  [128, 32, 61, 22]         1,568
│    │    └─BatchNorm2d: 3-6             [128, 32, 61, 22]         64
│    │    └─ReLU: 3-7                    [128, 32, 61, 22]         --
│    │    └─MaxPool2d: 3-8               [128, 32, 30, 11]         --
├─CNNBLock: 1-3                          [128, 64, 14, 5]          --
│    └─Sequential: 2-3                   [128, 64, 14, 5]          --
│    │    └─Conv2d: 3-9                  [128, 64, 28, 11]         6,208
│    │    └─BatchNorm2d: 3-10            [128, 64, 28, 11]         128
│    │    └─ReLU: 3-11                   [128, 64, 28, 11]         --
│    │    └─MaxPool2d: 3-12              [128, 64, 14, 5]          --
├─CNNBLock: 1-4                          [128, 64, 3, 1]           --
│    └─Sequential: 2-4                   [128, 64, 3, 1]           --
│    │    └─Conv2d: 3-13                 [128, 64, 12, 5]          12,352
│    │    └─BatchNorm2d: 3-14            [128, 64, 12, 5]          128
│    │    └─ReLU: 3-15                   [128, 64, 12, 5]          --
│    │    └─MaxPool2d: 3-16              [128, 64, 3, 1]           --
├─Flatten: 1-5                           [128, 192]                --
├─RNNBlock: 1-6                          [128, 32, 256]            --
│    └─MaxPool2d: 2-5                    [128, 1, 32, 22]          --
│    └─LSTM: 2-6                         [128, 32, 256]            155,648
├─Flatten: 1-7                           [128, 8192]               --
├─Linear: 1-8                            [128, 10]                 83,850
├─Softmax: 1-9                           [128, 10]                 --
================================================================================
Total params: 260,042
Trainable params: 260,042
Non-trainable params: 0
Total mult-adds (G): 1.30
================================================================================
Input size (MB): 2.88
Forward/backward pass size (MB): 326.25
Params size (MB): 1.04
Estimated Total Size (MB): 330.17
================================================================================

Model performance

Model performance on the validation and test sets for each of the model and the number of epochs they were trained for is listed in the table below.

Model# EpochsValidation AccuracyTest Accuracy
Baseline326.427.6
CRNN356.9456.62
Parallel CNN-LSTM684.9284.29

Classification report for Parallel CNN-LSTM:

PrecisionRecallF1-scoreSupport
zero0.910.870.89250
one0.710.90.79248
two0.780.810.8264
three0.810.910.86267
four0.910.760.83253
five0.860.730.79271
six0.950.890.92244
seven0.810.920.85239
eight0.910.820.86257
nine0.840.830.83259
------------------------------------------
Accuracy0.842552
Macro avg0.850.840.842552
Weighted avg0.850.840.842552

Final training was done for small number of epochs because of lack of gpu due to a gpu error, and also decreasing validation accuracy if trained after that point.

Some observations

Audio can actually be treated as kind of a Spatio-temporal data type - when you handle it as a sequence data, it has temporal properties; however when you convert it to a spectrogram (here we converted it to Mel-spectrogram), then it is nothing but a image, and images have spatial information.

Thus audio can be processed with both sequential models (RNN, LSTM and everything from that family), as well as convolutional models.

GitHub: word-classification-with-pytorch
Report: Link