Table of Contents
Audio Classification with CNN-LSTM networks
January 1, 2023
Synopsis
In this project, I aim to classify 1 second long audio clips of the words “one”, “two”, “three”, …, “zero”. The data for this project is taken from the TensorFlow Speech Recognition Challenge. However I have slightly deviated from the competition, in terms of the target classes, where I have truncated the target classes to the ones I mentioned above.
I trained 3 models on the data:
- A baseline CNN model
 
================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
================================================================================
BaseModel                                [128, 10]                 --
├─convblock: 1-1                         [128, 16, 65, 23]         --
│    └─Sequential: 2-1                   [128, 16, 65, 23]         --
│    │    └─Conv2d: 3-1                  [128, 16, 130, 46]        160
│    │    └─ReLU: 3-2                    [128, 16, 130, 46]        --
│    │    └─MaxPool2d: 3-3               [128, 16, 65, 23]         --
├─convblock: 1-2                         [128, 32, 33, 12]         --
│    └─Sequential: 2-2                   [128, 32, 33, 12]         --
│    │    └─Conv2d: 3-4                  [128, 32, 67, 25]         4,640
│    │    └─ReLU: 3-5                    [128, 32, 67, 25]         --
│    │    └─MaxPool2d: 3-6               [128, 32, 33, 12]         --
├─convblock: 1-3                         [128, 64, 17, 7]          --
│    └─Sequential: 2-3                   [128, 64, 17, 7]          --
│    │    └─Conv2d: 3-7                  [128, 64, 35, 14]         18,496
│    │    └─ReLU: 3-8                    [128, 64, 35, 14]         --
│    │    └─MaxPool2d: 3-9               [128, 64, 17, 7]          --
├─convblock: 1-4                         [128, 128, 9, 4]          --
│    └─Sequential: 2-4                   [128, 128, 9, 4]          --
│    │    └─Conv2d: 3-10                 [128, 128, 19, 9]         73,856
│    │    └─ReLU: 3-11                   [128, 128, 19, 9]         --
│    │    └─MaxPool2d: 3-12              [128, 128, 9, 4]          --
├─Flatten: 1-5                           [128, 4608]               --
├─Linear: 1-6                            [128, 10]                 46,090
├─Softmax: 1-7                           [128, 10]                 --
================================================================================
Total params: 143,242
Trainable params: 143,242
Non-trainable params: 0
Total mult-adds (G): 3.90
================================================================================
Input size (MB): 2.88
Forward/backward pass size (MB): 207.40
Params size (MB): 0.57
Estimated Total Size (MB): 210.86
================================================================================
- A CRNN Model, with a LSTM following a CNNBLock
 
================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
================================================================================
CRNN                                     [128, 10]                 --
├─convbloc: 1-1                          [128, 128, 22]            --
│    └─Sequential: 2-1                   [128, 128, 22]            --
│    │    └─Conv1d: 3-1                  [128, 128, 44]            82,048
│    │    └─BatchNorm1d: 3-2             [128, 128, 44]            256
│    │    └─ReLU: 3-3                    [128, 128, 44]            --
│    │    └─MaxPool1d: 3-4               [128, 128, 22]            --
├─convbloc: 1-2                          [128, 128, 11]            --
│    └─Sequential: 2-2                   [128, 128, 11]            --
│    │    └─Conv1d: 3-5                  [128, 128, 22]            82,048
│    │    └─BatchNorm1d: 3-6             [128, 128, 22]            256
│    │    └─ReLU: 3-7                    [128, 128, 22]            --
│    │    └─MaxPool1d: 3-8               [128, 128, 11]            --
├─convbloc: 1-3                          [128, 256, 5]             --
│    └─Sequential: 2-3                   [128, 256, 5]             --
│    │    └─Conv1d: 3-9                  [128, 256, 11]            164,096
│    │    └─BatchNorm1d: 3-10            [128, 256, 11]            512
│    │    └─ReLU: 3-11                   [128, 256, 11]            --
│    │    └─MaxPool1d: 3-12              [128, 256, 5]             --
├─LSTM: 1-4                              [128, 256, 96]            39,552
├─Flatten: 1-5                           [128, 24576]              --
├─Sequential: 1-6                        [128, 64]                 --
│    └─Linear: 2-4                       [128, 64]                 1,572,928
│    └─ReLU: 2-5                         [128, 64]                 --
├─Linear: 1-7                            [128, 10]                 650
├─Softmax: 1-8                           [128, 10]                 --
================================================================================
Total params: 1,942,346
Trainable params: 1,942,346
Non-trainable params: 0
Total mult-adds (G): 2.42
================================================================================
Input size (MB): 2.88
Forward/backward pass size (MB): 48.31
Params size (MB): 7.77
Estimated Total Size (MB): 58.96
================================================================================
- A Parallel CNN-LSTM model, where we have the inputs go through 5 CNN blocks and a LSTM block parallely and then they are concatenated
 
================================================================================
Layer (type:depth-idx)                   Output Shape              Param #
================================================================================
ParallelNet                              [128, 10]                 --
├─CNNBLock: 1-1                          [128, 16, 63, 22]         --
│    └─Sequential: 2-1                   [128, 16, 63, 22]         --
│    │    └─Conv2d: 3-1                  [128, 16, 126, 44]        64
│    │    └─BatchNorm2d: 3-2             [128, 16, 126, 44]        32
│    │    └─ReLU: 3-3                    [128, 16, 126, 44]        --
│    │    └─MaxPool2d: 3-4               [128, 16, 63, 22]         --
├─CNNBLock: 1-2                          [128, 32, 30, 11]         --
│    └─Sequential: 2-2                   [128, 32, 30, 11]         --
│    │    └─Conv2d: 3-5                  [128, 32, 61, 22]         1,568
│    │    └─BatchNorm2d: 3-6             [128, 32, 61, 22]         64
│    │    └─ReLU: 3-7                    [128, 32, 61, 22]         --
│    │    └─MaxPool2d: 3-8               [128, 32, 30, 11]         --
├─CNNBLock: 1-3                          [128, 64, 14, 5]          --
│    └─Sequential: 2-3                   [128, 64, 14, 5]          --
│    │    └─Conv2d: 3-9                  [128, 64, 28, 11]         6,208
│    │    └─BatchNorm2d: 3-10            [128, 64, 28, 11]         128
│    │    └─ReLU: 3-11                   [128, 64, 28, 11]         --
│    │    └─MaxPool2d: 3-12              [128, 64, 14, 5]          --
├─CNNBLock: 1-4                          [128, 64, 3, 1]           --
│    └─Sequential: 2-4                   [128, 64, 3, 1]           --
│    │    └─Conv2d: 3-13                 [128, 64, 12, 5]          12,352
│    │    └─BatchNorm2d: 3-14            [128, 64, 12, 5]          128
│    │    └─ReLU: 3-15                   [128, 64, 12, 5]          --
│    │    └─MaxPool2d: 3-16              [128, 64, 3, 1]           --
├─Flatten: 1-5                           [128, 192]                --
├─RNNBlock: 1-6                          [128, 32, 256]            --
│    └─MaxPool2d: 2-5                    [128, 1, 32, 22]          --
│    └─LSTM: 2-6                         [128, 32, 256]            155,648
├─Flatten: 1-7                           [128, 8192]               --
├─Linear: 1-8                            [128, 10]                 83,850
├─Softmax: 1-9                           [128, 10]                 --
================================================================================
Total params: 260,042
Trainable params: 260,042
Non-trainable params: 0
Total mult-adds (G): 1.30
================================================================================
Input size (MB): 2.88
Forward/backward pass size (MB): 326.25
Params size (MB): 1.04
Estimated Total Size (MB): 330.17
================================================================================
Model performance
Model performance on the validation and test sets for each of the model and the number of epochs they were trained for is listed in the table below.
| Model | # Epochs | Validation Accuracy | Test Accuracy | 
|---|---|---|---|
| Baseline | 3 | 26.4 | 27.6 | 
| CRNN | 3 | 56.94 | 56.62 | 
| Parallel CNN-LSTM | 6 | 84.92 | 84.29 | 
Classification report for Parallel CNN-LSTM:
| Precision | Recall | F1-score | Support | |
|---|---|---|---|---|
| zero | 0.91 | 0.87 | 0.89 | 250 | 
| one | 0.71 | 0.9 | 0.79 | 248 | 
| two | 0.78 | 0.81 | 0.8 | 264 | 
| three | 0.81 | 0.91 | 0.86 | 267 | 
| four | 0.91 | 0.76 | 0.83 | 253 | 
| five | 0.86 | 0.73 | 0.79 | 271 | 
| six | 0.95 | 0.89 | 0.92 | 244 | 
| seven | 0.81 | 0.92 | 0.85 | 239 | 
| eight | 0.91 | 0.82 | 0.86 | 257 | 
| nine | 0.84 | 0.83 | 0.83 | 259 | 
| ------------ | --------- | ------ | -------- | ------- | 
| Accuracy | 0.84 | 2552 | ||
| Macro avg | 0.85 | 0.84 | 0.84 | 2552 | 
| Weighted avg | 0.85 | 0.84 | 0.84 | 2552 | 
Final training was done for small number of epochs because of lack of gpu due to a gpu error, and also decreasing validation accuracy if trained after that point.
Some observations
Audio can actually be treated as kind of a Spatio-temporal data type - when you handle it as a sequence data, it has temporal properties; however when you convert it to a spectrogram (here we converted it to Mel-spectrogram), then it is nothing but a image, and images have spatial information.
Thus audio can be processed with both sequential models (RNN, LSTM and everything from that family), as well as convolutional models.
Links
GitHub: word-classification-with-pytorch
Report: Link