Table of Contents
Audio Classification with CNN-LSTM networks
January 1, 2023
In this project, I aim to classify 1 second long audio clips of the words “one”, “two”, “three”, …, “zero”. The data for this project is taken from the TensorFlow Speech Recognition Challenge. However I have slightly deviated from the competition, in terms of the target classes, where I have truncated the target classes to the ones I mentioned above.
I trained 3 models on the data:
- A baseline CNN model
Layer (type:depth-idx) Output Shape Param #
BaseModel [128, 10] --
├─convblock: 1-1 [128, 16, 65, 23] --
│ └─Sequential: 2-1 [128, 16, 65, 23] --
│ │ └─Conv2d: 3-1 [128, 16, 130, 46] 160
│ │ └─ReLU: 3-2 [128, 16, 130, 46] --
│ │ └─MaxPool2d: 3-3 [128, 16, 65, 23] --
├─convblock: 1-2 [128, 32, 33, 12] --
│ └─Sequential: 2-2 [128, 32, 33, 12] --
│ │ └─Conv2d: 3-4 [128, 32, 67, 25] 4,640
│ │ └─ReLU: 3-5 [128, 32, 67, 25] --
│ │ └─MaxPool2d: 3-6 [128, 32, 33, 12] --
├─convblock: 1-3 [128, 64, 17, 7] --
│ └─Sequential: 2-3 [128, 64, 17, 7] --
│ │ └─Conv2d: 3-7 [128, 64, 35, 14] 18,496
│ │ └─ReLU: 3-8 [128, 64, 35, 14] --
│ │ └─MaxPool2d: 3-9 [128, 64, 17, 7] --
├─convblock: 1-4 [128, 128, 9, 4] --
│ └─Sequential: 2-4 [128, 128, 9, 4] --
│ │ └─Conv2d: 3-10 [128, 128, 19, 9] 73,856
│ │ └─ReLU: 3-11 [128, 128, 19, 9] --
│ │ └─MaxPool2d: 3-12 [128, 128, 9, 4] --
├─Flatten: 1-5 [128, 4608] --
├─Linear: 1-6 [128, 10] 46,090
├─Softmax: 1-7 [128, 10] --
Total params: 143,242
Trainable params: 143,242
Non-trainable params: 0
Total mult-adds (G): 3.90
Input size (MB): 2.88
Forward/backward pass size (MB): 207.40
Params size (MB): 0.57
Estimated Total Size (MB): 210.86
- A CRNN Model, with a LSTM following a CNNBLock
Layer (type:depth-idx) Output Shape Param #
CRNN [128, 10] --
├─convbloc: 1-1 [128, 128, 22] --
│ └─Sequential: 2-1 [128, 128, 22] --
│ │ └─Conv1d: 3-1 [128, 128, 44] 82,048
│ │ └─BatchNorm1d: 3-2 [128, 128, 44] 256
│ │ └─ReLU: 3-3 [128, 128, 44] --
│ │ └─MaxPool1d: 3-4 [128, 128, 22] --
├─convbloc: 1-2 [128, 128, 11] --
│ └─Sequential: 2-2 [128, 128, 11] --
│ │ └─Conv1d: 3-5 [128, 128, 22] 82,048
│ │ └─BatchNorm1d: 3-6 [128, 128, 22] 256
│ │ └─ReLU: 3-7 [128, 128, 22] --
│ │ └─MaxPool1d: 3-8 [128, 128, 11] --
├─convbloc: 1-3 [128, 256, 5] --
│ └─Sequential: 2-3 [128, 256, 5] --
│ │ └─Conv1d: 3-9 [128, 256, 11] 164,096
│ │ └─BatchNorm1d: 3-10 [128, 256, 11] 512
│ │ └─ReLU: 3-11 [128, 256, 11] --
│ │ └─MaxPool1d: 3-12 [128, 256, 5] --
├─LSTM: 1-4 [128, 256, 96] 39,552
├─Flatten: 1-5 [128, 24576] --
├─Sequential: 1-6 [128, 64] --
│ └─Linear: 2-4 [128, 64] 1,572,928
│ └─ReLU: 2-5 [128, 64] --
├─Linear: 1-7 [128, 10] 650
├─Softmax: 1-8 [128, 10] --
Total params: 1,942,346
Trainable params: 1,942,346
Non-trainable params: 0
Total mult-adds (G): 2.42
Input size (MB): 2.88
Forward/backward pass size (MB): 48.31
Params size (MB): 7.77
Estimated Total Size (MB): 58.96
- A Parallel CNN-LSTM model, where we have the inputs go through 5 CNN blocks and a LSTM block parallely and then they are concatenated
Layer (type:depth-idx) Output Shape Param #
ParallelNet [128, 10] --
├─CNNBLock: 1-1 [128, 16, 63, 22] --
│ └─Sequential: 2-1 [128, 16, 63, 22] --
│ │ └─Conv2d: 3-1 [128, 16, 126, 44] 64
│ │ └─BatchNorm2d: 3-2 [128, 16, 126, 44] 32
│ │ └─ReLU: 3-3 [128, 16, 126, 44] --
│ │ └─MaxPool2d: 3-4 [128, 16, 63, 22] --
├─CNNBLock: 1-2 [128, 32, 30, 11] --
│ └─Sequential: 2-2 [128, 32, 30, 11] --
│ │ └─Conv2d: 3-5 [128, 32, 61, 22] 1,568
│ │ └─BatchNorm2d: 3-6 [128, 32, 61, 22] 64
│ │ └─ReLU: 3-7 [128, 32, 61, 22] --
│ │ └─MaxPool2d: 3-8 [128, 32, 30, 11] --
├─CNNBLock: 1-3 [128, 64, 14, 5] --
│ └─Sequential: 2-3 [128, 64, 14, 5] --
│ │ └─Conv2d: 3-9 [128, 64, 28, 11] 6,208
│ │ └─BatchNorm2d: 3-10 [128, 64, 28, 11] 128
│ │ └─ReLU: 3-11 [128, 64, 28, 11] --
│ │ └─MaxPool2d: 3-12 [128, 64, 14, 5] --
├─CNNBLock: 1-4 [128, 64, 3, 1] --
│ └─Sequential: 2-4 [128, 64, 3, 1] --
│ │ └─Conv2d: 3-13 [128, 64, 12, 5] 12,352
│ │ └─BatchNorm2d: 3-14 [128, 64, 12, 5] 128
│ │ └─ReLU: 3-15 [128, 64, 12, 5] --
│ │ └─MaxPool2d: 3-16 [128, 64, 3, 1] --
├─Flatten: 1-5 [128, 192] --
├─RNNBlock: 1-6 [128, 32, 256] --
│ └─MaxPool2d: 2-5 [128, 1, 32, 22] --
│ └─LSTM: 2-6 [128, 32, 256] 155,648
├─Flatten: 1-7 [128, 8192] --
├─Linear: 1-8 [128, 10] 83,850
├─Softmax: 1-9 [128, 10] --
Total params: 260,042
Trainable params: 260,042
Non-trainable params: 0
Total mult-adds (G): 1.30
Input size (MB): 2.88
Forward/backward pass size (MB): 326.25
Params size (MB): 1.04
Estimated Total Size (MB): 330.17
Model performance
Model performance on the validation and test sets for each of the model and the number of epochs they were trained for is listed in the table below.
Model | # Epochs | Validation Accuracy | Test Accuracy |
Baseline | 3 | 26.4 | 27.6 |
CRNN | 3 | 56.94 | 56.62 |
Parallel CNN-LSTM | 6 | 84.92 | 84.29 |
Classification report for Parallel CNN-LSTM:
Precision | Recall | F1-score | Support | |
zero | 0.91 | 0.87 | 0.89 | 250 |
one | 0.71 | 0.9 | 0.79 | 248 |
two | 0.78 | 0.81 | 0.8 | 264 |
three | 0.81 | 0.91 | 0.86 | 267 |
four | 0.91 | 0.76 | 0.83 | 253 |
five | 0.86 | 0.73 | 0.79 | 271 |
six | 0.95 | 0.89 | 0.92 | 244 |
seven | 0.81 | 0.92 | 0.85 | 239 |
eight | 0.91 | 0.82 | 0.86 | 257 |
nine | 0.84 | 0.83 | 0.83 | 259 |
------------ | --------- | ------ | -------- | ------- |
Accuracy | 0.84 | 2552 | ||
Macro avg | 0.85 | 0.84 | 0.84 | 2552 |
Weighted avg | 0.85 | 0.84 | 0.84 | 2552 |
Final training was done for small number of epochs because of lack of gpu due to a gpu error, and also decreasing validation accuracy if trained after that point.
Some observations
Audio can actually be treated as kind of a Spatio-temporal data type - when you handle it as a sequence data, it has temporal properties; however when you convert it to a spectrogram (here we converted it to Mel-spectrogram), then it is nothing but a image, and images have spatial information.
Thus audio can be processed with both sequential models (RNN, LSTM and everything from that family), as well as convolutional models.
GitHub: word-classification-with-pytorch
Report: Link