Table of Contents
Audio Classification with CNN-LSTM networks
January 1, 2023
Synopsis
In this project, I aim to classify 1 second long audio clips of the words “one”, “two”, “three”, …, “zero”. The data for this project is taken from the TensorFlow Speech Recognition Challenge. However I have slightly deviated from the competition, in terms of the target classes, where I have truncated the target classes to the ones I mentioned above.
I trained 3 models on the data:
- A baseline CNN model
================================================================================
Layer (type:depth-idx) Output Shape Param #
================================================================================
BaseModel [128, 10] --
├─convblock: 1-1 [128, 16, 65, 23] --
│ └─Sequential: 2-1 [128, 16, 65, 23] --
│ │ └─Conv2d: 3-1 [128, 16, 130, 46] 160
│ │ └─ReLU: 3-2 [128, 16, 130, 46] --
│ │ └─MaxPool2d: 3-3 [128, 16, 65, 23] --
├─convblock: 1-2 [128, 32, 33, 12] --
│ └─Sequential: 2-2 [128, 32, 33, 12] --
│ │ └─Conv2d: 3-4 [128, 32, 67, 25] 4,640
│ │ └─ReLU: 3-5 [128, 32, 67, 25] --
│ │ └─MaxPool2d: 3-6 [128, 32, 33, 12] --
├─convblock: 1-3 [128, 64, 17, 7] --
│ └─Sequential: 2-3 [128, 64, 17, 7] --
│ │ └─Conv2d: 3-7 [128, 64, 35, 14] 18,496
│ │ └─ReLU: 3-8 [128, 64, 35, 14] --
│ │ └─MaxPool2d: 3-9 [128, 64, 17, 7] --
├─convblock: 1-4 [128, 128, 9, 4] --
│ └─Sequential: 2-4 [128, 128, 9, 4] --
│ │ └─Conv2d: 3-10 [128, 128, 19, 9] 73,856
│ │ └─ReLU: 3-11 [128, 128, 19, 9] --
│ │ └─MaxPool2d: 3-12 [128, 128, 9, 4] --
├─Flatten: 1-5 [128, 4608] --
├─Linear: 1-6 [128, 10] 46,090
├─Softmax: 1-7 [128, 10] --
================================================================================
Total params: 143,242
Trainable params: 143,242
Non-trainable params: 0
Total mult-adds (G): 3.90
================================================================================
Input size (MB): 2.88
Forward/backward pass size (MB): 207.40
Params size (MB): 0.57
Estimated Total Size (MB): 210.86
================================================================================
- A CRNN Model, with a LSTM following a CNNBLock
================================================================================
Layer (type:depth-idx) Output Shape Param #
================================================================================
CRNN [128, 10] --
├─convbloc: 1-1 [128, 128, 22] --
│ └─Sequential: 2-1 [128, 128, 22] --
│ │ └─Conv1d: 3-1 [128, 128, 44] 82,048
│ │ └─BatchNorm1d: 3-2 [128, 128, 44] 256
│ │ └─ReLU: 3-3 [128, 128, 44] --
│ │ └─MaxPool1d: 3-4 [128, 128, 22] --
├─convbloc: 1-2 [128, 128, 11] --
│ └─Sequential: 2-2 [128, 128, 11] --
│ │ └─Conv1d: 3-5 [128, 128, 22] 82,048
│ │ └─BatchNorm1d: 3-6 [128, 128, 22] 256
│ │ └─ReLU: 3-7 [128, 128, 22] --
│ │ └─MaxPool1d: 3-8 [128, 128, 11] --
├─convbloc: 1-3 [128, 256, 5] --
│ └─Sequential: 2-3 [128, 256, 5] --
│ │ └─Conv1d: 3-9 [128, 256, 11] 164,096
│ │ └─BatchNorm1d: 3-10 [128, 256, 11] 512
│ │ └─ReLU: 3-11 [128, 256, 11] --
│ │ └─MaxPool1d: 3-12 [128, 256, 5] --
├─LSTM: 1-4 [128, 256, 96] 39,552
├─Flatten: 1-5 [128, 24576] --
├─Sequential: 1-6 [128, 64] --
│ └─Linear: 2-4 [128, 64] 1,572,928
│ └─ReLU: 2-5 [128, 64] --
├─Linear: 1-7 [128, 10] 650
├─Softmax: 1-8 [128, 10] --
================================================================================
Total params: 1,942,346
Trainable params: 1,942,346
Non-trainable params: 0
Total mult-adds (G): 2.42
================================================================================
Input size (MB): 2.88
Forward/backward pass size (MB): 48.31
Params size (MB): 7.77
Estimated Total Size (MB): 58.96
================================================================================
- A Parallel CNN-LSTM model, where we have the inputs go through 5 CNN blocks and a LSTM block parallely and then they are concatenated
================================================================================
Layer (type:depth-idx) Output Shape Param #
================================================================================
ParallelNet [128, 10] --
├─CNNBLock: 1-1 [128, 16, 63, 22] --
│ └─Sequential: 2-1 [128, 16, 63, 22] --
│ │ └─Conv2d: 3-1 [128, 16, 126, 44] 64
│ │ └─BatchNorm2d: 3-2 [128, 16, 126, 44] 32
│ │ └─ReLU: 3-3 [128, 16, 126, 44] --
│ │ └─MaxPool2d: 3-4 [128, 16, 63, 22] --
├─CNNBLock: 1-2 [128, 32, 30, 11] --
│ └─Sequential: 2-2 [128, 32, 30, 11] --
│ │ └─Conv2d: 3-5 [128, 32, 61, 22] 1,568
│ │ └─BatchNorm2d: 3-6 [128, 32, 61, 22] 64
│ │ └─ReLU: 3-7 [128, 32, 61, 22] --
│ │ └─MaxPool2d: 3-8 [128, 32, 30, 11] --
├─CNNBLock: 1-3 [128, 64, 14, 5] --
│ └─Sequential: 2-3 [128, 64, 14, 5] --
│ │ └─Conv2d: 3-9 [128, 64, 28, 11] 6,208
│ │ └─BatchNorm2d: 3-10 [128, 64, 28, 11] 128
│ │ └─ReLU: 3-11 [128, 64, 28, 11] --
│ │ └─MaxPool2d: 3-12 [128, 64, 14, 5] --
├─CNNBLock: 1-4 [128, 64, 3, 1] --
│ └─Sequential: 2-4 [128, 64, 3, 1] --
│ │ └─Conv2d: 3-13 [128, 64, 12, 5] 12,352
│ │ └─BatchNorm2d: 3-14 [128, 64, 12, 5] 128
│ │ └─ReLU: 3-15 [128, 64, 12, 5] --
│ │ └─MaxPool2d: 3-16 [128, 64, 3, 1] --
├─Flatten: 1-5 [128, 192] --
├─RNNBlock: 1-6 [128, 32, 256] --
│ └─MaxPool2d: 2-5 [128, 1, 32, 22] --
│ └─LSTM: 2-6 [128, 32, 256] 155,648
├─Flatten: 1-7 [128, 8192] --
├─Linear: 1-8 [128, 10] 83,850
├─Softmax: 1-9 [128, 10] --
================================================================================
Total params: 260,042
Trainable params: 260,042
Non-trainable params: 0
Total mult-adds (G): 1.30
================================================================================
Input size (MB): 2.88
Forward/backward pass size (MB): 326.25
Params size (MB): 1.04
Estimated Total Size (MB): 330.17
================================================================================
Model performance
Model performance on the validation and test sets for each of the model and the number of epochs they were trained for is listed in the table below.
| Model | # Epochs | Validation Accuracy | Test Accuracy |
|---|---|---|---|
| Baseline | 3 | 26.4 | 27.6 |
| CRNN | 3 | 56.94 | 56.62 |
| Parallel CNN-LSTM | 6 | 84.92 | 84.29 |
Classification report for Parallel CNN-LSTM:
| Precision | Recall | F1-score | Support | |
|---|---|---|---|---|
| zero | 0.91 | 0.87 | 0.89 | 250 |
| one | 0.71 | 0.9 | 0.79 | 248 |
| two | 0.78 | 0.81 | 0.8 | 264 |
| three | 0.81 | 0.91 | 0.86 | 267 |
| four | 0.91 | 0.76 | 0.83 | 253 |
| five | 0.86 | 0.73 | 0.79 | 271 |
| six | 0.95 | 0.89 | 0.92 | 244 |
| seven | 0.81 | 0.92 | 0.85 | 239 |
| eight | 0.91 | 0.82 | 0.86 | 257 |
| nine | 0.84 | 0.83 | 0.83 | 259 |
| ------------ | --------- | ------ | -------- | ------- |
| Accuracy | 0.84 | 2552 | ||
| Macro avg | 0.85 | 0.84 | 0.84 | 2552 |
| Weighted avg | 0.85 | 0.84 | 0.84 | 2552 |
Final training was done for small number of epochs because of lack of gpu due to a gpu error, and also decreasing validation accuracy if trained after that point.
Some observations
Audio can actually be treated as kind of a Spatio-temporal data type - when you handle it as a sequence data, it has temporal properties; however when you convert it to a spectrogram (here we converted it to Mel-spectrogram), then it is nothing but a image, and images have spatial information.
Thus audio can be processed with both sequential models (RNN, LSTM and everything from that family), as well as convolutional models.
Links
GitHub: word-classification-with-pytorch
Report: Link