This question is in line with the question posted here but with a slight nuance of the CNN. Using the feature extraction definition:
max_pad_len = 174 n_mels