How to apply CNN to Short-time Fourier Transform?

问题

So I have a code which returns a Short-Time Fourier Transform spectrum of a .wav file. I want to be able to take, say a millisecond of the spectrum, and train a CNN on it.

I'm not quite sure how I would implement that. I know how to format the image data to feed into the CNN, and how to train the network, but I'm lost on how to take the FFT-data and divide it into small time-frames.

The FFT Code(Sorry for ultra long code):

rate, audio = wavfile.read('scale_a_lydian.wav')

audio = np.mean(audio, axis=1)

N = audio.shape[0]
L = N / rate

M = 1024

# Audio is 44.1 Khz, or ~44100 samples / second
# window function takes 1024 samples or 0.02 seconds of audio (1024 / 44100 = ~0.02 seconds)
# and shifts the window 100 over each time
# so there would end up being (total_samplesize - 1024)/(100) total steps done (or slices)

slices = util.view_as_windows(audio, window_shape=(M,), step=100) #slices overlap

win = np.hanning(M + 1)[:-1]
slices = slices * win #each slice is 1024 samples (0.02 seconds of audio)

slices = slices.T #transpose matrix -> make each column 1024 samples (ie. make each column one slice)


spectrum = np.fft.fft(slices, axis=0)[:M // 2 + 1:-1] #perform fft on each slice and then take the first half of each slice, and reverse

spectrum = np.abs(spectrum) #take absolute value of slices

# take SampleSize * Slices
# transpose into slices * samplesize
# Take the first row -> slice * samplesize
# transpose back to samplesize * slice (essentially get 0.01s of spectrum)

spectrum2 = spectrum.T
spectrum2 = spectrum2[:1]
spectrum2 = spectrum2.T

The following outputs an FFT spectrum:

N = spectrum2.shape[0]
L = N / rate

f, ax = plt.subplots(figsize=(4.8, 2.4))

S = np.abs(spectrum2)
S = 20 * np.log10(S / np.max(S))

ax.imshow(S, origin='lower', cmap='viridis',
          extent=(0, L, 0, rate / 2 / 1000))
ax.axis('tight')
ax.set_ylabel('Frequency [kHz]')
ax.set_xlabel('Time [s]');
plt.show()

(Feel free to correct any theoretical errors that I put in the comments)

So from what I understand, I have a numpy array (spectrum) with each column being a slice with 510 samples (Cut in half, because half of each FFT slice is redundant (useless?)), with each sample having the list of frequencies?

The above code theoretically outputs 0.01s of audio as a spectrum, which is exactly what I need. Is this true, or am I not thinking right?

回答1:

I would suggest you to use Librosa for loading the audio and doing some pre-processing in just 1 line of code. You would want all your audio files to have the same sampling rate. Also you'd like to cut the audio in a specific portion to get a specific interval. You can load the audio like this:

import librosa

y, sr = librosa.load(audiofile, offset=10.0, duration=30.0, sr=16000)

So you'll have your time series as y. From here I would use this nice implementation of a CNN on audio. Here the guy is using his own library that performs on-gpu mel-spectrogram computation. You just need to give your y parameter to the network. See here how it's done. Alternatively, you can remove the first layer of that network and pre-compute your mel-spectrograms and save them somewhere. These would be your inputs to the network. See here

Other resources: Audio Classification : A Convolutional Neural Network Approach

来源：https://stackoverflow.com/questions/56293721/how-to-apply-cnn-to-short-time-fourier-transform

标签

python

python-3.x

conv-neural-network

fft