Find sound effect inside an audio file

后端未结

关注

 4  1024

攒了一身酷 2021-01-15 03:12

I have a load of 3 hour MP3 files, and every ~15 minutes a distinct 1 second sound effect is played, which signals the beginning of a new chapter.

Is it possible to

4条回答

萌比男神i (楼主)

2021-01-15 03:52
To follow up on the answers by @jonnor and @paul-john-leonard, they are both correct, by using frames (FFT) I was able to do Audio Event Detection.

I've written up the full source code at:

https://github.com/craigfrancis/audio-detect

Some notes though:
- To create the templates, I used ffmpeg:
  
  ffmpeg -ss 13.15 -i source.mp4 -t 0.8 -acodec copy -y templates/01.mp4;
- I decided to use librosa.core.stft, but I needed to make my own implementation of this stft function for the 3 hour file I'm analysing, as it's far too big to keep in memory.
- When using stft I tried using a hop_length of 64 at first, rather than the default (512), as I assumed that would give me more data to work with... the theory might be true, but 64 was far too detailed, and caused it to fail most of the time.
- I still have no idea how to get cross-correlation between frame and template to work (via numpy.correlate)... instead I took the results per frame (the 1025 buckets, not 1024, which I believe relate to the Hz frequencies found) and did a very simple average difference check, then ensured that average was above a certain value (my test case worked at 0.15, the main files I'm using this on required 0.55 - presumably because the main files had been compressed quite a bit more):
  
  hz_score = abs(source[0:1025,x] - template[2][0:1025,y])
  hz_score = sum(hz_score)/float(len(hz_score))
- When checking these scores, it's really useful to show them on a graph. I often used something like the following:
  
  import matplotlib.pyplot as plt
  plt.figure(figsize=(30, 5))
  plt.axhline(y=hz_match_required_start, color='y')
  
  while x < source_length:
  debug.append(hz_score)
  if x == mark_frame:
  plt.axvline(x=len(debug), ymin=0.1, ymax=1, color='r')
  
  plt.plot(debug)
  plt.show()
- When you create the template, you need to trim off any leading silence (to avoid bad matching), and an extra ~5 frames (it seems that the compression / re-encoding process alters this)... likewise, remove the last 2 frames (I think the frames include a bit of data from their surroundings, where the last one in particular can be a bit off).
- When you start finding a match, you might find it's ok for the first few frames, then it fails... you will probably need to try again a frame or two later. I found it easier having a process that supported multiple templates (slight variations on the sound), and would check their first testable (e.g. 6th) frame and if that matched, put them in a list of potential matches. Then, as it progressed on to the next frames of the source, it could compare it to the next frames of the template, until all frames in the template had been matched (or failed).
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...