I record a daily 2 minutes radio broadcast from Internet. There\'s always the same starting and ending jingle. Since the radio broadcast exact time may vary from more or les
I wonder if you could use a Hough transform. You would start by cataloging each step of the opening sequence. Let's say you use 10 ms steps and the opening sequence is 50 ms long. You compute some metric on each step and get
1 10 1 17 5
Now go through your audio and analyze each 10 ms step for the same metric. Call this array have_audio
8 10 8 7 5 1 10 1 17 6 2 10...
Now create a new empty array that's the same length as have_audio
. Call it start_votes
. It will contain "votes" for the start of the opening sequence. If you see a 1, you may be in the 1st or 3rd step of the opening sequence, so you have 1 vote for the opening sequence starting 1 step ago and 1 vote for the opening sequence starting 3 steps ago. If you see a 10, you have 1 vote for the opening sequence starting 2 steps ago, a 17 votes for 4 step ago, and so on.
So for that example have_audio
, your votes
will look like
2 0 0 1 0 4 0 0 0 0 0 1 ...
You have a lot of votes at position 6, so there's a good chance the opening sequence starts there.
You could improve performance by not bothering to analyze the entire opening sequence. If the opening sequence is 10 seconds long, you could just search for the first 5 seconds.