python webrtc voice activity detection is wrong

问题

I need to do voice activity detection as a step to classify audio files.

Basically, I need to know with certainty if a given audio has spoken language.

I am using py-webrtcvad, which I found in git-hub and is scarcely documented:

https://github.com/wiseman/py-webrtcvad

Thing is, when I try it on my own audio files, it works fine with the ones that have speech but keeps yielding false positives when I feed it with other types of audio (like music or bird sound), even if I set aggressiveness at 3.

Audios are 8000 sample/hz

The only thing I changed to the source code was the way I pass the arguments to main function (excluding sys.args).

def main(file, agresividad):

    audio, sample_rate = read_wave(file)
    vad = webrtcvad.Vad(int(agresividad))
    frames = frame_generator(30, audio, sample_rate)
    frames = list(frames)
    segments = vad_collector(sample_rate, 30, 300, vad, frames)
    for i, segment in enumerate(segments):
        path = 'chunk-%002d.wav' % (i,)
        print(' Writing %s' % (path,))
        write_wave(path, segment, sample_rate)

if __name__ == '__main__':

    file = 'myfilename.wav'
    agresividad = 3 #aggressiveness
    main(file, agresividad)

回答1:

I'm seeing the same thing. I'm afraid that's just the extent to which it works. Speech detection is a difficult task and webrtcvad wants to be light on resources so there's only so much you can do. If you need more accuracy then you would need different packages/methods that will necessarily take more computing power.

On aggressiveness, you're right that even on 3 there are still a lot of false positives. I'm also seeing false negatives however so one trick I'm using is running three instances of the detector, one for each aggressiveness setting. Then instead of classifying a frame 0 or 1 I give it the value of the highest aggressiveness that still said it was speech. In other words each sample now has a score of 0 to 3 with 0 meaning even the least strict detector said it wasn't speech and 3 meaning even the strictest setting said it was. I get a little bit more resolution like that and even with the false positives it is good enough for me.

来源：https://stackoverflow.com/questions/51463146/python-webrtc-voice-activity-detection-is-wrong

标签

python

audio

webrtc

speech-recognition

voice-recognition