Gender detection of the speaker from wave data of the audio

I would like to add a gender detection capability to a news video translator app I'm working on, so that the app can switch between male and female voice according to the voice onscreen. I'm not expecting 100% accuracy. I used EZAudio to obtain waveform data of a time period of audio and used the average RMS value to set a threshold(cutOff) value between male and female. Initially cutOff = 3.3.

    - (void)setInitialVoiceGenderDetectionParameters:(NSArray *)arrayAudioDetails
    {
        float initialMaleAvg = ((ConvertedTextDetails *)[arrayAudioDetails firstObject]).audioAverageRMS;
        // The average RMS value of a time period of Audio, say 5 sec
        float initialMaleVector = initialMaleAvg * 80;
        // MaleVector is the parameter to change the threshold according to different news clippings
        cutOff = (initialMaleVector < 5.3) ? initialMaleVector : 5.3;
        cutOff = (initialMaleVector > 23) ? initialMaleVector/2 : 5.3;
    }

Initially adjustValue = -0.9 and tanCutOff = 0.45. These values 5.3, 23, cutOff, adjustValue and tanCutOff are obtained from rigorous testing. Also tan of values are used to magnify the difference in values.

    - (BOOL)checkGenderWithPeekRMS:(float)pRMS andAverageRMS:(float)aRMS
{
    //pRMS is the peak RMS value in the audio snippet and aRMS is the average RMS value
    BOOL male = NO;
    if(tan(pRMS) < tanCutOff)
    {
        if(pRMS/aRMS > cutOff)
        {
            cutOff = cutOff + adjustValue;
            NSLog(@"FEMALE....");
            male = NO;
        }
        else
        {
            NSLog(@"MALE....");
            male = YES;
            cutOff = cutOff - adjustValue;
        }
    }
    else
    {
        NSLog(@"FEMALE.");
        male = NO;
    }

    return male;
}

Usage of the adjustValue is to calibrate the threshold each time a news video is translated as each video has different noise levels. But I know this method is noob-ish. What can I do create a stable threshold? or How can I normalise each audio snippet?

Alternate or more efficient ways to determine gender from audio wave data is also welcome.

Edit: From Nikolay's suggestion I researched on gender recognition using CMU Sphinx. Can anybody suggest how can I extract MFCC features and feed into a GMM/SVM classifier using Open Ears (CMU Sphinx for iOS platform) ?

Accurate gender identification can be implemented with GMM classifier of MFCC features. You can read about it here:

AGE AND GENDER RECOGNITION FOR TELEPHONE APPLICATIONS BASED ON GMM SUPERVECTORS AND SUPPORT VECTOR MACHINES

To the date I am not aware of open source implementation of this, though many components are available in open source speech recognition toolkits like CMUSphinx.

Accurate gender identification can be implemented with training a GMM classifier on MFCC features of male and female. Here is how one can go about it.

One needs to collect training set for each of the gender.
Extract MFCCs features from all the audios of respective gender(One can find python implementation like scikit-talkbox etc).
Train GMM models for both the gender on the extracted features from their training set audios.

For details, Here is an open source python implementation of the same. The following tutorials evaluates the code on subset extracted from Google's AudioSet which is released this year (2017)

https://appliedmachinelearning.wordpress.com/2017/06/14/voice-gender-detection-using-gmms-a-python-primer/

来源：https://stackoverflow.com/questions/30397126/gender-detection-of-the-speaker-from-wave-data-of-the-audio

标签

ios

objective-c

voice-recognition