Speaker Recognition using MARF

笑着哭i 提交于 2019-12-03 17:05:33

First thing I'd say is, in my experience, using the FFT algorithm won't give you the best result : try LPC in MARF

Second : MARF assumes what speech people call a "closed set" which means it will always return results even if the speaker is not known to the system -> you'd have to decide the likelihood of the response based on a distance threshold.

Also make sure the sliding window (Hamming window) size is set accordingly to your file's sample rate : e.g. using a window of 512 sampled values for a sample rate of 22050 Hz yields a window of ca. 23 ms which in my experience returned the best results on a data set of 500 speakers.

Since 22050 Hz means that much samples per second, finding the desired length of around 25 ms for any sample rate is easy : sample rate / 1000 * 25

Please note that the FFT algorithm used in MARF requires a window of exactly a power of 2 (256 / 512 / 1024 / ...).

But that's not required for the LPC algorithm (maybe slightly more efficient for the processor though, since powers of 2 is all it knows :-))

Ha, and don't forget that if you're using a stereo file, the window is twice as long... but I would advise to use a mono file : there's no added value in using a multichannel file for voice processing, it's longer and less precise.

A word on sample rate : the selected sample rate should be twice the highest frequency you're interested in. Usually, people consider that the highest frequency for voice is 4000Hz and thus select a sample rate of 8000Hz. Please note that this is not entirely correct : "s" and "sh" sounds reach for higher frequencies. It's true that you don't need those frequencies to understand what the speaker is saying, but when extracting a vocal print, it might be useful to use a broader spectrum. My preference goes to 22050Hz. Some vocal password packages don't allow you to go below 11000 Hz.

A word on bit depth : 8 bits vs 16 bits While the sample rate is the precision regarding time, the bit depth links to the precision of the amplitude. 8 bits gives you 256 values 16 bits gives you 65536 values

Needless to say why you should use 16 bits for vocal biometry :-)

For reference, an audio CD uses 44100Hz / 16 bit

About vText : as I told you earlier, Fourier Transforms (FFT) is not something I've found to be usable on large data sets. It lacks of precision.

Here it looks like something goes wrong when delegating calculations to MathLab. Without the code, imho, it's near to impossible to give you more info.

Don't hesitate to ask for clarification on the things I said, I might take some things for granted and not realize it's not that clear :-)

FWIW, I just wrote a Speaker Recognition tool in Java called Recognito, I believe it's not way better than MARF in terms of recognition capabilities, but it's definitely easier on the user for the initial steps, uses a licensing model which doesn't require your software to be open source, supports calls from multiple concurrent threads.

In case you want to give Recognito a shot : https://github.com/amaurycrickx/recognito

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!