How “ok google” technology is implemented [closed]

I've read a little about Speech/voice recognition, and I wonder how does it work. For instance, "ok Google" on android and similar cases ?

I would like to know how it works (how to differentiate and analyze a word in a continuous feed, to find of it's a keyword). If I think about it as a continuous text feed, one way of doing it would be Isolating a given length of the feed, then find a keyword. An audio feed is a little bit harder to understand, as there is no pure silence between words (as said) and isolating a given length doesn't guarantee cutting a keyword at the beginning or at the end of the selected sub-feed. How does it work?

And finally, if you guys know some libs (C/C++ if possible) which are capable of doing it, I'll be glad to implement a "keyword spotter".

Thank you.

Keyword spotting is usually implemented with dynamic programming, you just search for the best chunk of audio containing the keyword looking on all possible starts and all possible ends. You need to look for both keywords and alternatives. Basically in every moment of time you look for both keyword and other sounds and once probability for keyword is higher than the probability of other speech you raise the signal. The false alarm rate is controlled by a threshold. You do not need to handle silence specifically because it is covered by "other speech" model. In detail the algorithm is covered in the following thesis:

http://eprints.qut.edu.au/37254/

For implementation of keyword spotting you can check pocketsphinx and pocketsphinx Android demo. It is a C library able to spot words in continuous stream. You can find the tutorial here:

http://cmusphinx.sourceforge.net/wiki/tutorialpocketsphinx.

To spot for keyword from microphone you can try something simple like

  pocketsphinx_continuous -inmic yes -keyphrase "ok google" -kws_threshold 1e-20

Original "Ok Google" technology is described in the following publication:

SMALL-FOOTPRINT KEYWORD SPOTTING USING DEEP NEURAL NETWORKS by Guoguo Chen Carolina Parada Georg Heigold

https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenSemester2201314/chen2014small.pdf

It is pretty advanced technology, and more importantly, it requires a lot of specific data for training.

来源：https://stackoverflow.com/questions/28952997/how-ok-google-technology-is-implemented

标签

c++

audio

voice-recognition