Speech recognition response is poor in sphinx4

为君一笑 提交于 2019-11-30 07:43:55

The most common reasons for bad recognition accuracy are:

  1. The mismatch of the sample rate of the incoming audio. It must be 16khz 16bit mono little-endian file. You need to fix sample rate of the source with resampling.

  2. Zero silence regions in audio files decoded from mp3 break the decoder. You can use dither to introduce small random noise to solve this problem.

  3. The mismatch of the acoustic model. You can use acoustic model adaptation to improve accuracy

  4. The mismatch of the langauge model. You can create your own langauge model to match the vocabulary you are trying to decode.

You can get more information from the tutorial:

http://cmusphinx.sourceforge.net/wiki/tutorial

To get more detailed help you can always provide the audio samples you are trying to decode. They will help developers to analyze problem better. It's also helpful to provide the actual results you are getting from the decoder and your expectations.

CMU Sphinx is working very good for me, just for the sake of sharing some knowledge, my setup is:

  • Linux OS of course.
  • I record 32kHz .wav files that I later pass to the Recognizer as the audioFileDataSource to get the Speech converted to Text.
  • Trigram Language Model (SimpleNGramModel class)
  • My Language Model is a custom one that I generated with the words/phrases I wanted. (Used the CMU Cam Toolkit version 2 (docs available at http://svr-www.eng.cam.ac.uk/~prc14/toolkit_documentation.html to generate my own trigram.arpa files)
  • My Acoustic Model is wsj (TiedStateAcousticModel class) and wsjLoader (Sphinx3Loader class) with the WSJ_8gau_13dCep_8kHz_31mel_200Hz_3500Hz.jar (for some reason, this works better for me than the 16kHz model) and its dictionary.
  • I use a Live FrontEnd with melFilterBank (tuned up to the acoustic model parameters) and liveCMN.

I think the key is to generate the appropriate trigram.arpa files using the tools.

You will have to tune up your sphinx config properties as needed, there is no magic bullet for that, some of the ones that helped me are the speechClassifierThreshold (44) and the speechMarkerTrailer (77).

Hope it helps or at least gives you some ideas.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!