Java voice recognition for very small dictionary

I have MP3 audio files that contain voicemails that are left by a computer.

The message content is always in same format and left by the same computer voice with only a slight variation in content:

"You sold 4 cars today" (where the 4 can be anything from 0 to 9).

I have be trying to set up Sphinx, but the out-of-the-box models did not work too good.

I then tried to write my own acoustic model and haven't had much better success yet (30% unrecognized is my best).

I am wondering if voice recognition might be overkill for this task since I have exactly ONE voice, an expected audio pattern and a very limited dictionary that would need to be recognized.

I have access to each of the ten sounds (spoken numbers) that I would need to search for in the message.

Is there a non-VR approach to finding sounds in an audio file (I can convert MP3 to another format if necessary).

Update: My solution to this task follows

After working with Nikolay directly, I learned that the answer to my original question is irrelevant since the desired results may be achieved (with 100% accuracy) using Sphinx4 and a JSGF grammar.

1: Since the speech in my audo files is very limited, I created a JSGF grammar (salesreport.gram) to describe it. All of the information I needed to create the following grammar was available on this JSpeech Grammar Format page.

#JSGF V1.0;

grammar salesreport;

public <salesreport> = (<intro> | <sales> | <closing>)+;

<intro> = this is your automated automobile sales report;

<sales> = you sold <digit> cars today;

<closing> = thank you for using this system;

<digit> = zero | one | two | three | four | five | six | seven | eight | nine;

NOTE: Sphinx does not support JSGF tags in the grammar. If necessary, a regular expression may be used to extract specific information (the number of sales in my case).

2: It is very important that your audio files are properly formatted. The default sample rate for Sphinx is 16Khz (16Khz means there are 16000 samples collected every second). I converted my MP3 audio files to WAV format using FFmpeg.

ffmpeg -i input.mp3 -acodec pcm_s16le -ac 1 -ar 16000 output.wav

Unfortunately, FFmpeg renders this solution OS dependent. I am still looking for a way to convert the files using Java and will update this post if/when I find it.

Although it was not required to complete this task, I found Audacity helpful for working with audio files. It includes many utilities for working with the audio files (checking sample rate and bandwidth, file format conversion, etc).

3: Since telephone audio has a maximum bandwidth (the range of frequencies included in the audio) of 8kHz, I used the Sphinx en-us-8khz acoustic model.

4: I generated my dictionary, salesreport.dic, using lmtool

5: Using the files mentioned in the previous steps and the following code (modified version of Nikolay's example), my speech is recognized with 100% accuracy every time.

public String parseAudio(File voiceFile) throws FileNotFoundException, IOException
{
    String retVal = null;
    StringBuilder resultSB = new StringBuilder();

    Configuration configuration = new Configuration();

    configuration.setAcousticModelPath("file:acoustic_models/en-us-8khz");
    configuration.setDictionaryPath("file:salesreport.dic");
    configuration.setGrammarPath("file:salesreportResources/")
    configuration.setGrammarName("salesreport");
    configuration.setUseGrammar(true);

    StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
    try (InputStream stream = new FileInputStream(voiceFile))
    {
        recognizer.startRecognition(stream);

        SpeechResult result;

        while ((result = recognizer.getResult()) != null)
        {
            System.out.format("Hypothesis: %s\n", result.getHypothesis());
            resultSB.append(result.getHypothesis()
                    + " ");
        }

        recognizer.stopRecognition();
    }

    return resultSB.toString().trim();
}

The accuracy on such task must be 100%. Here is the code sample to use with the grammar:

public class TranscriberDemoGrammar {

    public static void main(String[] args) throws Exception {
        System.out.println("Loading models...");

        Configuration configuration = new Configuration();

        configuration.setAcousticModelPath("file:en-us-8khz");
        configuration.setDictionaryPath("cmu07a.dic");
        configuration.setGrammarPath("file:./");
        configuration.setGrammarName("digits");
        configuration.setUseGrammar(true);

        StreamSpeechRecognizer recognizer =
            new StreamSpeechRecognizer(configuration);
        InputStream stream = new FileInputStream(new File("file.wav"));
        recognizer.startRecognition(stream);

        SpeechResult result;

        while ((result = recognizer.getResult()) != null) {

            System.out.format("Hypothesis: %s\n",
                              result.getHypothesis());
            }

        recognizer.stopRecognition();
    }
}

You also need to make sure that both sample rate and audio bandwidth matches the decoder configuration

http://cmusphinx.sourceforge.net/wiki/faq#qwhat_is_sample_rate_and_how_does_it_affect_accuracy

First of all, Sphinx only work with WAVE file. For very limited vocabulary, Sphinx should generate good result when using a JSGF grammar file (but not that good in dictation mode). The main issue I found is that it does not provide confidence score (it is currently bugged). You might want to check three other alternative:

SpeechRecognizer from Windows platform. It provide easy to use recognition with confidence score and support grammar. This is C#, but you could build a native wrapper or custom server.
Google Speech API is an online speech recognition engine, free up to 50 request per day. There is several API for this, but I like JARVIS. Be careful though, since there is no official support or documentation about this and Google might (and already have in the past) close this engine whenever they want. Of course, you will have some privacy issue (is it okay to send this audio data to a third party ?).
I recently came through ISpeech and got good result with it. It provides its own Java wrapper API, free for mobile app. Same privacy issue as Google API.

I myself choose to go with the first option and build a speech recognition service in a custom http server. I found it to be the most effective way to tackle speech recognition from Java until Sphinx scoring issue is fixed.

来源：https://stackoverflow.com/questions/25507189/java-voice-recognition-for-very-small-dictionary

标签

java

audio

voice-recognition