问题
I am currently working on a project for android using tesseract OCR. I was hoping to fine tune the results given to the user by adding a dictionary. According to http://code.google.com/p/tesseract-ocr/wiki/FAQ , the best way to go about this would be to
Replace tessdata/eng.user-words with your own word list, in the same format - UTF8 text, one word per line.
However there is no eng.user-words file in the tessdata folder, I assume that if I just make a text file with my dictionary in it, it will never be used..
Has anybody had a similar experience and knows what to do? Any advice would be a great help.
回答1:
if you're using tesseract 3 (which I assume you are). You'll have to rebuild your eng.trainddata file I intended to replace the word-dawg file completely to try to get better results (ie - the words i'm detecting are always the same).
you'll need combine_tessdata and wordlist2dawg executables in the training directory when you compile tesseract.
unpack everything (i did this just to back up my eng.word-dawg, you'll also need the unicharset later)
./combine_tessdata -u eng.traineddata
create a textfile of your wordlist (wordlistfile)
create a eng.word-dawg
./wordlist2dawg wordlistfile eng.word-dawg traineddat_backup/.unicharset
replace the word-dawg file
./combine_tessdata -o eng.traineddata eng.word-dawg
that should be it.
来源:https://stackoverflow.com/questions/9568165/custom-dictionary-for-tesseract