问题
Im using tesseract-ocr package on Ubuntu Linux, I have been using it for a while and I think that in order to improve the accuracy of the OCR I only need a subset of letters from the alphabet. The letters I need are:
0123456789abcdefghijklmnopqrstuvwxyz
and only that, not even capital letters, can anybody give me a hand on indicating tesseract to only match againts a subset of letters ?
Thanks,
回答1:
From the python-tesseract project page:
import tesseract
api = tesseract.TessBaseAPI()
api.Init(".","eng",tesseract.OEM_DEFAULT)
api.SetVariable("tessedit_char_whitelist", "0123456789abcdefghijklmnopqrstuvwxyz")
api.SetPageSegMode(tesseract.PSM_AUTO)
So just set your own collection of characters in api.SetVariable
.
From the tesseract-ocr project FAQ
Tesseract 2.03 Use
TessBaseAPI::SetVariable("tessedit_char_whitelist", "0123456789");
BEFORE calling an Init function or put this in a text file called tessdata/configs/digits:
tessedit_char_whitelist 0123456789
and then your command line becomes:
tesseract image.tif outputbase nobatch digits
Warning: Until the old and new config variables get merged, you must have the nobatch parameter too.
Tesseract 3 A digits config file is already created, so just run a tesseract command like this:
tesseract imagename outputbase digits
回答2:
What you're looking for is the Tesseract Whitelist. If you're on python and working with it and the API, I think this should work (found on the Tesseract Google Group).
api.SetVariable("tessedit_char_whitelist", "abcdefghijklmnopqrstuvwxyz0123456789 ");
Note, I'm not sure which version of Tesseract this is for.
来源:https://stackoverflow.com/questions/15512193/tesseract-use-subset-of-letters