问题
I want to use tesseract
to recognize only numbers. The problem is that I have mixture of numbers & letters and when I use SetVariable("tessedit_char_whitelist", "0123456789")
for every symbol tesseract returns wrong digit.
Can I set a threshold value so that tesseract
omits the symbols with low resemblance?
NOTE: I set tesseract
to recognize only digits so there is no confusion between O and 0.
回答1:
Recognizing only numbers is actually answered on the tesseract FAQ page. See that page for more info, but if you have the version 3 package, the config files are already set up. You just specify on the commandline:
tesseract image.tif outputbase nobatch digits
As for the threshold value, I'm not sure which you mean. If your input is an unusual font, perhaps you might retrain with a sample of your input. An alternative is to change tesseract's pruning threshold. Both options are also mentioned in the FAQ.
回答2:
For tesseract 3, the command is simpler tesseract imagename outputbase digits
according to FAQ. But it doesn't work for me very well.
I turn to try different psm
options and find -psm 6
works best for my case.
man tesseract
for details.
回答3:
For tesseract 3, i try to create config file according FAQ.
BEFORE calling an Init function or put this in a text file called tessdata/configs/digits
:
tessedit_char_whitelist 0123456789
then, it works by using the command: tesseract imagename outputbase digits
回答4:
If one want to match 0-9
tesseract myimage.png stdout -c tessedit_char_whitelist=0123456789
Or if one almost wants to match 0-9, but with one or more different characters
tesseract myimage.png stdout -c tessedit_char_whitelist=01234ABCDE
回答5:
I made it a bit different (with tess-two). Maybe it will be useful for somebody.
So you need to initialize first the API.
TessBaseAPI baseApi = new TessBaseAPI();
baseApi.init(datapath, language, ocrEngineMode);
Then set the following variables
baseApi.setPageSegMode(TessBaseAPI.PageSegMode.PSM_SINGLE_LINE);
baseApi.setVariable(TessBaseAPI.VAR_CHAR_BLACKLIST, "!?@#$%&*()<>_-+=/:;'\"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz");
baseApi.setVariable(TessBaseAPI.VAR_CHAR_WHITELIST, ".,0123456789");
baseApi.setVariable("classify_bln_numeric_mode", "1");
In this way the engine will check only the numbers.
回答6:
You can instruct tesseract to use only digits, and if that is not accurate enough then best chance of getting better results is to go trough training process: http://www.resolveradiologic.com/blog/2013/01/15/training-tesseract/
回答7:
This feature is not supported in version 4. You can still use it via -c tessedit_char_whitelist=0123456789 with "--oem 0" which reverts to the old model.
There is a bounty to fix this issue.
Possible workarounds:
As stated by @amitdo
- Using the --oem 0 option (the legacy engine will be used)
- Retraining (fine tuning) #751 (comment)
- Post-processing #751 (comment)
回答8:
add "--psm 7 -c tessedit_char_whitelist=0123456789'" works for me when the image contain's only 1 line.
回答9:
What I do is to recognize everything, and when I have the text, I take out all the characters except numbers
//This replaces all except numbers from 0 to 9
recognizedText = recognizedText.replaceAll("[^0-9]+", " ");
This works pretty well for me.
来源:https://stackoverflow.com/questions/4944830/how-to-make-tesseract-to-recognize-only-numbers-when-they-are-mixed-with-letter