How do I improve the accuracy of the OCR text from Tesseract?

问题

I created a basic app for recognizing text using the Tesseract API from Google and integrated it with my camera app. It works fine but the only problem is the accuracy, as sometimes the text is recognized as a random set of characters and I guess the accuracy is about 50 percent.

Further, when it tries to scan more than four words in an image, the app crashes.

String ocrText = baseApi.getUTF8Text();
baseApi.end();

where baseApi is the object of the Tesseract API class.

Do I need to use a different data structure to save the recognized text or is there some other reason why more than four words don't get recognized?

回答1:

Tesseract API class provides a isValidWord Method to check if the string is a valid word. You can use this to check the recognized characters. This will increase the accuracy of the output.

I am developing using Tess4j Which is a Java JNA wrapper for tesseract-ocr, and it gives quite good results after checking.

Inaccurate results might be due to the text size, check this out. It says "Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi."

Further, not being able to detect more than 4 words depends on a lot of factors, what kind (with how many features) of test image, the size of the image, platform etc.

来源：https://stackoverflow.com/questions/11301343/how-do-i-improve-the-accuracy-of-the-ocr-text-from-tesseract

标签

java

android

android-ndk

ocr

tesseract