How do I improve the accuracy of the OCR text from Tesseract?

守給你的承諾、 提交于 2019-12-18 17:54:19

问题


I created a basic app for recognizing text using the Tesseract API from Google and integrated it with my camera app. It works fine but the only problem is the accuracy, as sometimes the text is recognized as a random set of characters and I guess the accuracy is about 50 percent.

Further, when it tries to scan more than four words in an image, the app crashes.

String ocrText = baseApi.getUTF8Text();
baseApi.end();

where baseApi is the object of the Tesseract API class.

Do I need to use a different data structure to save the recognized text or is there some other reason why more than four words don't get recognized?


回答1:


Tesseract API class provides a isValidWord Method to check if the string is a valid word. You can use this to check the recognized characters. This will increase the accuracy of the output.

I am developing using Tess4j Which is a Java JNA wrapper for tesseract-ocr, and it gives quite good results after checking.

Inaccurate results might be due to the text size, check this out. It says "Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi."

Further, not being able to detect more than 4 words depends on a lot of factors, what kind (with how many features) of test image, the size of the image, platform etc.



来源:https://stackoverflow.com/questions/11301343/how-do-i-improve-the-accuracy-of-the-ocr-text-from-tesseract

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!