Character Recognition using tesseract

大城市里の小女人 提交于 2019-12-03 21:04:00
karlphillip

There are several things you could try:

  • To be able to improve the accuracy you should improve the quality of the image for the OCR engine, and that means preprocessing the images before feeding them to Tesseract. I suggest investigating OpenCV for this purpose.

  • The main problem with OCR engines is that they are not as good at recognizing characters as we are. So even things that are not text sometimes get mistakenly identified as if they were. Therefore, to prevent this from happening it's best to detect the areas of text and send those to Tesseract instead of sending the full image, like you are doing with image #2.

  • Another way to extract the text regions of an image can be done with this technique to isolate them.

  • When you get the results from Tesseract, you can improve them by comparing the resulting text to a dictionary.

Some possible improvements:

  1. The resolution should be 300 dpi at least.
  2. Make your illumination more averagely distributed. There are several dark areas that might impact the results.
  3. Try to scale your characters a little bit. Currently they are in different sizes, and some of the letters are even distorted.
  4. Pre-process the image by thresholding and binarization.

You can do above with your own programming, or Fred's ImageMagick Scripts might help.

I'm not sure if my post is useful for you, because my answer is not about Tesseract. But it is about high accuracy, so I decided that it can be interesting for you to see how paid OCR SDK solution works.

That's results of recognition with ABBYY Cloud OCR SDK without any additional settings.

Disclaimer: I work for ABBYY.

You can try to use ScanTailor (http://scantailor.sourceforge.net/ it has also CLI interface) to binarize, deskew and dewarp images. Scaling images up might help to improve recognition. Because Tesseract recognition profiles were optimized to work on at least 300 DPI.

Another possibility is to train Tesseract on font which are characteristic for your material (more on this can be here: https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3).

I don't think that dictionary lookup will help here, because you have mostly numbers.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!