How to give best chance of success to an OCR software?

旧城冷巷雨未停 提交于 2019-12-04 17:15:04

Shameless plug and disclaimer: my company packages Tesseract for use in .NET

Tesseract is an OK OCR engine. It can miss a lot and gets readily confused by non-text. The best thing you can do for it is to make sure it gets text only. The next best thing is to give it something sanely binarized (adaptive or dynamic threshold to get there) or grayscale and let it try to do binarization.

  1. Train tesseract to recognize your font
  2. Make image extra clean and with enough free space around characters
  3. Profit :)

Here are few real world examples.

  • First image is original image (croped power meter numbers)
  • Second image is slightly cleaned up image in GIMP, around 50% OCR accuracy in tesseract
  • Third image is completely cleaned image - 100% OCR recognized without any training!

Even under the best conditions OCR variants will sneak up on you. Your best option will be to design your tests to be aware of them.

For distinguishing between 0 and O, one simple solution is to choose a font that distinguishes between both (eg: 0 has a dash or dot in its middle). Would that be acceptable in your application?

Another solution is to apply a dictionary-based step after the character-by-character analysis of the text - feeding the recognized text into some form of spell-checker or validator to differentiate between difficult characters.

For instance, a round symbol followed by other numbers is most likely to be a zero, while the same symbol followed by letters is most likely to be a capital o. It's a trivial example, but it shows how context is necessary to make a more reliable OCR system.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!