Improve OCR accuracy from scanned documents

a 夏天 提交于 2021-02-07 10:31:23

问题


I'm scanning a lot of A3 documents using a standard Brother A3 Multifunction and then use FineReader Pro for OCR'ing the images.

However, I'm getting a lot of errors in the characters recognized, and lots of non-alphanumeric strange characters.

Can someone give me any tips for programmatically improving the OCR accuracy, either pre-processing on the scanned images, or post-processing on the recognized text?


Edit: Find a sample pdf. It includes some sample images from which I get the poorest results.


回答1:


Do you have a sample image you can post somewhere then we can quickly tell you what is causing most of your problems. FineReader is one of the better OCR engines out there so there are definitely reasons why you are getting poor results.

It could be related to poor contrast and threshold settings, image skewing, dirty rollers in the scanner, complex and coloured backgrounds, dithered backgrounds, font sizes too small, scanning dpi being too low etc...

After seeing the attached image there are a few small issues.

  1. There are lots of dirty specks on the background page. FineReader seems to do a reasonable job with this on your images.
  2. There is some slight skew but that is not causing and problems.
  3. FineReader is getting confused with BOLD tall Arial type font used for column headers.
    4 A big problem seems to be the bottom region of the pages where the contrast is poor and the image is fuzzy. This seems to be a problem with the scanner but could be due to printing problems.

The printing is quite poor and I am guessing it is a scan from a newspaper. Most of your errors are due to scanning issues so it would be hard to programmatically improve the results.

Firstly, I would try scanning the image in grayscale using a slightly higher resolution and see if that helps. FineReader works well with grayscale images. If you have to have a B/W image then see if the scanner driver includes a setting for dynamic thresholding and turn it on.

Your images would not be an easy task for any OCR engine. You will get better results if you can improve the scanning. Page 3 has a lot of noise in the bottom right corner.

What version of FineReasder are you using ? FR10 would probably give better results than previous versions.



来源:https://stackoverflow.com/questions/4658407/improve-ocr-accuracy-from-scanned-documents

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!