Increase Accuracy of text recognition through pytesseract & PIL

问题

So I am trying to extract text from image. And as the quality and size of image is not good, it is giving inaccurate results. I tried few enhancements and other things with PIL but that is only worsening the quality of image.

Can someone suggest some enhancement in image to get better results. Few Examples of images:

回答1:

In the provided example of image the text is visually of quite good quality, so the question is how it comes that OCR gives inaccurate results?

To illustrate the conclusions given in further text of this answer let's run the the given image

through Tesseract. Below the result of Tesseract OCR:

"fhpgearedmomrs©gmachom"

Now let's resize the image four times and apply thresholding to it. I have done the resizing and thresholding manually in Gimp, but with appropriate resizing method and threshold value for PIL it can be for sure automated, so that after the enhancement you get an image similar to the enhanced image I have got:

The improved image run through Tesseract OCR gives following text:

"fhpgearedmotors©gmail.com"

This demonstrates that enlarging an image can help to achieve 100% accuracy on the provided text-image example.

It may appear weird that enlarging an image helps to achieve better OCR accuracy, BUT ... OCR was developed to convert scans of printed media to texts and expect 300 dpi images of the text by design. This explains why some OCR programs didn't resize the text by themselves to improve their results and do bad on small fonts expecting higher dpi resolution of the image which can be achieved by enlarging.

Here an excerpt from Tesseract FAQ on github.com prooving the statement above:

[There is a minimum text size for reasonable accuracy. You have to consider resolution as well as point size. Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi. A quick check is to count the pixels of the x-height of your characters. (X-height is the height of the lower case x.) At 10pt x 300dpi x-heights are typically about 20 pixels, although this can vary dramatically from font to font. Below an x-height of 10 pixels, you have very little chance of accurate results, and below about 8 pixels, most of the text will be "noise removed".]

来源：https://stackoverflow.com/questions/43382174/increase-accuracy-of-text-recognition-through-pytesseract-pil

标签

python-3.x

python-imaging-library

ocr

tesseract

pytesser