image processing to improve tesseract OCR accuracy

前端 未结 13 1853
鱼传尺愫
鱼传尺愫 2020-11-22 14:41

I\'ve been using tesseract to convert documents into text. The quality of the documents ranges wildly, and I\'m looking for tips on what sort of image processing might impr

13条回答
  •  余生分开走
    2020-11-22 15:12

    The Tesseract documentation contains some good details on how to improve the OCR quality via image processing steps.

    To some degree, Tesseract automatically applies them. It is also possible to tell Tesseract to write an intermediate image for inspection, i.e. to check how well the internal image processing works (search for tessedit_write_images in the above reference).

    More importantly, the new neural network system in Tesseract 4 yields much better OCR results - in general and especially for images with some noise. It is enabled with --oem 1, e.g. as in:

    $ tesseract --oem 1 -l deu page.png result pdf
    

    (this example selects the german language)

    Thus, it makes sense to test first how far you get with the new Tesseract LSTM mode before applying some custom pre-processing image processing steps.

提交回复
热议问题