Tesseract receipt scanning advice needed

后端 未结 2 1475
暗喜
暗喜 2020-12-23 10:46

I have struggled off and on again with Tesseract for various OCR projects and I found a use case today which I thought would be a slam dunk for it but after many hours I am

2条回答
  •  被撕碎了的回忆
    2020-12-23 11:07

    Text recognition on receipts is one of the hardest problems for OCR to handle.

    The reasons are numerous:

    • receipts are printed on cheap paper with cheap printers - to make them cheap, not readable!
    • they have very large amount of dense text (especially Wall-Mart receipts)
    • existing OCR engines are almost exclusively trained on non-receipt data (books, documents, etc.)
    • receipt structure, which is something between tabular and freeform, is hard for any layouting engine to handle.

    Your best bet is to perform the following:

    • Analyse the input images. If they are hard to read by eyes, they are hard to read to tesseract as well.
    • Perform additional image preprocessing. Image scaling (0.5x, 1.5x, 2x) sometimes help a lot. Cleaning existing noise also helps.
    • Tesseract training. It's not that hard to do :)
    • OCR result postprocessing to ensure layouting.

    Layouting is best performed by analysing the geometry of the results, not by regexes. Regexes have problems if the OCR has errors. Using geometry, for example, you find a good candidate for UPC number, draw a line through the centers of the characters, and then you know exactly which price belongs to that UPC.

    Also, some commercial solutions have customisations for receipt scanning, and can even run very fast on mobile devices.

    Company I'm working with, MicroBlink, has an OCR module for mobile devices. If you're on iOS, you can easily try it using CocoaPods

    pod try PPBlinkOCR
    

提交回复
热议问题