Tesseract receipt scanning advice needed

后端 未结 2 1476
暗喜
暗喜 2020-12-23 10:46

I have struggled off and on again with Tesseract for various OCR projects and I found a use case today which I thought would be a slam dunk for it but after many hours I am

2条回答
  •  忘掉有多难
    2020-12-23 11:24

    I ended up fully flushing this out and am pretty happy with the results so I thought I would post it in case anyone else ever finds it useful.

    I did not have to do any image splitting and instead used a regex since the Wal-mart receipts are so predictable.

    I am on Windows so I created a powershell script to run the conversion commands and regex find & replace:

    # -----------------------------------------------------------------
    # Script: ParseReceipt.ps1
    # Author: Jim Sanders
    # Date: 7/27/2015
    # Keywords: tesseract OCR ImageMagick CSV
    # Comments:
    #   Used to convert a Wal-mart receipt image to a CSV file
    # -----------------------------------------------------------------
    param(
        [Parameter(Mandatory=$true)] [string]$image
    ) # end param
    
    # create output and temporary files based on input name
    $base = (Get-ChildItem -Filter $image -File).BaseName
    $csvOutfile = $base + ".txt"
    $upscaleImage = $base + "_150.png"
    $ocrFile = $base + "_ocr"
    
    # upscale by 150% to ensure OCR works consistently
    convert $image -resize 150% $upscaleImage
    
    # perform the OCR to a temporary file
    tesseract $upscaleImage -psm 6 $ocrFile
    
    # column headers for the CSV
    $newline = "Description,UPC,Type,Cost,TaxType`n"
    $newline | Out-File $csvOutfile
    
    # read in the OCR file and write back out the CSV (Tesseract automatically adds .txt to the file name)
    $lines = Get-Content "$ocrFile.txt"
    
    Foreach ($line in $lines) {
        # This wraps the 12 digit UPC code and the price with commas, giving us our 5 columns for CSV
        $newline = $line -replace '\s\d{12}\s',',$&,' -replace '.\d+\.\d{2}.',',$&,' -replace ',\s',',' -replace '\s,',','
        $newline | Out-File -Append $csvOutfile
    }
    
    # clean up temporary files
    del $upscaleImage
    del "$ocrFile.txt"
    

    The resulting file needs to be opened in Excel and then have the text to columns feature run so that it won't ruin the UPC codes by auto converting them to numbers. This is a well known problem I won't dive into, but there are a multitude of ways to handle and I settled on this slightly more manual way.

    I would have been happiest to end up with a simple .csv I could double click but I couldn't find a great way to do that without mangling the UPC codes even more like by wrapping them in this format:

     "=""12345"""
    

    That does work but I wanted the UPC code to be just the digits alone as text in Excel in case I am able to later do a lookup against the Wal-mart API.

    Anyway, here is how they look after importing and some quick formating:

    https://s3.postimg.cc/b6cjsb4bn/Receipt_Excel.png

    I still need to do some garbage cleaning on the rows that aren't line items but that all only takes a few seconds so doesn't bother me too much.

    Thanks for the nudge in the right direction @RevJohn, I would not have thought to try simply scaling the image but that made all the difference in the world with Tesseract!

提交回复
热议问题