Can OCR software reliably read values from a table?

前端 未结 8 478
暗喜
暗喜 2020-12-04 18:41

Would OCR Software be able to reliably translate an image such as the following into a list of values?
\"Table

8条回答
  •  粉色の甜心
    2020-12-04 19:09

    If you always have solid borders in your table, you can try this solution:

    1. Locate the horizontal and vertical lines on each page (long runs of black pixels)
    2. Segment the image into cells using the line coordinates
    3. Clean up each cell (remove borders, threshold to black and white)
    4. Perform OCR on each cell
    5. Assemble results into a 2D array

    Else your document have a borderless table, you can try to follow this line:

    Optical Character Recognition is pretty amazing stuff, but it isn’t always perfect. To get the best possible results, it helps to use the cleanest input you can. In my initial experiments, I found that performing OCR on the entire document actually worked pretty well as long as I removed the cell borders (long horizontal and vertical lines). However, the software compressed all whitespace into a single empty space. Since my input documents had multiple columns with several words in each column, the cell boundaries were getting lost. Retaining the relationship between cells was very important, so one possible solution was to draw a unique character, like “^” on each cell boundary – something the OCR would still recognize and that I could use later to split the resulting strings.

    I found all this information in this link, asking Google "OCR to table". The author published a full algorithm using Python and Tesseract, both opensource solutions!

    If you want to try the Tesseract power, maybe you should try this site:

    http://www.free-ocr.com/

提交回复
热议问题