Can OCR software reliably read values from a table?

前端未结

关注

 8  478

暗喜 2020-12-04 18:41

Would OCR Software be able to reliably translate an image such as the following into a list of values?
$\"Table$

8条回答

粉色の甜心 (楼主)

2020-12-04 19:09
If you always have solid borders in your table, you can try this solution:
1. Locate the horizontal and vertical lines on each page (long runs of black pixels)
2. Segment the image into cells using the line coordinates
3. Clean up each cell (remove borders, threshold to black and white)
4. Perform OCR on each cell
5. Assemble results into a 2D array
Else your document have a borderless table, you can try to follow this line:

Optical Character Recognition is pretty amazing stuff, but it isn’t always perfect. To get the best possible results, it helps to use the cleanest input you can. In my initial experiments, I found that performing OCR on the entire document actually worked pretty well as long as I removed the cell borders (long horizontal and vertical lines). However, the software compressed all whitespace into a single empty space. Since my input documents had multiple columns with several words in each column, the cell boundaries were getting lost. Retaining the relationship between cells was very important, so one possible solution was to draw a unique character, like “^” on each cell boundary – something the OCR would still recognize and that I could use later to split the resulting strings.

I found all this information in this link, asking Google "OCR to table". The author published a full algorithm using Python and Tesseract, both opensource solutions!

If you want to try the Tesseract power, maybe you should try this site:

http://www.free-ocr.com/
0 讨论(0)

查看其它8个回答
发布评论:

提交评论
- 加载中...