Can OCR software reliably read values from a table?

前端 未结 8 471
暗喜
暗喜 2020-12-04 18:41

Would OCR Software be able to reliably translate an image such as the following into a list of values?
\"Table

相关标签:
8条回答
  • 2020-12-04 19:14

    You could try other approach. With tesseract (or other OCRS) you can get coordinates for each word. Then you can try to group those words by vercital and horizontal coordinates to get rows/columns. For example to tell a difference between a white space and tab space. It takes some practice to get good results but it is possible. With this method you can detect tables even if the tables use invisible separators - no lines. The word coordinates are solid base for table recog

    0 讨论(0)
  • 2020-12-04 19:15

    Here are the basic steps that have worked for me. Tools needed include Tesseract, Python, OpenCV, and ImageMagick if you need to do any rotation of images to correct skew.

    1. Use Tesseract to detect rotation and ImageMagick mogrify to fix it.
    2. Use OpenCV to find and extract tables.
    3. Use OpenCV to find and extract each cell from the table.
    4. Use OpenCV to crop and clean up each cell so that there is no noise that will confuse OCR software.
    5. Use Tesseract to OCR each cell.
    6. Combine the extracted text of each cell into the format you need.

    The code for each of these steps is extensive, but if you want to use a python package, it's as simple as the following.

    pip3 install table_ocr
    python3 -m table_ocr.demo https://raw.githubusercontent.com/eihli/image-table-ocr/master/resources/test_data/simple.png
    

    That package and demo module will turn the following table into CSV output.

    Cell,Format,Formula
    B4,Percentage,None
    C4,General,None
    D4,Accounting,None
    E4,Currency,"=PMT(B4/12,C4,D4)"
    F4,Currency,=E4*C4
    

    If you need to make any changes to get the code to work for table borders with different widths, there are extensive notes at https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

    0 讨论(0)
提交回复
热议问题