Extract individual field from table image to excel with OCR

前端 未结 3 925
醉酒成梦
醉酒成梦 2020-12-09 23:25

I have scanned images which have tables as shown in this image:

I am trying to extract each box separately and perform OCR but when I try to detect horizont

3条回答
  •  轮回少年
    2020-12-09 23:58

    This is function, which uses tesseract-ocr for layout detection. You can try with different RIL levels and PSM. For more details have a look here: https://github.com/sirfz/tesserocr

    import os
    import platform
    from typing import List, Tuple
    
    from tesserocr import PyTessBaseAPI, iterate_level, RIL
    
    system = platform.system()
    if system == 'Linux':
        tessdata_folder_default = ''
    elif system == 'Windows':
        tessdata_folder_default = r'C:\Program Files (x86)\Tesseract-OCR\tessdata'
    else:
        raise NotImplementedError
    
    # this tesseract specific env variable takes precedence for tessdata folder location selection
    # especially important for windows, as we don't know if we're running 32 or 64bit tesseract
    tessdata_folder = os.getenv('TESSDATA_PREFIX', tessdata_folder_default)
    
    
    def get_layout_boxes(input_image,  # PIL image object
                         level: RIL,
                         include_text: bool,
                         include_boxes: bool,
                         language: str,
                         psm: int,
                         tessdata_path='') -> List[Tuple]:
        """
        Get image components coordinates. It will return also text if include_text is True.
        :param input_image: input PIL image
        :param level: page iterator level, please see "RIL" enum
        :param include_text: if True return boxes texts
        :param include_boxes: if True return boxes coordinates
        :param language: language for OCR
        :param psm: page segmentation mode, by default it is PSM.AUTO which is 3
        :param tessdata_path: the path to the tessdata folder
        :return: list of tuples: [((x1, y1, x2, y2), text)), ...]
        """
        assert any((include_text, include_boxes)), (
            'Both include_text and include_boxes can not be False.')
    
        if not tessdata_path:
            tessdata_path = tessdata_folder
    
        try:
            with PyTessBaseAPI(path=tessdata_path, lang=language) as api:
                api.SetImage(input_image)
    
                api.SetPageSegMode(psm)
                api.Recognize()
                page_iterator = api.GetIterator()
                data = []
                for pi in iterate_level(page_iterator, level):
                    bounding_box = pi.BoundingBox(level)
                    if bounding_box is not None:
                        text = pi.GetUTF8Text(level) if include_text else None
                        box = bounding_box if include_boxes else None
                        data.append((box, text))
                return data
        except RuntimeError:
            print('Please specify correct path to tessdata.')
    

提交回复
热议问题