Extract Text Using PdfMiner and PyPDF2 Merges columns

后端 未结 3 1153
陌清茗
陌清茗 2020-12-29 14:07

I am trying to parse the pdf file text using pdfMiner, but the extracted text gets merged. I am using the pdf file from the following link.

PDF File

I am go

3条回答
  •  轮回少年
    2020-12-29 15:11

    Solution provided by @hlindblo gave pretty good results. To further group the extracted text chunks by page and paragraph, here are the simple commands I used.

    from collections import OrderedDict
    grouped_text = OrderedDict()
    for p in range(1000): # max page nb is 1000
        grouped_text[p] = {}
    for (page_nb, x_min, y_min, x_max, y_max, text) in device.rows:
        x_min = round(x_min)//10 # manipulate the level of aggregation --> x_min might be slitghly different
        try:
            grouped_text[page_nb][x_min]+= " " + text
        except:
            grouped_text[page_nb][x_min] = text
    

提交回复
热议问题