Extract Text Using PdfMiner and PyPDF2 Merges columns

后端未结

关注

 3  1153

陌清茗 2020-12-29 14:07

I am trying to parse the pdf file text using pdfMiner, but the extracted text gets merged. I am using the pdf file from the following link.

PDF File

I am go

3条回答

轮回少年 (楼主)

2020-12-29 15:11

Solution provided by @hlindblo gave pretty good results. To further group the extracted text chunks by page and paragraph, here are the simple commands I used.

from collections import OrderedDict
grouped_text = OrderedDict()
for p in range(1000): # max page nb is 1000
    grouped_text[p] = {}
for (page_nb, x_min, y_min, x_max, y_max, text) in device.rows:
    x_min = round(x_min)//10 # manipulate the level of aggregation --> x_min might be slitghly different
    try:
        grouped_text[page_nb][x_min]+= " " + text
    except:
        grouped_text[page_nb][x_min] = text

0 讨论(0)

查看其它3个回答