I am trying to parse the pdf file text using pdfMiner, but the extracted text gets merged. I am using the pdf file from the following link.
PDF File
I am go
Solution provided by @hlindblo gave pretty good results. To further group the extracted text chunks by page and paragraph, here are the simple commands I used.
from collections import OrderedDict
grouped_text = OrderedDict()
for p in range(1000): # max page nb is 1000
grouped_text[p] = {}
for (page_nb, x_min, y_min, x_max, y_max, text) in device.rows:
x_min = round(x_min)//10 # manipulate the level of aggregation --> x_min might be slitghly different
try:
grouped_text[page_nb][x_min]+= " " + text
except:
grouped_text[page_nb][x_min] = text