How to scrape tables in thousands of PDF files?

后端 未结 1 1295
北荒
北荒 2020-12-29 10:17

I have about 1\'500 PDFs consisting of only 1 page each, and exhibiting the same structure (see http://files.newsnetz.ch/extern/interactive/downloads/BAG_15m_kzh_2012_de.pdf

相关标签:
1条回答
  • 2020-12-29 10:44

    I didn't know this before, but less has this magical ability to read pdf files. I was able to extract the table data from your example pdf with this script:

    import subprocess
    import re
    
    output = subprocess.check_output(["less","BAG_15m_kzh_2012_de.pdf"])
    
    re_data_prefix = re.compile("^[0-9]+[.].*$")
    re_data_fields = re.compile("(([^ ]+[ ]?)+)")
    for line in output.splitlines():
        if re_data_prefix.match(line):
            print [l[0].strip() for l in re_data_fields.findall(line)]
    
    0 讨论(0)
提交回复
热议问题