How to extract PDF fields from a filled out form in Python?

后端 未结 6 1108
北恋
北恋 2020-12-02 07:10

I\'m trying to use Python to processes some PDF forms that were filled out and signed using Adobe Acrobat Reader.

I\'ve tried:

  • The pdfminer demo: it di
6条回答
  •  旧时难觅i
    2020-12-02 07:35

    Quick and dirty 2-minute job; just use PDFminer to convert PDF to xml and then grab all of the fields.

    from xml.etree import ElementTree
    from pprint import pprint
    import os
    
    def main():
        print "Calling PDFDUMP.py"
        os.system("dumppdf.py -a FILE.pdf > out.xml")
    
        # Preprocess the file to eliminate bad XML.
        print "Screening the file"
        o = open("output.xml","w") #open for append
        for line in open("out.xml"):
           line = line.replace("&#", "Invalid_XML") #some bad data in xml for formatting info.
           o.write(line) 
        o.close()
    
        print "Opening XML output"
        tree = ElementTree.parse('output.xml')
        lastnode = ""
        lastnode2 = ""
        list = {}
        entry = {}
    
        for node in tree.iter(): # Run through the tree..        
            # Check if New node
            if node.tag == "key" and node.text == "T":
                lastnode = node.tag + node.text
            elif lastnode == "keyT":
                for child in node.iter():
                    entry["ID"] = child.text
                lastnode = ""
    
            if node.tag == "key" and node.text == "V":
                lastnode2 = node.tag + node.text
            elif lastnode2 == "keyV":
                for child in node.iter():
                    if child.tag == "string":
                        if entry.has_key("ID"):
                            entry["Value"] = child.text
                            list[entry["ID"]] = entry["Value"]
                            entry = {}
                lastnode2 = ""
    
        pprint(list)
    
    if __name__ == '__main__':
      main()
    

    It isn't pretty, just a simple proof of concept. I need to implement it for a system I'm working on so I will be cleaning it up, but I thought I would post it in case anyone finds it useful.

提交回复
热议问题