I\'m trying to use Python to processes some PDF forms that were filled out and signed using Adobe Acrobat Reader.
I\'ve tried:
Quick and dirty 2-minute job; just use PDFminer to convert PDF to xml and then grab all of the fields.
from xml.etree import ElementTree
from pprint import pprint
import os
def main():
print "Calling PDFDUMP.py"
os.system("dumppdf.py -a FILE.pdf > out.xml")
# Preprocess the file to eliminate bad XML.
print "Screening the file"
o = open("output.xml","w") #open for append
for line in open("out.xml"):
line = line.replace("", "Invalid_XML") #some bad data in xml for formatting info.
o.write(line)
o.close()
print "Opening XML output"
tree = ElementTree.parse('output.xml')
lastnode = ""
lastnode2 = ""
list = {}
entry = {}
for node in tree.iter(): # Run through the tree..
# Check if New node
if node.tag == "key" and node.text == "T":
lastnode = node.tag + node.text
elif lastnode == "keyT":
for child in node.iter():
entry["ID"] = child.text
lastnode = ""
if node.tag == "key" and node.text == "V":
lastnode2 = node.tag + node.text
elif lastnode2 == "keyV":
for child in node.iter():
if child.tag == "string":
if entry.has_key("ID"):
entry["Value"] = child.text
list[entry["ID"]] = entry["Value"]
entry = {}
lastnode2 = ""
pprint(list)
if __name__ == '__main__':
main()
It isn't pretty, just a simple proof of concept. I need to implement it for a system I'm working on so I will be cleaning it up, but I thought I would post it in case anyone finds it useful.