How to extract PDF fields from a filled out form in Python?

后端 未结 6 1103
北恋
北恋 2020-12-02 07:10

I\'m trying to use Python to processes some PDF forms that were filled out and signed using Adobe Acrobat Reader.

I\'ve tried:

  • The pdfminer demo: it di
6条回答
  •  庸人自扰
    2020-12-02 07:28

    You should be able to do it with pdfminer, but it will require some delving into the internals of pdfminer and some knowledge about the pdf format (wrt forms of course, but also about pdf's internal structures like "dictionaries" and "indirect objects").

    This example might help you on your way (I think it will work only on simple cases, with no nested fields etc...)

    import sys
    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdftypes import resolve1
    
    filename = sys.argv[1]
    fp = open(filename, 'rb')
    
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    fields = resolve1(doc.catalog['AcroForm'])['Fields']
    for i in fields:
        field = resolve1(i)
        name, value = field.get('T'), field.get('V')
        print '{0}: {1}'.format(name, value)
    

    EDIT: forgot to mention: if you need to provide a password, pass it to doc.initialize()

提交回复
热议问题