How to extract PDF fields from a filled out form in Python?

后端 未结 6 1098
北恋
北恋 2020-12-02 07:10

I\'m trying to use Python to processes some PDF forms that were filled out and signed using Adobe Acrobat Reader.

I\'ve tried:

  • The pdfminer demo: it di
6条回答
  •  攒了一身酷
    2020-12-02 07:26

    Python 3.6+:

    pip install PyPDF2

    # -*- coding: utf-8 -*-
    
    from collections import OrderedDict
    from PyPDF2 import PdfFileWriter, PdfFileReader
    
    
    def _getFields(obj, tree=None, retval=None, fileobj=None):
        """
        Extracts field data if this PDF contains interactive form fields.
        The *tree* and *retval* parameters are for recursive use.
    
        :param fileobj: A file object (usually a text file) to write
            a report to on all interactive form fields found.
        :return: A dictionary where each key is a field name, and each
            value is a :class:`Field` object. By
            default, the mapping name is used for keys.
        :rtype: dict, or ``None`` if form data could not be located.
        """
        fieldAttributes = {'/FT': 'Field Type', '/Parent': 'Parent', '/T': 'Field Name', '/TU': 'Alternate Field Name',
                           '/TM': 'Mapping Name', '/Ff': 'Field Flags', '/V': 'Value', '/DV': 'Default Value'}
        if retval is None:
            retval = OrderedDict()
            catalog = obj.trailer["/Root"]
            # get the AcroForm tree
            if "/AcroForm" in catalog:
                tree = catalog["/AcroForm"]
            else:
                return None
        if tree is None:
            return retval
    
        obj._checkKids(tree, retval, fileobj)
        for attr in fieldAttributes:
            if attr in tree:
                # Tree is a field
                obj._buildField(tree, retval, fileobj, fieldAttributes)
                break
    
        if "/Fields" in tree:
            fields = tree["/Fields"]
            for f in fields:
                field = f.getObject()
                obj._buildField(field, retval, fileobj, fieldAttributes)
    
        return retval
    
    
    def get_form_fields(infile):
        infile = PdfFileReader(open(infile, 'rb'))
        fields = _getFields(infile)
        return OrderedDict((k, v.get('/V', '')) for k, v in fields.items())
    
    
    
    if __name__ == '__main__':
        from pprint import pprint
    
        pdf_file_name = 'FormExample.pdf'
    
        pprint(get_form_fields(pdf_file_name))
    

提交回复
热议问题