pyPdf for IndirectObject extraction

前端 未结 3 2182
不思量自难忘°
不思量自难忘° 2020-12-08 23:07

Following this example, I can list all elements into a pdf file

import pyPdf
pdf = pyPdf.PdfFileReader(open(\"pdffile.pdf\"))
list(pdf.pages) # Process all t         


        
3条回答
  •  天命终不由人
    2020-12-08 23:52

    An IndirectObject refers to an actual object (it's like a link or alias so that the total size of the PDF can be reduced when the same content appears in multiple places). The getObject method will give you the actual object.

    If the object is a text object, then just doing a str() or unicode() on the object should get you the data inside of it.

    Alternatively, pyPdf stores the objects in the resolvedObjects attribute. For example, a PDF that contains this object:

    13 0 obj
    << /Type /Catalog /Pages 3 0 R >>
    endobj
    

    Can be read with this:

    >>> import pyPdf
    >>> pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
    >>> pages = list(pdf.pages)
    >>> pdf.resolvedObjects
    {0: {2: {'/Parent': IndirectObject(3, 0), '/Contents': IndirectObject(4, 0), '/Type': '/Page', '/Resources': IndirectObject(6, 0), '/MediaBox': [0, 0, 595.2756, 841.8898]}, 3: {'/Kids': [IndirectObject(2, 0)], '/Count': 1, '/Type': '/Pages', '/MediaBox': [0, 0, 595.2756, 841.8898]}, 4: {'/Filter': '/FlateDecode'}, 5: 147, 6: {'/ColorSpace': {'/Cs1': IndirectObject(7, 0)}, '/ExtGState': {'/Gs2': IndirectObject(9, 0), '/Gs1': IndirectObject(10, 0)}, '/ProcSet': ['/PDF', '/Text'], '/Font': {'/F1.0': IndirectObject(8, 0)}}, 13: {'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}}}
    >>> pdf.resolvedObjects[0][13]
    {'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}
    

提交回复
热议问题