pyPdf for IndirectObject extraction

前端 未结 3 2183
不思量自难忘°
不思量自难忘° 2020-12-08 23:07

Following this example, I can list all elements into a pdf file

import pyPdf
pdf = pyPdf.PdfFileReader(open(\"pdffile.pdf\"))
list(pdf.pages) # Process all t         


        
相关标签:
3条回答
  • 2020-12-08 23:43

    Jehiah's method is good if looking everywhere for the object. My guess (looking at the PDF) is that it is always in the same place (the first page, in the 'MC0' property), and so a much simpler method of finding the string would be:

    import pyPdf
    pdf = pyPdf.PdfFileReader(open("file.pdf"))
    pdf.getPage(0)['/Resources']['/Properties']['/MC0']['/MYOBJECT'].getData()
    
    0 讨论(0)
  • 2020-12-08 23:46

    each element in pdf.pages is a dictionary, so assuming it's on page 1, pdf.pages[0]['/MYOBJECT'] should be the element you want.

    You can try to print that individually or poke at it with help and dir in a python prompt for more about how to get the string you want

    Edit:

    after receiving a copy of the pdf, i found the object at pdf.resolvedObjects[0][558]['/Resources']['/Properties']['/MC0']['/MYOBJECT'] and the value can be retrieved via getData()

    the following function gives a more generic way to solve this by recursively looking for the key in question

    import types
    import pyPdf
    pdf = pyPdf.PdfFileReader(open('file.pdf'))
    pages = list(pdf.pages)
    
    def findInDict(needle,haystack):
        for key in haystack.keys():
            try:
                value = haystack[key]
            except:
                continue
            if key == needle:
                return value
            if type(value) == types.DictType or isinstance(value,pyPdf.generic.DictionaryObject):  
                x = findInDict(needle,value)
                if x is not None:
                    return x
    
    answer = findInDict('/MYOBJECT',pdf.resolvedObjects).getData()
    
    0 讨论(0)
  • 2020-12-08 23:52

    An IndirectObject refers to an actual object (it's like a link or alias so that the total size of the PDF can be reduced when the same content appears in multiple places). The getObject method will give you the actual object.

    If the object is a text object, then just doing a str() or unicode() on the object should get you the data inside of it.

    Alternatively, pyPdf stores the objects in the resolvedObjects attribute. For example, a PDF that contains this object:

    13 0 obj
    << /Type /Catalog /Pages 3 0 R >>
    endobj
    

    Can be read with this:

    >>> import pyPdf
    >>> pdf = pyPdf.PdfFileReader(open("pdffile.pdf"))
    >>> pages = list(pdf.pages)
    >>> pdf.resolvedObjects
    {0: {2: {'/Parent': IndirectObject(3, 0), '/Contents': IndirectObject(4, 0), '/Type': '/Page', '/Resources': IndirectObject(6, 0), '/MediaBox': [0, 0, 595.2756, 841.8898]}, 3: {'/Kids': [IndirectObject(2, 0)], '/Count': 1, '/Type': '/Pages', '/MediaBox': [0, 0, 595.2756, 841.8898]}, 4: {'/Filter': '/FlateDecode'}, 5: 147, 6: {'/ColorSpace': {'/Cs1': IndirectObject(7, 0)}, '/ExtGState': {'/Gs2': IndirectObject(9, 0), '/Gs1': IndirectObject(10, 0)}, '/ProcSet': ['/PDF', '/Text'], '/Font': {'/F1.0': IndirectObject(8, 0)}}, 13: {'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}}}
    >>> pdf.resolvedObjects[0][13]
    {'/Type': '/Catalog', '/Pages': IndirectObject(3, 0)}
    
    0 讨论(0)
提交回复
热议问题