Retrieve page numbers from document with pyPDF

后端 未结 5 1637
逝去的感伤
逝去的感伤 2020-12-28 15:51

At the moment I\'m looking into doing some PDF merging with pyPdf, but sometimes the inputs are not in the right order, so I\'m looking into scraping each page for its page

5条回答
  •  暖寄归人
    2020-12-28 16:12

    For full documentation, see Adobe's 978-page PDF Reference. :-)

    More specifically, the PDF file contains metadata that indicates how the PDF's physical pages are mapped to logical page numbers and how page numbers should be formatted. This is where you go for canonical results. Example 2 of this page shows how this looks in the PDF markup. You'll have to fish that out, parse it, and perform a mapping yourself.

    In PyPDF, to get at this information, try, as a starting point:

    pdf.trailer["/Root"]["/PageLabels"]["/Nums"]
    

    By the way, when you see an IndirectObject instance, you can call its getObject() method to retrieve the actual object being pointed to.

    Your alternative is, as you say, to check the text objects and try to figure out which is the page number. You could use extractText() of the page object for this, but you'll get one string back and have to try to fish out the page number from that. (And of course the page number might be Roman or alphabetic instead of numeric, and some pages may not be numbered.) Instead, have a look at how extractText() actually does its job—PyPDF is written in Python, after all—and use it as a basis of a routine that checks each text object on the page individually to see if it's like a page number. Be wary of TOC/index pages that have lots of page numbers on them!

提交回复
热议问题