Retrieve page numbers from document with pyPDF

后端未结

关注

 5  1647

逝去的感伤 2020-12-28 15:51

At the moment I\'m looking into doing some PDF merging with pyPdf, but sometimes the inputs are not in the right order, so I\'m looking into scraping each page for its page

5条回答

予麋鹿 (楼主)

2020-12-28 16:18

The answer by kindall is very good. However, since a working code sample was requested later (by dreamer) and since I had the same problem today, I would like to add some notes.

pdf structure is not uniform; there are rather few things you can rely on, hence any working code sample is very unlikely to work for everyone. A very good explanation can be found in this answer.
As explained by kindall, you will most likely need to explore what pdf you are dealing with.

Like so:

import sys
import PyPDF2 as pyPdf

"""Open your pdf"""
pdf = pyPdf.PdfFileReader(open(sys.argv[1], "rb"))

"""Explore the /PageLabels (if it exists)"""

try:
    page_label_type = pdf.trailer["/Root"]["/PageLabels"]
    print(page_label_type)
except:
    print("No /PageLabel object")

"""Select the item that is most likely to contain the information you desire; e.g.
       {'/Nums': [0, IndirectObject(42, 0)]}
   here, we only have "/Num". """

try:
    page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"]
    print(page_label_type)
except:
    print("No /PageLabel object")

"""If you see a list, like
       [0, IndirectObject(42, 0)]
   get the correct item from it"""

try:
    page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1]
    print(page_label_type)
except:
    print("No /PageLabel object")

"""If you then have an indirect object, like
       IndirectObject(42, 0)
   use getObject()"""

try:
    page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()
    print(page_label_type)
except:
    print("No /PageLabel object")

"""Now we have e.g.
       {'/S': '/r', '/St': 21}
   meaning roman numerals, starting with page 21, i.e. xxi. We can now also obtain the two variables directly."""

try:
    page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/S"]
    print(page_label_type)
    start_page = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/St"]
    print(start_page)
except:
    print("No /PageLabel object")

As you can see from the ISO pdf 1.7 specification (relevant section here) there are lots of possibilities of how to label pages. As a simple working example consider this script that will at least deal with decimal (arabic) and with roman numerals:

Script:

import sys
import PyPDF2 as pyPdf

def arabic_to_roman(arabic):
    roman = ''
    while arabic >= 1000:
      roman += 'm'
      arabic -= 1000
    diffs = [900, 500, 400, 300, 200, 100, 90, 50, 40, 30, 20, 10, 9, 5, 4, 3, 2, 1]
    digits = ['cm', 'd', 'cd', 'ccc', 'cc', 'c', 'xc', 'l', 'xl', 'xxx', 'xx', 'x', 'ix', 'v', 'iv', 'iii', 'ii', 'i']
    for i in range(len(diffs)):
      if arabic >= diffs[i]:
        roman += digits[i]
        arabic -= diffs[i]
    return(roman)

def get_page_labels(pdf):
    try:
        page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/S"]
    except:
        page_label_type = "/D"
    try:
        page_start = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/St"]
    except:
        page_start = 1
    page_count = pdf.getNumPages()
    ##or, if you feel fancy, do:
    #page_count = pdf.trailer["/Root"]["/Pages"]["/Count"]
    page_stop = page_start + page_count 

    if page_label_type == "/D":
        page_numbers = list(range(page_start, page_stop))
        for i in range(len(page_numbers)):
            page_numbers[i] = str(page_numbers[i])
    elif page_label_type == '/r':
        page_numbers_arabic = range(page_start, page_stop)
        page_numbers = []
        for i in range(len(page_numbers_arabic)):
            page_numbers.append(arabic_to_roman(page_numbers_arabic[i]))

    print(page_label_type)
    print(page_start)
    print(page_count)
    print(page_numbers)

pdf = pyPdf.PdfFileReader(open(sys.argv[1], "rb"))
get_page_labels(pdf)

0 讨论(0)

查看其它5个回答