Retrieve page numbers from document with pyPDF

后端 未结 5 1647
逝去的感伤
逝去的感伤 2020-12-28 15:51

At the moment I\'m looking into doing some PDF merging with pyPdf, but sometimes the inputs are not in the right order, so I\'m looking into scraping each page for its page

5条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-28 16:18

    The answer by kindall is very good. However, since a working code sample was requested later (by dreamer) and since I had the same problem today, I would like to add some notes.

    1. pdf structure is not uniform; there are rather few things you can rely on, hence any working code sample is very unlikely to work for everyone. A very good explanation can be found in this answer.

    2. As explained by kindall, you will most likely need to explore what pdf you are dealing with.

    Like so:

    import sys
    import PyPDF2 as pyPdf
    
    """Open your pdf"""
    pdf = pyPdf.PdfFileReader(open(sys.argv[1], "rb"))
    
    """Explore the /PageLabels (if it exists)"""
    
    try:
        page_label_type = pdf.trailer["/Root"]["/PageLabels"]
        print(page_label_type)
    except:
        print("No /PageLabel object")
    
    """Select the item that is most likely to contain the information you desire; e.g.
           {'/Nums': [0, IndirectObject(42, 0)]}
       here, we only have "/Num". """
    
    try:
        page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"]
        print(page_label_type)
    except:
        print("No /PageLabel object")
    
    """If you see a list, like
           [0, IndirectObject(42, 0)]
       get the correct item from it"""
    
    try:
        page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1]
        print(page_label_type)
    except:
        print("No /PageLabel object")
    
    """If you then have an indirect object, like
           IndirectObject(42, 0)
       use getObject()"""
    
    try:
        page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()
        print(page_label_type)
    except:
        print("No /PageLabel object")
    
    """Now we have e.g.
           {'/S': '/r', '/St': 21}
       meaning roman numerals, starting with page 21, i.e. xxi. We can now also obtain the two variables directly."""
    
    try:
        page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/S"]
        print(page_label_type)
        start_page = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/St"]
        print(start_page)
    except:
        print("No /PageLabel object")
    
    1. As you can see from the ISO pdf 1.7 specification (relevant section here) there are lots of possibilities of how to label pages. As a simple working example consider this script that will at least deal with decimal (arabic) and with roman numerals:

    Script:

    import sys
    import PyPDF2 as pyPdf
    
    def arabic_to_roman(arabic):
        roman = ''
        while arabic >= 1000:
          roman += 'm'
          arabic -= 1000
        diffs = [900, 500, 400, 300, 200, 100, 90, 50, 40, 30, 20, 10, 9, 5, 4, 3, 2, 1]
        digits = ['cm', 'd', 'cd', 'ccc', 'cc', 'c', 'xc', 'l', 'xl', 'xxx', 'xx', 'x', 'ix', 'v', 'iv', 'iii', 'ii', 'i']
        for i in range(len(diffs)):
          if arabic >= diffs[i]:
            roman += digits[i]
            arabic -= diffs[i]
        return(roman)
    
    def get_page_labels(pdf):
        try:
            page_label_type = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/S"]
        except:
            page_label_type = "/D"
        try:
            page_start = pdf.trailer["/Root"]["/PageLabels"]["/Nums"][1].getObject()["/St"]
        except:
            page_start = 1
        page_count = pdf.getNumPages()
        ##or, if you feel fancy, do:
        #page_count = pdf.trailer["/Root"]["/Pages"]["/Count"]
        page_stop = page_start + page_count 
    
        if page_label_type == "/D":
            page_numbers = list(range(page_start, page_stop))
            for i in range(len(page_numbers)):
                page_numbers[i] = str(page_numbers[i])
        elif page_label_type == '/r':
            page_numbers_arabic = range(page_start, page_stop)
            page_numbers = []
            for i in range(len(page_numbers_arabic)):
                page_numbers.append(arabic_to_roman(page_numbers_arabic[i]))
    
        print(page_label_type)
        print(page_start)
        print(page_count)
        print(page_numbers)
    
    pdf = pyPdf.PdfFileReader(open(sys.argv[1], "rb"))
    get_page_labels(pdf)
    

提交回复
热议问题