How to extract the title of a PDF document from within a script for renaming?

前端 未结 6 1710
长情又很酷
长情又很酷 2021-02-01 21:34

I have thousands of PDF files in my computers which names are from a0001.pdf to a3621.pdf, and inside of each there is a title; e.g. \"aluminum carbona

6条回答
  •  暗喜
    暗喜 (楼主)
    2021-02-01 22:05

    You can use pdfminer library to parse the PDFs. The info property contains the Title of the PDF. Here is what a sample info looks like :

    [{'CreationDate': "D:20170110095753+05'30'", 'Producer': 'PDF-XChange Printer `V6 (6.0 build 317.1) [Windows 10 Enterprise x64 (Build 10586)]', 'Creator': 'PDF-XChange Office Addin', 'Title': 'Python Basics'}]`
    

    Then we can extract the Title using the properties of a dictionary. Here is the whole code (including iterating all the files and renaming them):

    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfdocument import PDFDocument
    import os
    
    start = "0000"
    
    def convert(var):
        while len(var) < 4:
            var = "0" + var
    
        return var
    
    for i in range(1,3622):
        var = str(i)
        var = convert(var)
        file_name = "a" + var + ".pdf"
        fp = open(file_name, 'rb')
        parser = PDFParser(fp)
        doc = PDFDocument(parser)
        fp.close()
        metadata = doc.info  # The "Info" metadata
        print metadata
        metadata = metadata[0]
        for x in metadata:
            if x == "Title":
                new_name = metadata[x] + ".pdf"
                os.rename(file_name,new_name)
    

提交回复
热议问题