Check whether a PDF-File is valid with Python

后端 未结 7 596
借酒劲吻你
借酒劲吻你 2020-12-08 10:50

I get a File via a HTTP-Upload and need to be sure its a pdf-file. Programing Language is Python, but this should not matter.

I thought of the follow

7条回答
  •  南方客
    南方客 (楼主)
    2020-12-08 11:30

    Here is a solution using pdfminersix, which can be installed with pip install pdfminer.six:

    from pdfminer.high_level import extract_text
    
    def is_pdf(path_to_file):
        try:
            extract_text(path_to_file)
            return True
        except:
            return False
    

    You can also use filetype (pip install filetype):

    import filetype
    
    def is_pdf(path_to_file):
        return filetype.guess(path_to_file).mime == 'application/pdf'
    

    Neither of these solutions is ideal.

    1. The problem with the filetype solution is that it doesn't tell you if the PDF itself is readable or not. It will tell you if the file is a PDF, but it could be a corrupt PDF.
    2. The pdfminer solution should only return True if the PDF is actually readable. But it is a big library and seems like overkill for such a simple function.

    I've started another thread here asking how to check if a file is a valid PDF without using a library (or using a smaller one).

提交回复
热议问题