Check whether a PDF-File is valid with Python

后端未结

关注

 7  602

I get a File via a HTTP-Upload and need to be sure its a pdf-file. Programing Language is Python, but this should not matter.

I thought of the follow

相关标签:

7条回答

隐瞒了意图╮

2020-12-08 11:18
The two most commonly used PDF libraries for Python are:
- pyPdf
- ReportLab
Both are pure python so should be easy to install as well be cross-platform.

With pyPdf it would probably be as simple as doing:
```
from pyPdf import PdfFileReader
doc = PdfFileReader(file("upload.pdf", "rb"))
```
This should be enough, but doc will now have documentInfo() and numPages() methods if you want to do further checking.

As Carl answered, pdftotext is also a good solution, and would probably be faster on very large documents (especially ones with many cross-references). However it might be a little slower on small PDF's due to system overhead of forking a new process, etc.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-12-08 11:22

If you're on a Linux or OS X box, you could use Pdftotext (part of Xpdf, found here). If you pass a non-PDF to pdftotext, it will certainly bark at you, and you can use commands.getstatusoutput to get the output and parse it for these warnings.

If you're looking for a platform-independent solution, you might be able to make use of pyPdf.

Edit: It's not elegant, but it looks like pyPdf's PdfFileReader will throw an IOError(22) if you attempt to load a non-PDF.

0 讨论(0)
发布评论:

提交评论
- 加载中...
清酒与你

2020-12-08 11:25

I run into the same problem but was not forced to use a programming language to manage this task. I used pyPDF but was not efficient for me as it hangs infinitely on some corrupted files.

However, I found this software useful till now.

Good luck with it.

https://sourceforge.net/projects/corruptedpdfinder/

0 讨论(0)
发布评论:

提交评论
- 加载中...
南方客

2020-12-08 11:30
Here is a solution using pdfminersix, which can be installed with pip install pdfminer.six:
```
from pdfminer.high_level import extract_text

def is_pdf(path_to_file):
    try:
        extract_text(path_to_file)
        return True
    except:
        return False
```
You can also use filetype (pip install filetype):
```
import filetype

def is_pdf(path_to_file):
    return filetype.guess(path_to_file).mime == 'application/pdf'
```
Neither of these solutions is ideal.
1. The problem with the filetype solution is that it doesn't tell you if the PDF itself is readable or not. It will tell you if the file is a PDF, but it could be a corrupt PDF.
2. The pdfminer solution should only return True if the PDF is actually readable. But it is a big library and seems like overkill for such a simple function.
I've started another thread here asking how to check if a file is a valid PDF without using a library (or using a smaller one).
0 讨论(0)
发布评论:

提交评论
- 加载中...
独厮守ぢ

2020-12-08 11:31
Update 2020

It looks like pdfminer.six is a maintained project (the others, including the one below, seem dead).

ReportLab is another one (mistakenly marked as dead by me)

Original answer

Since apparently neither PyPdf ~~nor ReportLab~~ is available anymore, the current solution I found (as of 2015) is to use PyPDF2 and catch exceptions (and possibly analyze getDocumentInfo())
```
import PyPDF2

with open("testfile.txt", "w") as f:
    f.write("hello world!")

try:
    PyPDF2.PdfFileReader(open("testfile.txt", "rb"))
except PyPDF2.utils.PdfReadError:
    print("invalid PDF file")
else:
    pass
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2020-12-08 11:33
In a project if mine I need to check for the mime type of some uploaded file. I simply use the file command like this:
```
from subprocess import Popen, PIPE
filetype = Popen("/usr/bin/file -b --mime -", shell=True, stdout=PIPE, stdin=PIPE).communicate(file.read(1024))[0].strip()
```
You of course might want to move the actual command into some configuration file as also command line options vary among operating systems (e.g. mac).

If you just need to know whether it's a PDF or not and do not need to process it anyway I think the file command is a faster solution than a lib. Doing it by hand is of course also possible but the file command gives you maybe more flexibility if you want to check for different types.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页

Check whether a PDF-File is valid with Python

Update 2020

Original answer