问题
Does a library exist that will remove "owner" passwords from PDF documents so that the text can then be programmatically extracted from them? Something like PDF Technologies' Password Recovery tool, but callable from the command line or from Python. A GUI interface is not really useful to me, since the number of documents is so large.
Please, no comments on the legality of the process. The PDFs in question are owned, and the text needs to be extracted in order to form keyword clouds for the document set.
回答1:
I do not know about python libraries, but for batch removal of passwords from PDF documents, my colleagues have had good experience with PwdRemover (not free).
回答2:
Here are two other (open-source) tools for command-line processing:
QPDF: A Content-Preserving PDF Transformation System:
qpdf --password=PASSWORD --decrypt SECURED.pdf UNSECURED.pdf
pdftk - the pdf toolkit:
pdftk SECURED.pdf input_pw PASSWORD output UNSECURED.pdf
回答3:
If you've forgotten the password or the employee who encrypted the documents has since left the company, you can use PDFCrack to recover the password(s).
来源:https://stackoverflow.com/questions/1750716/does-a-library-exist-to-remove-passwords-from-pdfs-programmatically