Manipulating PDF file | 易学教程

问题

I would like to read a PDF file as a text (postscript), add new objects in the file structure and save the final output as a new PDF but If I just copied the PDF PostScript content and paste it in a newly created PDF file (where encoding='ansi'), the file doesn't work.

I am sure that this may be encoding issue but I am not sure what I should do to have a valid PDF file format after manipulating the original PostScript content.

Here is the piece of code that didn't work with me:

pdf_file = open('Input.pdf', 'r', encoding='ansi').read()
pdf_file_bytes = bytearray(pdf_file, 'ansi')
pdf_file = open('Output_bytes.pdf', 'wb').write(pdf_file_bytes)

And as I said, the output PDF is not valid!

回答1:

First problem; the content of a PDF file is PDF, not PostScript.

Secondly, PDF is a binary file foramt so if you copy and paste it any kind of translation (such as CR/LF) will break it.

You haven't said what programming language your code uses, though it looks like Python. If it is Python then reading the file as binary instead of text might help.

回答2:

A PDF file is a complex file format consisting of various objects, unless you under low-level syntax of the PDF specification carefully it will be difficult to impossible to arbitrarily replace some bytes with some other bytes and have it result in a still valid PDF file.

More to the point what are you trying to accomplish. E.g. there may be a high-level way of doing whatever you're trying to do that doesn't involve manipulating PDF syntax directly. E.g. if you need to modify a font, add an annotation, set the PDF version, etc. Otherwise if you actually need to modify PDF syntax you need to use a library capable of dealing with low-level objects.

来源：https://stackoverflow.com/questions/55261941/manipulating-pdf-file

标签

pdf

ansi

pdf-manipulation