Example PDF language code which helps to study the official PDF specification? [closed]

房东的猫 提交于 2019-11-26 23:06:43

问题


I am trying to learn the PDF file format.

To this end I downloaded Adobe's PDF specification file, which is huge.

So to help me study the details of PDF, I want to follow its abstract explanations by looking in parallel at some real-world PDF files.

For example, one idea was to create a PDF file (using LaTeX) which has only one page and as content even only one character, a.

But when I open this PDF file in a hex editor (or in other tools that can show the internal PDF structure), there is a lot of binary or compressed content inside this PDF. For an example for what I see, look at the screenshot below:

I simply can not identify which part of this binary is representing my character a in this PDF.

The same happens with all the real-world PDF files I've tried so far. I simply cannot find any PDF files which contain working example code to help me understand the generic PDF language specification.

  • I would like others to explain to me: is there a practical way to study the PDF specification while at the same time verifying its bits and pieces with real PDF files?

  • I would like to know: which software tools are commonly used by PDF programmers that would help a newbie developer like me to dissect and un-compress existing binary PDF files so their source code can be investigated using a simple text editor? (Note: I'm not asking for a recommendation. In compliance with the SO FAQ I just want to know if such tools do exist, and which names they have.)

  • Is there a resource of freely available PDF files which don't contain binary and/or compressed content? Or how could I create my own such example files?

  • Are there (preferably free) PDF editors/parsers available which can visualize + dissect the raw binary data of PDF files and expose their structure?

I only need a first hook. The entry point, if you will, to the narrow path in the thick jungle of real world PDF files, which I then could follow along... while using the help of this bushwacker called 'PDF Specification'.


回答1:


The creators of iText (a Java/C# lib to create and manipulate PDFs) published a tool called RUPS.

From the sourceforge page:

RUPS is an abbreviation for Reading and Updating PDF Syntax. RUPS is a tool built on top of iText® that allows you to look inside a PDF document and browse the different PDF objects and content streams. (Updating PDFs isn't possible yet.)




回答2:


The way I helped myself to learn PDF syntax was this:

  • Looked for a tool that could de-compress PDFs (de-compress the internal streams).

  • Found qpdf, Jay Birkenbilt's commandline tool described as: "does structural, content-preserving transformations on PDF files".

  • Routinely running qpdf --qdf input.pdf decompressed-input.pdf.

  • Opening the newly created decompressed-input.pdf in a text editor.

The --qdf mode of the tool transforms the binary and ASCII elements of PDFs in a very useful way, without changing their visual page appearance (and it's very fast):

  1. Decompress previously compressed objects (exposing f.e. the PDF language source code of page element drawing operations).

  2. Also expand object streams (ObjStrm).

  3. Normalize the presentation of arrays, strings etc.

  4. Re-number objects so they start from 1 0 obj and then present them in ascending order in the file.

  5. Repair b0rken xref entries.

  6. Add comments which contain an object's original identity in the original file.

  7. Add comments for each page.

  8. ...and some more.

Looking at these (now mostly ASCII) files in a normal text editor is way more easy than trying to figure out the original binary PDF.




回答3:


I would recommend taking a look at a few files using PDF Vole (a tool based on iText, and similar to RUPS).

PDF Vole and RUPS will both allow you to navigate through the structure of a PDF file, inspect the entries on every object, decompress compressed streams, decrypt the file when needed, look at the content of pages and annotations, and track down the relation between objects in the file.

For example this file:

Will look like this in PDF Vole:

You could also take a look on the class hierarchy of iText itself (which is almost 1-to-1 with the PDF spec) and the book that explains it, iText in Action.




回答4:


If you are trying to generate PDF files via code, then this CodeProject source code might help.

The code along with the Adobe specification should get you going. I don't think there are many short cuts here. Understanding PostScript is going to take some study!

EDIT: and seeing as a PDF is compressed PostScript, something like RoPS could be handy too.



来源:https://stackoverflow.com/questions/12620113/example-pdf-language-code-which-helps-to-study-the-official-pdf-specification

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!