I am trying to learn the PDF file format.
To this end I downloaded Adobe's PDF specification file, which is huge.
So to help me study the details of PDF, I want to follow its abstract explanations by looking in parallel at some real-world PDF files.
For example, one idea was to create a PDF file (using LaTeX) which has only one page and as content even only one character, a
.
But when I open this PDF file in a hex editor (or in other tools that can show the internal PDF structure), there is a lot of binary or compressed content inside this PDF. For an example for what I see, look at the screenshot below:

I simply can not identify which part of this binary is representing my character a
in this PDF.
The same happens with all the real-world PDF files I've tried so far. I simply cannot find any PDF files which contain working example code to help me understand the generic PDF language specification.
I would like others to explain to me: is there a practical way to study the PDF specification while at the same time verifying its bits and pieces with real PDF files?
I would like to know: which software tools are commonly used by PDF programmers that would help a newbie developer like me to dissect and un-compress existing binary PDF files so their source code can be investigated using a simple text editor? (Note: I'm not asking for a recommendation. In compliance with the SO FAQ I just want to know if such tools do exist, and which names they have.)
Is there a resource of freely available PDF files which don't contain binary and/or compressed content? Or how could I create my own such example files?
Are there (preferably free) PDF editors/parsers available which can visualize + dissect the raw binary data of PDF files and expose their structure?
I only need a first hook. The entry point, if you will, to the narrow path in the thick jungle of real world PDF files, which I then could follow along... while using the help of this bushwacker called 'PDF Specification'.
The creators of iText (a Java/C# lib to create and manipulate PDFs) published a tool called RUPS.
From the sourceforge page:
RUPS is an abbreviation for Reading and Updating PDF Syntax. RUPS is a tool built on top of iText® that allows you to look inside a PDF document and browse the different PDF objects and content streams. (Updating PDFs isn't possible yet.)
The way I helped myself to learn PDF syntax was this:
Looked for a tool that could de-compress PDFs (de-compress the internal streams).
Found qpdf, Jay Birkenbilt's commandline tool described as: "does structural, content-preserving transformations on PDF files".
Routinely running
qpdf --qdf input.pdf decompressed-input.pdf
.Opening the newly created
decompressed-input.pdf
in a text editor.
The --qdf
mode of the tool transforms the binary and ASCII elements of PDFs in a very useful way, without changing their visual page appearance (and it's very fast):
Decompress previously compressed objects (exposing f.e. the PDF language source code of page element drawing operations).
Also expand object streams (
ObjStrm
).Normalize the presentation of arrays, strings etc.
Re-number objects so they start from
1 0 obj
and then present them in ascending order in the file.Repair b0rken
xref
entries.Add comments which contain an object's original identity in the original file.
Add comments for each page.
...and some more.
Looking at these (now mostly ASCII) files in a normal text editor is way more easy than trying to figure out the original binary PDF.
I would recommend taking a look at a few files using PDF Vole (a tool based on iText, and similar to RUPS).
PDF Vole and RUPS will both allow you to navigate through the structure of a PDF file, inspect the entries on every object, decompress compressed streams, decrypt the file when needed, look at the content of pages and annotations, and track down the relation between objects in the file.
For example this file:

Will look like this in PDF Vole:

You could also take a look on the class hierarchy of iText itself (which is almost 1-to-1 with the PDF spec) and the book that explains it, iText in Action.
If you are trying to generate PDF files via code, then this CodeProject source code might help.
The code along with the Adobe specification should get you going. I don't think there are many short cuts here. Understanding PostScript is going to take some study!
EDIT: and seeing as a PDF is compressed PostScript, something like RoPS could be handy too.
来源:https://stackoverflow.com/questions/12620113/example-pdf-language-code-which-helps-to-study-the-official-pdf-specification