How can you find a problem with a programmatically generated PDF? [closed]

问题

My group has been using the itext-sharp library and C#/.NET to generate custom, dynamic PDFs. For the most part, this process is working great for our needs. The one problem we can run into during development/testing is layout issues which can cause the PDF to not open/render correctly in Adobe Reader, esp. the newer versions of Acrobat/Reader.

The document will open the display correctly for the first X pages. But if there is an error, the remaining pages in the document will not display.

As mentioned, we are usually able to track this problem down to a layout-type issue with our C#/iText code. We eventually find the error by using the guess and check method, or divide and conquer. It works, but it doesn't feel like the best way to solve these problems.

I was wondering if there are any tools available that could speed up the process of validating a PDF document and could help to point out errors in the document?

回答1:

Validating PDF files can be quite a tricky task -- primarily because the tools required to do it properly are very expensive.

Acrobat has a tool (Advanced > Preflight > PDF Analysis > Report PDF syntax issues) that lets you scan a PDF for any syntax issues, but that tool can't be accessed programmatically.

Appligent has a tool called pdfHarmmony, which is powered by Adobe's PDF Library, and can be accessed programmatically, but it is very expensive (US$2500+). This option would give you the best results if you can afford it.

There's another option which is 3-Heights PDF Analysis & Repair, I don't know what it's quality is like, but it is similarly expensive.

This PDF Validator tool on SourceForge might interest you, however, it only analyzes the documents structure and not the content itself, so corrupt images or content streams won't be picked up.

Unfortunately, due to the difficulty of analyzing PDF files in detail, there aren't really any free tools that can do it properly, but I suppose a tool that checks the documents structure is better than nothing.

回答2:

The "cheapest" (and at the same time quite reliable!) way is to use Ghostscript. Let Ghostscript interpret the PDF and see which return value it gives. If it has no problem, the PDF file should be OK. On Windows:

 gswin32c.exe ^
       -o nul
       -sDEVICE=nullpage ^
        d:/path/to/file.pdf

The nullpage output device will not create any new file. But Ghostscript will tell on stdout/stderr if it encounters an error. Check for the content of the %errorlevel% pseudo environment variable. -- On Linux:

 gs \
       -o /dev/null \
       -sDEVICE=nullpage \
        /path/to/file.pdf

(Check return value with echo $? for a 0 value for "no problems".)

In case of errors, Ghostscript issues some info which may be helpful to you. In any case, at least you can positively identify those files which do have NO problems: if Ghostscript can process them, Acrobat (Reader) will have no problem rendering them too.

来源：https://stackoverflow.com/questions/3631152/how-can-you-find-a-problem-with-a-programmatically-generated-pdf

标签

pdf

pdf-generation

itextsharp

ghostscript