问题
For a side project I started using PDFBox to convert pdf file to image. This is the pdf file I am using to convert to image file https://bitcoin.org/bitcoin.pdf.
This is the code I am using. It is very simple code which calls PDFToImage. But the output jpg image file looks really bad with lot of commas inserted and some overlapping text.
String [] args_2 = new String[7];
String pdfPath = "C:\\bitcoin.pdf";
args_2[0] = "-startPage";
args_2[1] = "1";
args_2[2] = "-endPage";
args_2[3] = "1";
args_2[4] = "-outputPrefix";
args_2[5] = "my_image_2";
//args_2[6] = "-resolution";
//args_2[7] = "1000";
args_2[6] = pdfPath;
try {
PDFToImage.main(args_2);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
回答1:
If you look at the logging outputs (maybe you need to activate logging in your environment). you'll see many entries like these (generated using PDFBox 1.8.5):
Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <t> from <Century Schoolbook Fett> to the default font
Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <S> from <Times New Roman> to the default font
Jun 16, 2014 8:40:46 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <c> from <Arial> to the default font
Jun 16, 2014 8:40:52 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <i> from <Courier New> to the default font
So PDFBox uses different fonts than the fonts indicated by the PDF for rendering the text of it. This explains both the lots of commas inserted and the overlapping text:
- different fonts may have different encodings. It looks like your sample PDF uses an encoding which has a comma where the default font assumed by PDFBox has a space character;
- different fonts have different glyph widths. In your sample PDF the different glyph widths cause overlapping text.
This results in

The reason for all this is that PDFBox 1.8.x does not properly support all kinds of fonts for rendering. You might want to try PDFBox 2.0.0-SNAPSHOT, the new PDFBox currently under development, instead. Be aware, though, the classes for rendering have been changed.
Using PDFBox 2.0.0-SNAPSHOT
Using the current (mid-June 2014) state of PDFBox 2.0.0-SNAPSHOT you can render PDFs like this:
PDDocument document = PDDocument.loadNonSeq(resource, null);
PDDocumentCatalog catalog = document.getDocumentCatalog();
@SuppressWarnings("unchecked")
List<PDPage> pages = catalog.getAllPages();
PDFRenderer renderer = new PDFRenderer(document);
for (int i = 0; i < pages.size(); i++)
{
BufferedImage image = renderer.renderImage(i);
ImageIO.write(image, "png", new File("bitcoin-convertToImage-" + i + ".png"));
}
The result with this code is:

Other PDFRenderer.renderImage
overloads allow you to explicitly set the desired resolution.
PS: As proposed by Tilman Hausherr you may want to replace the ImageIO.write
call by
ImageIOUtil.writeImage(image, "bitcoin-convertToImage-" + i + ".png", 72);
ImageIOUtil
is a PDFBox helper class which tries to optimize the selection of the ImageIO
writer and to add a DPI attribute to the image file.
If you use a different PDFRenderer.renderImage
overload to set a resolution, remember to change the final parameter 72
here accordingly.
来源:https://stackoverflow.com/questions/24237313/pdfbox-pdf-to-image-generates-overlapping-text