For a side project I started using PDFBox to convert pdf file to image. This is the pdf file I am using to convert to image file https://bitcoin.org/bitcoin.pdf.
Thi
If you look at the logging outputs (maybe you need to activate logging in your environment). you'll see many entries like these (generated using PDFBox 1.8.5):
Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <t> from <Century Schoolbook Fett> to the default font
Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <S> from <Times New Roman> to the default font
Jun 16, 2014 8:40:46 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <c> from <Arial> to the default font
Jun 16, 2014 8:40:52 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <i> from <Courier New> to the default font
So PDFBox uses different fonts than the fonts indicated by the PDF for rendering the text of it. This explains both the lots of commas inserted and the overlapping text:
This results in
The reason for all this is that PDFBox 1.8.x does not properly support all kinds of fonts for rendering. You might want to try PDFBox 2.0.0-SNAPSHOT, the new PDFBox currently under development, instead. Be aware, though, the classes for rendering have been changed.
Using the current (mid-June 2014) state of PDFBox 2.0.0-SNAPSHOT you can render PDFs like this:
PDDocument document = PDDocument.loadNonSeq(resource, null);
PDDocumentCatalog catalog = document.getDocumentCatalog();
@SuppressWarnings("unchecked")
List<PDPage> pages = catalog.getAllPages();
PDFRenderer renderer = new PDFRenderer(document);
for (int i = 0; i < pages.size(); i++)
{
BufferedImage image = renderer.renderImage(i);
ImageIO.write(image, "png", new File("bitcoin-convertToImage-" + i + ".png"));
}
The result with this code is:
Other PDFRenderer.renderImage
overloads allow you to explicitly set the desired resolution.
PS: As proposed by Tilman Hausherr you may want to replace the ImageIO.write
call by
ImageIOUtil.writeImage(image, "bitcoin-convertToImage-" + i + ".png", 72);
ImageIOUtil
is a PDFBox helper class which tries to optimize the selection of the ImageIO
writer and to add a DPI attribute to the image file.
If you use a different PDFRenderer.renderImage
overload to set a resolution, remember to change the final parameter 72
here accordingly.