PDFBox pdf to image generates overlapping text

我是研究僧i 提交于 2019-11-29 12:52:40
mkl

If you look at the logging outputs (maybe you need to activate logging in your environment). you'll see many entries like these (generated using PDFBox 1.8.5):

Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <t> from <Century Schoolbook Fett> to the default font
Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <S> from <Times New Roman> to the default font
Jun 16, 2014 8:40:46 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <c> from <Arial> to the default font
Jun 16, 2014 8:40:52 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
Warnung: Changing font on <i> from <Courier New> to the default font

So PDFBox uses different fonts than the fonts indicated by the PDF for rendering the text of it. This explains both the lots of commas inserted and the overlapping text:

  1. different fonts may have different encodings. It looks like your sample PDF uses an encoding which has a comma where the default font assumed by PDFBox has a space character;
  2. different fonts have different glyph widths. In your sample PDF the different glyph widths cause overlapping text.

This results in

The reason for all this is that PDFBox 1.8.x does not properly support all kinds of fonts for rendering. You might want to try PDFBox 2.0.0-SNAPSHOT, the new PDFBox currently under development, instead. Be aware, though, the classes for rendering have been changed.

Using PDFBox 2.0.0-SNAPSHOT

Using the current (mid-June 2014) state of PDFBox 2.0.0-SNAPSHOT you can render PDFs like this:

PDDocument document = PDDocument.loadNonSeq(resource, null);
PDDocumentCatalog catalog = document.getDocumentCatalog();
@SuppressWarnings("unchecked")
List<PDPage> pages = catalog.getAllPages();

PDFRenderer renderer = new PDFRenderer(document);

for (int i = 0; i < pages.size(); i++)
{
    BufferedImage image = renderer.renderImage(i);
    ImageIO.write(image, "png", new File("bitcoin-convertToImage-" + i + ".png"));
}

The result with this code is:

Other PDFRenderer.renderImage overloads allow you to explicitly set the desired resolution.

PS: As proposed by Tilman Hausherr you may want to replace the ImageIO.write call by

    ImageIOUtil.writeImage(image, "bitcoin-convertToImage-" + i + ".png", 72);

ImageIOUtil is a PDFBox helper class which tries to optimize the selection of the ImageIO writer and to add a DPI attribute to the image file.

If you use a different PDFRenderer.renderImage overload to set a resolution, remember to change the final parameter 72 here accordingly.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!