PDFBox pdf to image generates overlapping text

前端 未结 1 997
[愿得一人]
[愿得一人] 2020-12-20 08:49

For a side project I started using PDFBox to convert pdf file to image. This is the pdf file I am using to convert to image file https://bitcoin.org/bitcoin.pdf.

Thi

相关标签:
1条回答
  • 2020-12-20 09:04

    If you look at the logging outputs (maybe you need to activate logging in your environment). you'll see many entries like these (generated using PDFBox 1.8.5):

    Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
    Warnung: Changing font on <t> from <Century Schoolbook Fett> to the default font
    Jun 16, 2014 8:40:43 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
    Warnung: Changing font on <S> from <Times New Roman> to the default font
    Jun 16, 2014 8:40:46 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
    Warnung: Changing font on <c> from <Arial> to the default font
    Jun 16, 2014 8:40:52 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont drawString
    Warnung: Changing font on <i> from <Courier New> to the default font
    

    So PDFBox uses different fonts than the fonts indicated by the PDF for rendering the text of it. This explains both the lots of commas inserted and the overlapping text:

    1. different fonts may have different encodings. It looks like your sample PDF uses an encoding which has a comma where the default font assumed by PDFBox has a space character;
    2. different fonts have different glyph widths. In your sample PDF the different glyph widths cause overlapping text.

    This results in First page of sample document rendered using PDFBox 1.8.5

    The reason for all this is that PDFBox 1.8.x does not properly support all kinds of fonts for rendering. You might want to try PDFBox 2.0.0-SNAPSHOT, the new PDFBox currently under development, instead. Be aware, though, the classes for rendering have been changed.

    Using PDFBox 2.0.0-SNAPSHOT

    Using the current (mid-June 2014) state of PDFBox 2.0.0-SNAPSHOT you can render PDFs like this:

    PDDocument document = PDDocument.loadNonSeq(resource, null);
    PDDocumentCatalog catalog = document.getDocumentCatalog();
    @SuppressWarnings("unchecked")
    List<PDPage> pages = catalog.getAllPages();
    
    PDFRenderer renderer = new PDFRenderer(document);
    
    for (int i = 0; i < pages.size(); i++)
    {
        BufferedImage image = renderer.renderImage(i);
        ImageIO.write(image, "png", new File("bitcoin-convertToImage-" + i + ".png"));
    }
    

    The result with this code is: First page of sample document rendered using the current PDFBox 2.0.0-SNAPSHOT

    Other PDFRenderer.renderImage overloads allow you to explicitly set the desired resolution.

    PS: As proposed by Tilman Hausherr you may want to replace the ImageIO.write call by

        ImageIOUtil.writeImage(image, "bitcoin-convertToImage-" + i + ".png", 72);
    

    ImageIOUtil is a PDFBox helper class which tries to optimize the selection of the ImageIO writer and to add a DPI attribute to the image file.

    If you use a different PDFRenderer.renderImage overload to set a resolution, remember to change the final parameter 72 here accordingly.

    0 讨论(0)
提交回复
热议问题