How do determine location of actual PDF content with PDFBox?

前端 未结 1 620
刺人心
刺人心 2020-12-19 19:47

We\'re printing some PDFs from a Java desktop app, using PDFBox, and the PDFs contain too much whitespace (fixing the PDF generator is unfortunately not an option).

相关标签:
1条回答
  • 2020-12-19 20:32

    As you have mentioned in a comment that

    it can be assumed that there is no background or other elements that would need special handling,

    I'll show a basic solution without any such special handling.

    A basic bounding box finder

    To find the bounding box without actually rendering to a bitmap and inspecting the bitmap pixels, one has to scan all the instructions of the content streams of the page and any XObjects referenced from there. One determines the bounding boxes of the stuff drawn by each instruction and eventually combines them to a single box.

    The simple box finder presented here combines them by simply returning the bounding box of their union.

    For scanning the instructions of content streams PDFBox offers a number of classes based on the PDFStreamEngine. The simple box finder is derived from the PDFGraphicsStreamEngine which extends the PDFStreamEngine by some method related to vector graphics.

    public class BoundingBoxFinder extends PDFGraphicsStreamEngine {
        public BoundingBoxFinder(PDPage page) {
            super(page);
        }
    
        public Rectangle2D getBoundingBox() {
            return rectangle;
        }
    
        //
        // Text
        //
        @Override
        protected void showGlyph(Matrix textRenderingMatrix, PDFont font, int code, String unicode, Vector displacement)
                throws IOException {
            super.showGlyph(textRenderingMatrix, font, code, unicode, displacement);
            Shape shape = calculateGlyphBounds(textRenderingMatrix, font, code);
            if (shape != null) {
                Rectangle2D rect = shape.getBounds2D();
                add(rect);
            }
        }
    
        /**
         * Copy of <code>org.apache.pdfbox.examples.util.DrawPrintTextLocations.calculateGlyphBounds(Matrix, PDFont, int)</code>.
         */
        private Shape calculateGlyphBounds(Matrix textRenderingMatrix, PDFont font, int code) throws IOException
        {
            GeneralPath path = null;
            AffineTransform at = textRenderingMatrix.createAffineTransform();
            at.concatenate(font.getFontMatrix().createAffineTransform());
            if (font instanceof PDType3Font)
            {
                // It is difficult to calculate the real individual glyph bounds for type 3 fonts
                // because these are not vector fonts, the content stream could contain almost anything
                // that is found in page content streams.
                PDType3Font t3Font = (PDType3Font) font;
                PDType3CharProc charProc = t3Font.getCharProc(code);
                if (charProc != null)
                {
                    BoundingBox fontBBox = t3Font.getBoundingBox();
                    PDRectangle glyphBBox = charProc.getGlyphBBox();
                    if (glyphBBox != null)
                    {
                        // PDFBOX-3850: glyph bbox could be larger than the font bbox
                        glyphBBox.setLowerLeftX(Math.max(fontBBox.getLowerLeftX(), glyphBBox.getLowerLeftX()));
                        glyphBBox.setLowerLeftY(Math.max(fontBBox.getLowerLeftY(), glyphBBox.getLowerLeftY()));
                        glyphBBox.setUpperRightX(Math.min(fontBBox.getUpperRightX(), glyphBBox.getUpperRightX()));
                        glyphBBox.setUpperRightY(Math.min(fontBBox.getUpperRightY(), glyphBBox.getUpperRightY()));
                        path = glyphBBox.toGeneralPath();
                    }
                }
            }
            else if (font instanceof PDVectorFont)
            {
                PDVectorFont vectorFont = (PDVectorFont) font;
                path = vectorFont.getPath(code);
    
                if (font instanceof PDTrueTypeFont)
                {
                    PDTrueTypeFont ttFont = (PDTrueTypeFont) font;
                    int unitsPerEm = ttFont.getTrueTypeFont().getHeader().getUnitsPerEm();
                    at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
                }
                if (font instanceof PDType0Font)
                {
                    PDType0Font t0font = (PDType0Font) font;
                    if (t0font.getDescendantFont() instanceof PDCIDFontType2)
                    {
                        int unitsPerEm = ((PDCIDFontType2) t0font.getDescendantFont()).getTrueTypeFont().getHeader().getUnitsPerEm();
                        at.scale(1000d / unitsPerEm, 1000d / unitsPerEm);
                    }
                }
            }
            else if (font instanceof PDSimpleFont)
            {
                PDSimpleFont simpleFont = (PDSimpleFont) font;
    
                // these two lines do not always work, e.g. for the TT fonts in file 032431.pdf
                // which is why PDVectorFont is tried first.
                String name = simpleFont.getEncoding().getName(code);
                path = simpleFont.getPath(name);
            }
            else
            {
                // shouldn't happen, please open issue in JIRA
                System.out.println("Unknown font class: " + font.getClass());
            }
            if (path == null)
            {
                return null;
            }
            return at.createTransformedShape(path.getBounds2D());
        }
    
        //
        // Bitmaps
        //
        @Override
        public void drawImage(PDImage pdImage) throws IOException {
            Matrix ctm = getGraphicsState().getCurrentTransformationMatrix();
            for (int x = 0; x < 2; x++) {
                for (int y = 0; y < 2; y++) {
                    add(ctm.transformPoint(x, y));
                }
            }
        }
    
        //
        // Paths
        //
        @Override
        public void appendRectangle(Point2D p0, Point2D p1, Point2D p2, Point2D p3) throws IOException {
            addToPath(p0, p1, p2, p3);
        }
    
        @Override
        public void clip(int windingRule) throws IOException {
        }
    
        @Override
        public void moveTo(float x, float y) throws IOException {
            addToPath(x, y);
        }
    
        @Override
        public void lineTo(float x, float y) throws IOException {
            addToPath(x, y);
        }
    
        @Override
        public void curveTo(float x1, float y1, float x2, float y2, float x3, float y3) throws IOException {
            addToPath(x1, y1);
            addToPath(x2, y2);
            addToPath(x3, y3);
        }
    
        @Override
        public Point2D getCurrentPoint() throws IOException {
            return null;
        }
    
        @Override
        public void closePath() throws IOException {
        }
    
        @Override
        public void endPath() throws IOException {
            rectanglePath = null;
        }
    
        @Override
        public void strokePath() throws IOException {
            addPath();
        }
    
        @Override
        public void fillPath(int windingRule) throws IOException {
            addPath();
        }
    
        @Override
        public void fillAndStrokePath(int windingRule) throws IOException {
            addPath();
        }
    
        @Override
        public void shadingFill(COSName shadingName) throws IOException {
        }
    
        void addToPath(Point2D... points) {
            Arrays.asList(points).forEach(p -> addToPath(p.getX(), p.getY()));
        }
    
        void addToPath(double newx, double newy) {
            if (rectanglePath == null) {
                rectanglePath = new Rectangle2D.Double(newx, newy, 0, 0);
            } else {
                rectanglePath.add(newx, newy);
            }
        }
    
        void addPath() {
            if (rectanglePath != null) {
                add(rectanglePath);
                rectanglePath = null;
            }
        }
    
        void add(Rectangle2D rect) {
            if (rectangle == null) {
                rectangle = new Rectangle2D.Double();
                rectangle.setRect(rect);
            } else {
                rectangle.add(rect);
            }
        }
    
        void add(Point2D... points) {
            for (Point2D point : points) {
                add(point.getX(), point.getY());
            }
        }
    
        void add(double newx, double newy) {
            if (rectangle == null) {
                rectangle = new Rectangle2D.Double(newx, newy, 0, 0);
            } else {
                rectangle.add(newx, newy);
            }
        }
    
        Rectangle2D rectanglePath = null;
        Rectangle2D rectangle = null;
    }
    

    (BoundingBoxFinder on github)

    As you can see I borrowed the calculateGlyphBounds helper method from a PDFBox example class.

    An usage example

    You can use the BoundingBoxFinder like this to draw a border line along the bounding box rim for a given PDPage pdPage of a PDDocument pdDocument:

    void drawBoundingBox(PDDocument pdDocument, PDPage pdPage) throws IOException {
        BoundingBoxFinder boxFinder = new BoundingBoxFinder(pdPage);
        boxFinder.processPage(pdPage);
        Rectangle2D box = boxFinder.getBoundingBox();
        if (box != null) {
            try (   PDPageContentStream canvas = new PDPageContentStream(pdDocument, pdPage, AppendMode.APPEND, true, true)) {
                canvas.setStrokingColor(Color.magenta);
                canvas.addRect((float)box.getMinX(), (float)box.getMinY(), (float)box.getWidth(), (float)box.getHeight());
                canvas.stroke();
            }
        }
    }
    

    (DetermineBoundingBox helper method)

    The result looks like this:

    Only a proof-of-concept

    Beware, the BoundingBoxFinder really is not very sophisticated; in particular it does not ignore invisible content like a white background rectangle, text drawn in rendering mode "invisible", arbitrary content covered by a white filled path, white parts of bitmap images, ... Furthermore, it does ignore clip paths, weird blend modes, annotations, ...

    Extending the class to properly handle those cases is pretty straight-forward but the sum of the code to add would exceed the scope of a stack overflow answer.


    For the code in this answer I used the current PDFBox 3.0.0-SNAPSHOT development branch but it should also work out of the box for current 2.x versions.

    0 讨论(0)
提交回复
热议问题