Text coordinates when stripping from PDFBox

前端 未结 2 1906
轻奢々
轻奢々 2020-12-06 20:36

i\'m trying to extract text with coordinates from a pdf file using PDFBox.

I mixed some methods/info found on internet (stackoverflow too), but the problem i have th

2条回答
  •  萌比男神i
    2020-12-06 21:38

    The following code worked for me:

        // Definition of font baseline, ascent, descent: https://en.wikipedia.org/wiki/Ascender_(typography)
        //
        // The origin of the text coordinate system is the top-left corner where Y increases downward.
        // TextPosition.getX(), getY() return the baseline.
        TextPosition firstLetter = textPositions.get(0);
        TextPosition lastLetter = textPositions.get(textPositions.size() - 1);
    
        // Looking at LegacyPDFStreamEngine.showGlyph(), ascender and descender heights are calculated like
        // CapHeight: https://stackoverflow.com/a/42021225/14731
        float ascent = firstLetter.getFont().getFontDescriptor().getAscent() / 1000 * lastLetter.getFontSize();
        Point topLeft = new Point(firstLetter.getX(), firstLetter.getY() - ascent);
    
        float descent = lastLetter.getFont().getFontDescriptor().getDescent() / 1000 * lastLetter.getFontSize();
        // Descent is negative, so we need to negate it to move downward.
        Point bottomRight = new Point(lastLetter.getX() + lastLetter.getWidth(),
            lastLetter.getY() - descent);
    
        float descender = lastLetter.getFont().getFontDescriptor().getDescent() / 1000 * lastLetter.getFontSize();
        // Descender height is negative, so we need to negate it to move downward
        Point bottomRight = new Point(lastLetter.getX() + lastLetter.getWidth(),
            lastLetter.getY() - descender);
    

    In other words, we are creating a bounding box from the font's ascender down to its descender.

    If you want to render these coordinates with the origin in the bottom-left corner, see https://stackoverflow.com/a/28114320/14731 for more details. You'll need to apply a transform like this:

    contents.transform(new Matrix(1, 0, 0, -1, 0, page.getHeight()));
    

提交回复
热议问题