i\'m trying to extract text with coordinates from a pdf file using PDFBox.
I mixed some methods/info found on internet (stackoverflow too), but the problem i have th
The following code worked for me:
// Definition of font baseline, ascent, descent: https://en.wikipedia.org/wiki/Ascender_(typography)
//
// The origin of the text coordinate system is the top-left corner where Y increases downward.
// TextPosition.getX(), getY() return the baseline.
TextPosition firstLetter = textPositions.get(0);
TextPosition lastLetter = textPositions.get(textPositions.size() - 1);
// Looking at LegacyPDFStreamEngine.showGlyph(), ascender and descender heights are calculated like
// CapHeight: https://stackoverflow.com/a/42021225/14731
float ascent = firstLetter.getFont().getFontDescriptor().getAscent() / 1000 * lastLetter.getFontSize();
Point topLeft = new Point(firstLetter.getX(), firstLetter.getY() - ascent);
float descent = lastLetter.getFont().getFontDescriptor().getDescent() / 1000 * lastLetter.getFontSize();
// Descent is negative, so we need to negate it to move downward.
Point bottomRight = new Point(lastLetter.getX() + lastLetter.getWidth(),
lastLetter.getY() - descent);
float descender = lastLetter.getFont().getFontDescriptor().getDescent() / 1000 * lastLetter.getFontSize();
// Descender height is negative, so we need to negate it to move downward
Point bottomRight = new Point(lastLetter.getX() + lastLetter.getWidth(),
lastLetter.getY() - descender);
In other words, we are creating a bounding box from the font's ascender down to its descender.
If you want to render these coordinates with the origin in the bottom-left corner, see https://stackoverflow.com/a/28114320/14731 for more details. You'll need to apply a transform like this:
contents.transform(new Matrix(1, 0, 0, -1, 0, page.getHeight()));