I am using Pdfbox to search a word(or String) from a pdf file and I also want to know the coordinates of that word. For example :- in a pdf file there is a string like "${abc}". I want to know the coordinates of this string. I Tried some couple of examples but didn't get the result according to me. in result it is displaying the coordinates of character.
Here is the Code
@Override protected void writeString(String string, List<TextPosition> textPositions) throws IOException { for(TextPosition text : textPositions) { System.out.println( "String[" + text.getXDirAdj() + "," + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" + text.getXScale() + " height=" + text.getHeightDir() + " space=" + text.getWidthOfSpace() + " width=" + text.getWidthDirAdj() + "]" + text.getUnicode()); } }
I am using pdfbox 2.0
The last method in which PDFBox' PDFTextStripper
class still has text with positions (before it is reduced to plain text) is the method
/** * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code> * and just calls {@link #writeString(String)}. * * @param text The text to write to the stream. * @param textPositions The TextPositions belonging to the text. * @throws IOException If there is an error when writing the text. */ protected void writeString(String text, List<TextPosition> textPositions) throws IOException
One should intercept here because this method receives pre-processed, in particular sorted TextPosition
objects (if one requested sorting to start with).
(Actually I would have preferred to intercept in the calling method writeLine
which according to the names of its parameters and local variables has all the TextPosition
instances of a line and calls writeString
once per word
; unfortunately, though, PDFBox developers have declared this method private... well, maybe this changes until the final 2.0.0 release... nudge, nudge. Update: Unfortunately it has not changed in the release... sigh)
Furthermore it is helpful to use a helper class to wrap sequences of TextPosition
instances in a String
-like class to make code clearer.
With this in mind one can search for the variables like this
List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException { final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>(); PDFTextStripper stripper = new PDFTextStripper() { @Override protected void writeString(String text, List<TextPosition> textPositions) throws IOException { TextPositionSequence word = new TextPositionSequence(textPositions); String string = word.toString(); int fromIndex = 0; int index; while ((index = string.indexOf(searchTerm, fromIndex)) > -1) { hits.add(word.subSequence(index, index + searchTerm.length())); fromIndex = index + 1; } super.writeString(text, textPositions); } }; stripper.setSortByPosition(true); stripper.setStartPage(page); stripper.setEndPage(page); stripper.getText(document); return hits; }
with this helper class
public class TextPositionSequence implements CharSequence { public TextPositionSequence(List<TextPosition> textPositions) { this(textPositions, 0, textPositions.size()); } public TextPositionSequence(List<TextPosition> textPositions, int start, int end) { this.textPositions = textPositions; this.start = start; this.end = end; } @Override public int length() { return end - start; } @Override public char charAt(int index) { TextPosition textPosition = textPositionAt(index); String text = textPosition.getUnicode(); return text.charAt(0); } @Override public TextPositionSequence subSequence(int start, int end) { return new TextPositionSequence(textPositions, this.start + start, this.start + end); } @Override public String toString() { StringBuilder builder = new StringBuilder(length()); for (int i = 0; i < length(); i++) { builder.append(charAt(i)); } return builder.toString(); } public TextPosition textPositionAt(int index) { return textPositions.get(start + index); } public float getX() { return textPositions.get(start).getXDirAdj(); } public float getY() { return textPositions.get(start).getYDirAdj(); } public float getWidth() { TextPosition first = textPositions.get(start); TextPosition last = textPositions.get(end); return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj(); } final List<TextPosition> textPositions; final int start, end; }
To merely output their positions, widths, final letters, and final letter positions, you can then use this
void printSubwords(PDDocument document, String searchTerm) throws IOException { System.out.printf("* Looking for '%s'\n", searchTerm); for (int page = 1; page <= document.getNumberOfPages(); page++) { List<TextPositionSequence> hits = findSubwords(document, page, searchTerm); for (TextPositionSequence hit : hits) { TextPosition lastPosition = hit.textPositionAt(hit.length() - 1); System.out.printf(" Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n", page, hit.getX(), hit.getY(), hit.getWidth(), lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj()); } } }
For tests I created a small test file using MS Word:

The output of this test
@Test public void testVariables() throws IOException { try ( InputStream resource = getClass().getResourceAsStream("Variables.pdf"); PDDocument document = PDDocument.load(resource); ) { System.out.println("\nVariables.pdf\n-------------\n"); printSubwords(document, "${var1}"); printSubwords(document, "${var 2}"); } }
is
Variables.pdf ------------- * Looking for '${var1}' Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06 Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995 Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997 Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18 * Looking for '${var 2}' Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997 Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74 Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998 Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81
I was a bit surprised because ${var 2}
has been found if on a single line; after all, PDFBox code made me assume the method writeString
I overrode only retrieves words; it looks as if it retrieves longer parts of the line than mere words...
If you need other data from the grouped TextPosition
instances, simply enhance TextPositionSequence
accordingly.
As mentioned, this is not an answer to your question but below is a skeleton example of how you would do this in IText
. This is not saying the same is not possible in Pdfbox.
Basically you make a RenderListener
that accepts the "parse events" as they happen. You pass this listener to PdfReaderContentParser.processContent
. In the listener's renderText
method you get all information you need to reconstruct the layout, including x/y coordinates and the text/image/... that make up the content.
RenderListener listener = new RenderListener() { @Override public void renderText(TextRenderInfo arg0) { LineSegment segment = arg0.getBaseline(); int x = (int) segment.getStartPoint().get(Vector.I1); // smaller Y means closer to the BOTTOM of the page. So we negate the Y to get proper top-to-bottom ordering int y = -(int) segment.getStartPoint().get(Vector.I2); int endx = (int) segment.getEndPoint().get(Vector.I1); log.debug("renderText "+x+".."+endx+"/"+y+": "+arg0.getText()); ... } ... // other overrides }; PdfReaderContentParser p = new PdfReaderContentParser(reader); for (int i = 1; i <= reader.getNumberOfPages(); i++) { log.info("handling page "+i); p.processContent(i, listener); }