How to search some specific string or a word and there coordinates from a pdf document in java

匿名 (未验证) 提交于 2019-12-03 02:28:01

问题:

I am using Pdfbox to search a word(or String) from a pdf file and I also want to know the coordinates of that word. For example :- in a pdf file there is a string like "${abc}". I want to know the coordinates of this string. I Tried some couple of examples but didn't get the result according to me. in result it is displaying the coordinates of character.

Here is the Code

@Override protected void writeString(String string, List<TextPosition> textPositions) throws IOException {     for(TextPosition text : textPositions) {           System.out.println( "String[" + text.getXDirAdj() + "," +                 text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +                 text.getXScale() + " height=" + text.getHeightDir() + " space=" +                 text.getWidthOfSpace() + " width=" +                 text.getWidthDirAdj() + "]" + text.getUnicode());      } } 

I am using pdfbox 2.0

回答1:

The last method in which PDFBox' PDFTextStripper class still has text with positions (before it is reduced to plain text) is the method

/**  * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>  * and just calls {@link #writeString(String)}.  *  * @param text The text to write to the stream.  * @param textPositions The TextPositions belonging to the text.  * @throws IOException If there is an error when writing the text.  */ protected void writeString(String text, List<TextPosition> textPositions) throws IOException 

One should intercept here because this method receives pre-processed, in particular sorted TextPosition objects (if one requested sorting to start with).

(Actually I would have preferred to intercept in the calling method writeLine which according to the names of its parameters and local variables has all the TextPosition instances of a line and calls writeString once per word; unfortunately, though, PDFBox developers have declared this method private... well, maybe this changes until the final 2.0.0 release... nudge, nudge. Update: Unfortunately it has not changed in the release... sigh)

Furthermore it is helpful to use a helper class to wrap sequences of TextPosition instances in a String-like class to make code clearer.

With this in mind one can search for the variables like this

List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException {     final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();     PDFTextStripper stripper = new PDFTextStripper()     {         @Override         protected void writeString(String text, List<TextPosition> textPositions) throws IOException         {             TextPositionSequence word = new TextPositionSequence(textPositions);             String string = word.toString();              int fromIndex = 0;             int index;             while ((index = string.indexOf(searchTerm, fromIndex)) > -1)             {                 hits.add(word.subSequence(index, index + searchTerm.length()));                 fromIndex = index + 1;             }             super.writeString(text, textPositions);         }     };      stripper.setSortByPosition(true);     stripper.setStartPage(page);     stripper.setEndPage(page);     stripper.getText(document);     return hits; } 

with this helper class

public class TextPositionSequence implements CharSequence {     public TextPositionSequence(List<TextPosition> textPositions)     {         this(textPositions, 0, textPositions.size());     }      public TextPositionSequence(List<TextPosition> textPositions, int start, int end)     {         this.textPositions = textPositions;         this.start = start;         this.end = end;     }      @Override     public int length()     {         return end - start;     }      @Override     public char charAt(int index)     {         TextPosition textPosition = textPositionAt(index);         String text = textPosition.getUnicode();         return text.charAt(0);     }      @Override     public TextPositionSequence subSequence(int start, int end)     {         return new TextPositionSequence(textPositions, this.start + start, this.start + end);     }      @Override     public String toString()     {         StringBuilder builder = new StringBuilder(length());         for (int i = 0; i < length(); i++)         {             builder.append(charAt(i));         }         return builder.toString();     }      public TextPosition textPositionAt(int index)     {         return textPositions.get(start + index);     }      public float getX()     {         return textPositions.get(start).getXDirAdj();     }      public float getY()     {         return textPositions.get(start).getYDirAdj();     }      public float getWidth()     {         TextPosition first = textPositions.get(start);         TextPosition last = textPositions.get(end);         return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj();     }      final List<TextPosition> textPositions;     final int start, end; } 

To merely output their positions, widths, final letters, and final letter positions, you can then use this

void printSubwords(PDDocument document, String searchTerm) throws IOException {     System.out.printf("* Looking for '%s'\n", searchTerm);     for (int page = 1; page <= document.getNumberOfPages(); page++)     {         List<TextPositionSequence> hits = findSubwords(document, page, searchTerm);         for (TextPositionSequence hit : hits)         {             TextPosition lastPosition = hit.textPositionAt(hit.length() - 1);             System.out.printf("  Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n",                     page, hit.getX(), hit.getY(), hit.getWidth(),                     lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj());         }     } } 

For tests I created a small test file using MS Word:

The output of this test

@Test public void testVariables() throws IOException {     try (   InputStream resource = getClass().getResourceAsStream("Variables.pdf");             PDDocument document = PDDocument.load(resource);    )     {         System.out.println("\nVariables.pdf\n-------------\n");         printSubwords(document, "${var1}");         printSubwords(document, "${var 2}");     } } 

is

Variables.pdf -------------  * Looking for '${var1}'   Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06   Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995   Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997   Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18  * Looking for '${var 2}'   Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997   Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74   Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998   Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81 

I was a bit surprised because ${var 2} has been found if on a single line; after all, PDFBox code made me assume the method writeString I overrode only retrieves words; it looks as if it retrieves longer parts of the line than mere words...

If you need other data from the grouped TextPosition instances, simply enhance TextPositionSequence accordingly.



回答2:

As mentioned, this is not an answer to your question but below is a skeleton example of how you would do this in IText. This is not saying the same is not possible in Pdfbox.

Basically you make a RenderListener that accepts the "parse events" as they happen. You pass this listener to PdfReaderContentParser.processContent. In the listener's renderText method you get all information you need to reconstruct the layout, including x/y coordinates and the text/image/... that make up the content.

RenderListener listener = new RenderListener() {     @Override     public void renderText(TextRenderInfo arg0) {         LineSegment segment = arg0.getBaseline();         int x = (int) segment.getStartPoint().get(Vector.I1);         // smaller Y means closer to the BOTTOM of the page. So we negate the Y to get proper top-to-bottom ordering         int y = -(int) segment.getStartPoint().get(Vector.I2);         int endx = (int) segment.getEndPoint().get(Vector.I1);         log.debug("renderText "+x+".."+endx+"/"+y+": "+arg0.getText());         ...     }      ... // other overrides };  PdfReaderContentParser p = new PdfReaderContentParser(reader); for (int i = 1; i <= reader.getNumberOfPages(); i++) {     log.info("handling page "+i);     p.processContent(i, listener); } 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!