iTextSharp - How to get the position of word on a page

半城伤御伤魂 提交于 2019-11-27 01:37:29

Yes there is. Check out the text.pdf.parser package, specifically LocationTextExtractionStrategy. Actually, that might not do the trick either. You'll probably want to write your own TextExtractionStrategy to feed into PdfTextExtractor:

MyTexExStrat strat = new MyTexExStrat();
PdfTextExtractor.getTextFromPage(reader, pageNum, strat);
// get the strings-n-rects from strat.

public class MyTexExStrat implements TextExtractionStrategy {
    void beginTextBlock() {}
    void endTextBlock() {}
    void renderImage(ImageRenderInfo info) {}
    void renderText(TextRenderInfo info) {
      // track text and location here.
    }
}

You'll probably want to look at the source for LocationTextExtractionStrategy to see how it combines text that shares a baseline. You might even just modify LTES to store parallel arrays of strings and rects.

PS: to build the rects, you can just get the AscentLine & DescentLine and use those coordinates as the top and bottom corners:

Vector bottomLeft = info.getDescentLine().getStartPoint();
Vector topRight = info.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(bottomLeft.get(Vector.I1),
                               bottomLeft.get(Vector.I2),
                               topRight.get(Vector.I1),
                               topRight.get(Vector.I2));

Warning: The above code ass-u-mes that the text is horizontal and proceeds from left to right. Rotated text will screw it up, as will vertical text or right-to-left (Arabic, Hebrew) text. For most applications, the above should be fine, but know it's limits.

Good hunting.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!