PDFBox - getting words locations (and not only characters')

谁都会走 提交于 2019-12-08 20:18:11

问题


Is it possible to get the locations of words using PDFBox, similar to "processTextPosition"? It seems that processTextPosition is called on single characters only, and the code that merges them into words is part of PDFTextStripper (in the "normalize") method, which does return the location of the text. Is there a method / utility that extracts the location as well? (For those wondering what the motivation is - the information is actually a table, and we would like to detect empty cells) Thanks


回答1:


to get words and their x and y positions in a text extracted from a pdf file you will have to extend the PdfTextStripper class and use the custom class to extract text from the pdf file eg

public class CustomPDFTextStripper extends PDFTextStripper{

    public CustomPDFTextStripper() throws IOException {

    }

    /**
    * Override the default functionality of PDFTextStripper.
    */

    @Override
    protected void writeString(String text, List<TextPosition> textPositions) throws IOException{
        TextPosition firstProsition = textPositions.get(0);
        writeString(String.format("[%s , %s , %s]", firstProsition.getTextPos().getXPosition(),
                firstProsition.getTextPos().getYPosition(), text));

    }
}

create an object of this custom class and extract text as thus

PDFTextStripper pdfStripper = new CustomPDFTextStripper();
String text = pdfStripper.getText(*pdf file wrapped as a PDDocument object*);

the resultant text string is in the form [xposition, yposition, word] separated by the default word separator



来源:https://stackoverflow.com/questions/12354266/pdfbox-getting-words-locations-and-not-only-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!