extract PDF text by columns

前端未结

关注

 2  1206

时光取名叫无心 2021-01-14 19:41

My question is:

How can I extract text from a PDF file which is divided in columns in a way that I get the result separated by this columns?

Background: I wo

2条回答

灰色年华 (楼主)

2021-01-14 20:08

Combined with @mkl's answer, I used PDFbox to complete the extraction of text by columns.

The way I find the boundary of two columns is trying constantly. =:

    StringBuilder pdfText = new StringBuilder();
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();

    stripper.setSortByPosition(true);

    Rectangle rectLeft = new Rectangle(10, 60, 320, 820);

    Rectangle rectRight = new Rectangle(330, 60, 320, 820);

    stripper.addRegion("leftColumn", rectLeft);

    stripper.addRegion("rightColumn", rectRight);

    PDPageTree allPages = document.getDocumentCatalog().getPages();
    int pageNumber = document.getNumberOfPages();


    String leftText = "";
    String rightText = "";

    for (int i = 0; i < pageNumber; i++) {

        PDPage page = (PDPage) allPages.get(i);

        stripper.extractRegions(page);
        leftText = stripper.getTextForRegion("leftColumn");
        rightText = stripper.getTextForRegion("rightColumn");

        pdfText.append(leftText);
        pdfText.append(rightText);


    }

0 讨论(0)

查看其它2个回答