extract PDF text by columns

前端 未结 2 1206
时光取名叫无心
时光取名叫无心 2021-01-14 19:41

My question is:

How can I extract text from a PDF file which is divided in columns in a way that I get the result separated by this columns?

Background: I wo

2条回答
  •  灰色年华
    2021-01-14 20:08

    Combined with @mkl's answer, I used PDFbox to complete the extraction of text by columns.

    The way I find the boundary of two columns is trying constantly. =:

        StringBuilder pdfText = new StringBuilder();
        PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    
        stripper.setSortByPosition(true);
    
        Rectangle rectLeft = new Rectangle(10, 60, 320, 820);
    
        Rectangle rectRight = new Rectangle(330, 60, 320, 820);
    
        stripper.addRegion("leftColumn", rectLeft);
    
        stripper.addRegion("rightColumn", rectRight);
    
        PDPageTree allPages = document.getDocumentCatalog().getPages();
        int pageNumber = document.getNumberOfPages();
    
    
        String leftText = "";
        String rightText = "";
    
        for (int i = 0; i < pageNumber; i++) {
    
            PDPage page = (PDPage) allPages.get(i);
    
            stripper.extractRegions(page);
            leftText = stripper.getTextForRegion("leftColumn");
            rightText = stripper.getTextForRegion("rightColumn");
    
            pdfText.append(leftText);
            pdfText.append(rightText);
    
    
        }
    

提交回复
热议问题