How to extract text from a PDF file with Apache PDFBox

前端 未结 5 1401
不知归路
不知归路 2020-12-08 05:02

I would like to extract text from a given PDF file with Apache PDFBox.

I wrote this code:

PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null         


        
5条回答
  •  爱一瞬间的悲伤
    2020-12-08 05:22

    Maven dep:

        
            org.apache.pdfbox
            pdfbox
            2.0.9
        
    

    Then the fucntion to get the pdf text as String.

    private static String readPDF(File pdf) throws InvalidPasswordException, IOException {
        try (PDDocument document = PDDocument.load(pdf)) {
    
            document.getClass();
    
            if (!document.isEncrypted()) {
    
                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition(true);
    
                PDFTextStripper tStripper = new PDFTextStripper();
    
                String pdfFileInText = tStripper.getText(document);
                // System.out.println("Text:" + st);
    
                // split by whitespace
                String lines[] = pdfFileInText.split("\\r?\\n");
                List pdfLines = new ArrayList<>();
                StringBuilder sb = new StringBuilder();
                for (String line : lines) {
                    System.out.println(line);
                    pdfLines.add(line);
                    sb.append(line + "\n");
                }
                return sb.toString();
            }
    
        }
        return null;
    }
    

提交回复
热议问题