How to extract text from a PDF file with Apache PDFBox

前端 未结 5 1409
不知归路
不知归路 2020-12-08 05:02

I would like to extract text from a given PDF file with Apache PDFBox.

I wrote this code:

PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null         


        
5条回答
  •  我在风中等你
    2020-12-08 05:14

    Using PDFBox 2.0.7, this is how I get the text of a PDF:

    static String getText(File pdfFile) throws IOException {
        PDDocument doc = PDDocument.load(pdfFile);
        return new PDFTextStripper().getText(doc);
    }
    

    Call it like this:

    try {
        String text = getText(new File("/home/me/test.pdf"));
        System.out.println("Text in PDF: " + text);
    } catch (IOException e) {
        e.printStackTrace();
    }
    

    Since user oivemaria asked in the comments:

    You can use PDFBox in your application by adding it to your dependencies in build.gradle:

    dependencies {
      compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7'
    }
    

    Here's more on dependency management using Gradle.


    If you want to keep the PDF's format in the parsed text, give PDFLayoutTextStripper a try.

提交回复
热议问题