I would like to extract text from a given PDF file with Apache PDFBox.
I wrote this code:
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null
Using PDFBox 2.0.7, this is how I get the text of a PDF:
static String getText(File pdfFile) throws IOException {
PDDocument doc = PDDocument.load(pdfFile);
return new PDFTextStripper().getText(doc);
}
Call it like this:
try {
String text = getText(new File("/home/me/test.pdf"));
System.out.println("Text in PDF: " + text);
} catch (IOException e) {
e.printStackTrace();
}
Since user oivemaria asked in the comments:
You can use PDFBox in your application by adding it to your dependencies in build.gradle:
dependencies {
compile group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.7'
}
Here's more on dependency management using Gradle.
If you want to keep the PDF's format in the parsed text, give PDFLayoutTextStripper a try.