I would like to extract text from a given PDF file with Apache PDFBox.
I wrote this code:
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null
Maven dep:
org.apache.pdfbox
pdfbox
2.0.9
Then the fucntion to get the pdf text as String.
private static String readPDF(File pdf) throws InvalidPasswordException, IOException {
try (PDDocument document = PDDocument.load(pdf)) {
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
// System.out.println("Text:" + st);
// split by whitespace
String lines[] = pdfFileInText.split("\\r?\\n");
List pdfLines = new ArrayList<>();
StringBuilder sb = new StringBuilder();
for (String line : lines) {
System.out.println(line);
pdfLines.add(line);
sb.append(line + "\n");
}
return sb.toString();
}
}
return null;
}