I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to ch
Maybe I am too late to answer. But you should have a look at Tika. It uses PDFBox Parser internally to parse PDF's
You just need to import tika-app-latest*.jar
public String parseToStringExample() throws IOException, SAXException, TikaException
{
Tika tika = new Tika();
try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
return tika.parseToString(stream); // This should return you the pdf's text
}
}
It would be a much cleaner solution . You can refer here for more details of Tika Usage : https://tika.apache.org/1.12/api/