I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to ch
I was using some of the suggestions I found here and on other sites/posts for determining whether a pdf was valid or not. I purposely corrupted a pdf file, and unfortunately, many of the solutions did not detect that the file was corrupted.
Eventually, after tinkering around with different methods in the API, I tried this:
PDDocument.load(file).getPage(0).getContents().toString();
This did not throw an exception, but it did output this:
WARN [COSParser:1154] The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 171, length: 1145844, expected end position: 1146015
Personally, I wanted an exception to be thrown if the file was corrupted so I could handle it myself, but it appeared that the API I was implementing already handled them in their own way.
To get around this, I decided to try parsing the files using the class that gave the warm statement (COSParser). I found that there was a subclass, called PDFParser, which inherited a method called "setLenient", which was the key (https://pdfbox.apache.org/docs/2.0.4/javadocs/org/apache/pdfbox/pdfparser/COSParser.html).
I then implemented the following:
RandomAccessFile accessFile = new RandomAccessFile(file, "r");
PDFParser parser = new PDFParser(accessFile);
parser.setLenient(false);
parser.parse();
This threw an Exception for my corrupted file, as I wanted. Hope this helps someone out!