How can I determine if a file is a PDF file?

后端未结

关注

 13  933

暖寄归人 2020-12-24 11:57

I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to ch

13条回答

死守一世寂寞 (楼主)

2020-12-24 12:26
I was using some of the suggestions I found here and on other sites/posts for determining whether a pdf was valid or not. I purposely corrupted a pdf file, and unfortunately, many of the solutions did not detect that the file was corrupted.

Eventually, after tinkering around with different methods in the API, I tried this:
```
PDDocument.load(file).getPage(0).getContents().toString();
```
This did not throw an exception, but it did output this:
```
 WARN  [COSParser:1154] The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 171, length: 1145844, expected end position: 1146015
```
Personally, I wanted an exception to be thrown if the file was corrupted so I could handle it myself, but it appeared that the API I was implementing already handled them in their own way.

To get around this, I decided to try parsing the files using the class that gave the warm statement (COSParser). I found that there was a subclass, called PDFParser, which inherited a method called "setLenient", which was the key (https://pdfbox.apache.org/docs/2.0.4/javadocs/org/apache/pdfbox/pdfparser/COSParser.html).

I then implemented the following:
```
        RandomAccessFile accessFile = new RandomAccessFile(file, "r");
        PDFParser parser = new PDFParser(accessFile); 
        parser.setLenient(false);
        parser.parse();
```
This threw an Exception for my corrupted file, as I wanted. Hope this helps someone out!
0 讨论(0)

查看其它13个回答
发布评论:

提交评论
- 加载中...