How can I determine if a file is a PDF file?

后端 未结 13 933
暖寄归人
暖寄归人 2020-12-24 11:57

I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to ch

13条回答
  •  死守一世寂寞
    2020-12-24 12:26

    I was using some of the suggestions I found here and on other sites/posts for determining whether a pdf was valid or not. I purposely corrupted a pdf file, and unfortunately, many of the solutions did not detect that the file was corrupted.

    Eventually, after tinkering around with different methods in the API, I tried this:

    PDDocument.load(file).getPage(0).getContents().toString();
    

    This did not throw an exception, but it did output this:

     WARN  [COSParser:1154] The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 171, length: 1145844, expected end position: 1146015
    

    Personally, I wanted an exception to be thrown if the file was corrupted so I could handle it myself, but it appeared that the API I was implementing already handled them in their own way.

    To get around this, I decided to try parsing the files using the class that gave the warm statement (COSParser). I found that there was a subclass, called PDFParser, which inherited a method called "setLenient", which was the key (https://pdfbox.apache.org/docs/2.0.4/javadocs/org/apache/pdfbox/pdfparser/COSParser.html).

    I then implemented the following:

            RandomAccessFile accessFile = new RandomAccessFile(file, "r");
            PDFParser parser = new PDFParser(accessFile); 
            parser.setLenient(false);
            parser.parse();
    

    This threw an Exception for my corrupted file, as I wanted. Hope this helps someone out!

提交回复
热议问题