How can I determine if a file is a PDF file?

后端 未结 13 895
暖寄归人
暖寄归人 2020-12-24 11:57

I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to ch

相关标签:
13条回答
  • 2020-12-24 12:26

    I was using some of the suggestions I found here and on other sites/posts for determining whether a pdf was valid or not. I purposely corrupted a pdf file, and unfortunately, many of the solutions did not detect that the file was corrupted.

    Eventually, after tinkering around with different methods in the API, I tried this:

    PDDocument.load(file).getPage(0).getContents().toString();
    

    This did not throw an exception, but it did output this:

     WARN  [COSParser:1154] The end of the stream doesn't point to the correct offset, using workaround to read the stream, stream start position: 171, length: 1145844, expected end position: 1146015
    

    Personally, I wanted an exception to be thrown if the file was corrupted so I could handle it myself, but it appeared that the API I was implementing already handled them in their own way.

    To get around this, I decided to try parsing the files using the class that gave the warm statement (COSParser). I found that there was a subclass, called PDFParser, which inherited a method called "setLenient", which was the key (https://pdfbox.apache.org/docs/2.0.4/javadocs/org/apache/pdfbox/pdfparser/COSParser.html).

    I then implemented the following:

            RandomAccessFile accessFile = new RandomAccessFile(file, "r");
            PDFParser parser = new PDFParser(accessFile); 
            parser.setLenient(false);
            parser.parse();
    

    This threw an Exception for my corrupted file, as I wanted. Hope this helps someone out!

    0 讨论(0)
  • 2020-12-24 12:27

    Here is what I use into my NUnit tests, that must validate against multiple versions of PDF generated using Crystal Reports:

    public static void CheckIsPDF(byte[] data)
        {
            Assert.IsNotNull(data);
            Assert.Greater(data.Length,4);
    
            // header 
            Assert.AreEqual(data[0],0x25); // %
            Assert.AreEqual(data[1],0x50); // P
            Assert.AreEqual(data[2],0x44); // D
            Assert.AreEqual(data[3],0x46); // F
            Assert.AreEqual(data[4],0x2D); // -
    
            if(data[5]==0x31 && data[6]==0x2E && data[7]==0x33) // version is 1.3 ?
            {                  
                // file terminator
                Assert.AreEqual(data[data.Length-7],0x25); // %
                Assert.AreEqual(data[data.Length-6],0x25); // %
                Assert.AreEqual(data[data.Length-5],0x45); // E
                Assert.AreEqual(data[data.Length-4],0x4F); // O
                Assert.AreEqual(data[data.Length-3],0x46); // F
                Assert.AreEqual(data[data.Length-2],0x20); // SPACE
                Assert.AreEqual(data[data.Length-1],0x0A); // EOL
                return;
            }
    
            if(data[5]==0x31 && data[6]==0x2E && data[7]==0x34) // version is 1.4 ?
            {
                // file terminator
                Assert.AreEqual(data[data.Length-6],0x25); // %
                Assert.AreEqual(data[data.Length-5],0x25); // %
                Assert.AreEqual(data[data.Length-4],0x45); // E
                Assert.AreEqual(data[data.Length-3],0x4F); // O
                Assert.AreEqual(data[data.Length-2],0x46); // F
                Assert.AreEqual(data[data.Length-1],0x0A); // EOL
                return;
            }
    
            Assert.Fail("Unsupported file format");
        }
    
    0 讨论(0)
  • 2020-12-24 12:27

    There is a very convenient and simple library for testing PDF content: https://github.com/codeborne/pdf-test

    API is very simple:

    import com.codeborne.pdftest.PDF;
    import static com.codeborne.pdftest.PDF.*;
    import static org.junit.Assert.assertThat;
    
    public class PDFContainsTextTest {
      @Test
      public void canAssertThatPdfContainsText() {
        PDF pdf = new PDF(new File("src/test/resources/50quickideas.pdf"));
        assertThat(pdf, containsText("50 Quick Ideas to Improve your User Stories"));
      }
    }
    
    0 讨论(0)
  • 2020-12-24 12:28

    The answer by Roger Keays is wrong! since not all PDF files in version 1.3 and not all terminated by EOL. The answer below works for all not corrupted pdf files:

    public static boolean is_pdf(byte[] data) {
        if (data != null && data.length > 4
                && data[0] == 0x25 && // %
                data[1] == 0x50 && // P
                data[2] == 0x44 && // D
                data[3] == 0x46 && // F
                data[4] == 0x2D) { // -
    
            // version 1.3 file terminator
            if (//data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x33 &&
                    data[data.length - 7] == 0x25 && // %
                    data[data.length - 6] == 0x25 && // %
                    data[data.length - 5] == 0x45 && // E
                    data[data.length - 4] == 0x4F && // O
                    data[data.length - 3] == 0x46 && // F
                    data[data.length - 2] == 0x20 // SPACE
                    //&& data[data.length - 1] == 0x0A// EOL
                    ) {
                return true;
            }
    
            // version 1.3 file terminator
            if (//data[5] == 0x31 && data[6] == 0x2E && data[7] == 0x34 &&
                    data[data.length - 6] == 0x25 && // %
                    data[data.length - 5] == 0x25 && // %
                    data[data.length - 4] == 0x45 && // E
                    data[data.length - 3] == 0x4F && // O
                    data[data.length - 2] == 0x46 // F
                    //&& data[data.length - 1] == 0x0A // EOL
                    ) {
                return true;
            }
        }
        return false;
    }
    
    0 讨论(0)
  • 2020-12-24 12:29

    Since you use PDFBox you can simply do:

    PDDocument.load(file);
    

    It'll fail with an Exception if the PDF is corrupted etc.

    If it succeeds you can also check if the PDF is encrypted using .isEncrypted()

    0 讨论(0)
  • 2020-12-24 12:34

    Maybe I am too late to answer. But you should have a look at Tika. It uses PDFBox Parser internally to parse PDF's

    You just need to import tika-app-latest*.jar

     public String parseToStringExample() throws IOException, SAXException, TikaException 
     {
    
          Tika tika = new Tika();
          try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
               return tika.parseToString(stream); // This should return you the pdf's text
          }
    }
    

    It would be a much cleaner solution . You can refer here for more details of Tika Usage : https://tika.apache.org/1.12/api/

    0 讨论(0)
提交回复
热议问题