How can I determine if a file is a PDF file?

后端 未结 13 914
暖寄归人
暖寄归人 2020-12-24 11:57

I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to ch

13条回答
  •  粉色の甜心
    2020-12-24 12:34

    Maybe I am too late to answer. But you should have a look at Tika. It uses PDFBox Parser internally to parse PDF's

    You just need to import tika-app-latest*.jar

     public String parseToStringExample() throws IOException, SAXException, TikaException 
     {
    
          Tika tika = new Tika();
          try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
               return tika.parseToString(stream); // This should return you the pdf's text
          }
    }
    

    It would be a much cleaner solution . You can refer here for more details of Tika Usage : https://tika.apache.org/1.12/api/

提交回复
热议问题