How can I determine if a file is a PDF file?

后端未结

关注

 13  914

暖寄归人 2020-12-24 11:57

I am using PdfBox in Java to extract text from PDF files. Some of the input files provided are not valid and PDFTextStripper halts on these files. Is there a clean way to ch

13条回答

粉色の甜心 (楼主)

2020-12-24 12:34
Maybe I am too late to answer. But you should have a look at Tika. It uses PDFBox Parser internally to parse PDF's

You just need to import tika-app-latest*.jar
```
 public String parseToStringExample() throws IOException, SAXException, TikaException 
 {

      Tika tika = new Tika();
      try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
           return tika.parseToString(stream); // This should return you the pdf's text
      }
}
```
It would be a much cleaner solution . You can refer here for more details of Tika Usage : https://tika.apache.org/1.12/api/
0 讨论(0)

查看其它13个回答
发布评论:

提交评论
- 加载中...