How to get the text content files with tika 1.6?

 ̄綄美尐妖づ 提交于 2019-12-11 13:29:58

问题


Hi i try get the text content from any files in this list pdf,txt,doc,docx and odt the implementation with tika previously worked fine but now is broken, The code is it:

```

public void uploadFile(FileUploadEvent event) throws Exception {
 UploadedFile file = event.getUploadedFile();
 byte[] data = file.getData();
 Tika tika = new Tika();
 string = tika.parseToString(new ByteArrayInputStream(data));
 ...
}

```

Any ideas? , bad implementation ?


回答1:


You need to add tika-parsers.

For example with maven add this dependency to your pom.xml:

<dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-parsers</artifactId>
        <version>1.7</version>
</dependency>

And you can use Auto-Detect Parser:

BodyContentHandler handler = new BodyContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try {
    parser.parse(is, handler, metadata);
    text = handler.toString();
} catch(TikaException te) {
    System.out.println(te.toString());
} finally {
    is.close();
}


来源:https://stackoverflow.com/questions/27969051/how-to-get-the-text-content-files-with-tika-1-6

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!