Why does the Tika facade choose EmptyParser?

纵然是瞬间 提交于 2021-01-27 19:10:39

问题


I'm using the Tika facade, per the example of the elasticsearch-mappper-attachment plugin. Here's my test code:

Tika tika = new Tika();                                                                                                                                                                                 
Metadata md = new Metadata();

try {                                                                                                                                                                                                   
    String content = tika.parseToString(src, md, 100000);

    System.out.println("Content length: " + content.length());  

    for (String s: md.names()) {                                                                                                                                                                        
        System.out.println(s + ": " + md.get(s));                                                                                                                                                       
    }                                                                                                                                                                                                   
}                                                                                                                                                                                                       
catch (TikaException e) {                                                                                                                                                                               
    System.out.println(e);                                                                                                                                                                              
} 

Here's the output:

Content length: 0
X-Parsed-By: org.apache.tika.parser.EmptyParser
Content-Type: text/html

So the question is: if Tika correctly identifies the input as text/html, why does it use the EmptyParser? If I'm supposed to pass a parser, which parser should I pass for best results, assuming that autodetection is successful, as above.

Thank you.


回答1:


Make sure that tika-parsers is on your classpath! If you are using Gradle,

compile 'org.apache.tika:tika-parsers:1.7'

will do the trick.



来源:https://stackoverflow.com/questions/28954805/why-does-the-tika-facade-choose-emptyparser

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!