How to read large files using TIka?

自作多情 提交于 2019-12-03 23:49:50

Assuming you're basically following the Tika example for extracting to plain text, then all you need to do is create your BodyContentHandler with a write limit of -1 to disable the write limit, as explained in the javadocs

Your code would then look something like (inspired by the example):

BodyContentHandler handler = new BodyContentHandler(-1);

InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try {
    parser.parse(stream, handler, metadata);
    return handler.toString();
} finally {
    stream.close();
}

I disagree with @Gagravarr using the write limit of -1, as the default that will be selected in -1 case is infact 100000 to be exact.

If i am not wrong, the documentation of Tika BodyContentHandler>WriteOutContentHandler states that:

The internal string buffer is bounded at 100k characters.

However the best way to achieve this is to pass an object of StringWriter as an argument in place of -1.

StringWriter any = new StringWriter();

and then

BodyContentHandler handler = new BodyContentHandler(any);

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!