How to read large files using TIka?

匿名 (未验证) 提交于 2019-12-03 03:04:01

问题:

I'm parsing large pdf and word documents using Tika but I get he followiing error message.

Your document contained more than 100000 characters, and so your requested limit has been reached. To receive the full text of the document, increase your limit. (Text up to the limit is however available). 

How can I increase the limit?

回答1:

Assuming you're basically following the Tika example for extracting to plain text, then all you need to do is create your BodyContentHandler with a write limit of -1 to disable the write limit, as explained in the javadocs

Your code would then look something like (inspired by the example):

BodyContentHandler handler = new BodyContentHandler(-1);  InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc"); AutoDetectParser parser = new AutoDetectParser(); Metadata metadata = new Metadata(); try {     parser.parse(stream, handler, metadata);     return handler.toString(); } finally {     stream.close(); } 


回答2:

I disagree with @Gagravarr using the write limit of -1, as the default that will be selected in -1 case is infact 100000 to be exact.

If i am not wrong, the documentation of Tika BodyContentHandler>WriteOutContentHandler states that:

The internal string buffer is bounded at 100k characters.

However the best way to achieve this is to pass an object of StringWriter as an argument in place of -1.

StringWriter any = new StringWriter(); 

and then

BodyContentHandler handler = new BodyContentHandler(any);



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!