Tika Parser: Exclude PDF Attachments

折月煮酒 提交于 2019-12-02 12:36:58

问题


There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certain (or all) PDF attachments in the Tika config?


回答1:


Implement a custom org.apache.tika.extractor.DocumentSelector and set it at the ParseContext. The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed.

Example DocumentSelector:

public class CustomDocumentSelector implements DocumentSelector {

  @Override
  public boolean select(Metadata metadata) {
    String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
    return resourceName == null || !resourceName.endsWith(".joboptions");
  }
}

Register it at the ParseContext:

parseContext.set(DocumentSelector.class, new CustomDocumentSelector());



回答2:


@gagravarr, we changed that behavior via TIKA-2096, Tika 1.15. The default is now "extract all embedded documents". To avoid parsing embedded documents call:

parseContext.set(Parser.class, new EmptyParser())

Or subclass EmbeddedDocumentExtractor to do nothing and send that in via the ParseContext.

If you were using Solr DIH's TikaEntityProcessor, I'd set extractEmbedded to false, but you aren't; and please don't. :)

So, I don't think there's an easy way to turn off parsing of embedded documents only for PDFs, and I'm not sure you'd want to. What if there were an MSWord file attached to a PDF, for example?

If you want to ignore .joboptions, you could use a custom EmbeddedDocumentExtractor.



来源:https://stackoverflow.com/questions/50817271/tika-parser-exclude-pdf-attachments

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!