Tika Parser: Exclude PDF Attachments

问题

There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certain (or all) PDF attachments in the Tika config?

回答1:

Implement a custom org.apache.tika.extractor.DocumentSelector and set it at the ParseContext. The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed.

Example DocumentSelector:

public class CustomDocumentSelector implements DocumentSelector {

  @Override
  public boolean select(Metadata metadata) {
    String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
    return resourceName == null || !resourceName.endsWith(".joboptions");
  }
}

Register it at the ParseContext:

parseContext.set(DocumentSelector.class, new CustomDocumentSelector());

回答2:

@gagravarr, we changed that behavior via TIKA-2096, Tika 1.15. The default is now "extract all embedded documents". To avoid parsing embedded documents call:

parseContext.set(Parser.class, new EmptyParser())

Or subclass EmbeddedDocumentExtractor to do nothing and send that in via the ParseContext.

If you were using Solr DIH's TikaEntityProcessor, I'd set extractEmbedded to false, but you aren't; and please don't. :)

So, I don't think there's an easy way to turn off parsing of embedded documents only for PDFs, and I'm not sure you'd want to. What if there were an MSWord file attached to a PDF, for example?

If you want to ignore .joboptions, you could use a custom EmbeddedDocumentExtractor.

来源：https://stackoverflow.com/questions/50817271/tika-parser-exclude-pdf-attachments

标签

pdf

solr

apache-tika