Tika Parser: Exclude PDF Attachments

后端 未结 2 1061
情深已故
情深已故 2021-01-23 01:21

There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certa

2条回答
  •  星月不相逢
    2021-01-23 01:40

    Implement a custom org.apache.tika.extractor.DocumentSelector and set it at the ParseContext. The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed.

    Example DocumentSelector:

    public class CustomDocumentSelector implements DocumentSelector {
    
      @Override
      public boolean select(Metadata metadata) {
        String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
        return resourceName == null || !resourceName.endsWith(".joboptions");
      }
    }
    

    Register it at the ParseContext:

    parseContext.set(DocumentSelector.class, new CustomDocumentSelector());
    

提交回复
热议问题