There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certa
Implement a custom org.apache.tika.extractor.DocumentSelector
and set it at the ParseContext
. The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed.
Example DocumentSelector:
public class CustomDocumentSelector implements DocumentSelector {
@Override
public boolean select(Metadata metadata) {
String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
return resourceName == null || !resourceName.endsWith(".joboptions");
}
}
Register it at the ParseContext:
parseContext.set(DocumentSelector.class, new CustomDocumentSelector());