Tika Parser: Exclude PDF Attachments

后端未结

关注

 2  1070

情深已故 2021-01-23 01:21

There is a PDF documents that has attachments (here: joboptions) that should not be extracted by Tika. The contents should not be sent to Solr. Is there any way to exclude certa

2条回答

星月不相逢 (楼主)

2021-01-23 01:40
Implement a custom org.apache.tika.extractor.DocumentSelector and set it at the ParseContext. The DocumentSelector is called with metadata of the embedded document to decide whether the embedded document should be parsed.

Example DocumentSelector:
```
public class CustomDocumentSelector implements DocumentSelector {

  @Override
  public boolean select(Metadata metadata) {
    String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
    return resourceName == null || !resourceName.endsWith(".joboptions");
  }
}
```
Register it at the ParseContext:
```
parseContext.set(DocumentSelector.class, new CustomDocumentSelector());
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...